date:20150508

Re: Build fail...

2015-05-08 Thread Andrew Or

Thanks for pointing this out. I reverted that commit.

2015-05-08 19:01 GMT-07:00 Ted Yu :

> Looks like you're right:
>
>
> https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-1.3-Maven-with-YARN/HADOOP_PROFILE=hadoop-2.4,label=centos/427/console
>
> [error]
> /home/jenkins/workspace/Spark-1.3-Maven-with-YARN/HADOOP_PROFILE/hadoop-2.4/label/centos/core/src/main/scala/org/apache/spark/MapOutputTracker.scala:370:
> value tryWithSafeFinally is not a member of object
> org.apache.spark.util.Utils
> [error] Utils.tryWithSafeFinally {
> [error]   ^
>
>
> FYI
>
>
> On Fri, May 8, 2015 at 6:53 PM, rtimp  wrote:
>
> > Hi,
> >
> > From what I myself noticed a few minutes ago, I think branch-1.3 might be
> > failing to compile due to the most recent commit. I tried reverting to
> > commit 7fd212b575b6227df5068844416e51f11740e771 (the commit prior to the
> > head) on that branch and recompiling, and was successful. As Ferris would
> > say, it is so choice.
> >
> >
> >
> > --
> > View this message in context:
> >
> http://apache-spark-developers-list.1001551.n3.nabble.com/Build-fail-tp12170p12171.html
> > Sent from the Apache Spark Developers List mailing list archive at
> > Nabble.com.
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> > For additional commands, e-mail: dev-h...@spark.apache.org
> >
> >
>

Re: Build fail...

2015-05-08 Thread Ted Yu

Looks like you're right:

https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-1.3-Maven-with-YARN/HADOOP_PROFILE=hadoop-2.4,label=centos/427/console

[error] 
/home/jenkins/workspace/Spark-1.3-Maven-with-YARN/HADOOP_PROFILE/hadoop-2.4/label/centos/core/src/main/scala/org/apache/spark/MapOutputTracker.scala:370:
value tryWithSafeFinally is not a member of object
org.apache.spark.util.Utils
[error] Utils.tryWithSafeFinally {
[error]   ^


FYI


On Fri, May 8, 2015 at 6:53 PM, rtimp  wrote:

> Hi,
>
> From what I myself noticed a few minutes ago, I think branch-1.3 might be
> failing to compile due to the most recent commit. I tried reverting to
> commit 7fd212b575b6227df5068844416e51f11740e771 (the commit prior to the
> head) on that branch and recompiling, and was successful. As Ferris would
> say, it is so choice.
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Build-fail-tp12170p12171.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Re: Build fail...

2015-05-08 Thread rtimp

Hi,

>From what I myself noticed a few minutes ago, I think branch-1.3 might be
failing to compile due to the most recent commit. I tried reverting to
commit 7fd212b575b6227df5068844416e51f11740e771 (the commit prior to the
head) on that branch and recompiling, and was successful. As Ferris would
say, it is so choice.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Build-fail-tp12170p12171.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Recent Spark test failures

2015-05-08 Thread Ted Yu

Andrew:
Do you think the -M and -A options described here can be used in test runs ?
http://scalatest.org/user_guide/using_the_runner

Cheers

On Wed, May 6, 2015 at 5:41 PM, Andrew Or  wrote:

> Dear all,
>
> I'm sure you have all noticed that the Spark tests have been fairly
> unstable recently. I wanted to share a tool that I use to track which tests
> have been failing most often in order to prioritize fixing these flaky
> tests.
>
> Here is an output of the tool. This spreadsheet reports the top 10 failed
> tests this week (ending yesterday 5/5):
>
> https://docs.google.com/spreadsheets/d/1Iv_UDaTFGTMad1sOQ_s4ddWr6KD3PuFIHmTSzL7LSb4
>
> It is produced by a small project:
> https://github.com/andrewor14/spark-test-failures
>
> I have been filing JIRAs on flaky tests based on this tool. Hopefully we
> can collectively stabilize the build a little more as we near the release
> for Spark 1.4.
>
> -Andrew
>

Intellij Spark Source Compilation

2015-05-08 Thread rtimp

Hello,

I'm trying to compile the master branch of the spark source (25889d8) in
intellij. I followed the instructions in the wiki
https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools,
namely I downloaded IntelliJ 14.1.2 with jre 1.7.0_55, imported pom.xml,
generated all sources in the maven toolbar, and compiled. I receive 3
errors:

Error:(133, 10) java:
org.apache.spark.network.sasl.SaslEncryption.EncryptedMessage is not
abstract and does not override abstract method touch(java.lang.Object) in
io.netty.util.ReferenceCounted
/home/loki11/code/spark/spark/network/common/src/main/java/org/apache/spark/network/buffer/LazyFileRegion.java
Error:(39, 14) java: org.apache.spark.network.buffer.LazyFileRegion is not
abstract and does not override abstract method touch(java.lang.Object) in
io.netty.util.ReferenceCounted
/home/loki11/code/spark/spark/network/common/src/main/java/org/apache/spark/network/protocol/MessageWithHeader.java
Error:(34, 1) java: org.apache.spark.network.protocol.MessageWithHeader is
not abstract and does not override abstract method touch(java.lang.Object)
in io.netty.util.ReferenceCounted

On the command line, build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0
-DskipTests clean package succeeds as well build/sbt clean assembly as well
was build/sbt compile.

It seems to me like I'm missing some trivial intellij option (I'm normally
an eclipse user, but was having even more trouble with that). Any advice?

Thanks!




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Intellij-Spark-Source-Compilation-tp12168.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Having pyspark.sql.types.StructType implement iter()

2015-05-08 Thread Reynold Xin

Sure.


On Fri, May 8, 2015 at 2:43 PM, Nicholas Chammas  wrote:

> StructType looks an awful lot like a Python dictionary.
>
> However, it doesn’t implement __iter__()
> , so doing
> a quick conversion like this doesn’t work:
>
> >>> df = sqlContext.jsonRDD(sc.parallelize(['{"name": "El
> Magnifico"}']))>>> df.schema
> StructType(List(StructField(name,StringType,true)))>>> dict(df.schema)
> Traceback (most recent call last):
>   File "", line 1, in 
> TypeError: 'StructType' object is not iterable
>
> This would be super helpful for doing any custom schema manipulations
> without having to go through the whole .json() -> json.loads() ->
> manipulate() -> json.dumps() -> .fromJson() charade.
>
> Same goes for Row, which offers an asDict()
> <
> https://spark.apache.org/docs/1.3.1/api/python/pyspark.sql.html#pyspark.sql.Row.asDict
> >
> method but doesn’t support the more Pythonic dict(Row).
>
> Does this make sense?
>
> Nick
> 
>

Having pyspark.sql.types.StructType implement iter()

2015-05-08 Thread Nicholas Chammas

StructType looks an awful lot like a Python dictionary.

However, it doesn’t implement __iter__()
, so doing
a quick conversion like this doesn’t work:

>>> df = sqlContext.jsonRDD(sc.parallelize(['{"name": "El Magnifico"}']))>>> 
>>> df.schema
StructType(List(StructField(name,StringType,true)))>>> dict(df.schema)
Traceback (most recent call last):
  File "", line 1, in 
TypeError: 'StructType' object is not iterable

This would be super helpful for doing any custom schema manipulations
without having to go through the whole .json() -> json.loads() ->
manipulate() -> json.dumps() -> .fromJson() charade.

Same goes for Row, which offers an asDict()

method but doesn’t support the more Pythonic dict(Row).

Does this make sense?

Nick

Re: branch-1.4 nightly builds?

2015-05-08 Thread Nicholas Chammas

https://issues.apache.org/jira/browse/SPARK-1517

That issue should probably be unassigned since I am not actively working on
it. (I can't unassign myself.)

Nick

On Fri, May 8, 2015 at 5:38 PM Punyashloka Biswal 
wrote:

> Dear Spark devs,
>
> Does anyone maintain nightly builds for branch-1.4? I'd like to start
> testing against it, and having a regularly updated build on a
> well-publicized repository would be a great help!
>
> Punya
>

branch-1.4 nightly builds?

2015-05-08 Thread Punyashloka Biswal

Dear Spark devs,

Does anyone maintain nightly builds for branch-1.4? I'd like to start
testing against it, and having a regularly updated build on a
well-publicized repository would be a great help!

Punya

Re: large volume spark job spends most of the time in AppendOnlyMap.changeValue

2015-05-08 Thread Josh Rosen

Do you have any more specific profiling data that you can share?  I'm
curious to know where AppendOnlyMap.changeValue is being called from.

On Fri, May 8, 2015 at 1:26 PM, Michal Haris 
wrote:

> +dev
> On 6 May 2015 10:45, "Michal Haris"  wrote:
>
> > Just wanted to check if somebody has seen similar behaviour or knows what
> > we might be doing wrong. We have a relatively complex spark application
> > which processes half a terabyte of data at various stages. We have
> profiled
> > it in several ways and everything seems to point to one place where 90%
> of
> > the time is spent:  AppendOnlyMap.changeValue. The job scales and is
> > relatively faster than its map-reduce alternative but it still feels
> slower
> > than it should be. I am suspecting too much spill but I haven't seen any
> > improvement by increasing number of partitions to 10k. Any idea would be
> > appreciated.
> >
> > --
> > Michal Haris
> > Technical Architect
> > direct line: +44 (0) 207 749 0229
> > www.visualdna.com | t: +44 (0) 207 734 7033,
> >
>

Re: large volume spark job spends most of the time in AppendOnlyMap.changeValue

2015-05-08 Thread Michal Haris

+dev
On 6 May 2015 10:45, "Michal Haris"  wrote:

> Just wanted to check if somebody has seen similar behaviour or knows what
> we might be doing wrong. We have a relatively complex spark application
> which processes half a terabyte of data at various stages. We have profiled
> it in several ways and everything seems to point to one place where 90% of
> the time is spent:  AppendOnlyMap.changeValue. The job scales and is
> relatively faster than its map-reduce alternative but it still feels slower
> than it should be. I am suspecting too much spill but I haven't seen any
> improvement by increasing number of partitions to 10k. Any idea would be
> appreciated.
>
> --
> Michal Haris
> Technical Architect
> direct line: +44 (0) 207 749 0229
> www.visualdna.com | t: +44 (0) 207 734 7033,
>

Re: DataFrame distinct vs RDD distinct

2015-05-08 Thread Olivier Girardot

I'll try to reproduce what has been reported to me first :) and I'll let
you know. Thanks !

Le jeu. 7 mai 2015 à 21:16, Michael Armbrust  a
écrit :

> I'd happily merge a PR that changes the distinct implementation to be more
> like Spark core, assuming it includes benchmarks that show better
> performance for both the "fits in memory case" and the "too big for memory
> case".
>
> On Thu, May 7, 2015 at 2:23 AM, Olivier Girardot <
> o.girar...@lateral-thoughts.com> wrote:
>
>> Ok, but for the moment, this seems to be killing performances on some
>> computations...
>> I'll try to give you precise figures on this between rdd and dataframe.
>>
>> Olivier.
>>
>> Le jeu. 7 mai 2015 à 10:08, Reynold Xin  a écrit :
>>
>> > In 1.5, we will most likely just rewrite distinct in SQL to either use
>> the
>> > Aggregate operator which will benefit from all the Tungsten
>> optimizations,
>> > or have a Tungsten version of distinct for SQL/DataFrame.
>> >
>> > On Thu, May 7, 2015 at 1:32 AM, Olivier Girardot <
>> > o.girar...@lateral-thoughts.com> wrote:
>> >
>> >> Hi everyone,
>> >> there seems to be different implementations of the "distinct" feature
>> in
>> >> DataFrames and RDD and some performance issue with the DataFrame
>> distinct
>> >> API.
>> >>
>> >> In RDD.scala :
>> >>
>> >> def distinct(numPartitions: Int)(implicit ord: Ordering[T] = null):
>> >> RDD[T] =
>> >> withScope { map(x => (x, null)).reduceByKey((x, y) => x,
>> >> numPartitions).map(_._1) }
>> >> And in DataFrame :
>> >>
>> >>
>> >> case class Distinct(partial: Boolean, child: SparkPlan) extends
>> UnaryNode
>> >> {
>> >> override def output: Seq[Attribute] = child.output override def
>> >> requiredChildDistribution: Seq[Distribution] = if (partial)
>> >> UnspecifiedDistribution :: Nil else
>> ClusteredDistribution(child.output) ::
>> >>
>> > Nil *override def execute(): RDD[Row] = {**
>> child.execute().mapPartitions {
>> >> iter =>** val hashSet = new scala.collection.mutable.HashSet[Row]()* *
>> var
>> >> currentRow: Row = null** while (iter.hasNext) {** currentRow =
>> >> iter.next()**
>> >> if (!hashSet.contains(currentRow)) {** hashSet.add(currentRow.copy())**
>> >> }**
>> >> }* * hashSet.iterator** }** }*}
>> >
>> >
>> >>
>> >>
>> >>
>> >>
>> >> I can try to reproduce more clearly the performance issue, but do you
>> have
>> >> any insights into why we can't have the same distinct strategy between
>> >> DataFrame and RDD ?
>> >>
>> >> Regards,
>> >>
>> >> Olivier.
>> >>
>> >
>>
>
>

Re: DataFrames equivalent to SQL table namespacing and aliases

2015-05-08 Thread Nicholas Chammas

Ah, neat. So in the example I gave earlier, I’d do this to get columns from
specific dataframes:

>>> df12.select(df1['a'], df2['other'])
DataFrame[a: bigint, other: string]>>> df12.select(df1['a'],
df2['other']).show()
a other
   4 I dunno

This perhaps should be documented in an example in the docs somewhere. I’ll
open a PR for that I suppose.

Nick


On Fri, May 8, 2015 at 3:01 PM Reynold Xin  wrote:

> You can actually just use df1['a'] in projection to differentiate.
>
> e.g. in Scala (similar things work in Python):
>
>
> scala> val df1 = Seq((1, "one")).toDF("a", "b")
> df1: org.apache.spark.sql.DataFrame = [a: int, b: string]
>
> scala> val df2 = Seq((2, "two")).toDF("a", "b")
> df2: org.apache.spark.sql.DataFrame = [a: int, b: string]
>
> scala> df1.join(df2, df1("a") === df2("a") - 1).select(*df1("a")*).show()
> +-+
> |a|
> +-+
> |1|
> +-+
>
>
>
>
> On Fri, May 8, 2015 at 11:53 AM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> Oh, I didn't know about that. Thanks for the pointer, Rakesh.
>>
>> I wonder why they did that, as opposed to taking the cue from SQL and
>> prefixing column names with a specifiable dataframe alias. The suffix
>> approach seems quite ugly.
>>
>> Nick
>>
>> On Fri, May 8, 2015 at 2:47 PM Rakesh Chalasani 
>> wrote:
>>
>> > To add to the above discussion, Pandas, allows suffixing and prefixing
>> to
>> > solve this issue
>> >
>> >
>> >
>> http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.join.html
>> >
>> > Rakesh
>> >
>> > On Fri, May 8, 2015 at 2:42 PM Nicholas Chammas <
>> > nicholas.cham...@gmail.com> wrote:
>> >
>> >> DataFrames, as far as I can tell, don’t have an equivalent to SQL’s
>> table
>> >> aliases.
>> >>
>> >> This is essential when joining dataframes that have identically named
>> >> columns.
>> >>
>> >> >>> # PySpark 1.3.1>>> df1 = sqlContext.jsonRDD(sc.parallelize(['{"a":
>> 4,
>> >> "other": "I know"}']))>>> df2 =
>> sqlContext.jsonRDD(sc.parallelize(['{"a":
>> >> 4, "other": "I dunno"}']))>>> df12 = df1.join(df2, df1['a'] ==
>> df2['a'])>>>
>> >> df12
>> >> DataFrame[a: bigint, other: string, a: bigint, other: string]>>>
>> >> df12.printSchema()
>> >> root
>> >>  |-- a: long (nullable = true)
>> >>  |-- other: string (nullable = true)
>> >>  |-- a: long (nullable = true)
>> >>  |-- other: string (nullable = true)
>> >>
>> >> Now, trying any one of the following:
>> >>
>> >> df12.select('a')
>> >> df12['a']
>> >> df12.a
>> >>
>> >> yields this:
>> >>
>> >> org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous,
>> >> could be: a#360L, a#358L.;
>> >>
>> >> Same goes for accessing the other field.
>> >>
>> >> This is good, but what are we supposed to do in this case?
>> >>
>> >> SQL solves this by fully qualifying the column name with the table
>> name,
>> >>
>> > and also offering table aliasing <
>> http://dba.stackexchange.com/a/5991/2660
>> >> >
>> >
>> >
>> >> in the case where you are joining a table to itself.
>> >>
>> >> If we translate this directly into DataFrames lingo, perhaps it would
>> look
>> >> something like:
>> >>
>> >> df12['df1.a']
>> >> df12['df2.other']
>> >>
>> >> But I’m not sure how this fits into the larger API. This certainly
>> isn’t
>> >> backwards compatible with how joins are done now.
>> >>
>> >> So what’s the recommended course of action here?
>> >>
>> >> Having to unique-ify all your column names before joining doesn’t sound
>> >> like a nice solution.
>> >>
>> >> Nick
>> >> 
>> >>
>> >
>>
>
>

Re: DataFrames equivalent to SQL table namespacing and aliases

2015-05-08 Thread Reynold Xin

You can actually just use df1['a'] in projection to differentiate.

e.g. in Scala (similar things work in Python):


scala> val df1 = Seq((1, "one")).toDF("a", "b")
df1: org.apache.spark.sql.DataFrame = [a: int, b: string]

scala> val df2 = Seq((2, "two")).toDF("a", "b")
df2: org.apache.spark.sql.DataFrame = [a: int, b: string]

scala> df1.join(df2, df1("a") === df2("a") - 1).select(*df1("a")*).show()
+-+
|a|
+-+
|1|
+-+




On Fri, May 8, 2015 at 11:53 AM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> Oh, I didn't know about that. Thanks for the pointer, Rakesh.
>
> I wonder why they did that, as opposed to taking the cue from SQL and
> prefixing column names with a specifiable dataframe alias. The suffix
> approach seems quite ugly.
>
> Nick
>
> On Fri, May 8, 2015 at 2:47 PM Rakesh Chalasani 
> wrote:
>
> > To add to the above discussion, Pandas, allows suffixing and prefixing to
> > solve this issue
> >
> >
> >
> http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.join.html
> >
> > Rakesh
> >
> > On Fri, May 8, 2015 at 2:42 PM Nicholas Chammas <
> > nicholas.cham...@gmail.com> wrote:
> >
> >> DataFrames, as far as I can tell, don’t have an equivalent to SQL’s
> table
> >> aliases.
> >>
> >> This is essential when joining dataframes that have identically named
> >> columns.
> >>
> >> >>> # PySpark 1.3.1>>> df1 = sqlContext.jsonRDD(sc.parallelize(['{"a":
> 4,
> >> "other": "I know"}']))>>> df2 =
> sqlContext.jsonRDD(sc.parallelize(['{"a":
> >> 4, "other": "I dunno"}']))>>> df12 = df1.join(df2, df1['a'] ==
> df2['a'])>>>
> >> df12
> >> DataFrame[a: bigint, other: string, a: bigint, other: string]>>>
> >> df12.printSchema()
> >> root
> >>  |-- a: long (nullable = true)
> >>  |-- other: string (nullable = true)
> >>  |-- a: long (nullable = true)
> >>  |-- other: string (nullable = true)
> >>
> >> Now, trying any one of the following:
> >>
> >> df12.select('a')
> >> df12['a']
> >> df12.a
> >>
> >> yields this:
> >>
> >> org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous,
> >> could be: a#360L, a#358L.;
> >>
> >> Same goes for accessing the other field.
> >>
> >> This is good, but what are we supposed to do in this case?
> >>
> >> SQL solves this by fully qualifying the column name with the table name,
> >>
> > and also offering table aliasing <
> http://dba.stackexchange.com/a/5991/2660
> >> >
> >
> >
> >> in the case where you are joining a table to itself.
> >>
> >> If we translate this directly into DataFrames lingo, perhaps it would
> look
> >> something like:
> >>
> >> df12['df1.a']
> >> df12['df2.other']
> >>
> >> But I’m not sure how this fits into the larger API. This certainly isn’t
> >> backwards compatible with how joins are done now.
> >>
> >> So what’s the recommended course of action here?
> >>
> >> Having to unique-ify all your column names before joining doesn’t sound
> >> like a nice solution.
> >>
> >> Nick
> >> 
> >>
> >
>

Re: DataFrames equivalent to SQL table namespacing and aliases

2015-05-08 Thread Nicholas Chammas

Oh, I didn't know about that. Thanks for the pointer, Rakesh.

I wonder why they did that, as opposed to taking the cue from SQL and
prefixing column names with a specifiable dataframe alias. The suffix
approach seems quite ugly.

Nick

On Fri, May 8, 2015 at 2:47 PM Rakesh Chalasani 
wrote:

> To add to the above discussion, Pandas, allows suffixing and prefixing to
> solve this issue
>
>
> http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.join.html
>
> Rakesh
>
> On Fri, May 8, 2015 at 2:42 PM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> DataFrames, as far as I can tell, don’t have an equivalent to SQL’s table
>> aliases.
>>
>> This is essential when joining dataframes that have identically named
>> columns.
>>
>> >>> # PySpark 1.3.1>>> df1 = sqlContext.jsonRDD(sc.parallelize(['{"a": 4,
>> "other": "I know"}']))>>> df2 = sqlContext.jsonRDD(sc.parallelize(['{"a":
>> 4, "other": "I dunno"}']))>>> df12 = df1.join(df2, df1['a'] == df2['a'])>>>
>> df12
>> DataFrame[a: bigint, other: string, a: bigint, other: string]>>>
>> df12.printSchema()
>> root
>>  |-- a: long (nullable = true)
>>  |-- other: string (nullable = true)
>>  |-- a: long (nullable = true)
>>  |-- other: string (nullable = true)
>>
>> Now, trying any one of the following:
>>
>> df12.select('a')
>> df12['a']
>> df12.a
>>
>> yields this:
>>
>> org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous,
>> could be: a#360L, a#358L.;
>>
>> Same goes for accessing the other field.
>>
>> This is good, but what are we supposed to do in this case?
>>
>> SQL solves this by fully qualifying the column name with the table name,
>>
> and also offering table aliasing > >
>
>
>> in the case where you are joining a table to itself.
>>
>> If we translate this directly into DataFrames lingo, perhaps it would look
>> something like:
>>
>> df12['df1.a']
>> df12['df2.other']
>>
>> But I’m not sure how this fits into the larger API. This certainly isn’t
>> backwards compatible with how joins are done now.
>>
>> So what’s the recommended course of action here?
>>
>> Having to unique-ify all your column names before joining doesn’t sound
>> like a nice solution.
>>
>> Nick
>> 
>>
>

Re: DataFrames equivalent to SQL table namespacing and aliases

2015-05-08 Thread Rakesh Chalasani

To add to the above discussion, Pandas, allows suffixing and prefixing to
solve this issue

http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.join.html

Rakesh

On Fri, May 8, 2015 at 2:42 PM Nicholas Chammas 
wrote:

> DataFrames, as far as I can tell, don’t have an equivalent to SQL’s table
> aliases.
>
> This is essential when joining dataframes that have identically named
> columns.
>
> >>> # PySpark 1.3.1>>> df1 = sqlContext.jsonRDD(sc.parallelize(['{"a": 4,
> "other": "I know"}']))>>> df2 = sqlContext.jsonRDD(sc.parallelize(['{"a":
> 4, "other": "I dunno"}']))>>> df12 = df1.join(df2, df1['a'] == df2['a'])>>>
> df12
> DataFrame[a: bigint, other: string, a: bigint, other: string]>>>
> df12.printSchema()
> root
>  |-- a: long (nullable = true)
>  |-- other: string (nullable = true)
>  |-- a: long (nullable = true)
>  |-- other: string (nullable = true)
>
> Now, trying any one of the following:
>
> df12.select('a')
> df12['a']
> df12.a
>
> yields this:
>
> org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous,
> could be: a#360L, a#358L.;
>
> Same goes for accessing the other field.
>
> This is good, but what are we supposed to do in this case?
>
> SQL solves this by fully qualifying the column name with the table name,
> and also offering table aliasing  >
> in the case where you are joining a table to itself.
>
> If we translate this directly into DataFrames lingo, perhaps it would look
> something like:
>
> df12['df1.a']
> df12['df2.other']
>
> But I’m not sure how this fits into the larger API. This certainly isn’t
> backwards compatible with how joins are done now.
>
> So what’s the recommended course of action here?
>
> Having to unique-ify all your column names before joining doesn’t sound
> like a nice solution.
>
> Nick
> 
>

Re: [SparkSQL] cannot filter by a DateType column

2015-05-08 Thread Michael Armbrust

What version of Spark are you using?  It appears that at least in master we
are doing the conversion correctly, but its possible older versions of
applySchema do not.  If you can reproduce the same bug in master, can you
open a JIRA?

On Fri, May 8, 2015 at 1:36 AM, Haopu Wang  wrote:

>  I want to filter a DataFrame based on a Date column.
>
>
>
> If the DataFrame object is constructed from a scala case class, it's
> working (either compare as String or Date). But if the DataFrame is
> generated by specifying a Schema to an RDD, it doesn't work. Below is the
> exception and test code.
>
>
>
> Do you have any idea about the error? Thank you very much!
>
>
>
> exception=
>
> *java.lang.ClassCastException*: java.sql.Date cannot be cast to
> java.lang.Integer
>
> at scala.runtime.BoxesRunTime.unboxToInt(*BoxesRunTime.java:106*)
>
> at
> org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToString$2$$anonfun$apply$6.apply(
> *Cast.scala:116*)
>
> at org.apache.spark.sql.catalyst.expressions.Cast.org
> $apache$spark$sql$catalyst$expressions$Cast$$buildCast(*Cast.scala:111*)
>
> at
> org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToString$2.apply(
> *Cast.scala:116*)
>
> at org.apache.spark.sql.catalyst.expressions.Cast.eval(
> *Cast.scala:426*)
>
> at org.apache.spark.sql.catalyst.expressions.GreaterThanOrEqual.eval(
> *predicates.scala:305*)
>
> at
> org.apache.spark.sql.catalyst.expressions.InterpretedPredicate$$anonfun$apply$1.apply(
> *predicates.scala:30*)
>
> at
> org.apache.spark.sql.catalyst.expressions.InterpretedPredicate$$anonfun$apply$1.apply(
> *predicates.scala:30*)
>
> at scala.collection.Iterator$$anon$14.hasNext(*Iterator.scala:390*)
>
> at scala.collection.Iterator$$anon$11.hasNext(*Iterator.scala:327*)
>
>
>
> code=
>
>
>
> *val* conf = *new* SparkConf().setAppName("DFTest").setMaster(
> "local[*]")
>
> *val* sc = *new* SparkContext(conf)
>
> *val* sqlCtx = *new* HiveContext(sc)
>
> *import* sqlCtx.implicits._
>
>
>
> *case* *class* Test(dt: java.sql.Date)
>
>
>
> *val* df = *sc**.makeRDD(Seq(Test(**new** java.sql.Date(**115**,**4*
> *,**7***.toDF
>
>
>
> *var* r = df.filter("dt >= '2015-05-06'")
>
> r.explain(*true*)
>
> r.show
>
> println("==")
>
> *var* r2 = df.filter("dt >= cast('2015-05-06' as DATE)")
>
> r2.explain(*true*)
>
> r2.show
>
> println("==")
>
>
>
> // "df2" doesn't do filter correct!!
>
> *val* rdd2 = sc.makeRDD(Seq((Row(*new* java.sql.Date(115,4,7)
>
>
>
> *val* schema = StructType(Array(StructField("dt", DateType, *false*)))
>
>
>
> *val* df2 = sqlCtx.applySchema(rdd2, schema)
>
>
>
> r = df2.filter("dt >= '2015-05-06'")
>
> r.explain(*true*)
>
> r.show
>
> println("==")
>
>
>
> r2 = df2.filter("dt >= cast('2015-05-06' as DATE)")
>
> r2.explain(*true*)
>
> r2.show
>
>
>

DataFrames equivalent to SQL table namespacing and aliases

2015-05-08 Thread Nicholas Chammas

DataFrames, as far as I can tell, don’t have an equivalent to SQL’s table
aliases.

This is essential when joining dataframes that have identically named
columns.

>>> # PySpark 1.3.1>>> df1 = sqlContext.jsonRDD(sc.parallelize(['{"a": 4, 
>>> "other": "I know"}']))>>> df2 = sqlContext.jsonRDD(sc.parallelize(['{"a": 
>>> 4, "other": "I dunno"}']))>>> df12 = df1.join(df2, df1['a'] == df2['a'])>>> 
>>> df12
DataFrame[a: bigint, other: string, a: bigint, other: string]>>>
df12.printSchema()
root
 |-- a: long (nullable = true)
 |-- other: string (nullable = true)
 |-- a: long (nullable = true)
 |-- other: string (nullable = true)

Now, trying any one of the following:

df12.select('a')
df12['a']
df12.a

yields this:

org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous,
could be: a#360L, a#358L.;

Same goes for accessing the other field.

This is good, but what are we supposed to do in this case?

SQL solves this by fully qualifying the column name with the table name,
and also offering table aliasing 
in the case where you are joining a table to itself.

If we translate this directly into DataFrames lingo, perhaps it would look
something like:

df12['df1.a']
df12['df2.other']

But I’m not sure how this fits into the larger API. This certainly isn’t
backwards compatible with how joins are done now.

So what’s the recommended course of action here?

Having to unique-ify all your column names before joining doesn’t sound
like a nice solution.

Nick

Re: [build system] QA infrastructure wiki updated w/latest package installs/versions

2015-05-08 Thread Patrick Wendell

Thanks Shane - really useful as I know that several companies are
interested in having in-house replicas of our QA infra.

On Fri, May 8, 2015 at 7:15 PM, shane knapp  wrote:
> so i spent a good part of the morning parsing out all of the packages and
> versions of things that we have installed on our jenkins workers:
>
> https://cwiki.apache.org/confluence/display/SPARK/Spark+QA+Infrastructure
>
> if you're looking to set up something to mimic our build system, this should
> be a great starting point!
>
> let me know if there's anything else i can do on this page.
>
> thanks!
>
> shane
>
> --
> You received this message because you are subscribed to the Google Groups
> "amp-infra" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to amp-infra+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

[build system] QA infrastructure wiki updated w/latest package installs/versions

2015-05-08 Thread shane knapp

so i spent a good part of the morning parsing out all of the packages and
versions of things that we have installed on our jenkins workers:

https://cwiki.apache.org/confluence/display/SPARK/Spark+QA+Infrastructure

if you're looking to set up something to mimic our build system, this
should be a great starting point!

let me know if there's anything else i can do on this page.

thanks!

shane

Re: Easy way to convert Row back to case class

2015-05-08 Thread Reynold Xin

In 1.4, you can do

row.getInt("colName")

In 1.5, some variant of this will come to allow you to turn a DataFrame
into a typed RDD, where the case class's field names match the column
names. https://github.com/apache/spark/pull/5713



On Fri, May 8, 2015 at 11:01 AM, Will Benton  wrote:

> This might not be the easiest way, but it's pretty easy:  you can use
> Row(field_1, ..., field_n) as a pattern in a case match.  So if you have a
> data frame with foo as an int column and bar as a String columns and you
> want to construct instances of a case class that wraps these up, you can do
> something like this:
>
> // assuming Record is declared as case class Record(foo: Int, bar:
> String)
> // and df is a data frame
>
> df.map {
>   case Row(foo: Int, bar: String) => Record(foo, bar)
> }
>
>
>
> best,
> wb
>
>
> - Original Message -
> > From: "Alexander Ulanov" 
> > To: dev@spark.apache.org
> > Sent: Friday, May 8, 2015 11:50:53 AM
> > Subject: Easy way to convert Row back to case class
> >
> > Hi,
> >
> > I created a dataset RDD[MyCaseClass], converted it to DataFrame and
> saved to
> > Parquet file, following
> >
> https://spark.apache.org/docs/latest/sql-programming-guide.html#interoperating-with-rdds
> >
> > When I load this dataset with sqlContext.parquetFile, I get DataFrame
> with
> > column names as in initial case class. I want to convert this DataFrame
> to
> > RDD to perform RDD operations. However, when I convert it I get RDD[Row]
> and
> > all information about row names gets lost. Could you suggest an easy way
> to
> > convert DataFrame to RDD[MyCaseClass]?
> >
> > Best regards, Alexander
> >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Re: [SparkR] is toDF() necessary

2015-05-08 Thread Shivaram Venkataraman

Agree that toDF is not very useful. In fact it was removed from the
namespace in a recent change
https://github.com/apache/spark/commit/4e930420c19ae7773b138dfc7db8fc03b4660251

Thanks
Shivaram

On Fri, May 8, 2015 at 1:10 AM, Sun, Rui  wrote:

> toDF() is defined to convert an RDD to a DataFrame. But it is just a very
> thin wrapper of createDataFrame() by help the caller avoid input of
> SQLContext.
>
> Since Scala/pySpark does not have toDF(), and we'd better keep API as
> narrow and simple as possible. Is toDF() really necessary? Could we
> eliminate it?
>
>
>

Re: Easy way to convert Row back to case class

2015-05-08 Thread Will Benton

This might not be the easiest way, but it's pretty easy:  you can use 
Row(field_1, ..., field_n) as a pattern in a case match.  So if you have a data 
frame with foo as an int column and bar as a String columns and you want to 
construct instances of a case class that wraps these up, you can do something 
like this:

// assuming Record is declared as case class Record(foo: Int, bar: String)
// and df is a data frame

df.map {
  case Row(foo: Int, bar: String) => Record(foo, bar)
}



best,
wb


- Original Message -
> From: "Alexander Ulanov" 
> To: dev@spark.apache.org
> Sent: Friday, May 8, 2015 11:50:53 AM
> Subject: Easy way to convert Row back to case class
> 
> Hi,
> 
> I created a dataset RDD[MyCaseClass], converted it to DataFrame and saved to
> Parquet file, following
> https://spark.apache.org/docs/latest/sql-programming-guide.html#interoperating-with-rdds
> 
> When I load this dataset with sqlContext.parquetFile, I get DataFrame with
> column names as in initial case class. I want to convert this DataFrame to
> RDD to perform RDD operations. However, when I convert it I get RDD[Row] and
> all information about row names gets lost. Could you suggest an easy way to
> convert DataFrame to RDD[MyCaseClass]?
> 
> Best regards, Alexander
> 

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Back-pressure for Spark Streaming

2015-05-08 Thread Akhil Das

We had a similar issue while working on one of our usecase where we were
processing at a moderate throughput (around 500MB/S). When the processing
time exceeds the batch duration, it started to throw up blocknotfound
exceptions, i made a workaround for that issue and is explained over here
http://apache-spark-developers-list.1001551.n3.nabble.com/SparkStreaming-Workaround-for-BlockNotFound-Exceptions-td12096.html

Basically, instead of generating blocks blindly, i made the receiver sleep
if there's an increase in the scheduling delay (if scheduling delay exceeds
3 times the batch duration). This prototype is working nicely and the speed
is encouraging as its processing at 500MB/S without having any failures so
far.

Thanks
Best Regards

On Fri, May 8, 2015 at 8:11 PM, François Garillot <
francois.garil...@typesafe.com> wrote:

> Hi guys,
>
> We[1] are doing a bit of work on Spark Streaming, to help it face
> situations where the throughput of data on an InputStream is (momentarily)
> susceptible to overwhelm the Receiver(s) memory.
>
> The JIRA & design doc is here:
> https://issues.apache.org/jira/browse/SPARK-7398
>
> We'd sure appreciate your comments !
>
> --
> François Garillot
> [1]: Typesafe & some helpful collaborators on benchmarking 'at scale'
>

Re: Spark Streaming with Tachyon : Some findings

2015-05-08 Thread Haoyuan Li

Thanks for the updates!

Best,

Haoyuan

On Fri, May 8, 2015 at 8:40 AM, Dibyendu Bhattacharya <
dibyendu.bhattach...@gmail.com> wrote:

> Just a followup on this Thread .
>
> I tried Hierarchical Storage on Tachyon (
> http://tachyon-project.org/Hierarchy-Storage-on-Tachyon.html ) , and that
> seems to have worked and I did not see any any Spark Job failed due to
> BlockNotFoundException.
> below is my  Hierarchical Storage settings..
>
>   -Dtachyon.worker.hierarchystore.level.max=2
>   -Dtachyon.worker.hierarchystore.level0.alias=MEM
>   -Dtachyon.worker.hierarchystore.level0.dirs.path=$TACHYON_RAM_FOLDER
>
>
> -Dtachyon.worker.hierarchystore.level0.dirs.quota=$TACHYON_WORKER_MEMORY_SIZE
>   -Dtachyon.worker.hierarchystore.level1.alias=HDD
>   -Dtachyon.worker.hierarchystore.level1.dirs.path=/mnt/tachyon
>   -Dtachyon.worker.hierarchystore.level1.dirs.quota=50GB
>   -Dtachyon.worker.allocate.strategy=MAX_FREE
>   -Dtachyon.worker.evict.strategy=LRU
>
> Regards,
> Dibyendu
>
> On Thu, May 7, 2015 at 1:46 PM, Dibyendu Bhattacharya <
> dibyendu.bhattach...@gmail.com> wrote:
>
> > Dear All ,
> >
> > I have been playing with Spark Streaming on Tachyon as the OFF_HEAP block
> > store  . Primary reason for evaluating Tachyon is to find if Tachyon can
> > solve the Spark BlockNotFoundException .
> >
> > In traditional MEMORY_ONLY StorageLevel, when blocks are evicted , jobs
> > failed due to block not found exception and storing blocks in
> > MEMORY_AND_DISK is not a good option either as it impact the throughput a
> > lot .
> >
> >
> > To test how Tachyon behave , I took the latest spark 1.4 from master ,
> and
> > used Tachyon 0.6.4 and configured Tachyon in Fault Tolerant Mode .
> Tachyon
> > is running in 3 Node AWS x-large cluster and Spark is running in 3 node
> AWS
> > x-large cluster.
> >
> > I have used the low level Receiver based Kafka consumer (
> > https://github.com/dibbhatt/kafka-spark-consumer)  which I have written
> > to pull from Kafka and write Blocks to Tachyon
> >
> >
> > I found there is similar improvement in throughput (as MEMORY_ONLY case )
> > but very good overall memory utilization (as it is off heap store) .
> >
> >
> > But I found one issue on which I need to clarification .
> >
> >
> > In Tachyon case also , I find  BlockNotFoundException  , but due to a
> > different reason .  What I see TachyonBlockManager.scala put the blocks
> in
> > WriteType.TRY_CACHE configuration . And because of this Blocks ate
> evicted
> > from Tachyon Cache and when Spark try to find the block it throws
> >  BlockNotFoundException .
> >
> > I see a pull request which discuss the same ..
> >
> > https://github.com/apache/spark/pull/158#discussion_r11195271
> >
> >
> > When I modified the WriteType to CACHE_THROUGH , BlockDropException is
> > gone , but it again impact the throughput ..
> >
> >
> > Just curious to know , if Tachyon has any settings which can solve the
> > Block Eviction from Cache to Disk, other than explicitly setting
> > CACHE_THROUGH  ?
> >
> > Regards,
> > Dibyendu
> >
> >
> >
>



-- 
Haoyuan Li
CEO, Tachyon Nexus 
AMPLab, EECS, UC Berkeley http://www.cs.berkeley.edu/~haoyuan/

Easy way to convert Row back to case class

2015-05-08 Thread Ulanov, Alexander

Hi,

I created a dataset RDD[MyCaseClass], converted it to DataFrame and saved to 
Parquet file, following 
https://spark.apache.org/docs/latest/sql-programming-guide.html#interoperating-with-rdds

When I load this dataset with sqlContext.parquetFile, I get DataFrame with 
column names as in initial case class. I want to convert this DataFrame to RDD 
to perform RDD operations. However, when I convert it I get RDD[Row] and all 
information about row names gets lost. Could you suggest an easy way to convert 
DataFrame to RDD[MyCaseClass]?

Best regards, Alexander

Re: Spark Streaming with Tachyon : Some findings

2015-05-08 Thread Dibyendu Bhattacharya

Just a followup on this Thread .

I tried Hierarchical Storage on Tachyon (
http://tachyon-project.org/Hierarchy-Storage-on-Tachyon.html ) , and that
seems to have worked and I did not see any any Spark Job failed due to
BlockNotFoundException.
below is my  Hierarchical Storage settings..

  -Dtachyon.worker.hierarchystore.level.max=2
  -Dtachyon.worker.hierarchystore.level0.alias=MEM
  -Dtachyon.worker.hierarchystore.level0.dirs.path=$TACHYON_RAM_FOLDER

-Dtachyon.worker.hierarchystore.level0.dirs.quota=$TACHYON_WORKER_MEMORY_SIZE
  -Dtachyon.worker.hierarchystore.level1.alias=HDD
  -Dtachyon.worker.hierarchystore.level1.dirs.path=/mnt/tachyon
  -Dtachyon.worker.hierarchystore.level1.dirs.quota=50GB
  -Dtachyon.worker.allocate.strategy=MAX_FREE
  -Dtachyon.worker.evict.strategy=LRU

Regards,
Dibyendu

On Thu, May 7, 2015 at 1:46 PM, Dibyendu Bhattacharya <
dibyendu.bhattach...@gmail.com> wrote:

> Dear All ,
>
> I have been playing with Spark Streaming on Tachyon as the OFF_HEAP block
> store  . Primary reason for evaluating Tachyon is to find if Tachyon can
> solve the Spark BlockNotFoundException .
>
> In traditional MEMORY_ONLY StorageLevel, when blocks are evicted , jobs
> failed due to block not found exception and storing blocks in
> MEMORY_AND_DISK is not a good option either as it impact the throughput a
> lot .
>
>
> To test how Tachyon behave , I took the latest spark 1.4 from master , and
> used Tachyon 0.6.4 and configured Tachyon in Fault Tolerant Mode . Tachyon
> is running in 3 Node AWS x-large cluster and Spark is running in 3 node AWS
> x-large cluster.
>
> I have used the low level Receiver based Kafka consumer (
> https://github.com/dibbhatt/kafka-spark-consumer)  which I have written
> to pull from Kafka and write Blocks to Tachyon
>
>
> I found there is similar improvement in throughput (as MEMORY_ONLY case )
> but very good overall memory utilization (as it is off heap store) .
>
>
> But I found one issue on which I need to clarification .
>
>
> In Tachyon case also , I find  BlockNotFoundException  , but due to a
> different reason .  What I see TachyonBlockManager.scala put the blocks in
> WriteType.TRY_CACHE configuration . And because of this Blocks ate evicted
> from Tachyon Cache and when Spark try to find the block it throws
>  BlockNotFoundException .
>
> I see a pull request which discuss the same ..
>
> https://github.com/apache/spark/pull/158#discussion_r11195271
>
>
> When I modified the WriteType to CACHE_THROUGH , BlockDropException is
> gone , but it again impact the throughput ..
>
>
> Just curious to know , if Tachyon has any settings which can solve the
> Block Eviction from Cache to Disk, other than explicitly setting
> CACHE_THROUGH  ?
>
> Regards,
> Dibyendu
>
>
>

Re: [build infra] quick downtime again tomorrow morning for DOCKER

2015-05-08 Thread shane knapp

...and this is done.  thanks for your patience!

On Fri, May 8, 2015 at 7:00 AM, shane knapp  wrote:

> this is happening now.
>
> On Thu, May 7, 2015 at 3:40 PM, shane knapp  wrote:
>
>> yes, docker.  that wonderful little wrapper for linux containers will be
>> installed and ready for play on all of the jenkins workers tomorrow morning.
>>
>> the downtime will be super quick:  i just need to kill the jenkins
>> slaves' ssh connections and relaunch to add the jenkins user to the docker
>> group.
>>
>> this will begin at around 7am PDT and shouldn't take long at all.
>>
>> shane
>>
>
>

Back-pressure for Spark Streaming

2015-05-08 Thread François Garillot

Hi guys,

We[1] are doing a bit of work on Spark Streaming, to help it face
situations where the throughput of data on an InputStream is (momentarily)
susceptible to overwhelm the Receiver(s) memory.

The JIRA & design doc is here:
https://issues.apache.org/jira/browse/SPARK-7398

We'd sure appreciate your comments !

-- 
François Garillot
[1]: Typesafe & some helpful collaborators on benchmarking 'at scale'

Re: [build infra] quick downtime again tomorrow morning for DOCKER

2015-05-08 Thread shane knapp

yes, absolutely.  right now i'm just getting the basics set up for a
student's build in the lab.  later on today i will be updating the spark
wiki qa infrastructure page w/more information.

On Fri, May 8, 2015 at 7:06 AM, Punyashloka Biswal 
wrote:

> Just curious: will docker allow new capabilities for the Spark build?
> (Where can I read more?)
>
> Punya
>
> On Fri, May 8, 2015 at 10:00 AM shane knapp  wrote:
>
>> this is happening now.
>>
>> On Thu, May 7, 2015 at 3:40 PM, shane knapp  wrote:
>>
>> > yes, docker.  that wonderful little wrapper for linux containers will be
>> > installed and ready for play on all of the jenkins workers tomorrow
>> morning.
>> >
>> > the downtime will be super quick:  i just need to kill the jenkins
>> slaves'
>> > ssh connections and relaunch to add the jenkins user to the docker
>> group.
>> >
>> > this will begin at around 7am PDT and shouldn't take long at all.
>> >
>> > shane
>> >
>>
>

Re: Collect inputs on SPARK-7035: compatibility issue with DataFrame.getattr

2015-05-08 Thread Punyashloka Biswal

Is there a foolproof way to access methods exclusively (instead of picking
between columns and methods at runtime)? Here are two ideas, neither of
which seems particularly Pythonic

   - pyspark.sql.methods(df).name()
   - df.__methods__.name()

Punya

On Fri, May 8, 2015 at 10:06 AM Nicholas Chammas 
wrote:

> And a link to SPARK-7035
>  (which
> Xiangrui mentioned in his initial email) for the lazy.
>
> On Fri, May 8, 2015 at 3:41 AM Xiangrui Meng  wrote:
>
> > On Fri, May 8, 2015 at 12:18 AM, Shivaram Venkataraman
> >  wrote:
> > > I dont know much about Python style, but I think the point Wes made
> about
> > > usability on the JIRA is pretty powerful. IMHO the number of methods
> on a
> > > Spark DataFrame might not be much more compared to Pandas. Given that
> it
> > > looks like users are okay with the possibility of collisions in Pandas
> I
> > > think sticking (1) is not a bad idea.
> > >
> >
> > This is true for interactive work. Spark's DataFrames can handle
> > really large datasets, which might be used in production workflows. So
> > I think it is reasonable for us to care more about compatibility
> > issues than Pandas.
> >
> > > Also is it possible to detect such collisions in Python ? A (4)th
> option
> > > might be to detect that `df` contains a column named `name` and print a
> > > warning in `df.name` which tells the user that the method is
> overriding
> > the
> > > column.
> >
> > Maybe we can inspect the frame `df.name` gets called and warn users in
> > `df.select(df.name)` but not in `name = df.name`. This could be tricky
> > to implement.
> >
> > -Xiangrui
> >
> > >
> > > Thanks
> > > Shivaram
> > >
> > >
> > > On Thu, May 7, 2015 at 11:59 PM, Xiangrui Meng 
> wrote:
> > >>
> > >> Hi all,
> > >>
> > >> In PySpark, a DataFrame column can be referenced using df["abcd"]
> > >> (__getitem__) and df.abcd (__getattr__). There is a discussion on
> > >> SPARK-7035 on compatibility issues with the __getattr__ approach, and
> > >> I want to collect more inputs on this.
> > >>
> > >> Basically, if in the future we introduce a new method to DataFrame, it
> > >> may break user code that uses the same attr to reference a column or
> > >> silently changes its behavior. For example, if we add name() to
> > >> DataFrame in the next release, all existing code using `df.name` to
> > >> reference a column called "name" will break. If we add `name()` as a
> > >> property instead of a method, all existing code using `df.name` may
> > >> still work but with a different meaning. `df.select(df.name)` no
> > >> longer selects the column called "name" but the column that has the
> > >> same name as `df.name`.
> > >>
> > >> There are several proposed solutions:
> > >>
> > >> 1. Keep both df.abcd and df["abcd"], and encourage users to use the
> > >> latter that is future proof. This is the current solution in master
> > >> (https://github.com/apache/spark/pull/5971). But I think users may be
> > >> still unaware of the compatibility issue and prefer `df.abcd` to
> > >> `df["abcd"]` because the former could be auto-completed.
> > >> 2. Drop df.abcd and support df["abcd"] only. From Wes' comment on the
> > >> JIRA page: "I actually dragged my feet on the _getattr_ issue for
> > >> several months back in the day, then finally added it (and tab
> > >> completion in IPython with _dir_), and immediately noticed a huge
> > >> quality-of-life improvement when using pandas for actual (esp.
> > >> interactive) work."
> > >> 3. Replace df.abcd by df.abcd_ (with a suffix "_"). Both df.abcd_ and
> > >> df["abcd"] would be future proof, and df.abcd_ could be
> > >> auto-completed. The tradeoff is apparently the extra "_" appearing in
> > >> the code.
> > >>
> > >> My preference is 3 > 1 > 2. Your inputs would be greatly appreciated.
> > >> Thanks!
> > >>
> > >> Best,
> > >> Xiangrui
> > >>
> > >> -
> > >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> > >> For additional commands, e-mail: dev-h...@spark.apache.org
> > >>
> > >
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> > For additional commands, e-mail: dev-h...@spark.apache.org
> >
> >
>

Re: [build infra] quick downtime again tomorrow morning for DOCKER

2015-05-08 Thread Punyashloka Biswal

Just curious: will docker allow new capabilities for the Spark build?
(Where can I read more?)

Punya

On Fri, May 8, 2015 at 10:00 AM shane knapp  wrote:

> this is happening now.
>
> On Thu, May 7, 2015 at 3:40 PM, shane knapp  wrote:
>
> > yes, docker.  that wonderful little wrapper for linux containers will be
> > installed and ready for play on all of the jenkins workers tomorrow
> morning.
> >
> > the downtime will be super quick:  i just need to kill the jenkins
> slaves'
> > ssh connections and relaunch to add the jenkins user to the docker group.
> >
> > this will begin at around 7am PDT and shouldn't take long at all.
> >
> > shane
> >
>

Re: Collect inputs on SPARK-7035: compatibility issue with DataFrame.getattr

2015-05-08 Thread Nicholas Chammas

And a link to SPARK-7035
 (which
Xiangrui mentioned in his initial email) for the lazy.

On Fri, May 8, 2015 at 3:41 AM Xiangrui Meng  wrote:

> On Fri, May 8, 2015 at 12:18 AM, Shivaram Venkataraman
>  wrote:
> > I dont know much about Python style, but I think the point Wes made about
> > usability on the JIRA is pretty powerful. IMHO the number of methods on a
> > Spark DataFrame might not be much more compared to Pandas. Given that it
> > looks like users are okay with the possibility of collisions in Pandas I
> > think sticking (1) is not a bad idea.
> >
>
> This is true for interactive work. Spark's DataFrames can handle
> really large datasets, which might be used in production workflows. So
> I think it is reasonable for us to care more about compatibility
> issues than Pandas.
>
> > Also is it possible to detect such collisions in Python ? A (4)th option
> > might be to detect that `df` contains a column named `name` and print a
> > warning in `df.name` which tells the user that the method is overriding
> the
> > column.
>
> Maybe we can inspect the frame `df.name` gets called and warn users in
> `df.select(df.name)` but not in `name = df.name`. This could be tricky
> to implement.
>
> -Xiangrui
>
> >
> > Thanks
> > Shivaram
> >
> >
> > On Thu, May 7, 2015 at 11:59 PM, Xiangrui Meng  wrote:
> >>
> >> Hi all,
> >>
> >> In PySpark, a DataFrame column can be referenced using df["abcd"]
> >> (__getitem__) and df.abcd (__getattr__). There is a discussion on
> >> SPARK-7035 on compatibility issues with the __getattr__ approach, and
> >> I want to collect more inputs on this.
> >>
> >> Basically, if in the future we introduce a new method to DataFrame, it
> >> may break user code that uses the same attr to reference a column or
> >> silently changes its behavior. For example, if we add name() to
> >> DataFrame in the next release, all existing code using `df.name` to
> >> reference a column called "name" will break. If we add `name()` as a
> >> property instead of a method, all existing code using `df.name` may
> >> still work but with a different meaning. `df.select(df.name)` no
> >> longer selects the column called "name" but the column that has the
> >> same name as `df.name`.
> >>
> >> There are several proposed solutions:
> >>
> >> 1. Keep both df.abcd and df["abcd"], and encourage users to use the
> >> latter that is future proof. This is the current solution in master
> >> (https://github.com/apache/spark/pull/5971). But I think users may be
> >> still unaware of the compatibility issue and prefer `df.abcd` to
> >> `df["abcd"]` because the former could be auto-completed.
> >> 2. Drop df.abcd and support df["abcd"] only. From Wes' comment on the
> >> JIRA page: "I actually dragged my feet on the _getattr_ issue for
> >> several months back in the day, then finally added it (and tab
> >> completion in IPython with _dir_), and immediately noticed a huge
> >> quality-of-life improvement when using pandas for actual (esp.
> >> interactive) work."
> >> 3. Replace df.abcd by df.abcd_ (with a suffix "_"). Both df.abcd_ and
> >> df["abcd"] would be future proof, and df.abcd_ could be
> >> auto-completed. The tradeoff is apparently the extra "_" appearing in
> >> the code.
> >>
> >> My preference is 3 > 1 > 2. Your inputs would be greatly appreciated.
> >> Thanks!
> >>
> >> Best,
> >> Xiangrui
> >>
> >> -
> >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> >> For additional commands, e-mail: dev-h...@spark.apache.org
> >>
> >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Re: [build infra] quick downtime again tomorrow morning for DOCKER

2015-05-08 Thread shane knapp

this is happening now.

On Thu, May 7, 2015 at 3:40 PM, shane knapp  wrote:

> yes, docker.  that wonderful little wrapper for linux containers will be
> installed and ready for play on all of the jenkins workers tomorrow morning.
>
> the downtime will be super quick:  i just need to kill the jenkins slaves'
> ssh connections and relaunch to add the jenkins user to the docker group.
>
> this will begin at around 7am PDT and shouldn't take long at all.
>
> shane
>

Re: NoClassDefFoundError with Spark 1.3

2015-05-08 Thread Ganelin, Ilya

All – the issue was much more subtle. I’d accidentally included a reference to 
a static object in a class that I wasn’t actually including in my build – hence 
the unrelated run-time error.

Thanks for the clarification on what the “provided” scope means.

Ilya Ganelin

[cid:F5843713-66AA-443B-ABB0-94CDC3D88A09]

From: Olivier Girardot mailto:ssab...@gmail.com>>
Date: Friday, May 8, 2015 at 6:40 AM
To: Akhil Das mailto:ak...@sigmoidanalytics.com>>, 
"Ganelin, Ilya" 
mailto:ilya.gane...@capitalone.com>>
Cc: dev mailto:dev@spark.apache.org>>
Subject: Re: NoClassDefFoundError with Spark 1.3

You're trying to launch using sbt run some "provided" dependency,
the goal of the "provided" scope is exactly to exclude this dependency from 
runtime, considering it as "provided" by the environment.

You configuration is correct to create an assembly jar - but not to use sbt run 
to test your project.

Regards,

Olivier.

Le ven. 8 mai 2015 à 10:41, Akhil Das 
mailto:ak...@sigmoidanalytics.com>> a écrit :
Looks like the jar you provided has some missing classes. Try this:

scalaVersion := "2.10.4"

libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "1.3.0",
"org.apache.spark" %% "spark-sql" % "1.3.0" % "provided",
"org.apache.spark" %% "spark-mllib" % "1.3.0" % "provided",
"log4j" % "log4j" % "1.2.15" excludeAll(
  ExclusionRule(organization = "com.sun.jdmk"),
  ExclusionRule(organization = "com.sun.jmx"),
  ExclusionRule(organization = "javax.jms")
  )
)

Thanks
Best Regards

On Thu, May 7, 2015 at 11:28 PM, Ganelin, Ilya 
mailto:ilya.gane...@capitalone.com>>
wrote:

> Hi all – I’m attempting to build a project with SBT and run it on Spark
> 1.3 (this previously worked before we upgraded to CDH 5.4 with Spark 1.3).
>
> I have the following in my build.sbt:
>
>
> scalaVersion := "2.10.4"
>
> libraryDependencies ++= Seq(
> "org.apache.spark" %% "spark-core" % "1.3.0" % "provided",
> "org.apache.spark" %% "spark-sql" % "1.3.0" % "provided",
> "org.apache.spark" %% "spark-mllib" % "1.3.0" % "provided",
> "log4j" % "log4j" % "1.2.15" excludeAll(
>   ExclusionRule(organization = "com.sun.jdmk"),
>   ExclusionRule(organization = "com.sun.jmx"),
>   ExclusionRule(organization = "javax.jms")
>   )
> )
>
> When I attempt to run this program with sbt run, however, I get the
> following error:
> java.lang.NoClassDefFoundError: org.apache.spark.Partitioner
>
> I don’t explicitly use the Partitioner class anywhere, and this seems to
> indicate some missing Spark libraries on the install. Do I need to confirm
> anything other than the presence of the Spark assembly? I’m on CDH 5.4 and
> I’m able to run the spark-shell without any trouble.
>
> Any help would be much appreciated.
>
> Thank you,
> Ilya Ganelin
>
>
> --
>
> The information contained in this e-mail is confidential and/or
> proprietary to Capital One and/or its affiliates. The information
> transmitted herewith is intended only for use by the individual or entity
> to which it is addressed.  If the reader of this message is not the
> intended recipient, you are hereby notified that any review,
> retransmission, dissemination, distribution, copying or other use of, or
> taking of any action in reliance upon this information is strictly
> prohibited. If you have received this communication in error, please
> contact the sender and delete the material from your computer.
>

The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed.  If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.

Re: NoClassDefFoundError with Spark 1.3

2015-05-08 Thread Olivier Girardot

You're trying to launch using sbt run some "provided" dependency,
the goal of the "provided" scope is exactly to exclude this dependency from
runtime, considering it as "provided" by the environment.

You configuration is correct to create an assembly jar - but not to use sbt
run to test your project.

Regards,

Olivier.

Le ven. 8 mai 2015 à 10:41, Akhil Das  a écrit :

> Looks like the jar you provided has some missing classes. Try this:
>
> scalaVersion := "2.10.4"
>
> libraryDependencies ++= Seq(
> "org.apache.spark" %% "spark-core" % "1.3.0",
> "org.apache.spark" %% "spark-sql" % "1.3.0" % "provided",
> "org.apache.spark" %% "spark-mllib" % "1.3.0" % "provided",
> "log4j" % "log4j" % "1.2.15" excludeAll(
>   ExclusionRule(organization = "com.sun.jdmk"),
>   ExclusionRule(organization = "com.sun.jmx"),
>   ExclusionRule(organization = "javax.jms")
>   )
> )
>
>
> Thanks
> Best Regards
>
> On Thu, May 7, 2015 at 11:28 PM, Ganelin, Ilya <
> ilya.gane...@capitalone.com>
> wrote:
>
> > Hi all – I’m attempting to build a project with SBT and run it on Spark
> > 1.3 (this previously worked before we upgraded to CDH 5.4 with Spark
> 1.3).
> >
> > I have the following in my build.sbt:
> >
> >
> > scalaVersion := "2.10.4"
> >
> > libraryDependencies ++= Seq(
> > "org.apache.spark" %% "spark-core" % "1.3.0" % "provided",
> > "org.apache.spark" %% "spark-sql" % "1.3.0" % "provided",
> > "org.apache.spark" %% "spark-mllib" % "1.3.0" % "provided",
> > "log4j" % "log4j" % "1.2.15" excludeAll(
> >   ExclusionRule(organization = "com.sun.jdmk"),
> >   ExclusionRule(organization = "com.sun.jmx"),
> >   ExclusionRule(organization = "javax.jms")
> >   )
> > )
> >
> > When I attempt to run this program with sbt run, however, I get the
> > following error:
> > java.lang.NoClassDefFoundError: org.apache.spark.Partitioner
> >
> > I don’t explicitly use the Partitioner class anywhere, and this seems to
> > indicate some missing Spark libraries on the install. Do I need to
> confirm
> > anything other than the presence of the Spark assembly? I’m on CDH 5.4
> and
> > I’m able to run the spark-shell without any trouble.
> >
> > Any help would be much appreciated.
> >
> > Thank you,
> > Ilya Ganelin
> >
> >
> > --
> >
> > The information contained in this e-mail is confidential and/or
> > proprietary to Capital One and/or its affiliates. The information
> > transmitted herewith is intended only for use by the individual or entity
> > to which it is addressed.  If the reader of this message is not the
> > intended recipient, you are hereby notified that any review,
> > retransmission, dissemination, distribution, copying or other use of, or
> > taking of any action in reliance upon this information is strictly
> > prohibited. If you have received this communication in error, please
> > contact the sender and delete the material from your computer.
> >
>

Re: Spark 1.3.1 / Hadoop 2.6 package has broken S3 access

2015-05-08 Thread Steve Loughran


> 2. I can add a hadoop-2.6 profile that sets things up for s3a, azure and 
> openstack swift.


Added: 
https://issues.apache.org/jira/browse/SPARK-7481 


One thing to consider here is testing; the s3x clients themselves have some 
tests that individuals/orgs can run against different S3 installations & 
private versions; people publish their results to see that there's been good 
coverage of the different S3 installations with their different consistency 
models & auth mechanisms. 

There's also some scale tests that take time & don't get run so often but which 
throw up surprises (RAX UK throttling DELETE, intermittent ConnectionReset 
exceptions reading multi-GB s3 files). 

Amazon have some public datasets that could be used to verify that spark can 
read files off S3, and maybe even find some of the scale problems.

In particular, http://datasets.elasticmapreduce.s3.amazonaws.com/ publishes 
ngrams as a set of .gz files free for all to read

Would there be a place in the code tree for some tests to run against things 
like this? They're cloud integration tests rather than unit tests and nobody 
would want them to be on by default, but it could be good for regression 
testing hadoop s3 support & spark integration

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Spark 1.3.1 / Hadoop 2.6 package has broken S3 access

2015-05-08 Thread Steve Loughran

> On 7 May 2015, at 18:02, Matei Zaharia  wrote:
> 
> We should make sure to update our docs to mention s3a as well, since many 
> people won't look at Hadoop's docs for this.
> 
> Matei
> 

1. to use s3a you'll also need an amazon toolkit JAR on the cp
2. I can add a hadoop-2.6 profile that sets things up for s3a, azure and 
openstack swift.
3. TREAT S3A on HADOOP 2.6 AS BETA-RELEASE

For anyone thinking putting that in all-caps seems excessive, consult

https://issues.apache.org/jira/browse/HADOOP-11571

in particular, anything that queries for the block size of a file before 
dividing work up is dead in the water due to 
HADOOP-11584 : s3a file block size set to 0 in getFileStatus. There's also 
thread pooling problems if too many
writes are going on in the same JVM; this may hit output operations

Hadoop 2.7 fixes all the phase I issues, leaving those in HADOOP-11694 to look 
at

>> On May 7, 2015, at 12:57 PM, Nicholas Chammas  
>> wrote:
>> 
>> Ah, thanks for the pointers.
>> 
>> So as far as Spark is concerned, is this a breaking change? Is it possible
>> that people who have working code that accesses S3 will upgrade to use
>> Spark-against-Hadoop-2.6 and find their code is not working all of a sudden?
>> 
>> Nick
>> 
>> On Thu, May 7, 2015 at 12:48 PM Peter Rudenko > >
>> wrote:
>> 
>>> Yep it's a Hadoop issue:
>>> https://issues.apache.org/jira/browse/HADOOP-11863
>>> 

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: NoClassDefFoundError with Spark 1.3

2015-05-08 Thread Akhil Das

Looks like the jar you provided has some missing classes. Try this:

scalaVersion := "2.10.4"

libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "1.3.0",
"org.apache.spark" %% "spark-sql" % "1.3.0" % "provided",
"org.apache.spark" %% "spark-mllib" % "1.3.0" % "provided",
"log4j" % "log4j" % "1.2.15" excludeAll(
  ExclusionRule(organization = "com.sun.jdmk"),
  ExclusionRule(organization = "com.sun.jmx"),
  ExclusionRule(organization = "javax.jms")
  )
)


Thanks
Best Regards

On Thu, May 7, 2015 at 11:28 PM, Ganelin, Ilya 
wrote:

> Hi all – I’m attempting to build a project with SBT and run it on Spark
> 1.3 (this previously worked before we upgraded to CDH 5.4 with Spark 1.3).
>
> I have the following in my build.sbt:
>
>
> scalaVersion := "2.10.4"
>
> libraryDependencies ++= Seq(
> "org.apache.spark" %% "spark-core" % "1.3.0" % "provided",
> "org.apache.spark" %% "spark-sql" % "1.3.0" % "provided",
> "org.apache.spark" %% "spark-mllib" % "1.3.0" % "provided",
> "log4j" % "log4j" % "1.2.15" excludeAll(
>   ExclusionRule(organization = "com.sun.jdmk"),
>   ExclusionRule(organization = "com.sun.jmx"),
>   ExclusionRule(organization = "javax.jms")
>   )
> )
>
> When I attempt to run this program with sbt run, however, I get the
> following error:
> java.lang.NoClassDefFoundError: org.apache.spark.Partitioner
>
> I don’t explicitly use the Partitioner class anywhere, and this seems to
> indicate some missing Spark libraries on the install. Do I need to confirm
> anything other than the presence of the Spark assembly? I’m on CDH 5.4 and
> I’m able to run the spark-shell without any trouble.
>
> Any help would be much appreciated.
>
> Thank you,
> Ilya Ganelin
>
>
> --
>
> The information contained in this e-mail is confidential and/or
> proprietary to Capital One and/or its affiliates. The information
> transmitted herewith is intended only for use by the individual or entity
> to which it is addressed.  If the reader of this message is not the
> intended recipient, you are hereby notified that any review,
> retransmission, dissemination, distribution, copying or other use of, or
> taking of any action in reliance upon this information is strictly
> prohibited. If you have received this communication in error, please
> contact the sender and delete the material from your computer.
>

[SparkSQL] cannot filter by a DateType column

2015-05-08 Thread Haopu Wang

I want to filter a DataFrame based on a Date column. 

 

If the DataFrame object is constructed from a scala case class, it's
working (either compare as String or Date). But if the DataFrame is
generated by specifying a Schema to an RDD, it doesn't work. Below is
the exception and test code.

 

Do you have any idea about the error? Thank you very much!

 

exception=

java.lang.ClassCastException: java.sql.Date cannot be cast to
java.lang.Integer

at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:106)

at
org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToString$2$$
anonfun$apply$6.apply(Cast.scala:116)

at
org.apache.spark.sql.catalyst.expressions.Cast.org$apache$spark$sql$cata
lyst$expressions$Cast$$buildCast(Cast.scala:111)

at
org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToString$2.a
pply(Cast.scala:116)

at
org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:426)

at
org.apache.spark.sql.catalyst.expressions.GreaterThanOrEqual.eval(predic
ates.scala:305)

at
org.apache.spark.sql.catalyst.expressions.InterpretedPredicate$$anonfun$
apply$1.apply(predicates.scala:30)

at
org.apache.spark.sql.catalyst.expressions.InterpretedPredicate$$anonfun$
apply$1.apply(predicates.scala:30)

at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:390)

at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)

 

code=

 

val conf = new
SparkConf().setAppName("DFTest").setMaster("local[*]")

val sc = new SparkContext(conf)

val sqlCtx = new HiveContext(sc)

import sqlCtx.implicits._



case class Test(dt: java.sql.Date)

 

val df = sc.makeRDD(Seq(Test(new java.sql.Date(115,4,7.toDF



var r = df.filter("dt >= '2015-05-06'")

r.explain(true)

r.show

println("==")

var r2 = df.filter("dt >= cast('2015-05-06' as DATE)")

r2.explain(true)

r2.show

println("==")

 

// "df2" doesn't do filter correct!!

val rdd2 = sc.makeRDD(Seq((Row(new java.sql.Date(115,4,7)



val schema = StructType(Array(StructField("dt", DateType, false)))



val df2 = sqlCtx.applySchema(rdd2, schema) 



r = df2.filter("dt >= '2015-05-06'")

r.explain(true)

r.show

println("==")



r2 = df2.filter("dt >= cast('2015-05-06' as DATE)")

r2.explain(true)

r2.show

[SparkR] is toDF() necessary

2015-05-08 Thread Sun, Rui

toDF() is defined to convert an RDD to a DataFrame. But it is just a very thin 
wrapper of createDataFrame() by help the caller avoid input of SQLContext.

Since Scala/pySpark does not have toDF(), and we'd better keep API as narrow 
and simple as possible. Is toDF() really necessary? Could we eliminate it?

Re: Collect inputs on SPARK-7035: compatibility issue with DataFrame.getattr

2015-05-08 Thread Xiangrui Meng

On Fri, May 8, 2015 at 12:18 AM, Shivaram Venkataraman
 wrote:
> I dont know much about Python style, but I think the point Wes made about
> usability on the JIRA is pretty powerful. IMHO the number of methods on a
> Spark DataFrame might not be much more compared to Pandas. Given that it
> looks like users are okay with the possibility of collisions in Pandas I
> think sticking (1) is not a bad idea.
>

This is true for interactive work. Spark's DataFrames can handle
really large datasets, which might be used in production workflows. So
I think it is reasonable for us to care more about compatibility
issues than Pandas.

> Also is it possible to detect such collisions in Python ? A (4)th option
> might be to detect that `df` contains a column named `name` and print a
> warning in `df.name` which tells the user that the method is overriding the
> column.

Maybe we can inspect the frame `df.name` gets called and warn users in
`df.select(df.name)` but not in `name = df.name`. This could be tricky
to implement.

-Xiangrui

>
> Thanks
> Shivaram
>
>
> On Thu, May 7, 2015 at 11:59 PM, Xiangrui Meng  wrote:
>>
>> Hi all,
>>
>> In PySpark, a DataFrame column can be referenced using df["abcd"]
>> (__getitem__) and df.abcd (__getattr__). There is a discussion on
>> SPARK-7035 on compatibility issues with the __getattr__ approach, and
>> I want to collect more inputs on this.
>>
>> Basically, if in the future we introduce a new method to DataFrame, it
>> may break user code that uses the same attr to reference a column or
>> silently changes its behavior. For example, if we add name() to
>> DataFrame in the next release, all existing code using `df.name` to
>> reference a column called "name" will break. If we add `name()` as a
>> property instead of a method, all existing code using `df.name` may
>> still work but with a different meaning. `df.select(df.name)` no
>> longer selects the column called "name" but the column that has the
>> same name as `df.name`.
>>
>> There are several proposed solutions:
>>
>> 1. Keep both df.abcd and df["abcd"], and encourage users to use the
>> latter that is future proof. This is the current solution in master
>> (https://github.com/apache/spark/pull/5971). But I think users may be
>> still unaware of the compatibility issue and prefer `df.abcd` to
>> `df["abcd"]` because the former could be auto-completed.
>> 2. Drop df.abcd and support df["abcd"] only. From Wes' comment on the
>> JIRA page: "I actually dragged my feet on the _getattr_ issue for
>> several months back in the day, then finally added it (and tab
>> completion in IPython with _dir_), and immediately noticed a huge
>> quality-of-life improvement when using pandas for actual (esp.
>> interactive) work."
>> 3. Replace df.abcd by df.abcd_ (with a suffix "_"). Both df.abcd_ and
>> df["abcd"] would be future proof, and df.abcd_ could be
>> auto-completed. The tradeoff is apparently the extra "_" appearing in
>> the code.
>>
>> My preference is 3 > 1 > 2. Your inputs would be greatly appreciated.
>> Thanks!
>>
>> Best,
>> Xiangrui
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Collect inputs on SPARK-7035: compatibility issue with DataFrame.getattr

2015-05-08 Thread Shivaram Venkataraman

I dont know much about Python style, but I think the point Wes made about
usability on the JIRA is pretty powerful. IMHO the number of methods on a
Spark DataFrame might not be much more compared to Pandas. Given that it
looks like users are okay with the possibility of collisions in Pandas I
think sticking (1) is not a bad idea.

Also is it possible to detect such collisions in Python ? A (4)th option
might be to detect that `df` contains a column named `name` and print a
warning in `df.name` which tells the user that the method is overriding the
column.

Thanks
Shivaram


On Thu, May 7, 2015 at 11:59 PM, Xiangrui Meng  wrote:

> Hi all,
>
> In PySpark, a DataFrame column can be referenced using df["abcd"]
> (__getitem__) and df.abcd (__getattr__). There is a discussion on
> SPARK-7035 on compatibility issues with the __getattr__ approach, and
> I want to collect more inputs on this.
>
> Basically, if in the future we introduce a new method to DataFrame, it
> may break user code that uses the same attr to reference a column or
> silently changes its behavior. For example, if we add name() to
> DataFrame in the next release, all existing code using `df.name` to
> reference a column called "name" will break. If we add `name()` as a
> property instead of a method, all existing code using `df.name` may
> still work but with a different meaning. `df.select(df.name)` no
> longer selects the column called "name" but the column that has the
> same name as `df.name`.
>
> There are several proposed solutions:
>
> 1. Keep both df.abcd and df["abcd"], and encourage users to use the
> latter that is future proof. This is the current solution in master
> (https://github.com/apache/spark/pull/5971). But I think users may be
> still unaware of the compatibility issue and prefer `df.abcd` to
> `df["abcd"]` because the former could be auto-completed.
> 2. Drop df.abcd and support df["abcd"] only. From Wes' comment on the
> JIRA page: "I actually dragged my feet on the _getattr_ issue for
> several months back in the day, then finally added it (and tab
> completion in IPython with _dir_), and immediately noticed a huge
> quality-of-life improvement when using pandas for actual (esp.
> interactive) work."
> 3. Replace df.abcd by df.abcd_ (with a suffix "_"). Both df.abcd_ and
> df["abcd"] would be future proof, and df.abcd_ could be
> auto-completed. The tradeoff is apparently the extra "_" appearing in
> the code.
>
> My preference is 3 > 1 > 2. Your inputs would be greatly appreciated.
> Thanks!
>
> Best,
> Xiangrui
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Collect inputs on SPARK-7035: compatibility issue with DataFrame.getattr

2015-05-08 Thread Xiangrui Meng

Hi all,

In PySpark, a DataFrame column can be referenced using df["abcd"]
(__getitem__) and df.abcd (__getattr__). There is a discussion on
SPARK-7035 on compatibility issues with the __getattr__ approach, and
I want to collect more inputs on this.

Basically, if in the future we introduce a new method to DataFrame, it
may break user code that uses the same attr to reference a column or
silently changes its behavior. For example, if we add name() to
DataFrame in the next release, all existing code using `df.name` to
reference a column called "name" will break. If we add `name()` as a
property instead of a method, all existing code using `df.name` may
still work but with a different meaning. `df.select(df.name)` no
longer selects the column called "name" but the column that has the
same name as `df.name`.

There are several proposed solutions:

1. Keep both df.abcd and df["abcd"], and encourage users to use the
latter that is future proof. This is the current solution in master
(https://github.com/apache/spark/pull/5971). But I think users may be
still unaware of the compatibility issue and prefer `df.abcd` to
`df["abcd"]` because the former could be auto-completed.
2. Drop df.abcd and support df["abcd"] only. From Wes' comment on the
JIRA page: "I actually dragged my feet on the _getattr_ issue for
several months back in the day, then finally added it (and tab
completion in IPython with _dir_), and immediately noticed a huge
quality-of-life improvement when using pandas for actual (esp.
interactive) work."
3. Replace df.abcd by df.abcd_ (with a suffix "_"). Both df.abcd_ and
df["abcd"] would be future proof, and df.abcd_ could be
auto-completed. The tradeoff is apparently the extra "_" appearing in
the code.

My preference is 3 > 1 > 2. Your inputs would be greatly appreciated. Thanks!

Best,
Xiangrui

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

44 matches

Mail list logo