Re: Collect inputs on SPARK-7035: compatibility issue with DataFrame.__getattr__

2015-05-08 Thread Shivaram Venkataraman
I dont know much about Python style, but I think the point Wes made about usability on the JIRA is pretty powerful. IMHO the number of methods on a Spark DataFrame might not be much more compared to Pandas. Given that it looks like users are okay with the possibility of collisions in Pandas I

Re: Collect inputs on SPARK-7035: compatibility issue with DataFrame.__getattr__

2015-05-08 Thread Xiangrui Meng
On Fri, May 8, 2015 at 12:18 AM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: I dont know much about Python style, but I think the point Wes made about usability on the JIRA is pretty powerful. IMHO the number of methods on a Spark DataFrame might not be much more compared to Pandas.

[SparkSQL] cannot filter by a DateType column

2015-05-08 Thread Haopu Wang
I want to filter a DataFrame based on a Date column. If the DataFrame object is constructed from a scala case class, it's working (either compare as String or Date). But if the DataFrame is generated by specifying a Schema to an RDD, it doesn't work. Below is the exception and test code.

Re: Spark 1.3.1 / Hadoop 2.6 package has broken S3 access

2015-05-08 Thread Steve Loughran
2. I can add a hadoop-2.6 profile that sets things up for s3a, azure and openstack swift. Added: https://issues.apache.org/jira/browse/SPARK-7481 One thing to consider here is testing; the s3x clients themselves have some tests that individuals/orgs can run against different S3

[SparkR] is toDF() necessary

2015-05-08 Thread Sun, Rui
toDF() is defined to convert an RDD to a DataFrame. But it is just a very thin wrapper of createDataFrame() by help the caller avoid input of SQLContext. Since Scala/pySpark does not have toDF(), and we'd better keep API as narrow and simple as possible. Is toDF() really necessary? Could we

Re: NoClassDefFoundError with Spark 1.3

2015-05-08 Thread Akhil Das
Looks like the jar you provided has some missing classes. Try this: scalaVersion := 2.10.4 libraryDependencies ++= Seq( org.apache.spark %% spark-core % 1.3.0, org.apache.spark %% spark-sql % 1.3.0 % provided, org.apache.spark %% spark-mllib % 1.3.0 % provided, log4j % log4j %

Re: Spark 1.3.1 / Hadoop 2.6 package has broken S3 access

2015-05-08 Thread Steve Loughran
On 7 May 2015, at 18:02, Matei Zaharia matei.zaha...@gmail.com wrote: We should make sure to update our docs to mention s3a as well, since many people won't look at Hadoop's docs for this. Matei 1. to use s3a you'll also need an amazon toolkit JAR on the cp 2. I can add a hadoop-2.6

Collect inputs on SPARK-7035: compatibility issue with DataFrame.__getattr__

2015-05-08 Thread Xiangrui Meng
Hi all, In PySpark, a DataFrame column can be referenced using df[abcd] (__getitem__) and df.abcd (__getattr__). There is a discussion on SPARK-7035 on compatibility issues with the __getattr__ approach, and I want to collect more inputs on this. Basically, if in the future we introduce a new

Re: [build infra] quick downtime again tomorrow morning for DOCKER

2015-05-08 Thread shane knapp
yes, absolutely. right now i'm just getting the basics set up for a student's build in the lab. later on today i will be updating the spark wiki qa infrastructure page w/more information. On Fri, May 8, 2015 at 7:06 AM, Punyashloka Biswal punya.bis...@gmail.com wrote: Just curious: will

Re: Collect inputs on SPARK-7035: compatibility issue with DataFrame.__getattr__

2015-05-08 Thread Punyashloka Biswal
Is there a foolproof way to access methods exclusively (instead of picking between columns and methods at runtime)? Here are two ideas, neither of which seems particularly Pythonic - pyspark.sql.methods(df).name() - df.__methods__.name() Punya On Fri, May 8, 2015 at 10:06 AM Nicholas

Re: [build infra] quick downtime again tomorrow morning for DOCKER

2015-05-08 Thread shane knapp
this is happening now. On Thu, May 7, 2015 at 3:40 PM, shane knapp skn...@berkeley.edu wrote: yes, docker. that wonderful little wrapper for linux containers will be installed and ready for play on all of the jenkins workers tomorrow morning. the downtime will be super quick: i just need

Re: NoClassDefFoundError with Spark 1.3

2015-05-08 Thread Olivier Girardot
You're trying to launch using sbt run some provided dependency, the goal of the provided scope is exactly to exclude this dependency from runtime, considering it as provided by the environment. You configuration is correct to create an assembly jar - but not to use sbt run to test your project.

Re: [build infra] quick downtime again tomorrow morning for DOCKER

2015-05-08 Thread Punyashloka Biswal
Just curious: will docker allow new capabilities for the Spark build? (Where can I read more?) Punya On Fri, May 8, 2015 at 10:00 AM shane knapp skn...@berkeley.edu wrote: this is happening now. On Thu, May 7, 2015 at 3:40 PM, shane knapp skn...@berkeley.edu wrote: yes, docker. that

Re: Collect inputs on SPARK-7035: compatibility issue with DataFrame.__getattr__

2015-05-08 Thread Nicholas Chammas
And a link to SPARK-7035 https://issues.apache.org/jira/browse/SPARK-7035 (which Xiangrui mentioned in his initial email) for the lazy. On Fri, May 8, 2015 at 3:41 AM Xiangrui Meng men...@gmail.com wrote: On Fri, May 8, 2015 at 12:18 AM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote:

Re: [build infra] quick downtime again tomorrow morning for DOCKER

2015-05-08 Thread shane knapp
...and this is done. thanks for your patience! On Fri, May 8, 2015 at 7:00 AM, shane knapp skn...@berkeley.edu wrote: this is happening now. On Thu, May 7, 2015 at 3:40 PM, shane knapp skn...@berkeley.edu wrote: yes, docker. that wonderful little wrapper for linux containers will be

Re: NoClassDefFoundError with Spark 1.3

2015-05-08 Thread Ganelin, Ilya
All – the issue was much more subtle. I’d accidentally included a reference to a static object in a class that I wasn’t actually including in my build – hence the unrelated run-time error. Thanks for the clarification on what the “provided” scope means. Ilya Ganelin

Re: Spark Streaming with Tachyon : Some findings

2015-05-08 Thread Dibyendu Bhattacharya
Just a followup on this Thread . I tried Hierarchical Storage on Tachyon ( http://tachyon-project.org/Hierarchy-Storage-on-Tachyon.html ) , and that seems to have worked and I did not see any any Spark Job failed due to BlockNotFoundException. below is my Hierarchical Storage settings..

Re: Spark Streaming with Tachyon : Some findings

2015-05-08 Thread Haoyuan Li
Thanks for the updates! Best, Haoyuan On Fri, May 8, 2015 at 8:40 AM, Dibyendu Bhattacharya dibyendu.bhattach...@gmail.com wrote: Just a followup on this Thread . I tried Hierarchical Storage on Tachyon ( http://tachyon-project.org/Hierarchy-Storage-on-Tachyon.html ) , and that seems to

Re: Back-pressure for Spark Streaming

2015-05-08 Thread Akhil Das
We had a similar issue while working on one of our usecase where we were processing at a moderate throughput (around 500MB/S). When the processing time exceeds the batch duration, it started to throw up blocknotfound exceptions, i made a workaround for that issue and is explained over here

[build system] QA infrastructure wiki updated w/latest package installs/versions

2015-05-08 Thread shane knapp
so i spent a good part of the morning parsing out all of the packages and versions of things that we have installed on our jenkins workers: https://cwiki.apache.org/confluence/display/SPARK/Spark+QA+Infrastructure if you're looking to set up something to mimic our build system, this should be a

Re: [SparkR] is toDF() necessary

2015-05-08 Thread Shivaram Venkataraman
Agree that toDF is not very useful. In fact it was removed from the namespace in a recent change https://github.com/apache/spark/commit/4e930420c19ae7773b138dfc7db8fc03b4660251 Thanks Shivaram On Fri, May 8, 2015 at 1:10 AM, Sun, Rui rui@intel.com wrote: toDF() is defined to convert an RDD

Re: DataFrames equivalent to SQL table namespacing and aliases

2015-05-08 Thread Nicholas Chammas
Oh, I didn't know about that. Thanks for the pointer, Rakesh. I wonder why they did that, as opposed to taking the cue from SQL and prefixing column names with a specifiable dataframe alias. The suffix approach seems quite ugly. Nick On Fri, May 8, 2015 at 2:47 PM Rakesh Chalasani

Re: Easy way to convert Row back to case class

2015-05-08 Thread Will Benton
This might not be the easiest way, but it's pretty easy: you can use Row(field_1, ..., field_n) as a pattern in a case match. So if you have a data frame with foo as an int column and bar as a String columns and you want to construct instances of a case class that wraps these up, you can do

Re: DataFrames equivalent to SQL table namespacing and aliases

2015-05-08 Thread Reynold Xin
You can actually just use df1['a'] in projection to differentiate. e.g. in Scala (similar things work in Python): scala val df1 = Seq((1, one)).toDF(a, b) df1: org.apache.spark.sql.DataFrame = [a: int, b: string] scala val df2 = Seq((2, two)).toDF(a, b) df2: org.apache.spark.sql.DataFrame =

DataFrames equivalent to SQL table namespacing and aliases

2015-05-08 Thread Nicholas Chammas
DataFrames, as far as I can tell, don’t have an equivalent to SQL’s table aliases. This is essential when joining dataframes that have identically named columns. # PySpark 1.3.1 df1 = sqlContext.jsonRDD(sc.parallelize(['{a: 4, other: I know}'])) df2 = sqlContext.jsonRDD(sc.parallelize(['{a:

Re: large volume spark job spends most of the time in AppendOnlyMap.changeValue

2015-05-08 Thread Josh Rosen
Do you have any more specific profiling data that you can share? I'm curious to know where AppendOnlyMap.changeValue is being called from. On Fri, May 8, 2015 at 1:26 PM, Michal Haris michal.ha...@visualdna.com wrote: +dev On 6 May 2015 10:45, Michal Haris michal.ha...@visualdna.com wrote:

Re: branch-1.4 nightly builds?

2015-05-08 Thread Nicholas Chammas
https://issues.apache.org/jira/browse/SPARK-1517 That issue should probably be unassigned since I am not actively working on it. (I can't unassign myself.) Nick On Fri, May 8, 2015 at 5:38 PM Punyashloka Biswal punya.bis...@gmail.com wrote: Dear Spark devs, Does anyone maintain nightly

Re: Easy way to convert Row back to case class

2015-05-08 Thread Reynold Xin
In 1.4, you can do row.getInt(colName) In 1.5, some variant of this will come to allow you to turn a DataFrame into a typed RDD, where the case class's field names match the column names. https://github.com/apache/spark/pull/5713 On Fri, May 8, 2015 at 11:01 AM, Will Benton wi...@redhat.com

Re: DataFrames equivalent to SQL table namespacing and aliases

2015-05-08 Thread Rakesh Chalasani
To add to the above discussion, Pandas, allows suffixing and prefixing to solve this issue http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.join.html Rakesh On Fri, May 8, 2015 at 2:42 PM Nicholas Chammas nicholas.cham...@gmail.com wrote: DataFrames, as far as I can tell,

Re: DataFrame distinct vs RDD distinct

2015-05-08 Thread Olivier Girardot
I'll try to reproduce what has been reported to me first :) and I'll let you know. Thanks ! Le jeu. 7 mai 2015 à 21:16, Michael Armbrust mich...@databricks.com a écrit : I'd happily merge a PR that changes the distinct implementation to be more like Spark core, assuming it includes benchmarks

Re: Build fail...

2015-05-08 Thread rtimp
Hi, From what I myself noticed a few minutes ago, I think branch-1.3 might be failing to compile due to the most recent commit. I tried reverting to commit 7fd212b575b6227df5068844416e51f11740e771 (the commit prior to the head) on that branch and recompiling, and was successful. As Ferris would

Re: Build fail...

2015-05-08 Thread Ted Yu
Looks like you're right: https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-1.3-Maven-with-YARN/HADOOP_PROFILE=hadoop-2.4,label=centos/427/console [error]

Re: Build fail...

2015-05-08 Thread Andrew Or
Thanks for pointing this out. I reverted that commit. 2015-05-08 19:01 GMT-07:00 Ted Yu yuzhih...@gmail.com: Looks like you're right: https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-1.3-Maven-with-YARN/HADOOP_PROFILE=hadoop-2.4,label=centos/427/console [error]

Intellij Spark Source Compilation

2015-05-08 Thread rtimp
Hello, I'm trying to compile the master branch of the spark source (25889d8) in intellij. I followed the instructions in the wiki https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools, namely I downloaded IntelliJ 14.1.2 with jre 1.7.0_55, imported pom.xml, generated all

Re: Recent Spark test failures

2015-05-08 Thread Ted Yu
Andrew: Do you think the -M and -A options described here can be used in test runs ? http://scalatest.org/user_guide/using_the_runner Cheers On Wed, May 6, 2015 at 5:41 PM, Andrew Or and...@databricks.com wrote: Dear all, I'm sure you have all noticed that the Spark tests have been fairly