Re: GraphX: New graph operator

2015-06-02 Thread Ankur Dave
I think it would be good to have more basic operators like union or difference, as long as they have an efficient distributed implementation and are plausibly useful. If they can be written in terms of the existing GraphX API, it would be best to put them into GraphOps to keep the core GraphX

Re: [VOTE] Release Apache Spark 1.4.0 (RC3)

2015-06-02 Thread Olivier Girardot
Hi everyone, I think there's a blocker on PySpark the when functions in python seems to be broken but the Scala API seems fine. Here's a snippet demonstrating that with Spark 1.4.0 RC3 : In [*1*]: df = sqlCtx.createDataFrame([(1, 1), (2, 2), (1, 2), (1, 2)], [key, value]) In [*2*]: from

about Spark MLlib StandardScaler's Implementation

2015-06-02 Thread RoyGaoVLIS
Hi, When I was trying to add test case for ML’s StandardScaler, I found MLlib’s StandardScaler’s output different from R with params(withMean false, withScale true) Because columns is divided by root-mean-square rather than standard deviation in R, the scale function. I’ m

Re: GraphX: New graph operator

2015-06-02 Thread Tarek Auel
Okay thanks for your feedback. What is the expected behavior of union? Like Union and/or union all of SQL? Union all would be more or less trivial if we just concatenate the vertices and edges (vertex Id conflicts have to be resolved). Should union look for duplicates on the actual attribute (VD)

Re: CSV Support in SparkR

2015-06-02 Thread Shivaram Venkataraman
Hi Alek As Burak said, you can already use the spark-csv with SparkR in the 1.4 release. So right now I use it with something like this # Launch SparkR ./bin/sparkR --packages com.databricks:spark-csv_2.10:1.0.3 df - read.df(sqlContext, ./nycflights13.csv, com.databricks.spark.csv, header=true)

DataFrame.withColumn very slow when used iteratively?

2015-06-02 Thread zsampson
Hey, I'm seeing extreme slowness in withColumn when it's used in a loop. I'm running this code: for (int i = 0; i NUM_ITERATIONS ++i) { df = df.withColumn(col+i, new Column(new Literal(i, DataTypes.IntegerType))); } where df is initially a trivial dataframe. Here are the results of running

Re: Possible space improvements to shuffle

2015-06-02 Thread Josh Rosen
The relevant JIRA that springs to mind is https://issues.apache.org/jira/browse/SPARK-2926 If an aggregator and ordering are both defined, then the map side of sort-based shuffle will sort based on the key ordering so that map-side spills can be efficiently merged. We do not currently do a

Re: DataFrame.withColumn very slow when used iteratively?

2015-06-02 Thread Reynold Xin
We improved this in 1.4. Adding 100 columns took 4s on my laptop. https://issues.apache.org/jira/browse/SPARK-7276 Still not the fastest, but much faster. scala Seq((1, 2)).toDF(a, b) res6: org.apache.spark.sql.DataFrame = [a: int, b: int] scala scala val start = System.nanoTime start: Long =

Re: CSV Support in SparkR

2015-06-02 Thread Eskilson,Aleksander
Ah, alright, cool. I’ll rebuild and let you know. Thanks again, Alek From: Shivaram Venkataraman shiva...@eecs.berkeley.edumailto:shiva...@eecs.berkeley.edu Reply-To: shiva...@eecs.berkeley.edumailto:shiva...@eecs.berkeley.edu shiva...@eecs.berkeley.edumailto:shiva...@eecs.berkeley.edu Date:

Re: CSV Support in SparkR

2015-06-02 Thread Eskilson,Aleksander
Hey, that’s pretty convenient. Unfortunately, although the package seems to pull fine into the session, I’m getting class not found exceptions with: Caused by: org.apache.spark.SparkExcetion: Job aborted due to stage failure: Task 0 in stage 6.0 failed 4 times, most recent failure: Lost task

Possible space improvements to shuffle

2015-06-02 Thread John Carrino
One thing I have noticed with ExternalSorter is that if an ordering is not defined, it does the sort using only the partition_id, instead of (parition_id, hash). This means that on the reduce side you need to pull the entire dataset into memory before you can begin iterating over the results. I

Re: DataFrame.withColumn very slow when used iteratively?

2015-06-02 Thread Andrew Ash
Would it be valuable to create a .withColumns([colName], [ColumnObject]) method that adds in bulk rather than iteratively? Alternatively effort might be better spent in making .withColumn() singular faster. On Tue, Jun 2, 2015 at 3:46 PM, Reynold Xin r...@databricks.com wrote: We improved this

Re: Possible space improvements to shuffle

2015-06-02 Thread John Carrino
Yes, I think that bug is what I want. Thank you. So I guess the current reason is that we don't want to buffer up numMapper incoming streams. So we just iterate through each and transfer it over in full because that is more network efficient? I'm not sure I understand why you wouldn't want to

Re: CSV Support in SparkR

2015-06-02 Thread Eskilson,Aleksander
Seems to work great in the master build. It’s really good to have this functionality. Regards, Alek Eskilson From: Eskilson, Aleksander Eskilson alek.eskil...@cerner.commailto:alek.eskil...@cerner.com Date: Tuesday, June 2, 2015 at 2:59 PM To:

createDataframe from s3 results in error

2015-06-02 Thread Ignacio Zendejas
I've run into an error when trying to create a dataframe. Here's the code: -- from pyspark import StorageLevel from pyspark.sql import Row table = 'blah' ssc = HiveContext(sc) data = sc.textFile('s3://bucket/some.tsv') def deserialize(s): p = s.strip().split('\t') p[-1] = float(p[-1])

Re: createDataframe from s3 results in error

2015-06-02 Thread Reynold Xin
Maybe an incompatible Hive package or Hive metastore? On Tue, Jun 2, 2015 at 3:25 PM, Ignacio Zendejas i...@node.io wrote: From RELEASE: Spark 1.3.1 built for Hadoop 2.4.0 Build flags: -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests -Pkinesis-asl -Pspark-ganglia-lgpl

Re: [SQL] Write parquet files under partition directories?

2015-06-02 Thread Reynold Xin
Almost all dataframe stuff are tracked by this umbrella ticket: https://issues.apache.org/jira/browse/SPARK-6116 For the reader/writer interface, it's here: https://issues.apache.org/jira/browse/SPARK-7654 https://github.com/apache/spark/pull/6175 On Tue, Jun 2, 2015 at 3:57 PM, Matt Cheah

Re: createDataframe from s3 results in error

2015-06-02 Thread Ignacio Zendejas
From RELEASE: Spark 1.3.1 built for Hadoop 2.4.0 Build flags: -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests -Pkinesis-asl -Pspark-ganglia-lgpl -Phadoop-provided -Phive -Phive-thriftserver And this stacktrace may be more useful: http://pastebin.ca/3016483 On Tue, Jun 2, 2015 at 3:13

Re: CSV Support in SparkR

2015-06-02 Thread Shivaram Venkataraman
Thanks for testing. We should probably include a section for this in the SparkR programming guide given how popular CSV files are in R. Feel free to open a PR for that if you get a chance. Shivaram On Tue, Jun 2, 2015 at 2:20 PM, Eskilson,Aleksander alek.eskil...@cerner.com wrote: Seems to

Re: createDataframe from s3 results in error

2015-06-02 Thread Reynold Xin
What version of Spark is this? On Tue, Jun 2, 2015 at 3:13 PM, Ignacio Zendejas i...@node.io wrote: I've run into an error when trying to create a dataframe. Here's the code: -- from pyspark import StorageLevel from pyspark.sql import Row table = 'blah' ssc = HiveContext(sc) data =

Re: [SQL] Write parquet files under partition directories?

2015-06-02 Thread Matt Cheah
Excellent! Where can I find the code, pull request, and Spark ticket where this was introduced? Thanks, -Matt Cheah From: Reynold Xin r...@databricks.com Date: Monday, June 1, 2015 at 10:25 PM To: Matt Cheah mch...@palantir.com Cc: dev@spark.apache.org dev@spark.apache.org, Mingyu Kim

[RESULT] [VOTE] Release Apache Spark 1.4.0 (RC3)

2015-06-02 Thread Patrick Wendell
This vote is cancelled in favor of RC4. Thanks everyone for the thorough testing of this RC. We are really close, but there were a few blockers found. I've cut a new RC to incorporate those issues. The following patches were merged during the RC3 testing period: (blockers) 4940630 [SPARK-8020]

[VOTE] Release Apache Spark 1.4.0 (RC4)

2015-06-02 Thread Patrick Wendell
Please vote on releasing the following candidate as Apache Spark version 1.4.0! The tag to be voted on is v1.4.0-rc3 (commit 22596c5): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h= 22596c534a38cfdda91aef18aa9037ab101e4251 The release files, including signatures, digests, etc.

Re: DataFrame.withColumn very slow when used iteratively?

2015-06-02 Thread Reynold Xin
.select itself is the bulk add right? On Tue, Jun 2, 2015 at 5:32 PM, Andrew Ash and...@andrewash.com wrote: Would it be valuable to create a .withColumns([colName], [ColumnObject]) method that adds in bulk rather than iteratively? Alternatively effort might be better spent in making

Re: Unit tests can generate spurious shutdown messages

2015-06-02 Thread Reynold Xin
Can you submit a pull request for it? Thanks. On Tue, Jun 2, 2015 at 4:25 AM, Mick Davies michael.belldav...@gmail.com wrote: If I write unit tests that indirectly initialize org.apache.spark.util.Utils, for example use sql types, but produce no logging, I get the following unpleasant stack

Re: [VOTE] Release Apache Spark 1.4.0 (RC4)

2015-06-02 Thread Patrick Wendell
He all - a tiny nit from the last e-mail. The tag is v1.4.0-rc4. The exact commit and all other information is correct. (thanks Shivaram who pointed this out). On Tue, Jun 2, 2015 at 8:53 PM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache

Unit tests can generate spurious shutdown messages

2015-06-02 Thread Mick Davies
If I write unit tests that indirectly initialize org.apache.spark.util.Utils, for example use sql types, but produce no logging, I get the following unpleasant stack trace in my test output. This caused by the the Utils class adding a shutdown hook which logs the message logDebug(Shutdown hook