Re: Why does SortShuffleWriter write to disk always?

2015-05-03 Thread Pramod Biligiri
Thanks for the info. I agree, it makes sense the way it is designed. Pramod On Sat, May 2, 2015 at 10:37 PM, Mridul Muralidharan mri...@gmail.com wrote: I agree, this is better handled by the filesystem cache - not to mention, being able to do zero copy writes. Regards, Mridul On Sat,

Re: [discuss] ending support for Java 6?

2015-05-03 Thread Sean Owen
Should be, but isn't what Jenkins does. https://issues.apache.org/jira/browse/SPARK-1437 At this point it might be simpler to just decide that 1.5 will require Java 7 and then the Jenkins setup is correct. (NB: you can also solve this by setting bootclasspath to JDK 6 libs even when using javac

LDA and PageRank Using GraphX

2015-05-03 Thread Praveen Kumar Muthuswamy
Hi All, I am looking to run LDA for topic modeling and page rank algorithms that comes with GraphX for some data analysis. Are there are any examples (GraphX) that I can take a look ? Thanks Praveen

Re: Speeding up Spark build during development

2015-05-03 Thread Mark Hamstra
https://spark.apache.org/docs/latest/building-spark.html#building-with-buildmvn On Sun, May 3, 2015 at 2:54 PM, Pramod Biligiri pramodbilig...@gmail.com wrote: This is great. I didn't know about the mvn script in the build directory. Pramod On Fri, May 1, 2015 at 9:51 AM, York, Brennon

Re: Speeding up Spark build during development

2015-05-03 Thread Pramod Biligiri
This is great. I didn't know about the mvn script in the build directory. Pramod On Fri, May 1, 2015 at 9:51 AM, York, Brennon brennon.y...@capitalone.com wrote: Following what Ted said, if you leverage the `mvn` from within the `build/` directory of Spark you¹ll get zinc for free which

Re: Submit Kill Spark Application program programmatically from another application

2015-05-03 Thread Chester Chen
Sounds like you are in Yarn-Cluster mode. I created a JIRA SPARK-3913 https://issues.apache.org/jira/browse/SPARK-3913 and PR https://github.com/apache/spark/pull/2786 is this what you looking for ? Chester On Sat, May 2, 2015 at 10:32 PM, Yijie Shen henry.yijies...@gmail.com wrote: Hi,

Re: createDataFrame allows column names as second param in Python not in Scala

2015-05-03 Thread Reynold Xin
We can't drop the existing createDataFrame one, since it breaks API compatibility, and the existing one also automatically infers the column name for case classes (in that case users most likely won't be declaring names directly). If this is really a problem, we should just create a new function

Re: Multi-Line JSON in SparkSQL

2015-05-03 Thread Reynold Xin
How does the pivotal format decides where to split the files? It seems to me the challenge is to decide that, and on the top of my head the only way to do this is to scan from the beginning and parse the json properly, which makes it not possible with large files (doable for whole input with a lot

Question about PageRank with Live Journal

2015-05-03 Thread yunming zhang
Hi, I have a question about running PageRan with live journal data as suggested by the example at org.apache.spark.examples.graphx.LiveJournalPageRank I ran with the following options bin/run-example org.apache.spark.examples.graphx.LiveJournalPageRank data/graphx/soc-LiveJournal1.txt

Re: Multi-Line JSON in SparkSQL

2015-05-03 Thread Olivier Girardot
I'll try to study that and get back to you. Regards, Olivier. Le lun. 4 mai 2015 à 04:05, Reynold Xin r...@databricks.com a écrit : How does the pivotal format decides where to split the files? It seems to me the challenge is to decide that, and on the top of my head the only way to do this

Blockers for 1.4.0

2015-05-03 Thread Sean Owen
I'd like to preemptively post the current list of 35 Blockers for release 1.4.0. (There are 53 Critical too, and a total of 273 JIRAs targeted for 1.4.0. Clearly most of that isn't accurate, so would be good to un-target most of that.) As a matter of process and hygiene, it would be best to

Re: [discuss] ending support for Java 6?

2015-05-03 Thread shane knapp
that bug predates my time at the amplab... :) anyways, just to restate: jenkins currently only builds w/java 7. if you folks need 6, i can make it happen, but it will be a (smallish) bit of work. shane On Sun, May 3, 2015 at 2:14 AM, Sean Owen so...@cloudera.com wrote: Should be, but isn't

Re: createDataFrame allows column names as second param in Python not in Scala

2015-05-03 Thread Olivier Girardot
I have the perfect counter example where some of the data scientists prototype in Python and the production materials is done in Scala. But I get your point, as a matter of fact I realised the toDF method took parameters a little while after posting this. However the toDF still needs you to go

Multi-Line JSON in SparkSQL

2015-05-03 Thread Olivier Girardot
Hi everyone, Is there any way in Spark SQL to load multi-line JSON data efficiently, I think there was in the mailing list a reference to http://pivotal-field-engineering.github.io/pmr-common/ for its JSONInputFormat But it's rather inaccessible considering the dependency is not available in any