What is the location in the source code of the computation of the elements in a map transformation?

2015-05-02 Thread Tom Hubregtsen
I am trying to understand what the data and computation flow is in Spark, and believe I fairly understand the Shuffle (both map and reduce side), but I do not get what happens to the computation from the map stages. I know all maps gets pipelined on the shuffle (when there is no other action in

createDataFrame allows column names as second param in Python not in Scala

2015-05-02 Thread Olivier Girardot
Hi everyone, SQLContext.createDataFrame has different behaviour in Scala or Python : l = [('Alice', 1)] sqlContext.createDataFrame(l).collect() [Row(_1=u'Alice', _2=1)] sqlContext.createDataFrame(l, ['name', 'age']).collect() [Row(name=u'Alice', age=1)] and in Scala : scala val data =

Why does SortShuffleWriter write to disk always?

2015-05-02 Thread Pramod Biligiri
Hi, I was trying to see if I can make Spark avoid hitting the disk for small jobs, but I see that the SortShuffleWriter.write() always writes to disk. I found an older thread ( http://apache-spark-user-list.1001560.n3.nabble.com/How-does-shuffle-work-in-spark-td584.html) saying that it doesn't

Re: [discuss] ending support for Java 6?

2015-05-02 Thread Mridul Muralidharan
We could build on minimum jdk we support for testing pr's - which will automatically cause build failures in case code uses newer api ? Regards, Mridul On Fri, May 1, 2015 at 2:46 PM, Reynold Xin r...@databricks.com wrote: It's really hard to inspect API calls since none of us have the Java

Re: [discuss] ending support for Java 6?

2015-05-02 Thread Ted Yu
+1 On Sat, May 2, 2015 at 1:09 PM, Mridul Muralidharan mri...@gmail.com wrote: We could build on minimum jdk we support for testing pr's - which will automatically cause build failures in case code uses newer api ? Regards, Mridul On Fri, May 1, 2015 at 2:46 PM, Reynold Xin

Re: [discuss] ending support for Java 6?

2015-05-02 Thread Mridul Muralidharan
Hi Shane, Since we are still maintaining support for jdk6, jenkins should be using jdk6 [1] to ensure we do not inadvertently use jdk7 or higher api which breaks source level compat. -source and -target is insufficient to ensure api usage is conformant with the minimum jdk version we are

Re: Why does SortShuffleWriter write to disk always?

2015-05-02 Thread Reynold Xin
I've personally prototyped completely in-memory shuffle for Spark 3 times. However, it is unclear how big of a gain it would be to put all of these in memory, under newer file systems (ext4, xfs). If the shuffle data is small, they are still in the file system buffer cache anyway. Note that

Submit Kill Spark Application program programmatically from another application

2015-05-02 Thread Yijie Shen
Hi, I’ve posted this problem in user@spark but find no reply, therefore moved to dev@spark, sorry for duplication. I am wondering if it is possible to submit, monitor  kill spark applications from another service. I have wrote a service this: parse user commands translate them into

Re: Why does SortShuffleWriter write to disk always?

2015-05-02 Thread Mridul Muralidharan
I agree, this is better handled by the filesystem cache - not to mention, being able to do zero copy writes. Regards, Mridul On Sat, May 2, 2015 at 10:26 PM, Reynold Xin r...@databricks.com wrote: I've personally prototyped completely in-memory shuffle for Spark 3 times. However, it is unclear

Re: createDataFrame allows column names as second param in Python not in Scala

2015-05-02 Thread Reynold Xin
Part of the reason is that it is really easy to just call toDF on Scala, and we already have a lot of createDataFrame functions. (You might find some of the cross-language differences confusing, but I'd argue most real users just stick to one language, and developers or trainers are the only ones

Re: [discuss] ending support for Java 6?

2015-05-02 Thread Reynold Xin
It's really hard to inspect API calls since none of us have the Java standard library in our brain. The only way we can enforce this is to have it in Jenkins, and Tom you are currently our mini-Jenkins server :) Joking aside, looks like we should support Java 6 in 1.4, and in the release notes

Re: Pandas' Shift in Dataframe

2015-05-02 Thread Olivier Girardot
To close this thread rxin created a broader Jira to handle window functions in Dataframes : https://issues.apache.org/jira/browse/SPARK-7322 Thanks everyone. Le mer. 29 avr. 2015 à 22:51, Olivier Girardot o.girar...@lateral-thoughts.com a écrit : To give you a broader idea of the current use

Re: What is the location in the source code of the computation of the elements in a map transformation?

2015-05-02 Thread Patrick Wendell
Maybe I can help a bit. What happens when you call .map(my func) is that you create a MapPartitionsRDD that has a reference to that closure in it's compute() function. When a job is run (jobs are run as the result of RDD actions):

Re: [discuss] ending support for Java 6?

2015-05-02 Thread shane knapp
that's kinda what we're doing right now, java 7 is the default/standard on our jenkins. or, i vote we buy a butler's outfit for thomas and have a second jenkins instance... ;) On Sat, May 2, 2015 at 1:09 PM, Mridul Muralidharan mri...@gmail.com wrote: We could build on minimum jdk we support

Re: [discuss] ending support for Java 6?

2015-05-02 Thread Koert Kuipers
i think i might be misunderstanding, but shouldnt java 6 currently be used in jenkins? On Sat, May 2, 2015 at 11:53 PM, shane knapp skn...@berkeley.edu wrote: that's kinda what we're doing right now, java 7 is the default/standard on our jenkins. or, i vote we buy a butler's outfit for