Re: [VOTE] Release Apache Spark 1.0.0 (rc9)

2014-05-20 Thread Tom Graves
I assume we will have an rc10 to fix the issues Matei found? Tom On Sunday, May 18, 2014 9:08 PM, Patrick Wendell pwend...@gmail.com wrote: Hey Matei - the issue you found is not related to security. This patch a few days ago broke builds for Hadoop 1 with YARN support enabled. The patch

Re: BUG: graph.triplets does not return proper values

2014-05-20 Thread GlennStrycker
Oh... ha, good point. Sorry, I'm new to mapreduce programming and forgot about that... I'll have to adjust my reduce function to output a vector/RDD as the element to return. Thanks for reminding me of this! -- View this message in context:

Re: Sorting partitions in Java

2014-05-20 Thread Sean Owen
It's an Iterator in both Java and Scala. In both cases you need to copy the stream of values into something List-like to sort it. An Iterable would not change that (not sure the API can promise many iterations anyway). If you just want the equivalent of toArray, you can use a utility method in

Re: Sorting partitions in Java

2014-05-20 Thread Madhu
Thanks Sean, I had seen that post you mentioned. What you suggest looks an in-memory sort, which is fine if each partition is small enough to fit in memory. Is it true that rdd.sortByKey(...) requires partitions to fit in memory? I wasn't sure if there was some magic behind the scenes that

Re: Sorting partitions in Java

2014-05-20 Thread Andrew Ash
Voted :) https://issues.apache.org/jira/browse/SPARK-983 On Tue, May 20, 2014 at 10:21 AM, Sandy Ryza sandy.r...@cloudera.comwrote: There is: SPARK-545 On Tue, May 20, 2014 at 10:16 AM, Andrew Ash and...@andrewash.com wrote: Sandy, is there a Jira ticket for that? On Tue, May 20,

Re: Sorting partitions in Java

2014-05-20 Thread Madhu
Sean, No, I don't want to sort the whole RDD, sortByKey seems to be good enough for that. Right now, I think the code I have will work for me, but I can imagine conditions where it will run out of memory. I'm not completely sure if SPARK-983 https://issues.apache.org/jira/browse/SPARK-983

Re: BUG: graph.triplets does not return proper values

2014-05-20 Thread GlennStrycker
Wait a minute... doesn't a reduce function return 1 element PER key pair? For example, word-count mapreduce functions return a {word, count} element for every unique word. Is this supposed to be a 1-element RDD object? The .reduce function for a MappedRDD or FlatMappedRDD both are of the form

Re: BUG: graph.triplets does not return proper values

2014-05-20 Thread Reynold Xin
You are probably looking for reduceByKey in that case. reduce just reduces everything in the collection into a single element. On Tue, May 20, 2014 at 12:16 PM, GlennStrycker glenn.stryc...@gmail.comwrote: Wait a minute... doesn't a reduce function return 1 element PER key pair? For example,

Re: BUG: graph.triplets does not return proper values

2014-05-20 Thread GlennStrycker
I don't seem to have this function in my Spark installation for this object, or the classes MappedRDD, FlatMappedRDD, EdgeRDD, VertexRDD, or Graph. Which class should have the reduceByKey function, and how do I cast my current RDD as this class? Perhaps this is still due to my Spark installation

Re: BUG: graph.triplets does not return proper values

2014-05-20 Thread Mark Hamstra
That's all very old functionality in Spark terms, so it shouldn't have anything to do with your installation being out-of-date. There is also no need to cast as long as the relevant implicit conversions are in scope: import org.apache.spark.SparkContext._ On Tue, May 20, 2014 at 1:00 PM,

Re: BUG: graph.triplets does not return proper values

2014-05-20 Thread Sean Owen
http://spark.apache.org/docs/0.9.1/api/core/index.html#org.apache.spark.rdd.PairRDDFunctions It becomes automagically available when your RDD contains pairs. On Tue, May 20, 2014 at 9:00 PM, GlennStrycker glenn.stryc...@gmail.com wrote: I don't seem to have this function in my Spark

Re: BUG: graph.triplets does not return proper values

2014-05-20 Thread GlennStrycker
For some reason it does not appear when I hit tab in Spark shell, but when I put everything together in one line, it DOES WORK! orig_graph.edges.map(_.copy()).cartesian(orig_graph.edges.map(_.copy())).flatMap( A = Seq(if (A._1.srcId == A._2.dstId) Edge(A._2.srcId,A._1.dstId,1) else if (A._1.dstId

[VOTE] Release Apache Spark 1.0.0 (RC10)

2014-05-20 Thread Tathagata Das
Please vote on releasing the following candidate as Apache Spark version 1.0.0! This has a few bug fixes on top of rc9: SPARK-1875: https://github.com/apache/spark/pull/824 SPARK-1876: https://github.com/apache/spark/pull/819 SPARK-1878: https://github.com/apache/spark/pull/822 SPARK-1879:

Re: Scala examples for Spark do not work as written in documentation

2014-05-20 Thread Andy Konwinski
I fixed the bug, but I kept the parameter i instead of _ since that (1) keeps it more parallel to the python and java versions which also use functions with a named variable and (2) doesn't require readers to know this particular use of the _ syntax in Scala. Thanks for catching this Glenn. Andy

Re: [VOTE] Release Apache Spark 1.0.0 (RC10)

2014-05-20 Thread Sandy Ryza
+1 On Tue, May 20, 2014 at 5:26 PM, Andrew Or and...@databricks.com wrote: +1 2014-05-20 13:13 GMT-07:00 Tathagata Das tathagata.das1...@gmail.com: Please vote on releasing the following candidate as Apache Spark version 1.0.0! This has a few bug fixes on top of rc9: SPARK-1875:

Re: [VOTE] Release Apache Spark 1.0.0 (RC10)

2014-05-20 Thread Marcelo Vanzin
+1 (non-binding) I have: - checked signatures and checksums of the files - built the code from the git repo using both sbt and mvn (against hadoop 2.3.0) - ran a few simple jobs in local, yarn-client and yarn-cluster mode Haven't explicitly tested any of the recent fixes, streaming nor sql. On

Re: Calling external classes added by sc.addJar needs to be through reflection

2014-05-20 Thread Xiangrui Meng
Talked with Sandy and DB offline. I think the best solution is sending the secondary jars to the distributed cache of all containers rather than just the master, and set the classpath to include spark jar, primary app jar, and secondary jars before executor starts. In this way, user only needs to

Re: [VOTE] Release Apache Spark 1.0.0 (RC10)

2014-05-20 Thread Matei Zaharia
+1 Tested it on both Windows and Mac OS X, with both Scala and Python. Confirmed that the issues in the previous RC were fixed. Matei On May 20, 2014, at 5:28 PM, Marcelo Vanzin van...@cloudera.com wrote: +1 (non-binding) I have: - checked signatures and checksums of the files - built