Re: [VOTE] Release Apache Spark 1.4.1

2015-06-30 Thread Joseph Bradley
+1 On Tue, Jun 30, 2015 at 5:27 PM, Reynold Xin r...@databricks.com wrote: +1 On Tue, Jun 23, 2015 at 10:37 PM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.4.1! This release fixes a handful of known issues in Spark

Re: [VOTE] Release Apache Spark 1.4.1

2015-06-30 Thread Reynold Xin
+1 On Tue, Jun 23, 2015 at 10:37 PM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.4.1! This release fixes a handful of known issues in Spark 1.4.0, listed here: http://s.apache.org/spark-1.4.1 The tag to be voted on

DStream.reduce

2015-06-30 Thread Zoltán Zvara
Why is reduce in DStream implemented with a map, reduceByKey and another map, given that we have an RDD.reduce?

Re: [DataFrame] partitionBy issues

2015-06-30 Thread rake
I ran into a similar problem, reading a csv file into a DataFrame and saving to Parquet with 'partitionBy', and getting OutOfMemory error even though it's not a large data file. I discovered that by default Spark appears to be allocating a block of 128MB in memory for each output Parquet

Re: Grouping runs of elements in a RDD

2015-06-30 Thread Reynold Xin
Try mapPartitions, which gives you an iterator, and you can produce an iterator back. On Tue, Jun 30, 2015 at 11:01 AM, RJ Nowling rnowl...@gmail.com wrote: Hi all, I have a problem where I have a RDD of elements: Item1 Item2 Item3 Item4 Item5 Item6 ... and I want to run a function over

Re: [DataFrame] partitionBy issues

2015-06-30 Thread vladio
https://issues.apache.org/jira/browse/SPARK-8597 A JIRA ticket discussing the same problem (with more insights than here)! -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/DataFrame-partitionBy-issues-tp12838p12974.html Sent from the Apache Spark

Re: Grouping runs of elements in a RDD

2015-06-30 Thread RJ Nowling
That's an interesting idea! I hadn't considered that. However, looking at the Partitioner interface, I would need to know from looking at a single key which doesn't fit my case, unfortunately. For my case, I need to compare successive pairs of keys. (I'm trying to re-join lines that were split

Re: Grouping runs of elements in a RDD

2015-06-30 Thread RJ Nowling
Thanks, Reynold. I still need to handle incomplete groups that fall between partition boundaries. So, I need a two-pass approach. I came up with a somewhat hacky way to handle those using the partition indices and key-value pairs as a second pass after the first. OCaml's std library provides a

Re: Grouping runs of elements in a RDD

2015-06-30 Thread Abhishek R. Singh
could you use a custom partitioner to preserve boundaries such that all related tuples end up on the same partition? On Jun 30, 2015, at 12:00 PM, RJ Nowling rnowl...@gmail.com wrote: Thanks, Reynold. I still need to handle incomplete groups that fall between partition boundaries. So, I