Re: Grouping runs of elements in a RDD

2015-07-02 Thread Mohit Jaggi
if you are joining successive lines together based on a predicate, then you
are doing a flatMap not an aggregate. you are on the right track with a
multi-pass solution. i had the same challenge when i needed a sliding
window over an RDD(see below).

[ i had suggested that the sliding window API be moved to spark-core. not
sure if that happened ]

- previous posts ---

http://spark.apache.org/docs/1.4.0/api/scala/index.html#org.apache.spark.mllib.rdd.RDDFunctions

 On Fri, Jan 30, 2015 at 12:27 AM, Mohit Jaggi mohitja...@gmail.com
 wrote:


 http://mail-archives.apache.org/mod_mbox/spark-user/201405.mbox/%3ccalrvtpkn65rolzbetc+ddk4o+yjm+tfaf5dz8eucpl-2yhy...@mail.gmail.com%3E

 you can use the MLLib function or do the following (which is what I had
 done):

 - in first pass over the data, using mapPartitionWithIndex, gather the
 first item in each partition. you can use collect (or aggregator) for this.
 “key” them by the partition index. at the end, you will have a map
(partition index) -- first item
 - in the second pass over the data, using mapPartitionWithIndex again,
 look at two (or in the general case N items at a time, you can use scala’s
 sliding iterator) items at a time and check the time difference(or any
 sliding window computation). To this mapParitition, pass the map created in
 previous step. You will need to use them to check the last item in this
 partition.

 If you can tolerate a few inaccuracies then you can just do the second
 step. You will miss the “boundaries” of the partitions but it might be
 acceptable for your use case.


On Tue, Jun 30, 2015 at 12:21 PM, RJ Nowling rnowl...@gmail.com wrote:

 That's an interesting idea!  I hadn't considered that.  However, looking
 at the Partitioner interface, I would need to know from looking at a single
 key which doesn't fit my case, unfortunately.  For my case, I need to
 compare successive pairs of keys.  (I'm trying to re-join lines that were
 split prematurely.)

 On Tue, Jun 30, 2015 at 2:07 PM, Abhishek R. Singh 
 abhis...@tetrationanalytics.com wrote:

 could you use a custom partitioner to preserve boundaries such that all
 related tuples end up on the same partition?

 On Jun 30, 2015, at 12:00 PM, RJ Nowling rnowl...@gmail.com wrote:

 Thanks, Reynold.  I still need to handle incomplete groups that fall
 between partition boundaries. So, I need a two-pass approach. I came up
 with a somewhat hacky way to handle those using the partition indices and
 key-value pairs as a second pass after the first.

 OCaml's std library provides a function called group() that takes a break
 function that operators on pairs of successive elements.  It seems a
 similar approach could be used in Spark and would be more efficient than my
 approach with key-value pairs since you know the ordering of the partitions.

 Has this need been expressed by others?

 On Tue, Jun 30, 2015 at 1:03 PM, Reynold Xin r...@databricks.com wrote:

 Try mapPartitions, which gives you an iterator, and you can produce an
 iterator back.


 On Tue, Jun 30, 2015 at 11:01 AM, RJ Nowling rnowl...@gmail.com wrote:

 Hi all,

 I have a problem where I have a RDD of elements:

 Item1 Item2 Item3 Item4 Item5 Item6 ...

 and I want to run a function over them to decide which runs of elements
 to group together:

 [Item1 Item2] [Item3] [Item4 Item5 Item6] ...

 Technically, I could use aggregate to do this, but I would have to use
 a List of List of T which would produce a very large collection in memory.

 Is there an easy way to accomplish this?  e.g.,, it would be nice to
 have a version of aggregate where the combination function can return a
 complete group that is added to the new RDD and an incomplete group which
 is passed to the next call of the reduce function.

 Thanks,
 RJ








Re: Grouping runs of elements in a RDD

2015-07-02 Thread RJ Nowling
Thanks, Mohit.  It sounds like we're on the same page -- I used a similar
approach.

On Thu, Jul 2, 2015 at 12:27 PM, Mohit Jaggi mohitja...@gmail.com wrote:

 if you are joining successive lines together based on a predicate, then
 you are doing a flatMap not an aggregate. you are on the right track
 with a multi-pass solution. i had the same challenge when i needed a
 sliding window over an RDD(see below).

 [ i had suggested that the sliding window API be moved to spark-core. not
 sure if that happened ]

 - previous posts ---


 http://spark.apache.org/docs/1.4.0/api/scala/index.html#org.apache.spark.mllib.rdd.RDDFunctions

  On Fri, Jan 30, 2015 at 12:27 AM, Mohit Jaggi mohitja...@gmail.com
  wrote:
 
 
  http://mail-archives.apache.org/mod_mbox/spark-user/201405.mbox/%3ccalrvtpkn65rolzbetc+ddk4o+yjm+tfaf5dz8eucpl-2yhy...@mail.gmail.com%3E
 
  you can use the MLLib function or do the following (which is what I had
  done):
 
  - in first pass over the data, using mapPartitionWithIndex, gather the
  first item in each partition. you can use collect (or aggregator) for this.
  “key” them by the partition index. at the end, you will have a map
 (partition index) -- first item
  - in the second pass over the data, using mapPartitionWithIndex again,
  look at two (or in the general case N items at a time, you can use scala’s
  sliding iterator) items at a time and check the time difference(or any
  sliding window computation). To this mapParitition, pass the map created in
  previous step. You will need to use them to check the last item in this
  partition.
 
  If you can tolerate a few inaccuracies then you can just do the second
  step. You will miss the “boundaries” of the partitions but it might be
  acceptable for your use case.


 On Tue, Jun 30, 2015 at 12:21 PM, RJ Nowling rnowl...@gmail.com wrote:

 That's an interesting idea!  I hadn't considered that.  However, looking
 at the Partitioner interface, I would need to know from looking at a single
 key which doesn't fit my case, unfortunately.  For my case, I need to
 compare successive pairs of keys.  (I'm trying to re-join lines that were
 split prematurely.)

 On Tue, Jun 30, 2015 at 2:07 PM, Abhishek R. Singh 
 abhis...@tetrationanalytics.com wrote:

 could you use a custom partitioner to preserve boundaries such that all
 related tuples end up on the same partition?

 On Jun 30, 2015, at 12:00 PM, RJ Nowling rnowl...@gmail.com wrote:

 Thanks, Reynold.  I still need to handle incomplete groups that fall
 between partition boundaries. So, I need a two-pass approach. I came up
 with a somewhat hacky way to handle those using the partition indices and
 key-value pairs as a second pass after the first.

 OCaml's std library provides a function called group() that takes a
 break function that operators on pairs of successive elements.  It seems a
 similar approach could be used in Spark and would be more efficient than my
 approach with key-value pairs since you know the ordering of the partitions.

 Has this need been expressed by others?

 On Tue, Jun 30, 2015 at 1:03 PM, Reynold Xin r...@databricks.com
 wrote:

 Try mapPartitions, which gives you an iterator, and you can produce an
 iterator back.


 On Tue, Jun 30, 2015 at 11:01 AM, RJ Nowling rnowl...@gmail.com
 wrote:

 Hi all,

 I have a problem where I have a RDD of elements:

 Item1 Item2 Item3 Item4 Item5 Item6 ...

 and I want to run a function over them to decide which runs of
 elements to group together:

 [Item1 Item2] [Item3] [Item4 Item5 Item6] ...

 Technically, I could use aggregate to do this, but I would have to use
 a List of List of T which would produce a very large collection in memory.

 Is there an easy way to accomplish this?  e.g.,, it would be nice to
 have a version of aggregate where the combination function can return a
 complete group that is added to the new RDD and an incomplete group which
 is passed to the next call of the reduce function.

 Thanks,
 RJ









Grouping runs of elements in a RDD

2015-06-30 Thread RJ Nowling
Hi all,

I have a problem where I have a RDD of elements:

Item1 Item2 Item3 Item4 Item5 Item6 ...

and I want to run a function over them to decide which runs of elements to
group together:

[Item1 Item2] [Item3] [Item4 Item5 Item6] ...

Technically, I could use aggregate to do this, but I would have to use a
List of List of T which would produce a very large collection in memory.

Is there an easy way to accomplish this?  e.g.,, it would be nice to have a
version of aggregate where the combination function can return a complete
group that is added to the new RDD and an incomplete group which is passed
to the next call of the reduce function.

Thanks,
RJ


Re: Grouping runs of elements in a RDD

2015-06-30 Thread Reynold Xin
Try mapPartitions, which gives you an iterator, and you can produce an
iterator back.


On Tue, Jun 30, 2015 at 11:01 AM, RJ Nowling rnowl...@gmail.com wrote:

 Hi all,

 I have a problem where I have a RDD of elements:

 Item1 Item2 Item3 Item4 Item5 Item6 ...

 and I want to run a function over them to decide which runs of elements to
 group together:

 [Item1 Item2] [Item3] [Item4 Item5 Item6] ...

 Technically, I could use aggregate to do this, but I would have to use a
 List of List of T which would produce a very large collection in memory.

 Is there an easy way to accomplish this?  e.g.,, it would be nice to have
 a version of aggregate where the combination function can return a complete
 group that is added to the new RDD and an incomplete group which is passed
 to the next call of the reduce function.

 Thanks,
 RJ



Re: Grouping runs of elements in a RDD

2015-06-30 Thread RJ Nowling
That's an interesting idea!  I hadn't considered that.  However, looking at
the Partitioner interface, I would need to know from looking at a single
key which doesn't fit my case, unfortunately.  For my case, I need to
compare successive pairs of keys.  (I'm trying to re-join lines that were
split prematurely.)

On Tue, Jun 30, 2015 at 2:07 PM, Abhishek R. Singh 
abhis...@tetrationanalytics.com wrote:

 could you use a custom partitioner to preserve boundaries such that all
 related tuples end up on the same partition?

 On Jun 30, 2015, at 12:00 PM, RJ Nowling rnowl...@gmail.com wrote:

 Thanks, Reynold.  I still need to handle incomplete groups that fall
 between partition boundaries. So, I need a two-pass approach. I came up
 with a somewhat hacky way to handle those using the partition indices and
 key-value pairs as a second pass after the first.

 OCaml's std library provides a function called group() that takes a break
 function that operators on pairs of successive elements.  It seems a
 similar approach could be used in Spark and would be more efficient than my
 approach with key-value pairs since you know the ordering of the partitions.

 Has this need been expressed by others?

 On Tue, Jun 30, 2015 at 1:03 PM, Reynold Xin r...@databricks.com wrote:

 Try mapPartitions, which gives you an iterator, and you can produce an
 iterator back.


 On Tue, Jun 30, 2015 at 11:01 AM, RJ Nowling rnowl...@gmail.com wrote:

 Hi all,

 I have a problem where I have a RDD of elements:

 Item1 Item2 Item3 Item4 Item5 Item6 ...

 and I want to run a function over them to decide which runs of elements
 to group together:

 [Item1 Item2] [Item3] [Item4 Item5 Item6] ...

 Technically, I could use aggregate to do this, but I would have to use a
 List of List of T which would produce a very large collection in memory.

 Is there an easy way to accomplish this?  e.g.,, it would be nice to
 have a version of aggregate where the combination function can return a
 complete group that is added to the new RDD and an incomplete group which
 is passed to the next call of the reduce function.

 Thanks,
 RJ







Re: Grouping runs of elements in a RDD

2015-06-30 Thread RJ Nowling
Thanks, Reynold.  I still need to handle incomplete groups that fall
between partition boundaries. So, I need a two-pass approach. I came up
with a somewhat hacky way to handle those using the partition indices and
key-value pairs as a second pass after the first.

OCaml's std library provides a function called group() that takes a break
function that operators on pairs of successive elements.  It seems a
similar approach could be used in Spark and would be more efficient than my
approach with key-value pairs since you know the ordering of the partitions.

Has this need been expressed by others?

On Tue, Jun 30, 2015 at 1:03 PM, Reynold Xin r...@databricks.com wrote:

 Try mapPartitions, which gives you an iterator, and you can produce an
 iterator back.


 On Tue, Jun 30, 2015 at 11:01 AM, RJ Nowling rnowl...@gmail.com wrote:

 Hi all,

 I have a problem where I have a RDD of elements:

 Item1 Item2 Item3 Item4 Item5 Item6 ...

 and I want to run a function over them to decide which runs of elements
 to group together:

 [Item1 Item2] [Item3] [Item4 Item5 Item6] ...

 Technically, I could use aggregate to do this, but I would have to use a
 List of List of T which would produce a very large collection in memory.

 Is there an easy way to accomplish this?  e.g.,, it would be nice to have
 a version of aggregate where the combination function can return a complete
 group that is added to the new RDD and an incomplete group which is passed
 to the next call of the reduce function.

 Thanks,
 RJ





Re: Grouping runs of elements in a RDD

2015-06-30 Thread Abhishek R. Singh
could you use a custom partitioner to preserve boundaries such that all related 
tuples end up on the same partition?

On Jun 30, 2015, at 12:00 PM, RJ Nowling rnowl...@gmail.com wrote:

 Thanks, Reynold.  I still need to handle incomplete groups that fall between 
 partition boundaries. So, I need a two-pass approach. I came up with a 
 somewhat hacky way to handle those using the partition indices and key-value 
 pairs as a second pass after the first.
 
 OCaml's std library provides a function called group() that takes a break 
 function that operators on pairs of successive elements.  It seems a similar 
 approach could be used in Spark and would be more efficient than my approach 
 with key-value pairs since you know the ordering of the partitions.
 
 Has this need been expressed by others?  
 
 On Tue, Jun 30, 2015 at 1:03 PM, Reynold Xin r...@databricks.com wrote:
 Try mapPartitions, which gives you an iterator, and you can produce an 
 iterator back.
 
 
 On Tue, Jun 30, 2015 at 11:01 AM, RJ Nowling rnowl...@gmail.com wrote:
 Hi all,
 
 I have a problem where I have a RDD of elements:
 
 Item1 Item2 Item3 Item4 Item5 Item6 ...
 
 and I want to run a function over them to decide which runs of elements to 
 group together:
 
 [Item1 Item2] [Item3] [Item4 Item5 Item6] ...
 
 Technically, I could use aggregate to do this, but I would have to use a List 
 of List of T which would produce a very large collection in memory.
 
 Is there an easy way to accomplish this?  e.g.,, it would be nice to have a 
 version of aggregate where the combination function can return a complete 
 group that is added to the new RDD and an incomplete group which is passed to 
 the next call of the reduce function.
 
 Thanks,
 RJ