Re: Grouping runs of elements in a RDD

2015-07-02 Thread RJ Nowling
at 12:21 PM, RJ Nowling rnowl...@gmail.com wrote: That's an interesting idea! I hadn't considered that. However, looking at the Partitioner interface, I would need to know from looking at a single key which doesn't fit my case, unfortunately. For my case, I need to compare successive pairs

Grouping runs of elements in a RDD

2015-06-30 Thread RJ Nowling
Hi all, I have a problem where I have a RDD of elements: Item1 Item2 Item3 Item4 Item5 Item6 ... and I want to run a function over them to decide which runs of elements to group together: [Item1 Item2] [Item3] [Item4 Item5 Item6] ... Technically, I could use aggregate to do this, but I would

Re: Grouping runs of elements in a RDD

2015-06-30 Thread RJ Nowling
prematurely.) On Tue, Jun 30, 2015 at 2:07 PM, Abhishek R. Singh abhis...@tetrationanalytics.com wrote: could you use a custom partitioner to preserve boundaries such that all related tuples end up on the same partition? On Jun 30, 2015, at 12:00 PM, RJ Nowling rnowl...@gmail.com wrote

Re: Grouping runs of elements in a RDD

2015-06-30 Thread RJ Nowling
by others? On Tue, Jun 30, 2015 at 1:03 PM, Reynold Xin r...@databricks.com wrote: Try mapPartitions, which gives you an iterator, and you can produce an iterator back. On Tue, Jun 30, 2015 at 11:01 AM, RJ Nowling rnowl...@gmail.com wrote: Hi all, I have a problem where I have a RDD of elements

Best way to store RDD data?

2014-11-20 Thread RJ Nowling
Hi all, I'm working on an application that has several tables (RDDs of tuples) of data. Some of the types are complex-ish (e.g., date time objects). I'd like to use something like case classes for each entry. What is the best way to store the data to disk in a text format without writing custom

Re: Multitenancy in Spark - within/across spark context

2014-10-25 Thread RJ Nowling
Ashwin, What is your motivation for needing to share RDDs between jobs? Optimizing for reusing data across jobs? If so, you may want to look into Tachyon. My understanding is that Tachyon acts like a caching layer and you can designate when data will be reused in multiple jobs so it know to keep

Re: New API for TFIDF generation in Spark 1.1.0

2014-09-19 Thread RJ Nowling
Jatin, If you file the JIRA and don't want to work on it, I'd be happy to step in and take a stab at it. RJ On Thu, Sep 18, 2014 at 4:08 PM, Xiangrui Meng men...@gmail.com wrote: Hi Jatin, HashingTF should be able to solve the memory problem if you use a small feature dimension in