Re: Grouping runs of elements in a RDD

2015-07-02 Thread RJ Nowling
> > > If you can tolerate a few inaccuracies then you can just do the second > > step. You will miss the “boundaries” of the partitions but it might be > > acceptable for your use case. > > > On Tue, Jun 30, 2015 at 12:21 PM, RJ Nowling wrote: > >> That's

Re: Grouping runs of elements in a RDD

2015-06-30 Thread RJ Nowling
ines that were split prematurely.) On Tue, Jun 30, 2015 at 2:07 PM, Abhishek R. Singh < abhis...@tetrationanalytics.com> wrote: > could you use a custom partitioner to preserve boundaries such that all > related tuples end up on the same partition? > > On Jun 30, 2015, at 1

Re: Grouping runs of elements in a RDD

2015-06-30 Thread RJ Nowling
ed by others? On Tue, Jun 30, 2015 at 1:03 PM, Reynold Xin wrote: > Try mapPartitions, which gives you an iterator, and you can produce an > iterator back. > > > On Tue, Jun 30, 2015 at 11:01 AM, RJ Nowling wrote: > >> Hi all, >> >> I have a problem where I have a

Grouping runs of elements in a RDD

2015-06-30 Thread RJ Nowling
Hi all, I have a problem where I have a RDD of elements: Item1 Item2 Item3 Item4 Item5 Item6 ... and I want to run a function over them to decide which runs of elements to group together: [Item1 Item2] [Item3] [Item4 Item5 Item6] ... Technically, I could use aggregate to do this, but I would h

Best way to store RDD data?

2014-11-20 Thread RJ Nowling
Hi all, I'm working on an application that has several tables (RDDs of tuples) of data. Some of the types are complex-ish (e.g., date time objects). I'd like to use something like case classes for each entry. What is the best way to store the data to disk in a text format without writing custom p

Re: Multitenancy in Spark - within/across spark context

2014-10-25 Thread RJ Nowling
Ashwin, What is your motivation for needing to share RDDs between jobs? Optimizing for reusing data across jobs? If so, you may want to look into Tachyon. My understanding is that Tachyon acts like a caching layer and you can designate when data will be reused in multiple jobs so it know to keep

Re: New API for TFIDF generation in Spark 1.1.0

2014-09-19 Thread RJ Nowling
Jatin, If you file the JIRA and don't want to work on it, I'd be happy to step in and take a stab at it. RJ On Thu, Sep 18, 2014 at 4:08 PM, Xiangrui Meng wrote: > Hi Jatin, > > HashingTF should be able to solve the memory problem if you use a > small feature dimension in HashingTF. Please do