>
> > If you can tolerate a few inaccuracies then you can just do the second
> > step. You will miss the “boundaries” of the partitions but it might be
> > acceptable for your use case.
>
>
> On Tue, Jun 30, 2015 at 12:21 PM, RJ Nowling wrote:
>
>> That's
ines that were
split prematurely.)
On Tue, Jun 30, 2015 at 2:07 PM, Abhishek R. Singh <
abhis...@tetrationanalytics.com> wrote:
> could you use a custom partitioner to preserve boundaries such that all
> related tuples end up on the same partition?
>
> On Jun 30, 2015, at 1
ed by others?
On Tue, Jun 30, 2015 at 1:03 PM, Reynold Xin wrote:
> Try mapPartitions, which gives you an iterator, and you can produce an
> iterator back.
>
>
> On Tue, Jun 30, 2015 at 11:01 AM, RJ Nowling wrote:
>
>> Hi all,
>>
>> I have a problem where I have a
Hi all,
I have a problem where I have a RDD of elements:
Item1 Item2 Item3 Item4 Item5 Item6 ...
and I want to run a function over them to decide which runs of elements to
group together:
[Item1 Item2] [Item3] [Item4 Item5 Item6] ...
Technically, I could use aggregate to do this, but I would h
Hi all,
I'm working on an application that has several tables (RDDs of tuples) of
data. Some of the types are complex-ish (e.g., date time objects). I'd like
to use something like case classes for each entry.
What is the best way to store the data to disk in a text format without
writing custom p
Ashwin,
What is your motivation for needing to share RDDs between jobs? Optimizing
for reusing data across jobs?
If so, you may want to look into Tachyon. My understanding is that Tachyon
acts like a caching layer and you can designate when data will be reused in
multiple jobs so it know to keep
Jatin,
If you file the JIRA and don't want to work on it, I'd be happy to step in
and take a stab at it.
RJ
On Thu, Sep 18, 2014 at 4:08 PM, Xiangrui Meng wrote:
> Hi Jatin,
>
> HashingTF should be able to solve the memory problem if you use a
> small feature dimension in HashingTF. Please do