at 12:21 PM, RJ Nowling rnowl...@gmail.com wrote:
That's an interesting idea! I hadn't considered that. However, looking
at the Partitioner interface, I would need to know from looking at a single
key which doesn't fit my case, unfortunately. For my case, I need to
compare successive pairs
Hi all,
I have a problem where I have a RDD of elements:
Item1 Item2 Item3 Item4 Item5 Item6 ...
and I want to run a function over them to decide which runs of elements to
group together:
[Item1 Item2] [Item3] [Item4 Item5 Item6] ...
Technically, I could use aggregate to do this, but I would
prematurely.)
On Tue, Jun 30, 2015 at 2:07 PM, Abhishek R. Singh
abhis...@tetrationanalytics.com wrote:
could you use a custom partitioner to preserve boundaries such that all
related tuples end up on the same partition?
On Jun 30, 2015, at 12:00 PM, RJ Nowling rnowl...@gmail.com wrote
by others?
On Tue, Jun 30, 2015 at 1:03 PM, Reynold Xin r...@databricks.com wrote:
Try mapPartitions, which gives you an iterator, and you can produce an
iterator back.
On Tue, Jun 30, 2015 at 11:01 AM, RJ Nowling rnowl...@gmail.com wrote:
Hi all,
I have a problem where I have a RDD of elements
Hi all,
I'm working on an application that has several tables (RDDs of tuples) of
data. Some of the types are complex-ish (e.g., date time objects). I'd like
to use something like case classes for each entry.
What is the best way to store the data to disk in a text format without
writing custom
Ashwin,
What is your motivation for needing to share RDDs between jobs? Optimizing
for reusing data across jobs?
If so, you may want to look into Tachyon. My understanding is that Tachyon
acts like a caching layer and you can designate when data will be reused in
multiple jobs so it know to keep
Jatin,
If you file the JIRA and don't want to work on it, I'd be happy to step in
and take a stab at it.
RJ
On Thu, Sep 18, 2014 at 4:08 PM, Xiangrui Meng men...@gmail.com wrote:
Hi Jatin,
HashingTF should be able to solve the memory problem if you use a
small feature dimension in