Hi Asaf,
featurestream [1] is an internal project I'm playing with that includes
support for some of this, in particular:
* 1-pass random forest construction
* schema inference
* native support for text fields
Would this be of interest? It's not open source, but if there's sufficient
demand I
Have you tried just passing a path to ssc.textFileStream() ? It
monitors the path for new files by looking at mtime/atime ; all
new/touched files in the time window appear as an rdd in the dstream.
On 1 December 2014 at 14:41, Benjamin Cuthbert cuthbert@gmail.com wrote:
All,
Is it possible
file = tranform file into a bunch of records
What does this function do exactly? Does it load the file locally?
Spark supports RDDs exceeding global RAM (cf the terasort example), but if
your example just loads each file locally, then this may cause problems.
Instead, you should load each file
with a binary format. The api allows reading out a
single record at a time, but I'm not sure how to get those records into
spark (without reading everything into memory from a single file at once).
On Mon, Dec 1, 2014 at 5:07 PM, Andy Twigg andy.tw...@gmail.com wrote:
file = tranform file into a bunch
if there was a simpler (and perhaps more
efficient) approach.
Keith
On Mon, Dec 1, 2014 at 6:28 PM, Andy Twigg andy.tw...@gmail.com wrote:
Could you modify your function so that it streams through the files record
by record and outputs them to hdfs, then read them all in as RDDs and take
the union? That would
at 6:44 PM, Debasish Das debasish.da...@gmail.com
javascript:_e(%7B%7D,'cvml','debasish.da...@gmail.com'); wrote:
If the tree is too big build it on graphxbut it will need thorough
analysis so that the partitions are well balanced...
On Tue, Sep 30, 2014 at 2:45 PM, Andy Twigg andy.tw
Hi Boromir,
Assuming the tree fits in memory, and what you want to do is parallelize
the computation, the 'obvious' way is the following:
* broadcast the tree T to each worker (ok since it fits in memory)
* construct an RDD for the deepest level - each element in the RDD is
(parent,data_at_node)