Re: Spark random forest - string data

2015-01-16 Thread Andy Twigg
Hi Asaf, featurestream [1] is an internal project I'm playing with that includes support for some of this, in particular: * 1-pass random forest construction * schema inference * native support for text fields Would this be of interest? It's not open source, but if there's sufficient demand I

Re: hdfs streaming context

2014-12-01 Thread Andy Twigg
Have you tried just passing a path to ssc.textFileStream() ? It monitors the path for new files by looking at mtime/atime ; all new/touched files in the time window appear as an rdd in the dstream. On 1 December 2014 at 14:41, Benjamin Cuthbert cuthbert@gmail.com wrote: All, Is it possible

Re: Loading RDDs in a streaming fashion

2014-12-01 Thread Andy Twigg
file = tranform file into a bunch of records What does this function do exactly? Does it load the file locally? Spark supports RDDs exceeding global RAM (cf the terasort example), but if your example just loads each file locally, then this may cause problems. Instead, you should load each file

Re: Loading RDDs in a streaming fashion

2014-12-01 Thread Andy Twigg
with a binary format. The api allows reading out a single record at a time, but I'm not sure how to get those records into spark (without reading everything into memory from a single file at once). On Mon, Dec 1, 2014 at 5:07 PM, Andy Twigg andy.tw...@gmail.com wrote: file = tranform file into a bunch

Re: Loading RDDs in a streaming fashion

2014-12-01 Thread Andy Twigg
if there was a simpler (and perhaps more efficient) approach. Keith On Mon, Dec 1, 2014 at 6:28 PM, Andy Twigg andy.tw...@gmail.com wrote: Could you modify your function so that it streams through the files record by record and outputs them to hdfs, then read them all in as RDDs and take the union? That would

Re: Handling tree reduction algorithm with Spark in parallel

2014-10-01 Thread Andy Twigg
at 6:44 PM, Debasish Das debasish.da...@gmail.com javascript:_e(%7B%7D,'cvml','debasish.da...@gmail.com'); wrote: If the tree is too big build it on graphxbut it will need thorough analysis so that the partitions are well balanced... On Tue, Sep 30, 2014 at 2:45 PM, Andy Twigg andy.tw

Re: Handling tree reduction algorithm with Spark in parallel

2014-09-30 Thread Andy Twigg
Hi Boromir, Assuming the tree fits in memory, and what you want to do is parallelize the computation, the 'obvious' way is the following: * broadcast the tree T to each worker (ok since it fits in memory) * construct an RDD for the deepest level - each element in the RDD is (parent,data_at_node)