You may be able to construct RDDs directly from an iterator - not sure - you may have to subclass your own.
On 1 December 2014 at 18:40, Keith Simmons <ke...@pulse.io> wrote: > Yep, that's definitely possible. It's one of the workarounds I was > considering. I was just curious if there was a simpler (and perhaps more > efficient) approach. > > Keith > > On Mon, Dec 1, 2014 at 6:28 PM, Andy Twigg <andy.tw...@gmail.com> wrote: >> >> Could you modify your function so that it streams through the files record >> by record and outputs them to hdfs, then read them all in as RDDs and take >> the union? That would only use bounded memory. >> >> On 1 December 2014 at 17:19, Keith Simmons <ke...@pulse.io> wrote: >>> >>> Actually, I'm working with a binary format. The api allows reading out a >>> single record at a time, but I'm not sure how to get those records into >>> spark (without reading everything into memory from a single file at once). >>> >>> >>> >>> On Mon, Dec 1, 2014 at 5:07 PM, Andy Twigg <andy.tw...@gmail.com> wrote: >>>>> >>>>> file => tranform file into a bunch of records >>>> >>>> >>>> What does this function do exactly? Does it load the file locally? >>>> Spark supports RDDs exceeding global RAM (cf the terasort example), but >>>> if your example just loads each file locally, then this may cause problems. >>>> Instead, you should load each file into an rdd with context.textFile(), >>>> flatmap that and union these rdds. >>>> >>>> also see >>>> >>>> http://stackoverflow.com/questions/23397907/spark-context-textfile-load-multiple-files >>>> >>>> >>>> On 1 December 2014 at 16:50, Keith Simmons <ke...@pulse.io> wrote: >>>>> >>>>> This is a long shot, but... >>>>> >>>>> I'm trying to load a bunch of files spread out over hdfs into an RDD, >>>>> and in most cases it works well, but for a few very large files, I exceed >>>>> available memory. My current workflow basically works like this: >>>>> >>>>> context.parallelize(fileNames).flatMap { file => >>>>> tranform file into a bunch of records >>>>> } >>>>> >>>>> I'm wondering if there are any APIs to somehow "flush" the records of a >>>>> big dataset so I don't have to load them all into memory at once. I know >>>>> this doesn't exist, but conceptually: >>>>> >>>>> context.parallelize(fileNames).streamMap { (file, stream) => >>>>> for every 10K records write records to stream and flush >>>>> } >>>>> >>>>> Keith >>>> >>>> >>> >> > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org