This is a common use case and this is how Hadoop APIs for reading data work, they return an Iterator [Your Record] instead of reading every record in at once. On Dec 1, 2014 9:43 PM, "Andy Twigg" <andy.tw...@gmail.com> wrote:
> You may be able to construct RDDs directly from an iterator - not sure > - you may have to subclass your own. > > On 1 December 2014 at 18:40, Keith Simmons <ke...@pulse.io> wrote: > > Yep, that's definitely possible. It's one of the workarounds I was > > considering. I was just curious if there was a simpler (and perhaps more > > efficient) approach. > > > > Keith > > > > On Mon, Dec 1, 2014 at 6:28 PM, Andy Twigg <andy.tw...@gmail.com> wrote: > >> > >> Could you modify your function so that it streams through the files > record > >> by record and outputs them to hdfs, then read them all in as RDDs and > take > >> the union? That would only use bounded memory. > >> > >> On 1 December 2014 at 17:19, Keith Simmons <ke...@pulse.io> wrote: > >>> > >>> Actually, I'm working with a binary format. The api allows reading > out a > >>> single record at a time, but I'm not sure how to get those records into > >>> spark (without reading everything into memory from a single file at > once). > >>> > >>> > >>> > >>> On Mon, Dec 1, 2014 at 5:07 PM, Andy Twigg <andy.tw...@gmail.com> > wrote: > >>>>> > >>>>> file => tranform file into a bunch of records > >>>> > >>>> > >>>> What does this function do exactly? Does it load the file locally? > >>>> Spark supports RDDs exceeding global RAM (cf the terasort example), > but > >>>> if your example just loads each file locally, then this may cause > problems. > >>>> Instead, you should load each file into an rdd with > context.textFile(), > >>>> flatmap that and union these rdds. > >>>> > >>>> also see > >>>> > >>>> > http://stackoverflow.com/questions/23397907/spark-context-textfile-load-multiple-files > >>>> > >>>> > >>>> On 1 December 2014 at 16:50, Keith Simmons <ke...@pulse.io> wrote: > >>>>> > >>>>> This is a long shot, but... > >>>>> > >>>>> I'm trying to load a bunch of files spread out over hdfs into an RDD, > >>>>> and in most cases it works well, but for a few very large files, I > exceed > >>>>> available memory. My current workflow basically works like this: > >>>>> > >>>>> context.parallelize(fileNames).flatMap { file => > >>>>> tranform file into a bunch of records > >>>>> } > >>>>> > >>>>> I'm wondering if there are any APIs to somehow "flush" the records > of a > >>>>> big dataset so I don't have to load them all into memory at once. I > know > >>>>> this doesn't exist, but conceptually: > >>>>> > >>>>> context.parallelize(fileNames).streamMap { (file, stream) => > >>>>> for every 10K records write records to stream and flush > >>>>> } > >>>>> > >>>>> Keith > >>>> > >>>> > >>> > >> > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >