This is a common use case and this is how Hadoop APIs for reading data
work, they return an Iterator [Your Record] instead of reading every record
in at once.
On Dec 1, 2014 9:43 PM, "Andy Twigg" <andy.tw...@gmail.com> wrote:

> You may be able to construct RDDs directly from an iterator - not sure
> - you may have to subclass your own.
>
> On 1 December 2014 at 18:40, Keith Simmons <ke...@pulse.io> wrote:
> > Yep, that's definitely possible.  It's one of the workarounds I was
> > considering.  I was just curious if there was a simpler (and perhaps more
> > efficient) approach.
> >
> > Keith
> >
> > On Mon, Dec 1, 2014 at 6:28 PM, Andy Twigg <andy.tw...@gmail.com> wrote:
> >>
> >> Could you modify your function so that it streams through the files
> record
> >> by record and outputs them to hdfs, then read them all in as RDDs and
> take
> >> the union? That would only use bounded memory.
> >>
> >> On 1 December 2014 at 17:19, Keith Simmons <ke...@pulse.io> wrote:
> >>>
> >>> Actually, I'm working with a binary format.  The api allows reading
> out a
> >>> single record at a time, but I'm not sure how to get those records into
> >>> spark (without reading everything into memory from a single file at
> once).
> >>>
> >>>
> >>>
> >>> On Mon, Dec 1, 2014 at 5:07 PM, Andy Twigg <andy.tw...@gmail.com>
> wrote:
> >>>>>
> >>>>> file => tranform file into a bunch of records
> >>>>
> >>>>
> >>>> What does this function do exactly? Does it load the file locally?
> >>>> Spark supports RDDs exceeding global RAM (cf the terasort example),
> but
> >>>> if your example just loads each file locally, then this may cause
> problems.
> >>>> Instead, you should load each file into an rdd with
> context.textFile(),
> >>>> flatmap that and union these rdds.
> >>>>
> >>>> also see
> >>>>
> >>>>
> http://stackoverflow.com/questions/23397907/spark-context-textfile-load-multiple-files
> >>>>
> >>>>
> >>>> On 1 December 2014 at 16:50, Keith Simmons <ke...@pulse.io> wrote:
> >>>>>
> >>>>> This is a long shot, but...
> >>>>>
> >>>>> I'm trying to load a bunch of files spread out over hdfs into an RDD,
> >>>>> and in most cases it works well, but for a few very large files, I
> exceed
> >>>>> available memory.  My current workflow basically works like this:
> >>>>>
> >>>>> context.parallelize(fileNames).flatMap { file =>
> >>>>>   tranform file into a bunch of records
> >>>>> }
> >>>>>
> >>>>> I'm wondering if there are any APIs to somehow "flush" the records
> of a
> >>>>> big dataset so I don't have to load them all into memory at once.  I
> know
> >>>>> this doesn't exist, but conceptually:
> >>>>>
> >>>>> context.parallelize(fileNames).streamMap { (file, stream) =>
> >>>>>  for every 10K records write records to stream and flush
> >>>>> }
> >>>>>
> >>>>> Keith
> >>>>
> >>>>
> >>>
> >>
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to