Re: laziness in textFile reading from HDFS?

Jonathan Coveney Tue, 06 Oct 2015 16:46:19 -0700

LZO files are not splittable by default but there are projects with Input
and Output formats to make splittable LZO files. Check out twitter's
elephantbird on GitHub


El miércoles, 7 de octubre de 2015, Mohammed Guller <moham...@glassbeam.com>
escribió:

> It is not uncommon to process datasets larger than available memory with
> Spark.
>
> I don't remember whether LZO files are splittable. Perhaps, in your case
> Spark is running into issues while decompressing a large LZO file.
>
> See if this helps:
>
> http://stackoverflow.com/questions/25248170/spark-hadoop-throws-exception-for-large-lzo-files
>
>
> Mohammed
>
>
> -----Original Message-----
> From: Matt Narrell [mailto:matt.narr...@gmail.com <javascript:;>]
> Sent: Tuesday, October 6, 2015 4:08 PM
> To: Mohammed Guller
> Cc: davidkl; user@spark.apache.org <javascript:;>
> Subject: Re: laziness in textFile reading from HDFS?
>
> Agreed. This is spark 1.2 on CDH5.x. How do you mitigate where the data
> sets are larger than available memory?
>
> My jobs stall and gc/heap issues all over the place.
>
> ..via mobile
>
> > On Oct 6, 2015, at 4:44 PM, Mohammed Guller <moham...@glassbeam.com
> <javascript:;>> wrote:
> >
> > I have not used LZO compressed files from Spark, so not sure why it
> stalls without caching.
> >
> > In general, if you are going to make just one pass over the data, there
> is not much benefit in caching it. The data gets read anyway only after the
> first action is called. If you are calling just a map operation and then a
> save operation, I don't see how caching would help.
> >
> > Mohammed
> >
> >
> > -----Original Message-----
> > From: Matt Narrell [mailto:matt.narr...@gmail.com <javascript:;>]
> > Sent: Tuesday, October 6, 2015 3:32 PM
> > To: Mohammed Guller
> > Cc: davidkl; user@spark.apache.org <javascript:;>
> > Subject: Re: laziness in textFile reading from HDFS?
> >
> > One.
> >
> > I read in LZO compressed files from HDFS Perform a map operation cache
> the results of this map operation call saveAsHadoopFile to write LZO back
> to HDFS.
> >
> > Without the cache, the job will stall.
> >
> > mn
> >
> >> On Oct 5, 2015, at 7:25 PM, Mohammed Guller <moham...@glassbeam.com
> <javascript:;>> wrote:
> >>
> >> Is there any specific reason for caching the RDD? How many passes you
> make over the dataset?
> >>
> >> Mohammed
> >>
> >> -----Original Message-----
> >> From: Matt Narrell [mailto:matt.narr...@gmail.com <javascript:;>]
> >> Sent: Saturday, October 3, 2015 9:50 PM
> >> To: Mohammed Guller
> >> Cc: davidkl; user@spark.apache.org <javascript:;>
> >> Subject: Re: laziness in textFile reading from HDFS?
> >>
> >> Is there any more information or best practices here?  I have the exact
> same issues when reading large data sets from HDFS (larger than available
> RAM) and I cannot run without setting the RDD persistence level to
> MEMORY_AND_DISK_SER, and using nearly all the cluster resources.
> >>
> >> Should I repartition this RDD to be equal to the number of cores?
> >>
> >> I notice that the job duration on the YARN UI is about 30 minutes
> longer than the Spark UI.  When the job initially starts, there is no tasks
> shown in the Spark UI..?
> >>
> >> All I;m doing is reading records from HDFS text files with sc.textFile,
> and rewriting them back to HDFS grouped by a timestamp.
> >>
> >> Thanks,
> >> mn
> >>
> >>> On Sep 29, 2015, at 8:06 PM, Mohammed Guller <moham...@glassbeam.com
> <javascript:;>> wrote:
> >>>
> >>> 1) It is not required to have the same amount of memory as data.
> >>> 2) By default the # of partitions are equal to the number of HDFS
> >>> blocks
> >>> 3) Yes, the read operation is lazy
> >>> 4) It is okay to have more number of partitions than number of cores.
> >>>
> >>> Mohammed
> >>>
> >>> -----Original Message-----
> >>> From: davidkl [mailto:davidkl...@hotmail.com <javascript:;>]
> >>> Sent: Monday, September 28, 2015 1:40 AM
> >>> To: user@spark.apache.org <javascript:;>
> >>> Subject: laziness in textFile reading from HDFS?
> >>>
> >>> Hello,
> >>>
> >>> I need to process a significant amount of data every day, about 4TB.
> This will be processed in batches of about 140GB. The cluster this will be
> running on doesn't have enough memory to hold the dataset at once, so I am
> trying to understand how this works internally.
> >>>
> >>> When using textFile to read an HDFS folder (containing multiple
> files), I understand that the number of partitions created are equal to the
> number of HDFS blocks, correct? Are those created in a lazy way? I mean, if
> the number of blocks/partitions is larger than the number of cores/threads
> the Spark driver was launched with (N), are N partitions created initially
> and then the rest when required? Or are all those partitions created up
> front?
> >>>
> >>> I want to avoid reading the whole data into memory just to spill it
> out to disk if there is no enough memory.
> >>>
> >>> Thanks!
> >>>
> >>>
> >>>
> >>> --
> >>> View this message in context:
> >>> http://apache-spark-user-list.1001560.n3.nabble.com/laziness-in-text
> >>> F i le-reading-from-HDFS-tp24837.html Sent from the Apache Spark
> >>> User List mailing list archive at Nabble.com.
> >>>
> >>> --------------------------------------------------------------------
> >>> - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> <javascript:;> For
> >>> additional commands, e-mail: user-h...@spark.apache.org <javascript:;>
> >>>
> >>>
> >>> --------------------------------------------------------------------
> >>> - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> <javascript:;> For
> >>> additional commands, e-mail: user-h...@spark.apache.org <javascript:;>
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org <javascript:;>
> For additional commands, e-mail: user-h...@spark.apache.org <javascript:;>
>
>

Re: laziness in textFile reading from HDFS?

Reply via email to