LZO files are not splittable by default but there are projects with Input and Output formats to make splittable LZO files. Check out twitter's elephantbird on GitHub
El miércoles, 7 de octubre de 2015, Mohammed Guller <moham...@glassbeam.com> escribió: > It is not uncommon to process datasets larger than available memory with > Spark. > > I don't remember whether LZO files are splittable. Perhaps, in your case > Spark is running into issues while decompressing a large LZO file. > > See if this helps: > > http://stackoverflow.com/questions/25248170/spark-hadoop-throws-exception-for-large-lzo-files > > > Mohammed > > > -----Original Message----- > From: Matt Narrell [mailto:matt.narr...@gmail.com <javascript:;>] > Sent: Tuesday, October 6, 2015 4:08 PM > To: Mohammed Guller > Cc: davidkl; user@spark.apache.org <javascript:;> > Subject: Re: laziness in textFile reading from HDFS? > > Agreed. This is spark 1.2 on CDH5.x. How do you mitigate where the data > sets are larger than available memory? > > My jobs stall and gc/heap issues all over the place. > > ..via mobile > > > On Oct 6, 2015, at 4:44 PM, Mohammed Guller <moham...@glassbeam.com > <javascript:;>> wrote: > > > > I have not used LZO compressed files from Spark, so not sure why it > stalls without caching. > > > > In general, if you are going to make just one pass over the data, there > is not much benefit in caching it. The data gets read anyway only after the > first action is called. If you are calling just a map operation and then a > save operation, I don't see how caching would help. > > > > Mohammed > > > > > > -----Original Message----- > > From: Matt Narrell [mailto:matt.narr...@gmail.com <javascript:;>] > > Sent: Tuesday, October 6, 2015 3:32 PM > > To: Mohammed Guller > > Cc: davidkl; user@spark.apache.org <javascript:;> > > Subject: Re: laziness in textFile reading from HDFS? > > > > One. > > > > I read in LZO compressed files from HDFS Perform a map operation cache > the results of this map operation call saveAsHadoopFile to write LZO back > to HDFS. > > > > Without the cache, the job will stall. > > > > mn > > > >> On Oct 5, 2015, at 7:25 PM, Mohammed Guller <moham...@glassbeam.com > <javascript:;>> wrote: > >> > >> Is there any specific reason for caching the RDD? How many passes you > make over the dataset? > >> > >> Mohammed > >> > >> -----Original Message----- > >> From: Matt Narrell [mailto:matt.narr...@gmail.com <javascript:;>] > >> Sent: Saturday, October 3, 2015 9:50 PM > >> To: Mohammed Guller > >> Cc: davidkl; user@spark.apache.org <javascript:;> > >> Subject: Re: laziness in textFile reading from HDFS? > >> > >> Is there any more information or best practices here? I have the exact > same issues when reading large data sets from HDFS (larger than available > RAM) and I cannot run without setting the RDD persistence level to > MEMORY_AND_DISK_SER, and using nearly all the cluster resources. > >> > >> Should I repartition this RDD to be equal to the number of cores? > >> > >> I notice that the job duration on the YARN UI is about 30 minutes > longer than the Spark UI. When the job initially starts, there is no tasks > shown in the Spark UI..? > >> > >> All I;m doing is reading records from HDFS text files with sc.textFile, > and rewriting them back to HDFS grouped by a timestamp. > >> > >> Thanks, > >> mn > >> > >>> On Sep 29, 2015, at 8:06 PM, Mohammed Guller <moham...@glassbeam.com > <javascript:;>> wrote: > >>> > >>> 1) It is not required to have the same amount of memory as data. > >>> 2) By default the # of partitions are equal to the number of HDFS > >>> blocks > >>> 3) Yes, the read operation is lazy > >>> 4) It is okay to have more number of partitions than number of cores. > >>> > >>> Mohammed > >>> > >>> -----Original Message----- > >>> From: davidkl [mailto:davidkl...@hotmail.com <javascript:;>] > >>> Sent: Monday, September 28, 2015 1:40 AM > >>> To: user@spark.apache.org <javascript:;> > >>> Subject: laziness in textFile reading from HDFS? > >>> > >>> Hello, > >>> > >>> I need to process a significant amount of data every day, about 4TB. > This will be processed in batches of about 140GB. The cluster this will be > running on doesn't have enough memory to hold the dataset at once, so I am > trying to understand how this works internally. > >>> > >>> When using textFile to read an HDFS folder (containing multiple > files), I understand that the number of partitions created are equal to the > number of HDFS blocks, correct? Are those created in a lazy way? I mean, if > the number of blocks/partitions is larger than the number of cores/threads > the Spark driver was launched with (N), are N partitions created initially > and then the rest when required? Or are all those partitions created up > front? > >>> > >>> I want to avoid reading the whole data into memory just to spill it > out to disk if there is no enough memory. > >>> > >>> Thanks! > >>> > >>> > >>> > >>> -- > >>> View this message in context: > >>> http://apache-spark-user-list.1001560.n3.nabble.com/laziness-in-text > >>> F i le-reading-from-HDFS-tp24837.html Sent from the Apache Spark > >>> User List mailing list archive at Nabble.com. > >>> > >>> -------------------------------------------------------------------- > >>> - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > <javascript:;> For > >>> additional commands, e-mail: user-h...@spark.apache.org <javascript:;> > >>> > >>> > >>> -------------------------------------------------------------------- > >>> - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > <javascript:;> For > >>> additional commands, e-mail: user-h...@spark.apache.org <javascript:;> > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org <javascript:;> > For additional commands, e-mail: user-h...@spark.apache.org <javascript:;> > >