I have not used LZO compressed files from Spark, so not sure why it stalls 
without caching. 

In general, if you are going to make just one pass over the data, there is not 
much benefit in caching it. The data gets read anyway only after the first 
action is called. If you are calling just a map operation and then a save 
operation, I don't see how caching would help.

Mohammed


-----Original Message-----
From: Matt Narrell [mailto:matt.narr...@gmail.com] 
Sent: Tuesday, October 6, 2015 3:32 PM
To: Mohammed Guller
Cc: davidkl; user@spark.apache.org
Subject: Re: laziness in textFile reading from HDFS?

One.

I read in LZO compressed files from HDFS Perform a map operation cache the 
results of this map operation call saveAsHadoopFile to write LZO back to HDFS.

Without the cache, the job will stall.  

mn

> On Oct 5, 2015, at 7:25 PM, Mohammed Guller <moham...@glassbeam.com> wrote:
> 
> Is there any specific reason for caching the RDD? How many passes you make 
> over the dataset? 
> 
> Mohammed
> 
> -----Original Message-----
> From: Matt Narrell [mailto:matt.narr...@gmail.com]
> Sent: Saturday, October 3, 2015 9:50 PM
> To: Mohammed Guller
> Cc: davidkl; user@spark.apache.org
> Subject: Re: laziness in textFile reading from HDFS?
> 
> Is there any more information or best practices here?  I have the exact same 
> issues when reading large data sets from HDFS (larger than available RAM) and 
> I cannot run without setting the RDD persistence level to 
> MEMORY_AND_DISK_SER, and using nearly all the cluster resources.
> 
> Should I repartition this RDD to be equal to the number of cores?  
> 
> I notice that the job duration on the YARN UI is about 30 minutes longer than 
> the Spark UI.  When the job initially starts, there is no tasks shown in the 
> Spark UI..?
> 
> All I;m doing is reading records from HDFS text files with sc.textFile, and 
> rewriting them back to HDFS grouped by a timestamp.
> 
> Thanks,
> mn
> 
>> On Sep 29, 2015, at 8:06 PM, Mohammed Guller <moham...@glassbeam.com> wrote:
>> 
>> 1) It is not required to have the same amount of memory as data. 
>> 2) By default the # of partitions are equal to the number of HDFS 
>> blocks
>> 3) Yes, the read operation is lazy
>> 4) It is okay to have more number of partitions than number of cores. 
>> 
>> Mohammed
>> 
>> -----Original Message-----
>> From: davidkl [mailto:davidkl...@hotmail.com]
>> Sent: Monday, September 28, 2015 1:40 AM
>> To: user@spark.apache.org
>> Subject: laziness in textFile reading from HDFS?
>> 
>> Hello,
>> 
>> I need to process a significant amount of data every day, about 4TB. This 
>> will be processed in batches of about 140GB. The cluster this will be 
>> running on doesn't have enough memory to hold the dataset at once, so I am 
>> trying to understand how this works internally.
>> 
>> When using textFile to read an HDFS folder (containing multiple files), I 
>> understand that the number of partitions created are equal to the number of 
>> HDFS blocks, correct? Are those created in a lazy way? I mean, if the number 
>> of blocks/partitions is larger than the number of cores/threads the Spark 
>> driver was launched with (N), are N partitions created initially and then 
>> the rest when required? Or are all those partitions created up front?
>> 
>> I want to avoid reading the whole data into memory just to spill it out to 
>> disk if there is no enough memory.
>> 
>> Thanks! 
>> 
>> 
>> 
>> --
>> View this message in context: 
>> http://apache-spark-user-list.1001560.n3.nabble.com/laziness-in-textF
>> i le-reading-from-HDFS-tp24837.html Sent from the Apache Spark User 
>> List mailing list archive at Nabble.com.
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For 
>> additional commands, e-mail: user-h...@spark.apache.org
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For 
>> additional commands, e-mail: user-h...@spark.apache.org
>> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to