Re: Reading Really Big File Stream from HDFS

2015-06-12 Thread Saisai Shao
Using sc.textFile will also read the file from HDFS one by one line through iterator, don't need to fit all into memory, even you have small size of memory, it still can be worked. 2015-06-12 13:19 GMT+08:00 SLiZn Liu sliznmail...@gmail.com: Hmm, you have a good point. So should I load the file

Reading Really Big File Stream from HDFS

2015-06-11 Thread SLiZn Liu
Hi Spark Users, I'm trying to load a literally big file (50GB when compressed as gzip file, stored in HDFS) by receiving a DStream using `ssc.textFileStream`, as this file cannot be fitted in my memory. However, it looks like no RDD will be received until I copy this big file to a prior-specified

Re: Reading Really Big File Stream from HDFS

2015-06-11 Thread SLiZn Liu
Hmm, you have a good point. So should I load the file by `sc.textFile()` and specify a high number of partitions, and the file is then split into partitions in memory across the cluster? On Thu, Jun 11, 2015 at 9:27 PM ayan guha guha.a...@gmail.com wrote: Why do you need to use stream in this