Hmm, you have a good point. So should I load the file by `sc.textFile()` and specify a high number of partitions, and the file is then split into partitions in memory across the cluster?
On Thu, Jun 11, 2015 at 9:27 PM ayan guha <guha.a...@gmail.com> wrote: > Why do you need to use stream in this use case? 50g need not to be in > memory. Give it a try with high number of partitions. > On 11 Jun 2015 23:09, "SLiZn Liu" <sliznmail...@gmail.com> wrote: > >> Hi Spark Users, >> >> I'm trying to load a literally big file (50GB when compressed as gzip >> file, stored in HDFS) by receiving a DStream using `ssc.textFileStream`, as >> this file cannot be fitted in my memory. However, it looks like no RDD will >> be received until I copy this big file to a prior-specified location on >> HDFS. Ideally, I'd like read this file by a small number of lines at a >> time, but receiving a file stream requires additional writing to HDFS. Any >> idea to achieve this? >> >> BEST REGARDS, >> Todd Leo >> >