Why do you need to use stream in this use case? 50g need not to be in memory. Give it a try with high number of partitions. On 11 Jun 2015 23:09, "SLiZn Liu" <sliznmail...@gmail.com> wrote:
> Hi Spark Users, > > I'm trying to load a literally big file (50GB when compressed as gzip > file, stored in HDFS) by receiving a DStream using `ssc.textFileStream`, as > this file cannot be fitted in my memory. However, it looks like no RDD will > be received until I copy this big file to a prior-specified location on > HDFS. Ideally, I'd like read this file by a small number of lines at a > time, but receiving a file stream requires additional writing to HDFS. Any > idea to achieve this? > > BEST REGARDS, > Todd Leo >