Hmm, you have a good point. So should I load the file by `sc.textFile()`
and specify a high number of partitions, and the file is then split into
partitions in memory across the cluster?

On Thu, Jun 11, 2015 at 9:27 PM ayan guha <guha.a...@gmail.com> wrote:

> Why do you need to use stream in this use case? 50g need not to be in
> memory. Give it a try with high number of partitions.
> On 11 Jun 2015 23:09, "SLiZn Liu" <sliznmail...@gmail.com> wrote:
>
>> Hi Spark Users,
>>
>> I'm trying to load a literally big file (50GB when compressed as gzip
>> file, stored in HDFS) by receiving a DStream using `ssc.textFileStream`, as
>> this file cannot be fitted in my memory. However, it looks like no RDD will
>> be received until I copy this big file to a prior-specified location on
>> HDFS. Ideally, I'd like read this file by a small number of lines at a
>> time, but receiving a file stream requires additional writing to HDFS. Any
>> idea to achieve this?
>>
>> BEST REGARDS,
>> Todd Leo
>>
>

Reply via email to