Using sc.textFile will also read the file from HDFS one by one line through
iterator, don't need to fit all into memory, even you have small size of
memory, it still can be worked.

2015-06-12 13:19 GMT+08:00 SLiZn Liu <sliznmail...@gmail.com>:

> Hmm, you have a good point. So should I load the file by `sc.textFile()`
> and specify a high number of partitions, and the file is then split into
> partitions in memory across the cluster?
>
> On Thu, Jun 11, 2015 at 9:27 PM ayan guha <guha.a...@gmail.com> wrote:
>
>> Why do you need to use stream in this use case? 50g need not to be in
>> memory. Give it a try with high number of partitions.
>> On 11 Jun 2015 23:09, "SLiZn Liu" <sliznmail...@gmail.com> wrote:
>>
>>> Hi Spark Users,
>>>
>>> I'm trying to load a literally big file (50GB when compressed as gzip
>>> file, stored in HDFS) by receiving a DStream using `ssc.textFileStream`, as
>>> this file cannot be fitted in my memory. However, it looks like no RDD will
>>> be received until I copy this big file to a prior-specified location on
>>> HDFS. Ideally, I'd like read this file by a small number of lines at a
>>> time, but receiving a file stream requires additional writing to HDFS. Any
>>> idea to achieve this?
>>>
>>> BEST REGARDS,
>>> Todd Leo
>>>
>>

Reply via email to