On 3 Dec 2015, at 00:42, Michele Freschi 
<mfres...@palantir.com<mailto:mfres...@palantir.com>> wrote:

Hi all,

I have an app streaming from s3 (textFileStream) and recently I've observed 
increasing delay and long time to list files:

INFO dstream.FileInputDStream: Finding new files took 394160 ms
...
INFO scheduler.JobScheduler: Total delay: 404.796 s for time 1449100200000 ms 
(execution: 10.154 s)

At this time I have about 13K files under the key prefix that I'm monitoring - 
hadoop takes about 6 minutes to list all the files while aws cli takes only 
seconds.
My understanding is that this is a current limitation of hadoop but I wanted to 
confirm it in case it's a misconfiguration on my part.

not a known issue.

Usual questions: which Hadoop version and are you using s3n or s3a connectors. 
The latter does use the AWS sdk, but it's only been stable enough to use in 
Hadoop 2.7


Some alternatives I'm considering:
1. copy old files to a different key prefix
2. use one of the available SQS receivers 
(https://github.com/imapi/spark-sqs-receiver ?)
3. implement the s3 listing outside of spark and use socketTextStream, but I 
couldn't find if it's reliable or not
4. create a custom s3 receiver using aws sdk (even if doesn't look like it's 
possible to use them from pyspark)

Has anyone experienced the same issue and found a better way to solve it ?

Thanks,
Michele


Reply via email to