Hi Steve,

I¹m on hadoop 2.7.1 using the s3n

From:  Steve Loughran <ste...@hortonworks.com>
Date:  Thursday, December 3, 2015 at 4:12 AM
Cc:  SPARK-USERS <user@spark.apache.org>
Subject:  Re: Spark Streaming from S3


> On 3 Dec 2015, at 00:42, Michele Freschi <mfres...@palantir.com> wrote:
> 
> Hi all,
> 
> I have an app streaming from s3 (textFileStream) and recently I've observed
> increasing delay and long time to list files:
> 
> INFO dstream.FileInputDStream: Finding new files took 394160 ms
> ...
> INFO scheduler.JobScheduler: Total delay: 404.796 s for time 1449100200000 ms
> (execution: 10.154 s)
> 
> At this time I have about 13K files under the key prefix that I'm monitoring -
> hadoop takes about 6 minutes to list all the files while aws cli takes only
> seconds. 
> My understanding is that this is a current limitation of hadoop but I wanted
> to confirm it in case it's a misconfiguration on my part.

not a known issue.

Usual questions: which Hadoop version and are you using s3n or s3a
connectors. The latter does use the AWS sdk, but it's only been stable
enough to use in Hadoop 2.7

> 
> Some alternatives I'm considering:
> 1. copy old files to a different key prefix
> 2. use one of the available SQS receivers
> (https://github.com/imapi/spark-sqs-receiver
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_imapi_spark-2
> Dsqs-2Dreceiver&d=CwMFAg&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=YaCZ7
> nUd7TxXQA5k9sR42nen4K6AtCtNo0sEWlPw-9Y&m=N7cTMu7V05lQx-vlxpAWGgZP6jyut95v0PsO5
> hanXSw&s=q0awXD6YCk7xE1zbKXuKbqaQvuCf6_AE4g5C7g8Hq8Q&e=>  ?)
> 3. implement the s3 listing outside of spark and use socketTextStream, but I
> couldn't find if it's reliable or not
> 4. create a custom s3 receiver using aws sdk (even if doesn't look like it's
> possible to use them from pyspark)
> 
> Has anyone experienced the same issue and found a better way to solve it ?
> 
> Thanks,
> Michele
> 



Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to