Re: Spark task hangs infinitely when accessing S3 from AWS

Michael Cutler Thu, 12 Nov 2015 01:59:09 -0800

Reading files directly from Amazon S3 can be frustrating especially if
you're dealing with a large number of input files, could you please
elaborate more on your use-case?  Does the S3 bucket in question already
contain a large number of files?


The implementation of the * wildcard operator in S3 input paths requires an
AWS S3 API call to list everything based on the common-prefix; so if your
input is something like;

  s3://my-bucket/<year>/<month>/<date>/*.json

Then the prefix "<year>/<month>/<date>/" will be passed to the API and
should be fairly efficient.

However if you're doing something more adventurous like;

  s3://my-bucket/*/*/*/*.json

There is no common-prefix to give the API here, it will literally list
every object in the bucket and then filter client-side to find anything
that matches "*.json", these types of requests are prone to timeouts and
other intermittent issues as well as taking a ridiculous amount of time
before the job can start.

Re: Spark task hangs infinitely when accessing S3 from AWS

Reply via email to