I've been thinking a lot of this. Definitely think there should be a clean
fix to this but haven't had the cycles to suggest something. You up for
looking at the code and trying to suggest something?

thanks!

--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Thu, Mar 10, 2016 at 8:06 AM, Oscar Morante <spacep...@gmail.com> wrote:

> I've been checking the logs, and I think that the problem is that it's
> walking through the "directories" in S3 recursively, doing lots of small
> HTTP requests.
>
> My files are organized like this which amplifies the issue:
>
>    /category/random-hash/year/month/day/hour/data-chunk-000.json.gz
>
> The random hash is there to trick S3 into using a different
> partition/shard for each put [1].  But it looks like this structure is
> clashing with the way Drill/hadoop.fs.s3a get the list of files.
>
> I think that it should be possible to get the complete list of files under
> a given "directory" (e.g. `/category`) doing just one HTTP query, but I
> don't know how hard it would be to incorporate that behavior.
>
> Any ideas?  How are you organizing your S3 files to get good performance?
>
> Thanks!
>
> [1]:
> http://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html
>
>
>
> On Thu, Mar 10, 2016 at 12:27:42PM +0200, Oscar Morante wrote:
>
>> I'm querying 20G of gzipped JSONs split in ~5600 small files with sizes
>> ranging from 1M to 30Mb.  Drill is running in aws in 4 m4.xlarge nodes and
>> it's taking around 50 minutes before the query starts executing.
>>
>> Any idea what could be causing this delay?  What's the best way to debug
>> this?
>>
>> Thanks,
>>
>
> --
> Oscar Morante
> "Self-education is, I firmly believe, the only kind of education there is."
>                                                          -- Isaac Asimov.
>

Reply via email to