I've been thinking a lot of this. Definitely think there should be a clean fix to this but haven't had the cycles to suggest something. You up for looking at the code and trying to suggest something?
thanks! -- Jacques Nadeau CTO and Co-Founder, Dremio On Thu, Mar 10, 2016 at 8:06 AM, Oscar Morante <spacep...@gmail.com> wrote: > I've been checking the logs, and I think that the problem is that it's > walking through the "directories" in S3 recursively, doing lots of small > HTTP requests. > > My files are organized like this which amplifies the issue: > > /category/random-hash/year/month/day/hour/data-chunk-000.json.gz > > The random hash is there to trick S3 into using a different > partition/shard for each put [1]. But it looks like this structure is > clashing with the way Drill/hadoop.fs.s3a get the list of files. > > I think that it should be possible to get the complete list of files under > a given "directory" (e.g. `/category`) doing just one HTTP query, but I > don't know how hard it would be to incorporate that behavior. > > Any ideas? How are you organizing your S3 files to get good performance? > > Thanks! > > [1]: > http://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html > > > > On Thu, Mar 10, 2016 at 12:27:42PM +0200, Oscar Morante wrote: > >> I'm querying 20G of gzipped JSONs split in ~5600 small files with sizes >> ranging from 1M to 30Mb. Drill is running in aws in 4 m4.xlarge nodes and >> it's taking around 50 minutes before the query starts executing. >> >> Any idea what could be causing this delay? What's the best way to debug >> this? >> >> Thanks, >> > > -- > Oscar Morante > "Self-education is, I firmly believe, the only kind of education there is." > -- Isaac Asimov. >