My guess is that spilling to S3 will be disastrously slow.
On Fri, Aug 16, 2019 at 9:37 AM Paul Rogers <[email protected]> wrote: > Hi Manu, > > To add a bit more background... Drill uses local storage only for spilling > result sets when they are too large for memory. Otherwise, data never > touches disk once read from S3. > > Unlike Snowflake, Drill does not cache S3 data locally. This means that, > if you query the same file multiple times, Drill will hit S3 for each > query. Adding Snowflake-like S3 caching is an open project looking for > volunteers. > > Spilling can be configured to go to the DFS (distributed file system). > Presumably, this can be S3, though I don't think anyone has tried this. > Information about configuring the spill directory is in [1]. > > Drill does not need Hadoop; it only needs ZK (and, as Nitin pointed out, > the proper configuration for your cloud vendor.) > > As it turns out, there is some information on AWS and S3 setup in the > "Learning Apache Drill" book. Probably not as much detail as you would > like, but enough to get you started. The book does not include GCE setup, > but the details should be similar. > > Drill uses the HDFS client (not server) to access a cloud vendor. So, as > long as you install the correct HDFS client libraries, you are mostly good > to go. Note that the S3 libraries have evolved over time. The book explains > the most recent library at the time we wrote the book last year. Please > check the HDFS project for which library you need for GCE access. > > Now a request: you will learn quite a number of important details as you > set up your cloud-agnostic solution. Please post your findings here, and/or > file JIRA tickets. so we can update documentation, or fix any issues that > you discover. You are benefiting from the work of others who created Drill; > please share your findings with the community so others can benefit from > your work. > > Thanks, > - Paul > > [1] > https://drill.apache.org/docs/sort-based-and-hash-based-memory-constrained-operators/ > > > > > > > On Friday, August 16, 2019, 05:10:00 AM PDT, Nitin Pawar < > [email protected]> wrote: > > From my learning and I could be wrong in few things but wait for others to > answer as well > > > 1. When stetting up the drill cluster in prod environment to query data > ranging from several gigabytes to few terabytes hosted in s3/blob > storage/cloud storage, what are the considerations for disk space ? I > understand drill bits make use of data locality, but how does that work in > case of cloud storage like s3 ? Will the entire data from s3 be moved to > drill cluster before starting the query processing ? > > It is advised to use parquet as your file formats. It improves your > performance a lot. Drill will bring all the data it needs to process for a > given query. This can be reduced if arrange your folder structure with > filterable columns such as dates etc. When you are using parquet files, > each of these files or blocks are downloaded separately by all the drillbit > servers and then based on your query patterns the data localization happens > such as when you say group by or filter and then sum etc. All the data > generally resides in memory and then starts spilling to disks based on your > query patterns. > > 2. Is it possible to use s3 or other cloud storage solutions for Sort, > Hash Aggregate, and Hash Join operators spill data rather than using local > disk ? > As per my understanding, only local disks are used for non-memory based > aggregations. Using the cloud based storage systems for intermediate > outputs as heavy network IO and causes huge delays in queries. > > > 3. Is it ok to run drill production cluster without hadoop ? Is just > zookeeper quorum enough ? > You do NOT need to set up a hadoop cluster. Apache drill has no > per-requisite on hadoop for execution purposes unless you are using those > fer,eature sets of apache drill. > To run drill cluster, a zookeeper quorum is more than sufficient. From > there on based on what storage systems you use, you will need to create > storage plugins and use them. > > On Fri, Aug 16, 2019 at 10:38 AM Manu Mukundan <[email protected] > > > wrote: > > > Hi, > > > > My name is Manu and I am working as a Bigdata architect in a small > startup > > company in Kochi, India. Our new project handles visualizing large volume > > of unstructured data in cloud storage (It can be S3, Azure blob storage > or > > Google cloud storage). We are planning to use Apache Drill as SQL query > > execution engine so that we will be cloud agnostic. Unfortunately we are > > finding some key questions unanswered before moving ahead with Drill as > > our platform. Hoping you can provide some clarity and it will be much > > appreciated. > > > > > > 1. When stetting up the drill cluster in prod environment to query data > > ranging from several gigabytes to few terabytes hosted in s3/blob > > storage/cloud storage, what are the considerations for disk space ? I > > understand drill bits make use of data locality, but how does that work > in > > case of cloud storage like s3 ? Will the entire data from s3 be moved to > > drill cluster before starting the query processing ? > > 2. Is it possible to use s3 or other cloud storage solutions for Sort, > > Hash Aggregate, and Hash Join operators spill data rather than using > local > > disk ? > > 3. Is it ok to run drill production cluster without hadoop ? Is just > > zookeeper quorum enough ? > > > > > > I totally understand how busy you can be but if you get a chance, please > > help me to get a clarity on these items. It will be really helpful > > > > Thanks again! > > Manu Mukundan > > Bigdata Architect, > > Prevalent AI, > > [email protected] > > > > > > > > -- > Nitin Pawar >
