Re: Clarification regarding Apache drill setup

Ted Dunning Fri, 16 Aug 2019 11:54:31 -0700

My guess is that spilling to S3 will be disastrously slow.



On Fri, Aug 16, 2019 at 9:37 AM Paul Rogers <[email protected]>
wrote:

> Hi Manu,
>
> To add a bit more background... Drill uses local storage only for spilling
> result sets when they are too large for memory. Otherwise, data never
> touches disk once read from S3.
>
> Unlike Snowflake, Drill does not cache S3 data locally. This means that,
> if you query the same file multiple times, Drill will hit S3 for each
> query. Adding Snowflake-like S3 caching is an open project looking for
> volunteers.
>
> Spilling can be configured to go to the DFS (distributed file system).
> Presumably, this can be S3, though I don't think anyone has tried this.
> Information about configuring the spill directory is in [1].
>
> Drill does not need Hadoop; it only needs ZK (and, as Nitin pointed out,
> the proper configuration for your cloud vendor.)
>
> As it turns out, there is some information on AWS and S3 setup in the
> "Learning Apache Drill" book. Probably not as much detail as you would
> like, but enough to get you started. The book does not include GCE setup,
> but the details should be similar.
>
> Drill uses the HDFS client (not server) to access a cloud vendor. So, as
> long as you install the correct HDFS client libraries, you are mostly good
> to go. Note that the S3 libraries have evolved over time. The book explains
> the most recent library at the time we wrote the book last year. Please
> check the HDFS project for which library you need for GCE access.
>
> Now a request: you will learn quite a number of important details as you
> set up your cloud-agnostic solution. Please post your findings here, and/or
> file JIRA tickets. so we can update documentation, or fix any issues that
> you discover. You are benefiting from the work of others who created Drill;
> please share your findings with the community so others can benefit from
> your work.
>
> Thanks,
> - Paul
>
> [1]
> https://drill.apache.org/docs/sort-based-and-hash-based-memory-constrained-operators/
>
>
>
>
>
>
>     On Friday, August 16, 2019, 05:10:00 AM PDT, Nitin Pawar <
> [email protected]> wrote:
>
>  From my learning and I could be wrong in few things but wait for others to
> answer as well
>
>
> 1.  When stetting up the drill cluster in prod environment to query data
> ranging from several gigabytes to few terabytes hosted in s3/blob
> storage/cloud storage, what are the considerations for disk space ? I
> understand drill bits make use of data locality, but how does that work in
> case of cloud storage like s3 ? Will the entire data from s3 be moved to
> drill cluster before starting the query processing ?
>
> It is advised to use parquet as your file formats. It improves your
> performance a lot. Drill will bring all the data it needs to process for a
> given query. This can be reduced if arrange your folder structure with
> filterable columns such as dates etc. When you are using parquet files,
> each of these files or blocks are downloaded separately by all the drillbit
> servers and then based on your query patterns the data localization happens
> such as when you say group by or filter and then sum etc. All the data
> generally resides in memory and then starts spilling to disks based on your
> query patterns.
>
>   2.  Is it possible to use s3 or other cloud storage solutions for Sort,
> Hash Aggregate, and Hash Join operators spill data rather than using local
> disk ?
> As per my understanding, only local disks are used for non-memory based
> aggregations. Using the cloud based storage systems for intermediate
> outputs as heavy network IO and causes huge delays in queries.
>
>
>   3.  Is it ok to run drill production cluster without hadoop ? Is just
> zookeeper quorum enough ?
> You do NOT need to set up a hadoop cluster. Apache drill has no
> per-requisite  on hadoop for execution purposes unless you are using those
> fer,eature sets of apache drill.
> To run drill cluster, a zookeeper quorum is more than sufficient. From
> there on based on what storage systems you use, you will need to create
> storage plugins and use them.
>
> On Fri, Aug 16, 2019 at 10:38 AM Manu Mukundan <[email protected]
> >
> wrote:
>
> > Hi,
> >
> > My name is Manu and I am working as a Bigdata architect in a small
> startup
> > company in Kochi, India. Our new project handles visualizing large volume
> > of unstructured data in cloud storage (It can be S3, Azure blob storage
> or
> > Google cloud storage). We are planning to use Apache Drill as SQL query
> > execution engine so that we will be cloud agnostic. Unfortunately we are
> > finding some  key questions unanswered before moving ahead with Drill as
> > our platform. Hoping you can provide some clarity and it will be much
> > appreciated.
> >
> >
> >  1.  When stetting up the drill cluster in prod environment to query data
> > ranging from several gigabytes to few terabytes hosted in s3/blob
> > storage/cloud storage, what are the considerations for disk space ? I
> > understand drill bits make use of data locality, but how does that work
> in
> > case of cloud storage like s3 ? Will the entire data from s3 be moved to
> > drill cluster before starting the query processing ?
> >  2.  Is it possible to use s3 or other cloud storage solutions for Sort,
> > Hash Aggregate, and Hash Join operators spill data rather than using
> local
> > disk ?
> >  3.  Is it ok to run drill production cluster without hadoop ? Is just
> > zookeeper quorum enough ?
> >
> >
> > I totally understand how busy you can be but if you get a chance, please
> > help me to get a clarity on these items. It will be really helpful
> >
> > Thanks again!
> > Manu Mukundan
> > Bigdata Architect,
> > Prevalent AI,
> > [email protected]
> >
> >
> >
>
> --
> Nitin Pawar
>

Re: Clarification regarding Apache drill setup

Reply via email to