Re: Optimizing S3 access for Drill using Parquet files

Stefán Baxter Tue, 14 Jul 2015 12:39:06 -0700

Hi Paul,

This sounds interesting.


Can you elaborate a bit more on:

Do all the files need to be loaded in memory or will it swap hot files
in/out of memory?

Have you used it with Drill / Parquet?

Is the built-in columnar store relevant (Native store for row tables)? (I
would think not)

Regards,
 -Stefan


On Tue, Jul 14, 2015 at 7:13 PM, Paul Mogren <pmog...@commercehub.com>
wrote:

> Stefan,
>
> You might be interested in http://tachyon-project.org
>
>
>
>
> On 7/14/15, 1:12 PM, "Stefán Baxter" <ste...@activitystream.com> wrote:
>
> >Hi,
> >
> >Thank you.
> >
> >I was not suggesting this to be a part of Drill, only asking if any
> >experience exist in this area. :)
> >
> >I'm trying to evaluate S3-almost-only vs. HDFS so your points are handy.
> >
> >Regards,
> > -Stefan
> >
> >
> >
> >On Tue, Jul 14, 2015 at 5:08 PM, Jason Altekruse
> ><altekruseja...@gmail.com>
> >wrote:
> >
> >> I am not aware of anyone doing something like this today, but it seems
> >>like
> >> something best handled outside of Drill right now. Drill considers
> >>itself
> >> essentially stateless, we do not manage indexes, table constraints or
> >> caching data for any of our current storage systems. There was some work
> >> being done to cache Parquet metadata, in this case we were placing all
> >>of
> >> the parquet footers in a single file, which would need to be manually
> >> refreshed. This work has not made it into the mainline, but you can
> >>follow
> >> the progress here:
> >>
> >> https://issues.apache.org/jira/browse/DRILL-2743
> >>
> >> I would take a look around for general purpose local caching systems for
> >> S3. To make these work with Drill today they will have to re-expose the
> >> HDFS API. There might be something out there that already does this,
> >>but as
> >> some of the primary users of S3 are web application developers, they
> >>might
> >> not have worried about providing the HDFS API on top of any caching
> >>systems
> >> developed to date.
> >>
> >> One thing to note, the HDFS API is already available on top of the local
> >> file system, this is what enables us to read from the local disk in
> >> embedded mode. If you can get a caching system to expose NFS, you could
> >> mount this to the same path on all of your nodes and it should be able
> >>to
> >> read from that path mounted on your local FS.
> >>
> >>
> >>
> >> On Tue, Jul 14, 2015 at 1:06 AM, Stefán Baxter
> >><ste...@activitystream.com>
> >> wrote:
> >>
> >> > Hi,
> >> >
> >> > I'm wondering if the people that use Drill with S3 are using some
> >>sort of
> >> > local cache on the drillbit-nodes for historical, non changing,
> >>Parquet
> >> > segments.
> >> >
> >> > I'm pretty sure that I'm not using the correct terminology and that
> >>the
> >> > correct question is this: Are there any ways to optimize S3 with
> >>drill so
> >> > that "hot segments" are stored locally while hot and then just dropped
> >> from
> >> > local nodes when they are not.
> >> >
> >> > I guess this only really matters  where networking speeds between the
> >> > drill-bit nodes and S3 is not optimal.
> >> >
> >> > Regards,
> >> >  -Stefan
> >> >
> >>
>
>

Re: Optimizing S3 access for Drill using Parquet files

Reply via email to