Re: "ETL" Options for Drill

John Omernik Tue, 03 Nov 2015 10:23:56 -0800

What are the thoughts for combining all the data in the individual
directories?  Is there an option to replace like in Hive?




On Mon, Nov 2, 2015 at 9:27 AM, Andries Engelbrecht <
aengelbre...@maprtech.com> wrote:

> You are probably on the right track to create a workspace for querying one
> or 2 levels up on the directory structure, and then use CTAS on
> subdirectories for incremental data loads.
>
> —Andries
>
>
> > On Nov 2, 2015, at 5:29 AM, John Omernik <j...@omernik.com> wrote:
> >
> > So as Insert into isn't really an option, I would love some feed back on
> > options with Drill for ETL. I understand Drill is very good with raw
> text,
> > at the same time, for a highly used table, the advantages of taking text
> > data and "final" storing it as Parquet would be handy.  I was thinking
> > through some options, and I thought I'd lay them out here, partially to
> > organize my thoughts, and partially to get feedback or look for other
> > options.
> >
> > We have data coming in on a regular basis, say every 30 minutes.  It's
> > quasi batched already, and we are looking to reduce the lag time of the
> > data as much as possible. To start out with things, this data makes sense
> > to have a "directory" partition based on the date.  It can be fairly
> large
> > data per data, and that will be the easiest to prune on.
> >
> > I "may" have this data in a format that is already loaded Parquet files
> > from a Hadoop system (I think Impala) but I am not sure how well these
> > files are formed, and Drill may have some advantage to loading them
> itself.
> >
> > So, with Drill, can I just copy the files into the "current" day
> partition
> > and have it be good? Should I put subdirectories (dir1) with the time of
> > the load or does that not give me anything?
> >
> > At the end of the day, is there anything in Drill like Hive where I can
> say
> > "INSERT INTO table(partition=curday) select * from table where
> > partition=curday that could read all the loaded Parquet files as a set
> and
> > rewrite them, potentially optimized for queries (adding file level
> > partitions via PARTITIONED BY to the directory level partitions?)
> >
> > Obviously without an INSERT INTO functionality , that wouldn't work, I
> > could create table as a tmp partition like CREATE TABLE
> > `tablename/dailyload` AS select * from `tablename` where dir0 = 'curday`
> If
> > that completes, rm 'curday' and mv dailyload curday.  That should work
> > right?
> >
> > Will Drill work if I have all my days with only dir0 but the daily load
> has
> > dir1? That may be a challenge, but maybe I don't have to have
> > subdirectories for each hour.  I will admit I am coming at this from a
> Hive
> > perspective, but trying to explain my thinking to see if folks can point
> > out other options or ideas in the Drill mindset.
> >
> >
> > Thanks!
> >
> > John
>
>

Re: "ETL" Options for Drill

Reply via email to