What are the thoughts for combining all the data in the individual directories? Is there an option to replace like in Hive?
On Mon, Nov 2, 2015 at 9:27 AM, Andries Engelbrecht < aengelbre...@maprtech.com> wrote: > You are probably on the right track to create a workspace for querying one > or 2 levels up on the directory structure, and then use CTAS on > subdirectories for incremental data loads. > > —Andries > > > > On Nov 2, 2015, at 5:29 AM, John Omernik <j...@omernik.com> wrote: > > > > So as Insert into isn't really an option, I would love some feed back on > > options with Drill for ETL. I understand Drill is very good with raw > text, > > at the same time, for a highly used table, the advantages of taking text > > data and "final" storing it as Parquet would be handy. I was thinking > > through some options, and I thought I'd lay them out here, partially to > > organize my thoughts, and partially to get feedback or look for other > > options. > > > > We have data coming in on a regular basis, say every 30 minutes. It's > > quasi batched already, and we are looking to reduce the lag time of the > > data as much as possible. To start out with things, this data makes sense > > to have a "directory" partition based on the date. It can be fairly > large > > data per data, and that will be the easiest to prune on. > > > > I "may" have this data in a format that is already loaded Parquet files > > from a Hadoop system (I think Impala) but I am not sure how well these > > files are formed, and Drill may have some advantage to loading them > itself. > > > > So, with Drill, can I just copy the files into the "current" day > partition > > and have it be good? Should I put subdirectories (dir1) with the time of > > the load or does that not give me anything? > > > > At the end of the day, is there anything in Drill like Hive where I can > say > > "INSERT INTO table(partition=curday) select * from table where > > partition=curday that could read all the loaded Parquet files as a set > and > > rewrite them, potentially optimized for queries (adding file level > > partitions via PARTITIONED BY to the directory level partitions?) > > > > Obviously without an INSERT INTO functionality , that wouldn't work, I > > could create table as a tmp partition like CREATE TABLE > > `tablename/dailyload` AS select * from `tablename` where dir0 = 'curday` > If > > that completes, rm 'curday' and mv dailyload curday. That should work > > right? > > > > Will Drill work if I have all my days with only dir0 but the daily load > has > > dir1? That may be a challenge, but maybe I don't have to have > > subdirectories for each hour. I will admit I am coming at this from a > Hive > > perspective, but trying to explain my thinking to see if folks can point > > out other options or ideas in the Drill mindset. > > > > > > Thanks! > > > > John > >