John , you mentioned the following. If you are trying to incrementally load one day worth of data to existing Parquet table, wouldn't that be a new partition/directory for every day rather than overwriting an existing partition.
INSERT OVERWRITE TABLE mytable (partition=2015-01-01) select field1, field2,field3, field4, '11:40' as loadtime from mytable where partition = '2015-01-01' On Wed, Nov 4, 2015 at 8:00 AM, John Omernik <j...@omernik.com> wrote: > I am trying to manage ETL of some data. Many of my questions over the last > few days revolve around this data set, so here I am trying to put a > manifesto post together. (Manifesto may be too strong of a word, but I have > a feeling it will be a large post :) > > So here some points: > > 1. I am pulling data from an Amazon SQS queue using python and writing > files to a MapRFS file location. This is working well, including writing > files to filenames prefaced by a . until full loaded so drill ignores them. > (thanks Andries E for the tip there!) > > 2. I CAN directly query the JSON if I issue Alter Session > set`store.json.all_text_mode` > = true. This is because there are some array fields that use JSON nulls, > and I've accepted that I have to read them as strings. (Note my questions > on list about setting this in a view rather than for the system (all > queries, all users) or for the session (all queries for that user) > > 3. My initial idea was to take data for a day, and then load it into a > parquet table with directory based partitions. i.e. I'd take the json from > a complete day of data > > ALTER SESSION SET `store.json.all_text_mode` = true; > CREATE table `parquetfinal/2015-11-03` as > select field1, field2, field3, field4 from `streaminglocation` where dir0 = > '2015-11-03' > > This works great, I have all my data for the previous day going back in > time in a efficient storage format that does not require my users to set > the store.json.all_text_mode setting in order to query it. Mission > accomplished on day -1 and older data. > > 4. I then needed a way to "join" that data (yesterday and going backwards) > with the current data. My initial thought was to use a view to just union > the parquet table to the json table. The json table has the same date > format in directories (i.e. dir0 = '2015-11-04') However, we ran into a > snag because I really don't want to have my users run the alter session > command. > > I know this is a small point, but if a user is running a query and gets an > error, thats a situation where my user either has to know about alter > session or they call the helpdesk, or they just give up and say "well this > isn't ready" and don't use the data. Not all my users will live in the > Drill docs and do so happily. Yet, I don't want to set that at a system > level or even by default at a session level due to unforeseen consequences > of having that setting set and querying a different data set and getting > inconsistent results. So just joining the two as is wouldn't work. > > 5. My next ideas was to have a parquet backed incremental load table. > Basically it takes around 20 seconds to load the json to parquet. I'd > create a temp location outside the final parquet location. Load a whole > days worth of data every 10 minutes into a new uniquely named directory and > update a view. Then I would create another view that would take the > "final" location backed parquet table and UNION ALL it with the > "incremental_load" parquet table (that has only one day). This worked, but > I am getting no optimizations. A select count(*) on either of two parts of > the UNION ALL returns in 0.5 seconds, a select count(*) on > the unionized view returns in 20 seconds. That is rough. > > I thought about having the temp data sit directly in the parquet backed > final table, but I don't see a way to create my Parquet files and then > replace the partition that exists in the table with the new data. (This is > something I used to do in hive with > INSERT OVERWRITE TABLE mytable (partition=2015-01-01) > select field1, field2,field3, field4, '11:40' as loadtime > from mytable where partition = '2015-01-01' > > (Basically updating the partition with a new loadtime on every record in > this example) > > So I am now in a pickle, I want to be able to provide the streaming data to > my users in as timely of a manner as possible. I have the issue with ALTER > SESSION statement and the law of unintended consequences from changing a > setting that will alter more than the table I am querying, and I UNIONS are > being optimized by drill causing slow queries. I am interested in > feedback on how to address this ETL concern in the best way with Drill. > > And on the plus side, people can use my long email as a sleep aid... I'll > call that my contribution to the devs who had to much caffeine :) > > John >