Re: Drill ETL: Views are Less than impressive

Neeraja Rentachintala Wed, 04 Nov 2015 14:36:50 -0800

John , you mentioned the following.
If you are trying to incrementally load one day worth of data to existing
Parquet table, wouldn't that be a new partition/directory for every day
rather than overwriting an existing partition.


INSERT OVERWRITE TABLE mytable (partition=2015-01-01)
select field1, field2,field3, field4, '11:40' as loadtime
 from mytable where partition = '2015-01-01'

On Wed, Nov 4, 2015 at 8:00 AM, John Omernik <j...@omernik.com> wrote:

> I am trying to manage ETL of some data.  Many of my questions over the last
> few days revolve around this data set, so here I am trying to put a
> manifesto post together. (Manifesto may be too strong of a word, but I have
> a feeling it will be a large post :)
>
> So here some points:
>
> 1. I am pulling data from an Amazon SQS queue using python and writing
> files to a MapRFS file location. This is working well, including writing
> files to filenames prefaced by a . until full loaded so drill ignores them.
> (thanks Andries E for the tip there!)
>
> 2. I CAN directly query the JSON  if I issue Alter Session
> set`store.json.all_text_mode`
> = true.  This is because there are some array fields that use JSON nulls,
> and I've accepted that I have to read them as strings. (Note my questions
> on list about setting this in a view rather than for the system (all
> queries, all users) or for the session (all queries for that user)
>
> 3.  My initial idea was to take data for a day, and then load it into a
> parquet table with directory based partitions.  i.e. I'd take the json from
> a complete day of data
>
> ALTER SESSION SET `store.json.all_text_mode` = true;
> CREATE table `parquetfinal/2015-11-03` as
> select field1, field2, field3, field4 from `streaminglocation` where dir0 =
> '2015-11-03'
>
> This works great, I have all my data for the previous day going back in
> time in a efficient storage format that does not require my users to set
> the store.json.all_text_mode setting in order to query it.  Mission
> accomplished on day -1 and older data.
>
> 4. I then needed a way to "join" that data (yesterday and going backwards)
> with the current data.  My initial thought was to use a view to just union
> the parquet table to the json table.  The  json table has the same date
> format in directories (i.e. dir0 = '2015-11-04')   However, we ran into a
> snag because I really don't want to have my users run the alter session
> command.
>
> I know this is a small point, but if a user is running a query and gets an
> error, thats a situation where my user either has to know about alter
> session or they call the helpdesk, or they just give up and say "well this
> isn't ready" and don't use the data.  Not all my users will live in the
> Drill docs and do so happily.  Yet, I don't want to set that at a system
> level or even by default at a session level due to unforeseen consequences
> of having that setting set and querying a different data set and getting
> inconsistent results.   So just joining the two as is wouldn't work.
>
> 5. My next ideas was to have a parquet backed incremental load table.
> Basically it takes around 20 seconds to load the json to parquet. I'd
> create a temp location outside the final parquet location. Load a whole
> days worth of data every 10 minutes into a new uniquely named directory and
> update a view.  Then I would create another view that would take the
> "final" location backed parquet table and UNION ALL it with the
> "incremental_load" parquet table (that has only one day).  This worked, but
> I am getting no optimizations.  A select count(*) on either of two parts of
> the UNION ALL returns in 0.5 seconds, a select count(*) on
> the unionized view returns in 20 seconds. That is rough.
>
> I thought about having the temp data sit directly in the parquet backed
> final table, but I don't see a way to create my Parquet files and then
> replace the partition that exists in the table with the new data. (This is
> something I used to do in hive  with
> INSERT OVERWRITE TABLE mytable (partition=2015-01-01)
> select field1, field2,field3, field4, '11:40' as loadtime
>  from mytable where partition = '2015-01-01'
>
> (Basically updating the partition with a new loadtime on every record in
> this example)
>
> So I am now in a pickle, I want to be able to provide the streaming data to
> my users in as timely of a manner as possible.  I have the issue with ALTER
> SESSION statement and the law of unintended consequences from changing a
> setting that will alter more than the table I am querying, and I UNIONS are
> being optimized by drill causing slow queries.   I am interested in
> feedback on how to address this ETL concern in the best way with Drill.
>
> And on the plus side, people can use my long email as a sleep aid... I'll
> call that my contribution to the devs who had to much caffeine :)
>
> John
>

Re: Drill ETL: Views are Less than impressive

Reply via email to