I do not know what made me think back to this thread but it is a
rather cool idea. I guess now NoSQL is more attractive for this, but
if you wanted to do it...

a1
pk  first   last       lastmod  deleted
1   A  lincoln   1        0
2   bob  barker   2             0

a2
1   A Lincoln    4      0
2   X           X       6       1

select transform(pk,first,last,lastmod,deleted) USING 'latest_not_delete.pl' as
(pk1,first1,last1,lastmod1,deleted1)
FROM a1 union a2 cluster by pk sort by pk, lastmod asc

Where 'latest_not_delete.pl' would be a script that accepts N rows and
returns only the most recent.

Has anyone ever tried something like this?

On Wed, Jun 9, 2010 at 6:32 PM, atreju <n.atr...@gmail.com> wrote:
> As an ideal solution, I have a suggestion to Hive contributors to make it
> look like Hive is doing insert/update/delete:
>
>
> This will require a special type of table creation syntax. I will call it as
> "UPDATABLE TABLE". The table will have 3 special columns that are defined in
> the create script:
> 1. The primary key column. (let's say: col_pk)
> 2. BIGINT type date column that shows the ms from Jan 1st, 1970 to actual
> data manipulation date/time in RDMBS. (dml_date)
> 3. TINYINT or BOOLEAN type column that will store 0 if the record is deleted
> and 1 if it is inserted or updated. (dml_action)
>
> This will require the RDBMS table to have PK and last update date column and
> deletes recorded in some other table by pk and date.
>
> On Day 1, the entire table is put into Hadoop, with addition of 2 extra
> columns: dml_date (bigint) and dml_action.
>
> On Day 2, we first find the max of dml_date from Hive table. Then we query
> from RDBMS inserts/updates/deletes since that date/time and write into a
> file with the correct dml_date/dml_action. The file goes to the same folder
> that our Hive table is in.
>
> Right now, if on Day 1 we had 100 rows and on Day 2, 10 rows a regular Hive
> table would show 110 rows. But, since this is a special table (UPDATABLE
> TABLE), every time this table is queried in Hive, Hive first run a
> map-reduce that would find the most recent (max(dml_date)) row per pk (group
> by col_pk) that is not deleted (dml_action!=0) and use that output in the
> user's query. That is the big idea!!
>
> Hive can have Insert/Update/Delete commands that would do nothing but create
> a file with rows of manipulated data with correct date and action.
>
> There can be a special "flush" kind of command that runs the MR and replaces
> all files in the table directory with single file. That can run weekly,
> monthly or may be after each time dml data received from RDBMS.
>
> Sqoop can have Hive interface that saves certain table attributes like pk
> column, RDBMS connection info,... and with one command from Hive, the Hive
> table gets updated from RDBMS....
>
> What do you think?
>
>
>
> On Tue, Jun 8, 2010 at 3:58 PM, Aaron Kimball <aa...@cloudera.com> wrote:
>>
>> I think that this might be the way to go. In general, folding updates and
>> deletes into datasets is a difficult problem due to the append-only nature
>> of datasets.
>>
>> Something that might help you here is to partition your tables in Hive
>> based on some well-distributed key. Then if you have a relatively small
>> number of partitions affected by an incremental import (perhaps more
>> recently-imported records are more likely to be updated? in this case,
>> partition the tables by the month/week you imported them?) you can only
>> perform the fold-in of the new deltas on the affected partitions. This
>> should be much faster than a full table scan.
>>
>> Have you seen the Sqoop tool? It handles imports and exports between HDFS
>> (and Hive) and RDBMS systems --  but currently can only import new records
>> (and subsequent INSERTs); it can't handle updates/deletes. Sqoop is
>> available at http://github.com/cloudera/sqoop -- it doesn't run on Apache
>> 0.20.3, but works on CDH (Cloudera's Distribution for Hadoop) and Hadoop
>> 0.21/trunk.
>>
>> This sort of capability is something I'm really interested in adding to
>> Sqoop. If you've got a well-run process for doing this, I'd really
>> appreciate your help adding this feature :) Send me an email off-list if
>> you're interested. At the very least, I'd urge you to try out the tool.
>>
>> Cheers,
>> - Aaron Kimball
>>
>>
>> On Tue, Jun 8, 2010 at 8:54 PM, atreju <n.atr...@gmail.com> wrote:
>>>
>>> To generate smart output from base data we need to copy some base tables
>>> from relational database into Hadoop. Some of them are big. To dump the
>>> entire table into Hadoop everyday is not an option since there are like 30+
>>> tables and each would take several hours.
>>>
>>> The methodology that we approached is to get the entire table dump first.
>>> Then each day or every 4-6 hours get only insert/update/delete since the
>>> last copy from RDBMS (based on a date field in the table). Using Hive do
>>> outer join + union the new data with existing data and write into a new
>>> file. For example, if there are a 100 rows in Hadoop, and in RDBMS 3 records
>>> inserted, 2 records updated and 1 deleted since the last Hadoop copy, then
>>> the Hive query will get 97 of the not changed data + 3 inserts + 2 updates
>>> and write into a new file. The other applications like Pig or Hive will pick
>>> the most recent file to use when selecting/loading data from those base
>>> table data files.
>>>
>>> This logic is working fine in lower environments for small size tables.
>>> With production data, for about 30GB size table, the incremental
>>> re-generation of the file in Hadoop is still taking several hours. I tried
>>> using zipped version and it took even longer time. I am not convinced that
>>> this is the best we can do to handle updates and deletes since we had to
>>> re-write 29GB unchanged data of the 30GB file again into a new file. ...and
>>> this is not the biggest table.
>>>
>>> I am thinking that this should be problem for many companies. What are
>>> the other approaches to apply updates and deletes on base tables to the
>>> Hadoop data files?
>>>
>>> We have 4 data nodes and using version 20.3.
>>>
>>> Thanks!
>>>
>>
>>
>

Reply via email to