Well as far as I know there is some update statement planned for spark, but not 
sure which release. You could alternatively use Hive+Orc. 
Another alternative would be to add the deltas in a separate file and when 
accessing the table filtering out the double entries. From time to time you 
could have a merge process creating one file out of all the deltas.

> On 19 Jul 2016, at 21:27, Aakash Basu <raj2coo...@gmail.com> wrote:
> 
> Hi all,
> 
> I'm trying to pull a full table from oracle, which is huge with some 10 
> million records which will be the initial load to HDFS.
> 
> Then I will do delta loads everyday in the same folder in HDFS.
> 
> Now, my query here is,
> 
> DAY 0 - I did the initial load (full dump).
> 
> DAY 1 - I'll load only that day's data which has suppose 10 records (5 old 
> with some column's value altered and 5 new).
> 
> Here, my question is, how will I push this file to HDFS through Spark code, 
> if I do append, it will create duplicates (which i don't want), if i keep 
> separate files and while using it in other program am giving the path of it 
> as folder which contains all files /. But in this case also the 
> registerTempTable will have duplicates for those 5 old rows.
> 
> What is the BEST logic to be applied here?
> 
> I tried to resolve this by doing a search in that file of the records if 
> matching load the new ones by deleting the old, but this will be time 
> consuming for such a huge record, right?
> 
> Please help!
> 
> Thanks,
> Aakash.

Reply via email to