Well as far as I know there is some update statement planned for spark, but not sure which release. You could alternatively use Hive+Orc. Another alternative would be to add the deltas in a separate file and when accessing the table filtering out the double entries. From time to time you could have a merge process creating one file out of all the deltas.
> On 19 Jul 2016, at 21:27, Aakash Basu <raj2coo...@gmail.com> wrote: > > Hi all, > > I'm trying to pull a full table from oracle, which is huge with some 10 > million records which will be the initial load to HDFS. > > Then I will do delta loads everyday in the same folder in HDFS. > > Now, my query here is, > > DAY 0 - I did the initial load (full dump). > > DAY 1 - I'll load only that day's data which has suppose 10 records (5 old > with some column's value altered and 5 new). > > Here, my question is, how will I push this file to HDFS through Spark code, > if I do append, it will create duplicates (which i don't want), if i keep > separate files and while using it in other program am giving the path of it > as folder which contains all files /. But in this case also the > registerTempTable will have duplicates for those 5 old rows. > > What is the BEST logic to be applied here? > > I tried to resolve this by doing a search in that file of the records if > matching load the new ones by deleting the old, but this will be time > consuming for such a huge record, right? > > Please help! > > Thanks, > Aakash.