Re: Little idea needed

2016-07-20 Thread Aakash Basu
Thanks for the detailed description buddy. But this will actually be done through NiFi (End to End) so we need to add the delta logic inside NiFi to automate the whole process. That's why, need a good (best) solution to solve this problem. Since, this is a classic issue which we can face any

Re: Little idea needed

2016-07-20 Thread Aakash Basu
Your second point: That's going to be a bottleneck for all the programs which will fetch the data from that folder and again add extra filters into the DF. I want to finish that off, there itself. And that merge logic is weak when one table is huge and the other is very small (which is the case

Re: Little idea needed

2016-07-20 Thread Mich Talebzadeh
In reality a true real time analytics will require interrogating the transaction (redo) log of the RDBMS database to see for changes. An RDBMS will only keep on current record (the most recent) so if record is deleted since last import into HDFS that record will not exist. If the record has been

Re: Little idea needed

2016-07-19 Thread ayan guha
Well this one keeps cropping up in every project especially when hadoop implemented alongside MPP. For the fact, there is no reliable out of box update operation available in hdfs or hive or SPARK. Hence, one approach is what Mitch suggested, that do not update. Rather just keep all source

Re: Little idea needed

2016-07-19 Thread Mich Talebzadeh
Well this is a classic. The initial load can be done through Sqoop (outside of Spark) or through JDBC connection in Spark. 10 million rows in nothing. Then you have to think of updates and deletes in addition to new rows. With Sqoop you can load from the last ID in the source table, assuming

Re: Little idea needed

2016-07-19 Thread Jörn Franke
Well as far as I know there is some update statement planned for spark, but not sure which release. You could alternatively use Hive+Orc. Another alternative would be to add the deltas in a separate file and when accessing the table filtering out the double entries. From time to time you could

Little idea needed

2016-07-19 Thread Aakash Basu
Hi all, I'm trying to pull a full table from oracle, which is huge with some 10 million records which will be the initial load to HDFS. Then I will do delta loads everyday in the same folder in HDFS. Now, my query here is, DAY 0 - I did the initial load (full dump). DAY 1 - I'll load only