Hi, I am new to spark and no SQL databases.
So Please correct me if I am wrong. Since I will be accessing multiple columns (almost 20-30 columns) of a row, I will have to go with rowbased db instead column based right! May be I can use Avro in this case. Does spark go well with Avroro? I will do my research on this. But please let me know your opinion on this. Thanks, Prasad On Fri 5 Apr, 2019, 1:09 AM Teemu Heikkilä <te...@emblica.fi wrote: > So basically you could have base dump/snapshot of the full database - or > all the required data stored into HDFS or similar system as partitioned > files (ie. orc/parquet) > > Then you use the change stream after the dump and join it on the snapshot > - similarly than what your database is doing. > After that you can build the aggregates and reports from that table. > > - T > > On 4 Apr 2019, at 22.35, Prasad Bhalerao <prasadbhalerao1...@gmail.com> > wrote: > > I did not understand this "update actual snapshots ie. by joining the > data". > > > There is another microservice which updates these Oracle tables. I can > have this micro service to send the update data feed on Kafka topics. > > Thanks, > Prasad > > On Fri 5 Apr, 2019, 12:57 AM Teemu Heikkilä <te...@emblica.fi wrote: > >> Based on your answers, I would consider using the update stream to update >> actual snapshots ie. by joining the data >> >> Ofcourse now it depends on how the update stream has been implemented how >> to get the data in spark. >> >> Could you tell little bit more about that? >> - Teemu >> >> On 4 Apr 2019, at 22.23, Prasad Bhalerao <prasadbhalerao1...@gmail.com> >> wrote: >> >> Hi , >> >> I can create a view on these tables but the thing is I am going to need >> almost every column from these tables and I have faced issues with oracle >> views on such a large tables which involves joins. Some how oracle used to >> choose not so correct execution plan. >> >> Can you please tell me how creating a views will help in this scenario? >> >> Can you please tell if I am thinking in right direction? >> >> I have two challenges >> 1) First to load 2-4 TB of data in spark very quickly. >> 2) And then keep this data updated in spark whenever DB updates are done. >> >> Thanks, >> Prasad >> >> On Fri, Apr 5, 2019 at 12:35 AM Jason Nerothin <jasonnerot...@gmail.com> >> wrote: >> >>> Hi Prasad, >>> >>> Could you create an Oracle-side view that captures only the relevant >>> records and the use Spark JDBC connector to load the view into Spark? >>> >>> On Thu, Apr 4, 2019 at 1:48 PM Prasad Bhalerao < >>> prasadbhalerao1...@gmail.com> wrote: >>> >>>> Hi, >>>> >>>> I am exploring spark for my Reporting application. >>>> My use case is as follows... >>>> I have 4-5 oracle tables which contains more than 1.5 billion rows. >>>> These tables are updated very frequently every day. I don't have choice to >>>> change database technology. So this data is going to remain in Oracle only. >>>> To generate 1 report, on an average 15 - 50 million rows has to be >>>> fetched from oracle tables. These rows contains some blob columns. Most of >>>> the time is spent in fetching these many rows from db over the network. >>>> Data processing is not that complex. Currently these report takes around >>>> 3-8 hours to complete. I trying to speed up this report generation process. >>>> >>>> Can use spark as a caching layer in this case to avoid fetching data >>>> from oracle over the network every time? I am thinking to submit a spark >>>> job for each report request and use spark SQL to fetch the data and then >>>> process it and write to a file? I trying to use kind of data locality in >>>> this case. >>>> >>>> Whenever a data is updated in oracle tables can I refresh the data in >>>> spark storage? I can get the update feed using messaging technology. >>>> >>>> Can some one from community help me with this? >>>> Suggestions are welcome. >>>> >>>> >>>> Thanks, >>>> Prasad >>>> >>>> >>>> >>>> Thanks, >>>> Prasad >>>> >>> >>> >>> -- >>> Thanks, >>> Jason >>> >> >> >