Hi Prasad, Could you create an Oracle-side view that captures only the relevant records and the use Spark JDBC connector to load the view into Spark?
On Thu, Apr 4, 2019 at 1:48 PM Prasad Bhalerao <prasadbhalerao1...@gmail.com> wrote: > Hi, > > I am exploring spark for my Reporting application. > My use case is as follows... > I have 4-5 oracle tables which contains more than 1.5 billion rows. These > tables are updated very frequently every day. I don't have choice to change > database technology. So this data is going to remain in Oracle only. > To generate 1 report, on an average 15 - 50 million rows has to be fetched > from oracle tables. These rows contains some blob columns. Most of the time > is spent in fetching these many rows from db over the network. Data > processing is not that complex. Currently these report takes around 3-8 > hours to complete. I trying to speed up this report generation process. > > Can use spark as a caching layer in this case to avoid fetching data from > oracle over the network every time? I am thinking to submit a spark job for > each report request and use spark SQL to fetch the data and then process it > and write to a file? I trying to use kind of data locality in this case. > > Whenever a data is updated in oracle tables can I refresh the data in > spark storage? I can get the update feed using messaging technology. > > Can some one from community help me with this? > Suggestions are welcome. > > > Thanks, > Prasad > > > > Thanks, > Prasad > -- Thanks, Jason