Re: reporting use case

Teemu Heikkilä Thu, 04 Apr 2019 12:28:16 -0700

Based on your answers, I would consider using the update stream to update 
actual snapshots ie. by joining the data


Ofcourse now it depends on how the update stream has been implemented how to 
get the data in spark.

Could you tell little bit more about that?
- Teemu

> On 4 Apr 2019, at 22.23, Prasad Bhalerao <prasadbhalerao1...@gmail.com> wrote:
> 
> Hi ,
> 
> I can create a view on these tables but the thing is I am going to need 
> almost every column from these tables and I have faced issues with oracle 
> views on such a large tables which involves joins. Some how oracle used to 
> choose not so correct execution plan.
> 
> Can you please tell me how creating a views will help in this scenario?
> 
> Can you please tell if I am thinking in right direction?
> 
> I have two challenges
> 1) First to load 2-4 TB of data in spark very quickly.
> 2) And then keep this data updated in spark whenever DB updates are done.
> 
> Thanks,
> Prasad
> 
> On Fri, Apr 5, 2019 at 12:35 AM Jason Nerothin <jasonnerot...@gmail.com 
> <mailto:jasonnerot...@gmail.com>> wrote:
> Hi Prasad, 
> 
> Could you create an Oracle-side view that captures only the relevant records 
> and the use Spark JDBC connector to load the view into Spark?
> 
> On Thu, Apr 4, 2019 at 1:48 PM Prasad Bhalerao <prasadbhalerao1...@gmail.com 
> <mailto:prasadbhalerao1...@gmail.com>> wrote:
> Hi,
> 
> I am exploring spark for my Reporting application.
> My use case is as follows...
> I have 4-5 oracle tables which contains more than 1.5 billion rows. These 
> tables are updated very frequently every day. I don't have choice to change 
> database technology. So this data is going to remain in Oracle only.
> To generate 1 report, on an average 15 - 50 million rows has to be fetched 
> from oracle tables. These rows contains some blob columns. Most of the time 
> is spent in fetching these many rows from db over the network. Data 
> processing is not that complex. Currently these report takes around 3-8 hours 
> to complete. I trying to speed up this report generation process.
> 
> Can use spark as a caching layer in this case to avoid fetching data from 
> oracle over the network every time? I am thinking to submit a spark job for 
> each report request and use spark SQL to fetch the data and then process it 
> and write to a file? I trying to use kind of data locality in this case.
> 
> Whenever a data is updated in oracle tables can I refresh the data in spark 
> storage? I can get the update feed using messaging technology.
> 
> Can some one from community help me with this?
> Suggestions are welcome.
>  
> 
> Thanks,
> Prasad
> 
> 
> 
> Thanks,
> Prasad
> 
> 
> -- 
> Thanks,
> Jason

Re: reporting use case

Reply via email to