Re: Is Spark suited for replacing a batch job using many database tables?

Michael Segel Wed, 06 Jul 2016 13:53:08 -0700

Sorry, 
I was assuming that you wanted to build the data lake in Hadoop rather than 
just reading from DB2. (Data Lakes need to be built correctly. )


So, slightly different answer.

Yes, you can do this… 

You will end up with an immutable copy of the data that you would read in 
serially. Then you will probably need to repartition the data, depending on 
size and how much parallelization you want.  And then run the batch processing. 

But I have to ask why? 
Are you having issues with DB2? 
Are your batch jobs interfering with your transactional work? 

You will have a hit up front as you read the data from DB2, but then, depending 
on how you use the data… you may be faster overall. 

Please don’t misunderstand, Spark is a viable solution, however… there’s a bit 
of heavy lifting that has to occur (e.g. building and maintaining a spark 
cluster) and there are alternatives out there that work. 

Performance of the DB2 tables will vary based on indexing, assuming you have 
the appropriate indexes in place. 

You could also look at Apache Drill too. 

HTH 
-Mike



> On Jul 6, 2016, at 3:24 PM, Andreas Bauer <dabuks...@gmail.com> wrote:
> 
> Thanks for the advice. I have to retrieve the basic data from the DB2 tables 
> but afterwards I'm pretty free to transform the data as needed. 
> 
> 
> 
> On 6. Juli 2016 um 22:12:26 MESZ, Michael Segel <msegel_had...@hotmail.com> 
> wrote:
>> I think you need to learn the basics of how to build a ‘data 
>> lake/pond/sewer’ first. 
>> 
>> The short answer is yes. 
>> The longer answer is that you need to think more about translating a 
>> relational model in to a hierarchical model, something that I seriously 
>> doubt has been taught in schools in a very long time. 
>> 
>> Then there’s more to the design, including indexing. 
>> Do you want to stick with SQL or do you want to hand code the work to allow 
>> for indexing / secondary indexing to help with the filtering since Spark SQL 
>> doesn’t really handle indexing. Note that you could actually still use an 
>> index table (narrow/thin inverted table) and join against the base table to 
>> get better performance. 
>> 
>> There’s more to this, but you get the idea.
>> 
>> HTH
>> 
>> -Mike
>> 
>> > On Jul 6, 2016, at 2:25 PM, dabuki wrote:
>> > 
>> > I was thinking about to replace a legacy batch job with Spark, but I'm not
>> > sure if Spark is suited for this use case. Before I start the proof of
>> > concept, I wanted to ask for opinions.
>> > 
>> > The legacy job works as follows: A file (100k - 1 mio entries) is iterated.
>> > Every row contains a (book) order with an id and for each row approx. 15
>> > processing steps have to be performed that involve access to multiple
>> > database tables. In total approx. 25 tables (each containing 10k-700k
>> > entries) have to be scanned using the book's id and the retrieved data is
>> > joined together. 
>> > 
>> > As I'm new to Spark I'm not sure if I can leverage Spark's processing model
>> > for this use case.
>> > 
>> > 
>> > 
>> > 
>> > 
>> > --
>> > View this message in context: 
>> > http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-suited-for-replacing-a-batch-job-using-many-database-tables-tp27300.html
>> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> > 
>> > ---------------------------------------------------------------------
>> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>> > 
>> > 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>> 


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Is Spark suited for replacing a batch job using many database tables?

Reply via email to