Well your mileage varies depending on what you want to do. I suggest that you do a POC to find out exactly what benefits you are going to get and if the approach is going to pay.
Spark does not have a CBO like DB2 or Oracle but provides DAG and in-memory capabilities. Use something basis like Spark-shell to start experimenting and take it from there. HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On 6 July 2016 at 21:24, Andreas Bauer <dabuks...@gmail.com> wrote: > Thanks for the advice. I have to retrieve the basic data from the DB2 > tables but afterwards I'm pretty free to transform the data as needed. > > > > On 6. Juli 2016 um 22:12:26 MESZ, Michael Segel <msegel_had...@hotmail.com> > wrote: > > I think you need to learn the basics of how to build a ‘data > lake/pond/sewer’ first. > > The short answer is yes. > The longer answer is that you need to think more about translating a > relational model in to a hierarchical model, something that I seriously > doubt has been taught in schools in a very long time. > > Then there’s more to the design, including indexing. > Do you want to stick with SQL or do you want to hand code the work to > allow for indexing / secondary indexing to help with the filtering since > Spark SQL doesn’t really handle indexing. Note that you could actually > still use an index table (narrow/thin inverted table) and join against the > base table to get better performance. > > There’s more to this, but you get the idea. > > HTH > > -Mike > > > On Jul 6, 2016, at 2:25 PM, dabuki wrote: > > > > I was thinking about to replace a legacy batch job with Spark, but I'm > not > > sure if Spark is suited for this use case. Before I start the proof of > > concept, I wanted to ask for opinions. > > > > The legacy job works as follows: A file (100k - 1 mio entries) is > iterated. > > Every row contains a (book) order with an id and for each row approx. 15 > > processing steps have to be performed that involve access to multiple > > database tables. In total approx. 25 tables (each containing 10k-700k > > entries) have to be scanned using the book's id and the retrieved data is > > joined together. > > > > As I'm new to Spark I'm not sure if I can leverage Spark's processing > model > > for this use case. > > > > > > > > > > > > -- > > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-suited-for-replacing-a-batch-job-using-many-database-tables-tp27300.html > > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > > > --------------------------------------------------------------------- > > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > > > > > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >