Re: Is Spark suited for replacing a batch job using many database tables?

Michael Segel Wed, 06 Jul 2016 13:12:55 -0700

I think you need to learn the basics of how to build a ‘data lake/pond/sewer’ 
first.

The short answer is yes. 
The longer answer is that you need to think more about translating a relational 
model in to a hierarchical model, something that I seriously doubt has been 
taught in schools in a very long time.  

Then there’s more to the design, including indexing. 
Do you want to stick with SQL or do you want to hand code the work to allow for 
indexing / secondary indexing to help with the filtering since Spark SQL 
doesn’t really handle indexing. Note that you could actually still use an index 
table (narrow/thin inverted table) and join against the base table to get 
better performance. 

There’s more to this, but you get the idea.

HTH

-Mike

> On Jul 6, 2016, at 2:25 PM, dabuki <dabuks...@gmail.com> wrote:
> 
> I was thinking about to replace a legacy batch job with Spark, but I'm not
> sure if Spark is suited for this use case. Before I start the proof of
> concept, I wanted to ask for opinions.
> 
> The legacy job works as follows: A file (100k - 1 mio entries) is iterated.
> Every row contains a (book) order with an id and for each row approx. 15
> processing steps have to be performed that involve access to multiple
> database tables. In total approx. 25 tables (each containing 10k-700k
> entries) have to be scanned using the book's id and the retrieved data is
> joined together. 
> 
> As I'm new to Spark I'm not sure if I can leverage Spark's processing model
> for this use case.
> 
> 
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-suited-for-replacing-a-batch-job-using-many-database-tables-tp27300.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 
> 

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Is Spark suited for replacing a batch job using many database tables?

Reply via email to