Re: Is Spark suited for replacing a batch job using many database tables?

2016-07-06 Thread Andreas Bauer
The job works just fine. DB2 also performs very well. But I'm supposed to 
investigate alternatives. Thanks for the advice regarding Apache Drill. I'll 
definitely have a look! Best regards,  Andreas  



Sorry,  
I was assuming that you wanted to build the data lake in Hadoop rather than 
just reading from DB2. (Data Lakes need to be built correctly. )  


So, slightly different answer.



Yes, you can do this…  


You will end up with an immutable copy of the data that you would read in 
serially. Then you will probably need to repartition the data, depending on 
size and how much parallelization you want.  And then run the batch processing. 
 


But I have to ask why?  
Are you having issues with DB2?  
Are your batch jobs interfering with your transactional work?  


You will have a hit up front as you read the data from DB2, but then, depending 
on how you use the data… you may be faster overall.  


Please don’t misunderstand, Spark is a viable solution, however… there’s a bit 
of heavy lifting that has to occur (e.g. building and maintaining a spark 
cluster) and there are alternatives out there that work.  


Performance of the DB2 tables will vary based on indexing, assuming you have 
the appropriate indexes in place.  


You could also look at Apache Drill too.  


HTH  
-Mike







> On Jul 6, 2016, at 3:24 PM, Andreas Bauer <dabuks...@gmail.com> wrote:

>  
> Thanks for the advice. I have to retrieve the basic data from the DB2 tables 
> but afterwards I'm pretty free to transform the data as needed.  
>  
>  
>  
> On 6. Juli 2016 um 22:12:26 MESZ, Michael Segel <msegel_had...@hotmail.com> 
> wrote:

>> I think you need to learn the basics of how to build a ‘data 
>> lake/pond/sewer’ first.  
>>  
>> The short answer is yes.  
>> The longer answer is that you need to think more about translating a 
>> relational model in to a hierarchical model, something that I seriously 
>> doubt has been taught in schools in a very long time.  
>>  
>> Then there’s more to the design, including indexing.  
>> Do you want to stick with SQL or do you want to hand code the work to allow 
>> for indexing / secondary indexing to help with the filtering since Spark SQL 
>> doesn’t really handle indexing. Note that you could actually still use an 
>> index table (narrow/thin inverted table) and join against the base table to 
>> get better performance.  
>>  
>> There’s more to this, but you get the idea.

>>  
>> HTH

>>  
>> -Mike

>>  
>> > On Jul 6, 2016, at 2:25 PM, dabuki wrote:

>> >  
>> > I was thinking about to replace a legacy batch job with Spark, but I'm not

>> > sure if Spark is suited for this use case. Before I start the proof of

>> > concept, I wanted to ask for opinions.

>> >  
>> > The legacy job works as follows: A file (100k - 1 mio entries) is iterated.

>> > Every row contains a (book) order with an id and for each row approx. 15

>> > processing steps have to be performed that involve access to multiple

>> > database tables. In total approx. 25 tables (each containing 10k-700k

>> > entries) have to be scanned using the book's id and the retrieved data is

>> > joined together.  
>> >  
>> > As I'm new to Spark I'm not sure if I can leverage Spark's processing model

>> > for this use case.

>> >  
>> >  
>> >  
>> >  
>> >  
>> > --

>> > View this message in context: 
>> > http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-suited-for-replacing-a-batch-job-using-many-database-tables-tp27300.html

>> > Sent from the Apache Spark User List mailing list archive at Nabble.com.

>> >  
>> > -

>> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org

>> >  
>> >  
>>  
>>  
>> -

>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org

>>  




Re: Is Spark suited for replacing a batch job using many database tables?

2016-07-06 Thread Michael Segel
Sorry, 
I was assuming that you wanted to build the data lake in Hadoop rather than 
just reading from DB2. (Data Lakes need to be built correctly. ) 

So, slightly different answer.

Yes, you can do this… 

You will end up with an immutable copy of the data that you would read in 
serially. Then you will probably need to repartition the data, depending on 
size and how much parallelization you want.  And then run the batch processing. 

But I have to ask why? 
Are you having issues with DB2? 
Are your batch jobs interfering with your transactional work? 

You will have a hit up front as you read the data from DB2, but then, depending 
on how you use the data… you may be faster overall. 

Please don’t misunderstand, Spark is a viable solution, however… there’s a bit 
of heavy lifting that has to occur (e.g. building and maintaining a spark 
cluster) and there are alternatives out there that work. 

Performance of the DB2 tables will vary based on indexing, assuming you have 
the appropriate indexes in place. 

You could also look at Apache Drill too. 

HTH 
-Mike



> On Jul 6, 2016, at 3:24 PM, Andreas Bauer <dabuks...@gmail.com> wrote:
> 
> Thanks for the advice. I have to retrieve the basic data from the DB2 tables 
> but afterwards I'm pretty free to transform the data as needed. 
> 
> 
> 
> On 6. Juli 2016 um 22:12:26 MESZ, Michael Segel <msegel_had...@hotmail.com> 
> wrote:
>> I think you need to learn the basics of how to build a ‘data 
>> lake/pond/sewer’ first. 
>> 
>> The short answer is yes. 
>> The longer answer is that you need to think more about translating a 
>> relational model in to a hierarchical model, something that I seriously 
>> doubt has been taught in schools in a very long time. 
>> 
>> Then there’s more to the design, including indexing. 
>> Do you want to stick with SQL or do you want to hand code the work to allow 
>> for indexing / secondary indexing to help with the filtering since Spark SQL 
>> doesn’t really handle indexing. Note that you could actually still use an 
>> index table (narrow/thin inverted table) and join against the base table to 
>> get better performance. 
>> 
>> There’s more to this, but you get the idea.
>> 
>> HTH
>> 
>> -Mike
>> 
>> > On Jul 6, 2016, at 2:25 PM, dabuki wrote:
>> > 
>> > I was thinking about to replace a legacy batch job with Spark, but I'm not
>> > sure if Spark is suited for this use case. Before I start the proof of
>> > concept, I wanted to ask for opinions.
>> > 
>> > The legacy job works as follows: A file (100k - 1 mio entries) is iterated.
>> > Every row contains a (book) order with an id and for each row approx. 15
>> > processing steps have to be performed that involve access to multiple
>> > database tables. In total approx. 25 tables (each containing 10k-700k
>> > entries) have to be scanned using the book's id and the retrieved data is
>> > joined together. 
>> > 
>> > As I'm new to Spark I'm not sure if I can leverage Spark's processing model
>> > for this use case.
>> > 
>> > 
>> > 
>> > 
>> > 
>> > --
>> > View this message in context: 
>> > http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-suited-for-replacing-a-batch-job-using-many-database-tables-tp27300.html
>> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> > 
>> > -
>> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>> > 
>> > 
>> 
>> 
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>> 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Is Spark suited for replacing a batch job using many database tables?

2016-07-06 Thread Mich Talebzadeh
Well your mileage varies depending on what you want to do.

I suggest that you do a POC to find out exactly what benefits you are going
to get and if the approach is going to pay.

Spark does not have a CBO like DB2 or Oracle but provides DAG and in-memory
capabilities. Use something basis like Spark-shell to start experimenting
and take it from there.

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 6 July 2016 at 21:24, Andreas Bauer <dabuks...@gmail.com> wrote:

> Thanks for the advice. I have to retrieve the basic data from the DB2
> tables but afterwards I'm pretty free to transform the data as needed.
>
>
>
> On 6. Juli 2016 um 22:12:26 MESZ, Michael Segel <msegel_had...@hotmail.com>
> wrote:
>
> I think you need to learn the basics of how to build a ‘data
> lake/pond/sewer’ first.
>
> The short answer is yes.
> The longer answer is that you need to think more about translating a
> relational model in to a hierarchical model, something that I seriously
> doubt has been taught in schools in a very long time.
>
> Then there’s more to the design, including indexing.
> Do you want to stick with SQL or do you want to hand code the work to
> allow for indexing / secondary indexing to help with the filtering since
> Spark SQL doesn’t really handle indexing. Note that you could actually
> still use an index table (narrow/thin inverted table) and join against the
> base table to get better performance.
>
> There’s more to this, but you get the idea.
>
> HTH
>
> -Mike
>
> > On Jul 6, 2016, at 2:25 PM, dabuki wrote:
> >
> > I was thinking about to replace a legacy batch job with Spark, but I'm
> not
> > sure if Spark is suited for this use case. Before I start the proof of
> > concept, I wanted to ask for opinions.
> >
> > The legacy job works as follows: A file (100k - 1 mio entries) is
> iterated.
> > Every row contains a (book) order with an id and for each row approx. 15
> > processing steps have to be performed that involve access to multiple
> > database tables. In total approx. 25 tables (each containing 10k-700k
> > entries) have to be scanned using the book's id and the retrieved data is
> > joined together.
> >
> > As I'm new to Spark I'm not sure if I can leverage Spark's processing
> model
> > for this use case.
> >
> >
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-suited-for-replacing-a-batch-job-using-many-database-tables-tp27300.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > -
> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >
> >
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Is Spark suited for replacing a batch job using many database tables?

2016-07-06 Thread Andreas Bauer
Thanks for the advice. I have to retrieve the basic data from the DB2 tables 
but afterwards I'm pretty free to transform the data as needed. 



I think you need to learn the basics of how to build a ‘data lake/pond/sewer’ 
first.  


The short answer is yes.  
The longer answer is that you need to think more about translating a relational 
model in to a hierarchical model, something that I seriously doubt has been 
taught in schools in a very long time.   


Then there’s more to the design, including indexing.  
Do you want to stick with SQL or do you want to hand code the work to allow for 
indexing / secondary indexing to help with the filtering since Spark SQL 
doesn’t really handle indexing. Note that you could actually still use an index 
table (narrow/thin inverted table) and join against the base table to get 
better performance.  


There’s more to this, but you get the idea.



HTH



-Mike



> On Jul 6, 2016, at 2:25 PM, dabuki <dabuks...@gmail.com> wrote:

>  
> I was thinking about to replace a legacy batch job with Spark, but I'm not

> sure if Spark is suited for this use case. Before I start the proof of

> concept, I wanted to ask for opinions.

>  
> The legacy job works as follows: A file (100k - 1 mio entries) is iterated.

> Every row contains a (book) order with an id and for each row approx. 15

> processing steps have to be performed that involve access to multiple

> database tables. In total approx. 25 tables (each containing 10k-700k

> entries) have to be scanned using the book's id and the retrieved data is

> joined together.  
>  
> As I'm new to Spark I'm not sure if I can leverage Spark's processing model

> for this use case.

>  
>  
>  
>  
>  
> --

> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-suited-for-replacing-a-batch-job-using-many-database-tables-tp27300.html

> Sent from the Apache Spark User List mailing list archive at Nabble.com.

>  
> -

> To unsubscribe e-mail: user-unsubscr...@spark.apache.org

>  
>  




-

To unsubscribe e-mail: user-unsubscr...@spark.apache.org





Re: Is Spark suited for replacing a batch job using many database tables?

2016-07-06 Thread Andreas Bauer
Yes, that was the idea to cache the tables in memory as they should neatly fit. 
 The loading time is no problem as the job is not time critical. The critical 
point is the constant access to the DB2 tables, which consumes costly MIPS, and 
this I hope to replace with the cached version.  So, I'll definitely give it 
try :)



On 6. Juli 2016 um 21:59:28 MESZ, Mich Talebzadeh <mich.talebza...@gmail.com> 
wrote:Well you can try it. I have done it with Oracle, SAP Sybase IQ etc but 
need to be aware of time that JDBC connection is going to take to load data.  
Sounds like your tables are pretty small so they can be cached.  Where are you 
going to store the result set etc?  HTHDr Mich Talebzadeh LinkedIn  
  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 http://talebzadehmich.wordpress.com   Disclaimer: Use it at your own 
risk.  Any and all responsibility for any loss, damage or destruction of data 
or any other property which may arise from relying on this email's technical 
content is explicitly disclaimed. The author will in no case be liable for any 
monetary damages arising from such loss, damage or destruction.     
 On 6 July 2016 at 20:54, Andreas Bauer <dabuks...@gmail.com> wrote: In fact, 
yes. On 6. Juli 2016 um 21:46:34 MESZ, Mich Talebzadeh 
<mich.talebza...@gmail.com> wrote:So you want to use Spark as the query engine 
accessing DB2 tables via JDBC?   Dr Mich Talebzadeh LinkedIn    
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 http://talebzadehmich.wordpress.com   Disclaimer: Use it at your own 
risk.  Any and all responsibility for any loss, damage or destruction of data 
or any other property which may arise from relying on this email's technical 
content is explicitly disclaimed. The author will in no case be liable for any 
monetary damages arising from such loss, damage or destruction.     
 On 6 July 2016 at 20:39, Andreas Bauer <dabuks...@gmail.com> wrote:The sql 
statements are embedded in a PL/1 program using DB2 running ob z/OS. Quite 
powerful, but expensive and foremost shared withother jobs in the comapny. The 
whole job takes approx. 20 minutes.  So I was thinking to use Spark and let the 
Spark job run on 10 or 20 virtual instances, which I can spawn easily, 
on-demand and almost for free using a cloud infrastructure.  On 6. Juli 2016 um 
21:29:53 MESZ, Jean Georges Perrin <j...@jgp.net> wrote:What are you doing it 
on right now?> On Jul 6, 2016, at 3:25 PM, dabuki   wrote:> > I was thinking 
about to replace a legacy batch job with Spark, but I'm not> sure if Spark is 
suited for this use case. Before I start the proof of> concept, I wanted to ask 
for opinions.> > The legacy job works as follows: A file (100k - 1 mio entries) 
is iterated.> Every row contains a (book) order with an id and for each row 
approx. 15> processing steps have to be performed that involve access to 
multiple> database tables. In total approx. 25 tables (each containing 
10k-700k> entries) have to be scanned using the book's id and the retrieved 
data is> joined together. > > As I'm new to Spark I'm not sure if I can 
leverage Spark's processing model> for this use case.> > > > > > --> View this 
message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-suited-for-replacing-a-batch-job-using-many-database-tables-tp27300.html>
 Sent from the Apache Spark User List mailing list archive at Nabble.com.> > 
-> To 
unsubscribe e-mail: user-unsubscr...@spark.apache.org>  

Re: Is Spark suited for replacing a batch job using many database tables?

2016-07-06 Thread Michael Segel
I think you need to learn the basics of how to build a ‘data lake/pond/sewer’ 
first. 

The short answer is yes. 
The longer answer is that you need to think more about translating a relational 
model in to a hierarchical model, something that I seriously doubt has been 
taught in schools in a very long time.  

Then there’s more to the design, including indexing. 
Do you want to stick with SQL or do you want to hand code the work to allow for 
indexing / secondary indexing to help with the filtering since Spark SQL 
doesn’t really handle indexing. Note that you could actually still use an index 
table (narrow/thin inverted table) and join against the base table to get 
better performance. 

There’s more to this, but you get the idea.

HTH

-Mike

> On Jul 6, 2016, at 2:25 PM, dabuki <dabuks...@gmail.com> wrote:
> 
> I was thinking about to replace a legacy batch job with Spark, but I'm not
> sure if Spark is suited for this use case. Before I start the proof of
> concept, I wanted to ask for opinions.
> 
> The legacy job works as follows: A file (100k - 1 mio entries) is iterated.
> Every row contains a (book) order with an id and for each row approx. 15
> processing steps have to be performed that involve access to multiple
> database tables. In total approx. 25 tables (each containing 10k-700k
> entries) have to be scanned using the book's id and the retrieved data is
> joined together. 
> 
> As I'm new to Spark I'm not sure if I can leverage Spark's processing model
> for this use case.
> 
> 
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-suited-for-replacing-a-batch-job-using-many-database-tables-tp27300.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 
> 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Is Spark suited for replacing a batch job using many database tables?

2016-07-06 Thread Mich Talebzadeh
Well you can try it. I have done it with Oracle, SAP Sybase IQ etc but need
to be aware of time that JDBC connection is going to take to load data.

Sounds like your tables are pretty small so they can be cached.

Where are you going to store the result set etc?

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 6 July 2016 at 20:54, Andreas Bauer <dabuks...@gmail.com> wrote:

>  In fact, yes.
>
>
>
>
> On 6. Juli 2016 um 21:46:34 MESZ, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
> So you want to use Spark as the query engine accessing DB2 tables via JDBC?
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 6 July 2016 at 20:39, Andreas Bauer <dabuks...@gmail.com> wrote:
>
> The sql statements are embedded in a PL/1 program using DB2 running ob
> z/OS. Quite powerful, but expensive and foremost shared withother jobs in
> the comapny. The whole job takes approx. 20 minutes.
>
> So I was thinking to use Spark and let the Spark job run on 10 or 20
> virtual instances, which I can spawn easily, on-demand and almost for free
> using a cloud infrastructure.
>
>
>
>
> On 6. Juli 2016 um 21:29:53 MESZ, Jean Georges Perrin <j...@jgp.net> wrote:
>
> What are you doing it on right now?
>
> > On Jul 6, 2016, at 3:25 PM, dabuki wrote:
> >
> > I was thinking about to replace a legacy batch job with Spark, but I'm
> not
> > sure if Spark is suited for this use case. Before I start the proof of
> > concept, I wanted to ask for opinions.
> >
> > The legacy job works as follows: A file (100k - 1 mio entries) is
> iterated.
> > Every row contains a (book) order with an id and for each row approx. 15
> > processing steps have to be performed that involve access to multiple
> > database tables. In total approx. 25 tables (each containing 10k-700k
> > entries) have to be scanned using the book's id and the retrieved data is
> > joined together.
> >
> > As I'm new to Spark I'm not sure if I can leverage Spark's processing
> model
> > for this use case.
> >
> >
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-suited-for-replacing-a-batch-job-using-many-database-tables-tp27300.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > -
> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >
>
>
>


Re: Is Spark suited for replacing a batch job using many database tables?

2016-07-06 Thread Jean Georges Perrin
Right now, I am having "fun" with Spark and 26446249960843350 datapoints on my 
MacBook Air, but my small friend is suffering... 

From my experience:
You will be able to do the job with Spark. You can try to load everything on a 
dev machine, no need to have a server, a workstation might be enough.
I would not recommend VM when you go to production, unless you have them. Bare 
metal seems more suitable.

It's definitely worth a shot!

> On Jul 6, 2016, at 3:39 PM, Andreas Bauer <dabuks...@gmail.com> wrote:
> 
> The sql statements are embedded in a PL/1 program using DB2 running ob z/OS. 
> Quite powerful, but expensive and foremost shared withother jobs in the 
> comapny. The whole job takes approx. 20 minutes. 
> 
> So I was thinking to use Spark and let the Spark job run on 10 or 20 virtual 
> instances, which I can spawn easily, on-demand and almost for free using a 
> cloud infrastructure. 
> 
> 
> 
> 
> On 6. Juli 2016 um 21:29:53 MESZ, Jean Georges Perrin <j...@jgp.net> wrote:
>> What are you doing it on right now?
>> 
>> > On Jul 6, 2016, at 3:25 PM, dabuki wrote:
>> > 
>> > I was thinking about to replace a legacy batch job with Spark, but I'm not
>> > sure if Spark is suited for this use case. Before I start the proof of
>> > concept, I wanted to ask for opinions.
>> > 
>> > The legacy job works as follows: A file (100k - 1 mio entries) is iterated.
>> > Every row contains a (book) order with an id and for each row approx. 15
>> > processing steps have to be performed that involve access to multiple
>> > database tables. In total approx. 25 tables (each containing 10k-700k
>> > entries) have to be scanned using the book's id and the retrieved data is
>> > joined together. 
>> > 
>> > As I'm new to Spark I'm not sure if I can leverage Spark's processing model
>> > for this use case.
>> > 
>> > 
>> > 
>> > 
>> > 
>> > --
>> > View this message in context: 
>> > http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-suited-for-replacing-a-batch-job-using-many-database-tables-tp27300.html
>> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> > 
>> > -
>> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>> > 
>> 



Re: Is Spark suited for replacing a batch job using many database tables?

2016-07-06 Thread Andreas Bauer
 In fact, yes. 



On 6. Juli 2016 um 21:46:34 MESZ, Mich Talebzadeh <mich.talebza...@gmail.com> 
wrote:So you want to use Spark as the query engine accessing DB2 tables via 
JDBC?   Dr Mich Talebzadeh LinkedIn    
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 http://talebzadehmich.wordpress.com   Disclaimer: Use it at your own 
risk.  Any and all responsibility for any loss, damage or destruction of data 
or any other property which may arise from relying on this email's technical 
content is explicitly disclaimed. The author will in no case be liable for any 
monetary damages arising from such loss, damage or destruction.     
 On 6 July 2016 at 20:39, Andreas Bauer <dabuks...@gmail.com> wrote:The sql 
statements are embedded in a PL/1 program using DB2 running ob z/OS. Quite 
powerful, but expensive and foremost shared withother jobs in the comapny. The 
whole job takes approx. 20 minutes.  So I was thinking to use Spark and let the 
Spark job run on 10 or 20 virtual instances, which I can spawn easily, 
on-demand and almost for free using a cloud infrastructure.  On 6. Juli 2016 um 
21:29:53 MESZ, Jean Georges Perrin <j...@jgp.net> wrote:What are you doing it 
on right now?> On Jul 6, 2016, at 3:25 PM, dabuki   wrote:> > I was thinking 
about to replace a legacy batch job with Spark, but I'm not> sure if Spark is 
suited for this use case. Before I start the proof of> concept, I wanted to ask 
for opinions.> > The legacy job works as follows: A file (100k - 1 mio entries) 
is iterated.> Every row contains a (book) order with an id and for each row 
approx. 15> processing steps have to be performed that involve access to 
multiple> database tables. In total approx. 25 tables (each containing 
10k-700k> entries) have to be scanned using the book's id and the retrieved 
data is> joined together. > > As I'm new to Spark I'm not sure if I can 
leverage Spark's processing model> for this use case.> > > > > > --> View this 
message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-suited-for-replacing-a-batch-job-using-many-database-tables-tp27300.html>
 Sent from the Apache Spark User List mailing list archive at Nabble.com.> > 
-> To 
unsubscribe e-mail: user-unsubscr...@spark.apache.org>   

Re: Is Spark suited for replacing a batch job using many database tables?

2016-07-06 Thread Mich Talebzadeh
So you want to use Spark as the query engine accessing DB2 tables via JDBC?

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 6 July 2016 at 20:39, Andreas Bauer <dabuks...@gmail.com> wrote:

> The sql statements are embedded in a PL/1 program using DB2 running ob
> z/OS. Quite powerful, but expensive and foremost shared withother jobs in
> the comapny. The whole job takes approx. 20 minutes.
>
> So I was thinking to use Spark and let the Spark job run on 10 or 20
> virtual instances, which I can spawn easily, on-demand and almost for free
> using a cloud infrastructure.
>
>
>
>
> On 6. Juli 2016 um 21:29:53 MESZ, Jean Georges Perrin <j...@jgp.net> wrote:
>
> What are you doing it on right now?
>
> > On Jul 6, 2016, at 3:25 PM, dabuki wrote:
> >
> > I was thinking about to replace a legacy batch job with Spark, but I'm
> not
> > sure if Spark is suited for this use case. Before I start the proof of
> > concept, I wanted to ask for opinions.
> >
> > The legacy job works as follows: A file (100k - 1 mio entries) is
> iterated.
> > Every row contains a (book) order with an id and for each row approx. 15
> > processing steps have to be performed that involve access to multiple
> > database tables. In total approx. 25 tables (each containing 10k-700k
> > entries) have to be scanned using the book's id and the retrieved data is
> > joined together.
> >
> > As I'm new to Spark I'm not sure if I can leverage Spark's processing
> model
> > for this use case.
> >
> >
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-suited-for-replacing-a-batch-job-using-many-database-tables-tp27300.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > -
> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >
>
>


Re: Is Spark suited for replacing a batch job using many database tables?

2016-07-06 Thread Andreas Bauer
The sql statements are embedded in a PL/1 program using DB2 running ob z/OS. 
Quite powerful, but expensive and foremost shared withother jobs in the 
comapny. The whole job takes approx. 20 minutes.  So I was thinking to use 
Spark and let the Spark job run on 10 or 20 virtual instances, which I can 
spawn easily, on-demand and almost for free using a cloud infrastructure.   



What are you doing it on right now?



> On Jul 6, 2016, at 3:25 PM, dabuki <dabuks...@gmail.com> wrote:

>  
> I was thinking about to replace a legacy batch job with Spark, but I'm not

> sure if Spark is suited for this use case. Before I start the proof of

> concept, I wanted to ask for opinions.

>  
> The legacy job works as follows: A file (100k - 1 mio entries) is iterated.

> Every row contains a (book) order with an id and for each row approx. 15

> processing steps have to be performed that involve access to multiple

> database tables. In total approx. 25 tables (each containing 10k-700k

> entries) have to be scanned using the book's id and the retrieved data is

> joined together.  
>  
> As I'm new to Spark I'm not sure if I can leverage Spark's processing model

> for this use case.

>  
>  
>  
>  
>  
> --

> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-suited-for-replacing-a-batch-job-using-many-database-tables-tp27300.html

> Sent from the Apache Spark User List mailing list archive at Nabble.com.

>  
> -

> To unsubscribe e-mail: user-unsubscr...@spark.apache.org

>  




Re: Is Spark suited for replacing a batch job using many database tables?

2016-07-06 Thread Jean Georges Perrin
What are you doing it on right now?

> On Jul 6, 2016, at 3:25 PM, dabuki <dabuks...@gmail.com> wrote:
> 
> I was thinking about to replace a legacy batch job with Spark, but I'm not
> sure if Spark is suited for this use case. Before I start the proof of
> concept, I wanted to ask for opinions.
> 
> The legacy job works as follows: A file (100k - 1 mio entries) is iterated.
> Every row contains a (book) order with an id and for each row approx. 15
> processing steps have to be performed that involve access to multiple
> database tables. In total approx. 25 tables (each containing 10k-700k
> entries) have to be scanned using the book's id and the retrieved data is
> joined together. 
> 
> As I'm new to Spark I'm not sure if I can leverage Spark's processing model
> for this use case.
> 
> 
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-suited-for-replacing-a-batch-job-using-many-database-tables-tp27300.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Is Spark suited for replacing a batch job using many database tables?

2016-07-06 Thread dabuki
I was thinking about to replace a legacy batch job with Spark, but I'm not
sure if Spark is suited for this use case. Before I start the proof of
concept, I wanted to ask for opinions.

The legacy job works as follows: A file (100k - 1 mio entries) is iterated.
Every row contains a (book) order with an id and for each row approx. 15
processing steps have to be performed that involve access to multiple
database tables. In total approx. 25 tables (each containing 10k-700k
entries) have to be scanned using the book's id and the retrieved data is
joined together. 

As I'm new to Spark I'm not sure if I can leverage Spark's processing model
for this use case.





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-suited-for-replacing-a-batch-job-using-many-database-tables-tp27300.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org