Re: Scalable JDBCRDD

2015-03-02 Thread Cody Koeninger
Have you already tried using the Vertica hadoop input format with spark? I don't know how it's implemented, but I'd hope that it has some notion of vertica-specific shard locality (which JdbcRDD does not). If you're really constrained to consuming the result set in a single thread, whatever

Re: Scalable JDBCRDD

2015-03-02 Thread Michal Klos
Hi Cody, Thanks for the reply. Yea, we thought of possibly doing this in a UDX in Vertica somehow to get the lower level co-operation but its a bit daunting. We want to do this because there are things we want to do with the result-set in Spark that are not possible in Vertica. The DStream

Re: Scalable JDBCRDD

2015-03-01 Thread Jörn Franke
What database are you using? Le 28 févr. 2015 18:15, Michal Klos michal.klo...@gmail.com a écrit : Hi Spark community, We have a use case where we need to pull huge amounts of data from a SQL query against a database into Spark. We need to execute the query against our huge database and not

Re: Scalable JDBCRDD

2015-03-01 Thread Cody Koeninger
I'm a little confused by your comments regarding LIMIT. There's nothing about JdbcRDD that depends on limit. You just need to be able to partition your data in some way such that it has numeric upper and lower bounds. Primary key range scans, not limit, would ordinarily be the best way to do

Re: Scalable JDBCRDD

2015-03-01 Thread michal.klo...@gmail.com
Jorn: Vertica Cody: I posited the limit just as an example of how jdbcrdd could be used least invasively. Let's say we used a partition on a time field -- we would still need to have N executions of those queries. The queries we have are very intense and concurrency is an issue even if the

Re: Scalable JDBCRDD

2015-03-01 Thread eric
What you're saying is that, due to the intensity of the query, you need to run a single query and partition the results, versus running one query for each partition. I assume it's not viable to throw the query results into another table in your database and then query that using the normal

Re: Scalable JDBCRDD

2015-03-01 Thread michal.klo...@gmail.com
Yes exactly. The temp table is an approach but then we need to manage the deletion of it etc. I'm sure we won't be the only people with this crazy use case. If there isn't a feasible way to do this within the framework then that's okay. But if there is a way we are happy to write the code and

Scalable JDBCRDD

2015-02-28 Thread Michal Klos
Hi Spark community, We have a use case where we need to pull huge amounts of data from a SQL query against a database into Spark. We need to execute the query against our huge database and not a substitute (SparkSQL, Hive, etc) because of a couple of factors including custom functions used in the