Re: Spark JDBC reads

Subhash Sriram Tue, 07 Mar 2017 06:12:13 -0800

Could you create a view of the table on your JDBC data source and just query 
that from Spark?


Thanks,
Subhash 

Sent from my iPhone

> On Mar 7, 2017, at 6:37 AM, El-Hassan Wanas <elhassan.wa...@gmail.com> wrote:
> 
> As an example, this is basically what I'm doing:
> 
>      val myDF = originalDataFrame.select(col(columnName).when(col(columnName) 
> === "foobar", 0).when(col(columnName) === "foobarbaz", 1))
> 
> Except there's much more columns and much more conditionals. The generated 
> Spark workflow starts with an SQL that basically does:
> 
>    SELECT columnName, columnName2, etc. from table;
> 
> Then the conditionals/transformations are evaluated on the cluster.
> 
> Is there a way from the DataSet API to force the computation to happen on the 
> SQL data source in this case? Or should I work with JDBCRDD and use 
> createDataFrame on that?
> 
> 
>> On 03/07/2017 02:19 PM, Jörn Franke wrote:
>> Can you provide some source code? I am not sure I understood the problem .
>> If you want to do a preprocessing at the JDBC datasource then you can write 
>> your own data source. Additionally you may want to modify the sql statement 
>> to extract the data in the right format and push some preprocessing to the 
>> database.
>> 
>>> On 7 Mar 2017, at 12:04, El-Hassan Wanas <elhassan.wa...@gmail.com> wrote:
>>> 
>>> Hello,
>>> 
>>> There is, as usual, a big table lying on some JDBC data source. I am doing 
>>> some data processing on that data from Spark, however, in order to speed up 
>>> my analysis, I use reduced encodings and minimize the general size of the 
>>> data before processing.
>>> 
>>> Spark has been doing a great job at generating the proper workflows that do 
>>> that preprocessing for me, but it seems to generate those workflows for 
>>> execution on the Spark Cluster. The issue with that is the large transfer 
>>> cost is still incurred.
>>> 
>>> Is there any way to force Spark to run the preprocessing on the JDBC data 
>>> source and get the prepared output DataFrame instead?
>>> 
>>> Thanks,
>>> 
>>> Wanas
>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Spark JDBC reads

Reply via email to