Could you create a view of the table on your JDBC data source and just query that from Spark?
Thanks, Subhash Sent from my iPhone > On Mar 7, 2017, at 6:37 AM, El-Hassan Wanas <elhassan.wa...@gmail.com> wrote: > > As an example, this is basically what I'm doing: > > val myDF = originalDataFrame.select(col(columnName).when(col(columnName) > === "foobar", 0).when(col(columnName) === "foobarbaz", 1)) > > Except there's much more columns and much more conditionals. The generated > Spark workflow starts with an SQL that basically does: > > SELECT columnName, columnName2, etc. from table; > > Then the conditionals/transformations are evaluated on the cluster. > > Is there a way from the DataSet API to force the computation to happen on the > SQL data source in this case? Or should I work with JDBCRDD and use > createDataFrame on that? > > >> On 03/07/2017 02:19 PM, Jörn Franke wrote: >> Can you provide some source code? I am not sure I understood the problem . >> If you want to do a preprocessing at the JDBC datasource then you can write >> your own data source. Additionally you may want to modify the sql statement >> to extract the data in the right format and push some preprocessing to the >> database. >> >>> On 7 Mar 2017, at 12:04, El-Hassan Wanas <elhassan.wa...@gmail.com> wrote: >>> >>> Hello, >>> >>> There is, as usual, a big table lying on some JDBC data source. I am doing >>> some data processing on that data from Spark, however, in order to speed up >>> my analysis, I use reduced encodings and minimize the general size of the >>> data before processing. >>> >>> Spark has been doing a great job at generating the proper workflows that do >>> that preprocessing for me, but it seems to generate those workflows for >>> execution on the Spark Cluster. The issue with that is the large transfer >>> cost is still incurred. >>> >>> Is there any way to force Spark to run the preprocessing on the JDBC data >>> source and get the prepared output DataFrame instead? >>> >>> Thanks, >>> >>> Wanas >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>> > > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org