I was kind of hoping that I would use Spark in this instance to generate that intermediate SQL as part of its workflow strategy. Sort of as a database independent way of doing my preprocessing. Is there any way that allows me to capture the generated SQL from catalyst? If so I would just use JDBCRdd with that.
The other option being to generate that SQL in text format which isn't the nicest thing to do. On Mar 7, 2017 5:02 PM, "Subhash Sriram" <subhash.sri...@gmail.com> wrote: > Could you create a view of the table on your JDBC data source and just > query that from Spark? > > Thanks, > Subhash > > Sent from my iPhone > > > On Mar 7, 2017, at 6:37 AM, El-Hassan Wanas <elhassan.wa...@gmail.com> > wrote: > > > > As an example, this is basically what I'm doing: > > > > val myDF = > > originalDataFrame.select(col(columnName).when(col(columnName) > === "foobar", 0).when(col(columnName) === "foobarbaz", 1)) > > > > Except there's much more columns and much more conditionals. The > generated Spark workflow starts with an SQL that basically does: > > > > SELECT columnName, columnName2, etc. from table; > > > > Then the conditionals/transformations are evaluated on the cluster. > > > > Is there a way from the DataSet API to force the computation to happen > on the SQL data source in this case? Or should I work with JDBCRDD and use > createDataFrame on that? > > > > > >> On 03/07/2017 02:19 PM, Jörn Franke wrote: > >> Can you provide some source code? I am not sure I understood the > problem . > >> If you want to do a preprocessing at the JDBC datasource then you can > write your own data source. Additionally you may want to modify the sql > statement to extract the data in the right format and push some > preprocessing to the database. > >> > >>> On 7 Mar 2017, at 12:04, El-Hassan Wanas <elhassan.wa...@gmail.com> > wrote: > >>> > >>> Hello, > >>> > >>> There is, as usual, a big table lying on some JDBC data source. I am > doing some data processing on that data from Spark, however, in order to > speed up my analysis, I use reduced encodings and minimize the general size > of the data before processing. > >>> > >>> Spark has been doing a great job at generating the proper workflows > that do that preprocessing for me, but it seems to generate those workflows > for execution on the Spark Cluster. The issue with that is the large > transfer cost is still incurred. > >>> > >>> Is there any way to force Spark to run the preprocessing on the JDBC > data source and get the prepared output DataFrame instead? > >>> > >>> Thanks, > >>> > >>> Wanas > >>> > >>> > >>> --------------------------------------------------------------------- > >>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >>> > > > > > > --------------------------------------------------------------------- > > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > >