Re: How do I convert a data frame to broadcast variable?

Silvio Fiorito Thu, 03 Nov 2016 10:27:08 -0700

Hi Nishit,


Yes the JDBC connector supports predicate pushdown and column pruning. So any 
selection you make on the dataframe will get materialized in the query sent via 
JDBC.


You should be able to verify this by looking at the physical query plan:


val df = sqlContext.jdbc(....).select($"col1", $"col2")

df.explain(true)


Or if you can easily log queries submitted to your database then you can view 
the specific query.


Thanks,

Silvio

________________________________
From: Jain, Nishit <nja...@underarmour.com>
Sent: Thursday, November 3, 2016 12:32:48 PM
To: Denny Lee; user@spark.apache.org
Subject: Re: How do I convert a data frame to broadcast variable?

Thanks Denny! That does help. I will give that a shot.

Question: If I am going this route, I am wondering how can I only read few 
columns of a table (not whole table) from JDBC as data frame.
This function from data frame reader does not give an option to read only 
certain columns:
def jdbc(url: String, table: String, predicates: Array[String], 
connectionProperties: Properties): DataFrame

On the other hand if I want to create a JDBCRdd I can specify a select query 
(instead of full table):
new JdbcRDD(sc: SparkContext, getConnection: () ? Connection, sql: String, 
lowerBound: Long, upperBound: Long, numPartitions: Int, mapRow: (ResultSet) ? T 
= JdbcRDD.resultSetToObjectArray)(implicit arg0: ClassTag[T])

May be if I do df.select(col1, col2)  on data frame created via a table, will 
spark be smart enough to fetch only two columns not entire table?
Any way to test this?


From: Denny Lee <denny.g....@gmail.com<mailto:denny.g....@gmail.com>>
Date: Thursday, November 3, 2016 at 10:59 AM
To: "Jain, Nishit" <nja...@underarmour.com<mailto:nja...@underarmour.com>>, 
"user@spark.apache.org<mailto:user@spark.apache.org>" 
<user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: Re: How do I convert a data frame to broadcast variable?

If you're able to read the data in as a DataFrame, perhaps you can use a 
BroadcastHashJoin so that way you can join to that table presuming its small 
enough to distributed?  Here's a handy guide on a BroadcastHashJoin: 
https://docs.cloud.databricks.com/docs/latest/databricks_guide/index.html#04%20SQL,%20DataFrames%20%26%20Datasets/05%20BroadcastHashJoin%20-%20scala.html

HTH!


On Thu, Nov 3, 2016 at 8:53 AM Jain, Nishit 
<nja...@underarmour.com<mailto:nja...@underarmour.com>> wrote:
I have a lookup table in HANA database. I want to create a spark broadcast 
variable for it.
What would be the suggested approach? Should I read it as an data frame and 
convert data frame into broadcast variable?

Thanks,
Nishit

Re: How do I convert a data frame to broadcast variable?

Reply via email to