I try to load some rows from a big SQL table.  Here is my code:

===
jdbcDF = sqlContext.read.format("jdbc").options(
      url="jdbc:postgresql://...",
      dbtable="mytable",
      partitionColumn="t",
      lowerBound=1451577600,
      upperBound=1454256000,
      numPartitions=1).load()
print(jdbcDF.count())
===

The code runs very slow because Spark tries to load whole table.
I know there is a solution that uses subquery.  I can use:

dbtable="(SELECT * FROM mytable WHERE t>=1451577600 AND t<= 1454256000) tmp".
However, it's still slow because the subquery creates a temp table.

I would like to know how can I specify where filters so I don't need
to load the whole table?

>From spark source code I guess the filter in JDBCRelation is the
solution I'm looking for.  However, I don't know how to create a
filter and pass it to jdbc driver.
===
https://github.com/apache/spark/blob/40ed2af587cedadc6e5249031857a922b3b234ca/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala
===



-- 
Thanks for help,
Jyun-Fan Tsai

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to