Re: [Pyspark, SQL] Very slow IN operator

Fred Reiss Thu, 06 Apr 2017 16:21:53 -0700

If you just want to emulate pushing down a join, you can just wrap the IN
list query in a JDBCRelation directly:


scala> val r_df = spark.read.format("jdbc").option("url",
> "jdbc:h2:/tmp/testdb").option("dbtable", "R").load()
> r_df: org.apache.spark.sql.DataFrame = [A: int]
> scala> r_df.show
> +---+
> |  A|
> +---+
> | 42|
> |-42|
> +---+
>
> scala> val querystr = s"select * from R where a in (${(1 to
> 100000).mkString(",")})"
> querystr: String = select * from R where a in
> (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,21...
> scala> val filtered_df = spark.read.format("jdbc").option("url",
> "jdbc:h2:/tmp/testdb").option("dbtable", s"($querystr)").load()
> filtered_df: org.apache.spark.sql.DataFrame = [A: int]
> scala> filtered_df.show
> +---+
> |  A|
> +---+
> | 42|
> +---+


Fred


On Thu, Apr 6, 2017 at 1:51 AM Maciej Bryński <mac...@brynski.pl> wrote:

> 2017-04-06 4:00 GMT+02:00 Michael Segel <msegel_had...@hotmail.com>:
> > Just out of curiosity, what would happen if you put your 10K values in
> to a temp table and then did a join against it?
>
> The answer is predicates pushdown.
> In my case I'm using this kind of query on JDBC table and IN predicate
> is executed on DB in less than 1s.
>
>
> Regards,
> --
> Maciek Bryński
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: [Pyspark, SQL] Very slow IN operator

Reply via email to