Re: [Pyspark, SQL] Very slow IN operator

2017-04-06 Thread Fred Reiss
If you just want to emulate pushing down a join, you can just wrap the IN
list query in a JDBCRelation directly:

scala> val r_df = spark.read.format("jdbc").option("url",
> "jdbc:h2:/tmp/testdb").option("dbtable", "R").load()
> r_df: org.apache.spark.sql.DataFrame = [A: int]
> scala> r_df.show
> +---+
> |  A|
> +---+
> | 42|
> |-42|
> +---+
>
> scala> val querystr = s"select * from R where a in (${(1 to
> 10).mkString(",")})"
> querystr: String = select * from R where a in
> (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,21...
> scala> val filtered_df = spark.read.format("jdbc").option("url",
> "jdbc:h2:/tmp/testdb").option("dbtable", s"($querystr)").load()
> filtered_df: org.apache.spark.sql.DataFrame = [A: int]
> scala> filtered_df.show
> +---+
> |  A|
> +---+
> | 42|
> +---+


Fred


On Thu, Apr 6, 2017 at 1:51 AM Maciej Bryński  wrote:

> 2017-04-06 4:00 GMT+02:00 Michael Segel :
> > Just out of curiosity, what would happen if you put your 10K values in
> to a temp table and then did a join against it?
>
> The answer is predicates pushdown.
> In my case I'm using this kind of query on JDBC table and IN predicate
> is executed on DB in less than 1s.
>
>
> Regards,
> --
> Maciek Bryński
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [Pyspark, SQL] Very slow IN operator

2017-04-06 Thread Maciej Bryński
2017-04-06 4:00 GMT+02:00 Michael Segel :
> Just out of curiosity, what would happen if you put your 10K values in to a 
> temp table and then did a join against it?

The answer is predicates pushdown.
In my case I'm using this kind of query on JDBC table and IN predicate
is executed on DB in less than 1s.


Regards,
-- 
Maciek Bryński

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [Pyspark, SQL] Very slow IN operator

2017-04-05 Thread Michael Segel
Just out of curiosity, what would happen if you put your 10K values in to a 
temp table and then did a join against it? 

> On Apr 5, 2017, at 4:30 PM, Maciej Bryński  wrote:
> 
> Hi,
> I'm trying to run queries with many values in IN operator.
> 
> The result is that for more than 10K values IN operator is getting slower.
> 
> For example this code is running about 20 seconds.
> 
> df = spark.range(0,10,1,1)
> df.where('id in ({})'.format(','.join(map(str,range(10).count()
> 
> Any ideas how to improve this ?
> Is it a bug ?
> -- 
> Maciek Bryński
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> 


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [Pyspark, SQL] Very slow IN operator

2017-04-05 Thread Garren Staubli
Query building time is significant because it's a simple query but a long
one at almost 4,000 characters alone.

Task deserialization time takes up an inordinate amount of time (0.9s) when
I run your test and building the query itself is several seconds.

I would recommend using a JOIN (a broadcast join if your data set is small
enough) when the alternative is a massive IN statement.

On Wed, Apr 5, 2017 at 2:31 PM, Maciej Bryński [via Apache Spark Developers
List]  wrote:

> Hi,
> I'm trying to run queries with many values in IN operator.
>
> The result is that for more than 10K values IN operator is getting slower.
>
> For example this code is running about 20 seconds.
>
> df = spark.range(0,10,1,1)
> df.where('id in ({})'.format(','.join(map(str,range(10).count()
>
> Any ideas how to improve this ?
> Is it a bug ?
> --
> Maciek Bryński
>
> -
> To unsubscribe e-mail: [hidden email]
> <http:///user/SendEmail.jtp?type=node&node=21307&i=0>
>
>
>
> --
> If you reply to this email, your message will be added to the discussion
> below:
> http://apache-spark-developers-list.1001551.n3.
> nabble.com/Pyspark-SQL-Very-slow-IN-operator-tp21307.html
> To unsubscribe from Apache Spark Developers List, click here
> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=Z3N0YXVibGlAZ21haWwuY29tfDF8LTM1NDYzMTky>
> .
> NAML
> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Re-Pyspark-SQL-Very-slow-IN-operator-tp21309.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: [Pyspark, SQL] Very slow IN operator

2017-04-05 Thread Garren Staubli
Query building time is significant because it's a simple query but a long
one at almost 4,000 characters alone.

Task deserialization time takes up an inordinate amount of time (0.9s) when
I run your test and building the query itself is several seconds.

I would recommend using a JOIN (a broadcast join if your data set is small
enough) when the alternative is a massive IN statement.

On Wed, Apr 5, 2017 at 2:31 PM, Maciej Bryński [via Apache Spark Developers
List]  wrote:

> Hi,
> I'm trying to run queries with many values in IN operator.
>
> The result is that for more than 10K values IN operator is getting slower.
>
> For example this code is running about 20 seconds.
>
> df = spark.range(0,10,1,1)
> df.where('id in ({})'.format(','.join(map(str,range(10).count()
>
> Any ideas how to improve this ?
> Is it a bug ?
> --
> Maciek Bryński
>
> -
> To unsubscribe e-mail: [hidden email]
> <http:///user/SendEmail.jtp?type=node&node=21307&i=0>
>
>
>
> --
> If you reply to this email, your message will be added to the discussion
> below:
> http://apache-spark-developers-list.1001551.n3.
> nabble.com/Pyspark-SQL-Very-slow-IN-operator-tp21307.html
> To unsubscribe from Apache Spark Developers List, click here
> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=Z3N0YXVibGlAZ21haWwuY29tfDF8LTM1NDYzMTky>
> .
> NAML
> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Re-Pyspark-SQL-Very-slow-IN-operator-tp21308.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

[Pyspark, SQL] Very slow IN operator

2017-04-05 Thread Maciej Bryński
Hi,
I'm trying to run queries with many values in IN operator.

The result is that for more than 10K values IN operator is getting slower.

For example this code is running about 20 seconds.

df = spark.range(0,10,1,1)
df.where('id in ({})'.format(','.join(map(str,range(10).count()

Any ideas how to improve this ?
Is it a bug ?
-- 
Maciek Bryński

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org