Re: [Pyspark, SQL] Very slow IN operator
If you just want to emulate pushing down a join, you can just wrap the IN list query in a JDBCRelation directly: scala> val r_df = spark.read.format("jdbc").option("url", > "jdbc:h2:/tmp/testdb").option("dbtable", "R").load() > r_df: org.apache.spark.sql.DataFrame = [A: int] > scala> r_df.show > +---+ > | A| > +---+ > | 42| > |-42| > +---+ > > scala> val querystr = s"select * from R where a in (${(1 to > 10).mkString(",")})" > querystr: String = select * from R where a in > (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,21... > scala> val filtered_df = spark.read.format("jdbc").option("url", > "jdbc:h2:/tmp/testdb").option("dbtable", s"($querystr)").load() > filtered_df: org.apache.spark.sql.DataFrame = [A: int] > scala> filtered_df.show > +---+ > | A| > +---+ > | 42| > +---+ Fred On Thu, Apr 6, 2017 at 1:51 AM Maciej Bryński wrote: > 2017-04-06 4:00 GMT+02:00 Michael Segel : > > Just out of curiosity, what would happen if you put your 10K values in > to a temp table and then did a join against it? > > The answer is predicates pushdown. > In my case I'm using this kind of query on JDBC table and IN predicate > is executed on DB in less than 1s. > > > Regards, > -- > Maciek Bryński > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > >
Re: [Pyspark, SQL] Very slow IN operator
2017-04-06 4:00 GMT+02:00 Michael Segel : > Just out of curiosity, what would happen if you put your 10K values in to a > temp table and then did a join against it? The answer is predicates pushdown. In my case I'm using this kind of query on JDBC table and IN predicate is executed on DB in less than 1s. Regards, -- Maciek Bryński - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Re: [Pyspark, SQL] Very slow IN operator
Just out of curiosity, what would happen if you put your 10K values in to a temp table and then did a join against it? > On Apr 5, 2017, at 4:30 PM, Maciej Bryński wrote: > > Hi, > I'm trying to run queries with many values in IN operator. > > The result is that for more than 10K values IN operator is getting slower. > > For example this code is running about 20 seconds. > > df = spark.range(0,10,1,1) > df.where('id in ({})'.format(','.join(map(str,range(10).count() > > Any ideas how to improve this ? > Is it a bug ? > -- > Maciek Bryński > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Re: [Pyspark, SQL] Very slow IN operator
Query building time is significant because it's a simple query but a long one at almost 4,000 characters alone. Task deserialization time takes up an inordinate amount of time (0.9s) when I run your test and building the query itself is several seconds. I would recommend using a JOIN (a broadcast join if your data set is small enough) when the alternative is a massive IN statement. On Wed, Apr 5, 2017 at 2:31 PM, Maciej Bryński [via Apache Spark Developers List] wrote: > Hi, > I'm trying to run queries with many values in IN operator. > > The result is that for more than 10K values IN operator is getting slower. > > For example this code is running about 20 seconds. > > df = spark.range(0,10,1,1) > df.where('id in ({})'.format(','.join(map(str,range(10).count() > > Any ideas how to improve this ? > Is it a bug ? > -- > Maciek Bryński > > - > To unsubscribe e-mail: [hidden email] > <http:///user/SendEmail.jtp?type=node&node=21307&i=0> > > > > -- > If you reply to this email, your message will be added to the discussion > below: > http://apache-spark-developers-list.1001551.n3. > nabble.com/Pyspark-SQL-Very-slow-IN-operator-tp21307.html > To unsubscribe from Apache Spark Developers List, click here > <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=Z3N0YXVibGlAZ21haWwuY29tfDF8LTM1NDYzMTky> > . > NAML > <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> > -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Re-Pyspark-SQL-Very-slow-IN-operator-tp21309.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
Re: [Pyspark, SQL] Very slow IN operator
Query building time is significant because it's a simple query but a long one at almost 4,000 characters alone. Task deserialization time takes up an inordinate amount of time (0.9s) when I run your test and building the query itself is several seconds. I would recommend using a JOIN (a broadcast join if your data set is small enough) when the alternative is a massive IN statement. On Wed, Apr 5, 2017 at 2:31 PM, Maciej Bryński [via Apache Spark Developers List] wrote: > Hi, > I'm trying to run queries with many values in IN operator. > > The result is that for more than 10K values IN operator is getting slower. > > For example this code is running about 20 seconds. > > df = spark.range(0,10,1,1) > df.where('id in ({})'.format(','.join(map(str,range(10).count() > > Any ideas how to improve this ? > Is it a bug ? > -- > Maciek Bryński > > - > To unsubscribe e-mail: [hidden email] > <http:///user/SendEmail.jtp?type=node&node=21307&i=0> > > > > -- > If you reply to this email, your message will be added to the discussion > below: > http://apache-spark-developers-list.1001551.n3. > nabble.com/Pyspark-SQL-Very-slow-IN-operator-tp21307.html > To unsubscribe from Apache Spark Developers List, click here > <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=Z3N0YXVibGlAZ21haWwuY29tfDF8LTM1NDYzMTky> > . > NAML > <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> > -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Re-Pyspark-SQL-Very-slow-IN-operator-tp21308.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
[Pyspark, SQL] Very slow IN operator
Hi, I'm trying to run queries with many values in IN operator. The result is that for more than 10K values IN operator is getting slower. For example this code is running about 20 seconds. df = spark.range(0,10,1,1) df.where('id in ({})'.format(','.join(map(str,range(10).count() Any ideas how to improve this ? Is it a bug ? -- Maciek Bryński - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org