Re: [Pyspark, SQL] Very slow IN operator

Michael Segel Wed, 05 Apr 2017 19:01:26 -0700

Just out of curiosity, what would happen if you put your 10K values in to a 
temp table and then did a join against it?


> On Apr 5, 2017, at 4:30 PM, Maciej Bryński <mac...@brynski.pl> wrote:
> 
> Hi,
> I'm trying to run queries with many values in IN operator.
> 
> The result is that for more than 10K values IN operator is getting slower.
> 
> For example this code is running about 20 seconds.
> 
> df = spark.range(0,100000,1,1)
> df.where('id in ({})'.format(','.join(map(str,range(100000))))).count()
> 
> Any ideas how to improve this ?
> Is it a bug ?
> -- 
> Maciek Bryński
> 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> 


---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [Pyspark, SQL] Very slow IN operator

Reply via email to