Just out of curiosity, what would happen if you put your 10K values in to a temp table and then did a join against it?
> On Apr 5, 2017, at 4:30 PM, Maciej Bryński <mac...@brynski.pl> wrote: > > Hi, > I'm trying to run queries with many values in IN operator. > > The result is that for more than 10K values IN operator is getting slower. > > For example this code is running about 20 seconds. > > df = spark.range(0,100000,1,1) > df.where('id in ({})'.format(','.join(map(str,range(100000))))).count() > > Any ideas how to improve this ? > Is it a bug ? > -- > Maciek Bryński > > --------------------------------------------------------------------- > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org