The only way to do "in" using python currently is to use the string based
filter API (where you pass us an expression as a string, and we parse it
using our SQL parser).
from pyspark.sql import Row
from pyspark.sql.functions import *
df = sc.parallelize([Row(name="test")]).toDF()
df.filter("name in ('a', 'b')").collect()
Out[1]: []
df.filter("name in ('test')").collect()
Out[2]: [Row(name=u'test')]
In general you want to avoid lambda functions whenever you can do the same
thing a dataframe expression. This is because your lambda function is a
black box that we cannot optimize (though you should certainly use them for
the advanced stuff that expressions can't handle).
I opened SPARK-6536 <https://issues.apache.org/jira/browse/SPARK-6536> to
provide a nicer interface for this.
On Wed, Mar 25, 2015 at 7:41 AM, Stuart Layton <[email protected]>
wrote:
> I have a SparkSQL dataframe with a a few billion rows that I need to
> quickly filter down to a few hundred thousand rows, using an operation like
> (syntax may not be correct)
>
> df = df[ df.filter(lambda x: x.key_col in approved_keys)]
>
> I was thinking about serializing the data using parquet and saving it to
> S3, however as I want to optimize for filtering speed I'm not sure this is
> the best option.
>
> --
> Stuart Layton
>