Re: What are the best options for quickly filtering a DataFrame on a single column?

Michael Armbrust Wed, 25 Mar 2015 11:31:29 -0700

The only way to do "in" using python currently is to use the string based
filter API (where you pass us an expression as a string, and we parse it
using our SQL parser).

from pyspark.sql import Row
from pyspark.sql.functions import *

df = sc.parallelize([Row(name="test")]).toDF()
df.filter("name in ('a', 'b')").collect()
Out[1]: []

df.filter("name in ('test')").collect()
Out[2]: [Row(name=u'test')]

In general you want to avoid lambda functions whenever you can do the same
thing a dataframe expression.  This is because your lambda function is a
black box that we cannot optimize (though you should certainly use them for
the advanced stuff that expressions can't handle).

I opened SPARK-6536 <https://issues.apache.org/jira/browse/SPARK-6536> to
provide a nicer interface for this.

On Wed, Mar 25, 2015 at 7:41 AM, Stuart Layton <[email protected]>
wrote:

> I have a SparkSQL dataframe with a a few billion rows that I need to
> quickly filter down to a few hundred thousand rows, using an operation like
> (syntax may not be correct)
>
> df = df[ df.filter(lambda x: x.key_col in approved_keys)]
>
> I was thinking about serializing the data using parquet and saving it to
> S3, however as I want to optimize for filtering speed I'm not sure this is
> the best option.
>
> --
> Stuart Layton
>

Re: What are the best options for quickly filtering a DataFrame on a single column?

Reply via email to