Re: What are the best options for quickly filtering a DataFrame on a single column?

Stuart Layton Wed, 25 Mar 2015 11:41:50 -0700

Thanks for the response, I was using IN as an example of the type of
operation I need to do. Is there another way to do this that lines up more
naturally with the way things are supposed to be done in SparkSQL?


On Wed, Mar 25, 2015 at 2:29 PM, Michael Armbrust <mich...@databricks.com>
wrote:

> The only way to do "in" using python currently is to use the string based
> filter API (where you pass us an expression as a string, and we parse it
> using our SQL parser).
>
> from pyspark.sql import Row
> from pyspark.sql.functions import *
>
> df = sc.parallelize([Row(name="test")]).toDF()
> df.filter("name in ('a', 'b')").collect()
> Out[1]: []
>
> df.filter("name in ('test')").collect()
> Out[2]: [Row(name=u'test')]
>
> In general you want to avoid lambda functions whenever you can do the same
> thing a dataframe expression.  This is because your lambda function is a
> black box that we cannot optimize (though you should certainly use them for
> the advanced stuff that expressions can't handle).
>
> I opened SPARK-6536 <https://issues.apache.org/jira/browse/SPARK-6536> to
> provide a nicer interface for this.
>
>
> On Wed, Mar 25, 2015 at 7:41 AM, Stuart Layton <stuart.lay...@gmail.com>
> wrote:
>
>> I have a SparkSQL dataframe with a a few billion rows that I need to
>> quickly filter down to a few hundred thousand rows, using an operation like
>> (syntax may not be correct)
>>
>> df = df[ df.filter(lambda x: x.key_col in approved_keys)]
>>
>> I was thinking about serializing the data using parquet and saving it to
>> S3, however as I want to optimize for filtering speed I'm not sure this is
>> the best option.
>>
>> --
>> Stuart Layton
>>
>
>


-- 
Stuart Layton

Re: What are the best options for quickly filtering a DataFrame on a single column?

Reply via email to