Re: What are the best options for quickly filtering a DataFrame on a single column?

Michael Armbrust Wed, 25 Mar 2015 12:28:14 -0700

My example is a totally reasonable way to do it, it just requires
constructing strings


In many cases you can also do it with column objects

df[df.name == "test"].collect()

Out[15]: [Row(name=u'test')]


You should check out:
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column
and
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions

On Wed, Mar 25, 2015 at 11:39 AM, Stuart Layton <[email protected]>
wrote:

> Thanks for the response, I was using IN as an example of the type of
> operation I need to do. Is there another way to do this that lines up more
> naturally with the way things are supposed to be done in SparkSQL?
>
> On Wed, Mar 25, 2015 at 2:29 PM, Michael Armbrust <[email protected]>
> wrote:
>
>> The only way to do "in" using python currently is to use the string based
>> filter API (where you pass us an expression as a string, and we parse it
>> using our SQL parser).
>>
>> from pyspark.sql import Row
>> from pyspark.sql.functions import *
>>
>> df = sc.parallelize([Row(name="test")]).toDF()
>> df.filter("name in ('a', 'b')").collect()
>> Out[1]: []
>>
>> df.filter("name in ('test')").collect()
>> Out[2]: [Row(name=u'test')]
>>
>> In general you want to avoid lambda functions whenever you can do the
>> same thing a dataframe expression.  This is because your lambda function is
>> a black box that we cannot optimize (though you should certainly use them
>> for the advanced stuff that expressions can't handle).
>>
>> I opened SPARK-6536 <https://issues.apache.org/jira/browse/SPARK-6536> to
>> provide a nicer interface for this.
>>
>>
>> On Wed, Mar 25, 2015 at 7:41 AM, Stuart Layton <[email protected]>
>> wrote:
>>
>>> I have a SparkSQL dataframe with a a few billion rows that I need to
>>> quickly filter down to a few hundred thousand rows, using an operation like
>>> (syntax may not be correct)
>>>
>>> df = df[ df.filter(lambda x: x.key_col in approved_keys)]
>>>
>>> I was thinking about serializing the data using parquet and saving it to
>>> S3, however as I want to optimize for filtering speed I'm not sure this is
>>> the best option.
>>>
>>> --
>>> Stuart Layton
>>>
>>
>>
>
>
> --
> Stuart Layton
>

Re: What are the best options for quickly filtering a DataFrame on a single column?

Reply via email to