My example is a totally reasonable way to do it, it just requires constructing strings
In many cases you can also do it with column objects df[df.name == "test"].collect() Out[15]: [Row(name=u'test')] You should check out: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column and http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions On Wed, Mar 25, 2015 at 11:39 AM, Stuart Layton <[email protected]> wrote: > Thanks for the response, I was using IN as an example of the type of > operation I need to do. Is there another way to do this that lines up more > naturally with the way things are supposed to be done in SparkSQL? > > On Wed, Mar 25, 2015 at 2:29 PM, Michael Armbrust <[email protected]> > wrote: > >> The only way to do "in" using python currently is to use the string based >> filter API (where you pass us an expression as a string, and we parse it >> using our SQL parser). >> >> from pyspark.sql import Row >> from pyspark.sql.functions import * >> >> df = sc.parallelize([Row(name="test")]).toDF() >> df.filter("name in ('a', 'b')").collect() >> Out[1]: [] >> >> df.filter("name in ('test')").collect() >> Out[2]: [Row(name=u'test')] >> >> In general you want to avoid lambda functions whenever you can do the >> same thing a dataframe expression. This is because your lambda function is >> a black box that we cannot optimize (though you should certainly use them >> for the advanced stuff that expressions can't handle). >> >> I opened SPARK-6536 <https://issues.apache.org/jira/browse/SPARK-6536> to >> provide a nicer interface for this. >> >> >> On Wed, Mar 25, 2015 at 7:41 AM, Stuart Layton <[email protected]> >> wrote: >> >>> I have a SparkSQL dataframe with a a few billion rows that I need to >>> quickly filter down to a few hundred thousand rows, using an operation like >>> (syntax may not be correct) >>> >>> df = df[ df.filter(lambda x: x.key_col in approved_keys)] >>> >>> I was thinking about serializing the data using parquet and saving it to >>> S3, however as I want to optimize for filtering speed I'm not sure this is >>> the best option. >>> >>> -- >>> Stuart Layton >>> >> >> > > > -- > Stuart Layton >
