What are the best options for quickly filtering a DataFrame on a single column?

2015-03-25 Thread Stuart Layton
I have a SparkSQL dataframe with a a few billion rows that I need to quickly filter down to a few hundred thousand rows, using an operation like (syntax may not be correct) df = df[ df.filter(lambda x: x.key_col in approved_keys)] I was thinking about serializing the data using parquet and

Re: What are the best options for quickly filtering a DataFrame on a single column?

2015-03-25 Thread Michael Armbrust
The only way to do in using python currently is to use the string based filter API (where you pass us an expression as a string, and we parse it using our SQL parser). from pyspark.sql import Row from pyspark.sql.functions import * df = sc.parallelize([Row(name=test)]).toDF() df.filter(name in

Re: What are the best options for quickly filtering a DataFrame on a single column?

2015-03-25 Thread Michael Armbrust
My example is a totally reasonable way to do it, it just requires constructing strings In many cases you can also do it with column objects df[df.name == test].collect() Out[15]: [Row(name=u'test')] You should check out:

Re: What are the best options for quickly filtering a DataFrame on a single column?

2015-03-25 Thread Stuart Layton
Thanks for the response, I was using IN as an example of the type of operation I need to do. Is there another way to do this that lines up more naturally with the way things are supposed to be done in SparkSQL? On Wed, Mar 25, 2015 at 2:29 PM, Michael Armbrust mich...@databricks.com wrote: The