I have a SparkSQL dataframe with a a few billion rows that I need to
quickly filter down to a few hundred thousand rows, using an operation like
(syntax may not be correct)
df = df[ df.filter(lambda x: x.key_col in approved_keys)]
I was thinking about serializing the data using parquet and
The only way to do in using python currently is to use the string based
filter API (where you pass us an expression as a string, and we parse it
using our SQL parser).
from pyspark.sql import Row
from pyspark.sql.functions import *
df = sc.parallelize([Row(name=test)]).toDF()
df.filter(name in
My example is a totally reasonable way to do it, it just requires
constructing strings
In many cases you can also do it with column objects
df[df.name == test].collect()
Out[15]: [Row(name=u'test')]
You should check out:
Thanks for the response, I was using IN as an example of the type of
operation I need to do. Is there another way to do this that lines up more
naturally with the way things are supposed to be done in SparkSQL?
On Wed, Mar 25, 2015 at 2:29 PM, Michael Armbrust mich...@databricks.com
wrote:
The