Hi All,

I am trying out with spark for the first time, so am reaching out for what
would seem as very basic question.

Consider the below example

>>> l = 
>>> [("US","City1",125),("US","City2",123),("Europe","CityX",23),("Europe","CityY",17)]
>>> print l
[('US', 'City1', 125), ('US', 'City2', 123), ('Europe', 'CityX', 23),
('Europe', 'CityY', 17)]

>>> sc = SparkContext(appName="N")
>>> sqlsc = SQLContext(sc)
>>> df = sqlsc.createDataFrame(l)
>>> df.printSchema()
 |-- _1: string (nullable = true)
 |-- _2: string (nullable = true)
 |-- _3: long (nullable = true)
>>> df.registerTempTable("t1")
>>> rdf=sqlsc.sql("Select _1,sum(_3) from t1 group by _1").show()
|    _1|_c1|
|    US|248|
|Europe| 40|
>>> rdf.printSchema()
 |-- _1: string (nullable = true)
 |-- _c1: long (nullable = true)
>>> rdf.registerTempTable("t2")
>>> sqlsc.sql("Select * from t2 where _c1 > 200").show()
| _1|_c1|
| US|248|

So basically, I am trying to find all the _3 (which can be population
subscribed to some service) which are above threshold in each country. In
the above table, there is an additional dataframe is created (rdf)

Now, How do I eliminate the rdf dataframe and embed the complete query
within df dataframe itself.

I tried, but pyspark throws error

>>> sqlsc.sql("Select _1,sum(_3) from t1 group by _1").show()
|    _1|_c1|
|    US|248|
|Europe| 40|
>>> sqlsc.sql("Select _1,sum(_3) from t1 group by _1 where _c1 > 200").show()
Traceback (most recent call last):
line 308, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o28.sql.
: java.lang.RuntimeException: [1.39] failure: ``union'' expected but
`where' found


Is there a possible way to avoid creation of the data frame (rdf) and
directly get the result from df?

I have not put much thought on how it would be beneficial, but just
pondering the question.


Reply via email to