Re: error in sparkSQL 1.5 using count(1) in nested queries

Michael Armbrust Fri, 09 Oct 2015 12:02:53 -0700

Thanks for reporting: https://issues.apache.org/jira/browse/SPARK-11032


You can probably workaround this by aliasing the count and just doing a
filter on that value afterwards.

On Thu, Oct 8, 2015 at 8:47 PM, Jeff Thompson <
jeffreykeatingthomp...@gmail.com> wrote:

> After upgrading from 1.4.1 to 1.5.1 I found some of my spark SQL queries
> no longer worked.  Seems to be related to using count(1) or count(*) in a
> nested query.  I can reproduce the issue in a pyspark shell with the sample
> code below.  The ‘people’ table is from spark-1.5.1-bin-hadoop2.4/
> examples/src/main/resources/people.json.
>
> Environment details: Hadoop 2.5.0-cdh5.3.0, YARN
>
> *Test code:*
>
> from pyspark.sql import SQLContext
> print(sc.version)
> sqlContext = SQLContext(sc)
>
> df = sqlContext.read.json("/user/thj1pal/people.json")
> df.show()
>
> sqlContext.registerDataFrameAsTable(df,"PEOPLE")
>
> result = sqlContext.sql("SELECT MIN(t0.age) FROM (SELECT * FROM PEOPLE
> WHERE age > 0) t0 HAVING(COUNT(1) > 0)")
> result.show()
>
> *spark 1.4.1 output*
>
> 1.4.1
> +----+-------+
> | age|   name|
> +----+-------+
> |null|Michael|
> |  30|   Andy|
> |  19| Justin|
> +----+-------+
>
> +--+
> |c0|
> +--+
> |19|
> +--+
>
>
> *spark 1.5.1 output*
>
> 1.5.1
> +----+-------+
> | age|   name|
> +----+-------+
> |null|Michael|
> |  30|   Andy|
> |  19| Justin|
> +----+-------+
>
> ---------------------------------------------------------------------------
> Py4JJavaError                             Traceback (most recent call last)
> <ipython-input-1-342b585498f7> in <module>()
>       9
>      10 result = sqlContext.sql("SELECT MIN(t0.age) FROM     (SELECT *
> FROM PEOPLE WHERE age > 0) t0     HAVING(COUNT(1) > 0)")
> ---> 11 result.show()
>
> /home/thj1pal/spark-1.5.1-bin-hadoop2.4/python/pyspark/sql/dataframe.pyc
> in show(self, n, truncate)
>     254         +---+-----+
>     255         """
> --> 256         print(self._jdf.showString(n, truncate))
>     257
>     258     def __repr__(self):
>
> /home/thj1pal/spark-1.5.1-bin-hadoop2.4/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py
> in __call__(self, *args)
>     536         answer = self.gateway_client.send_command(command)
>     537         return_value = get_return_value(answer,
> self.gateway_client,
> --> 538                 self.target_id, self.name)
>     539
>     540         for temp_arg in temp_args:
>
> /home/thj1pal/spark-1.5.1-bin-hadoop2.4/python/pyspark/sql/utils.pyc in
> deco(*a, **kw)
>      34     def deco(*a, **kw):
>      35         try:
> ---> 36             return f(*a, **kw)
>      37         except py4j.protocol.Py4JJavaError as e:
>      38             s = e.java_exception.toString()
>
> /home/thj1pal/spark-1.5.1-bin-hadoop2.4/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py
> in get_return_value(answer, gateway_client, target_id, name)
>     298                 raise Py4JJavaError(
>     299                     'An error occurred while calling {0}{1}{2}.\n'.
> --> 300                     format(target_id, '.', name), value)
>     301             else:
>     302                 raise Py4JError(
>
> Py4JJavaError: An error occurred while calling o33.showString.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task
> 0 in stage 4.0 failed 4 times, most recent failure: Lost task 0.3 in stage
> 4.0 (TID 9, pal-bd-n06-ib): java.lang.UnsupportedOperationException: Cannot
> evaluate expression: count(1)
>         at
> org.apache.spark.sql.catalyst.expressions.Unevaluable$class.eval(Expression.scala:188)
>         at
> org.apache.spark.sql.catalyst.expressions.Count.eval(aggregates.scala:156)
>         at
> org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:327)
> ….
>

Re: error in sparkSQL 1.5 using count(1) in nested queries

Reply via email to