Re: Issue with column named "count" in a DataFrame

Michael Armbrust Wed, 22 Jul 2015 16:27:04 -0700

Additionally have you tried enclosing count in `backticks`?

On Wed, Jul 22, 2015 at 4:25 PM, Michael Armbrust <mich...@databricks.com>
wrote:


> I believe this will be fixed in Spark 1.5
>
> https://github.com/apache/spark/pull/7237
>
> On Wed, Jul 22, 2015 at 3:04 PM, Young, Matthew T <
> matthew.t.yo...@intel.com> wrote:
>
>> I'm trying to do some simple counting and aggregation in an IPython
>> notebook with Spark 1.4.0 and I have encountered behavior that looks like a
>> bug.
>>
>> When I try to filter rows out of an RDD with a column name of count I get
>> a large error message. I would just avoid naming things count, except for
>> the fact that this is the default column name created with the count()
>> operation in pyspark.sql.GroupedData
>>
>> The small example program below demonstrates the issue.
>>
>> from pyspark.sql import SQLContext
>> sqlContext = SQLContext(sc)
>> dataFrame = sc.parallelize([("foo",), ("foo",), ("bar",)]).toDF(["title"])
>> counts = dataFrame.groupBy('title').count()
>> counts.filter("title = 'foo'").show() # Works
>> counts.filter("count > 1").show()     # Errors out
>>
>>
>> I can even reproduce the issue in a PySpark shell session by entering
>> these commands.
>>
>> I suspect that the error has something to with Spark wanting to call the
>> count() function in place of looking at the count column.
>>
>> The error message is as follows:
>>
>>
>> Py4JJavaError                             Traceback (most recent call
>> last)
>> <ipython-input-29-62a1b7c71f21> in <module>()
>> ----> 1 counts.filter("count > 1").show() # Errors Out
>>
>> C:\Users\User\Downloads\spark-1.4.0-bin-hadoop2.6\python\pyspark\sql\dataframe.pyc
>> in filter(self, condition)
>>     774         """
>>     775         if isinstance(condition, basestring):
>> --> 776             jdf = self._jdf.filter(condition)
>>     777         elif isinstance(condition, Column):
>>     778             jdf = self._jdf.filter(condition._jc)
>>
>> C:\Python27\lib\site-packages\py4j\java_gateway.pyc in __call__(self,
>> *args)
>>     536         answer = self.gateway_client.send_command(command)
>>     537         return_value = get_return_value(answer,
>> self.gateway_client,
>> --> 538                 self.target_id, self.name)
>>
>>     539
>>     540         for temp_arg in temp_args:
>>
>> C:\Python27\lib\site-packages\py4j\protocol.pyc in
>> get_return_value(answer, gateway_client, target_id, name)
>>     298                 raise Py4JJavaError(
>>     299                     'An error occurred while calling
>> {0}{1}{2}.\n'.
>> --> 300                     format(target_id, '.', name), value)
>>     301             else:
>>     302                 raise Py4JError(
>>
>> Py4JJavaError: An error occurred while calling o229.filter.
>> : java.lang.RuntimeException: [1.7] failure: ``('' expected but `>' found
>>
>> count > 1
>>       ^
>>         at scala.sys.package$.error(package.scala:27)
>>         at
>> org.apache.spark.sql.catalyst.SqlParser.parseExpression(SqlParser.scala:45)
>>         at org.apache.spark.sql.DataFrame.filter(DataFrame.scala:652)
>>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>         at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
>>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>>         at java.lang.reflect.Method.invoke(Unknown Source)
>>         at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
>>         at
>> py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
>>         at py4j.Gateway.invoke(Gateway.java:259)
>>         at
>> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>>         at py4j.commands.CallCommand.execute(CallCommand.java:79)
>>         at py4j.GatewayConnection.run(GatewayConnection.java:207)
>>         at java.lang.Thread.run(Unknown Source)
>>
>>
>>
>> Is there a recommended workaround to the inability to filter on a column
>> named count? Do I have to make a new DataFrame and rename the column just
>> to work around this bug? What's the best way to do that?
>>
>> Thanks,
>>
>> -- Matthew Young
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

Re: Issue with column named "count" in a DataFrame

Reply via email to