Additionally have you tried enclosing count in `backticks`? On Wed, Jul 22, 2015 at 4:25 PM, Michael Armbrust <mich...@databricks.com> wrote:
> I believe this will be fixed in Spark 1.5 > > https://github.com/apache/spark/pull/7237 > > On Wed, Jul 22, 2015 at 3:04 PM, Young, Matthew T < > matthew.t.yo...@intel.com> wrote: > >> I'm trying to do some simple counting and aggregation in an IPython >> notebook with Spark 1.4.0 and I have encountered behavior that looks like a >> bug. >> >> When I try to filter rows out of an RDD with a column name of count I get >> a large error message. I would just avoid naming things count, except for >> the fact that this is the default column name created with the count() >> operation in pyspark.sql.GroupedData >> >> The small example program below demonstrates the issue. >> >> from pyspark.sql import SQLContext >> sqlContext = SQLContext(sc) >> dataFrame = sc.parallelize([("foo",), ("foo",), ("bar",)]).toDF(["title"]) >> counts = dataFrame.groupBy('title').count() >> counts.filter("title = 'foo'").show() # Works >> counts.filter("count > 1").show() # Errors out >> >> >> I can even reproduce the issue in a PySpark shell session by entering >> these commands. >> >> I suspect that the error has something to with Spark wanting to call the >> count() function in place of looking at the count column. >> >> The error message is as follows: >> >> >> Py4JJavaError Traceback (most recent call >> last) >> <ipython-input-29-62a1b7c71f21> in <module>() >> ----> 1 counts.filter("count > 1").show() # Errors Out >> >> C:\Users\User\Downloads\spark-1.4.0-bin-hadoop2.6\python\pyspark\sql\dataframe.pyc >> in filter(self, condition) >> 774 """ >> 775 if isinstance(condition, basestring): >> --> 776 jdf = self._jdf.filter(condition) >> 777 elif isinstance(condition, Column): >> 778 jdf = self._jdf.filter(condition._jc) >> >> C:\Python27\lib\site-packages\py4j\java_gateway.pyc in __call__(self, >> *args) >> 536 answer = self.gateway_client.send_command(command) >> 537 return_value = get_return_value(answer, >> self.gateway_client, >> --> 538 self.target_id, self.name) >> >> 539 >> 540 for temp_arg in temp_args: >> >> C:\Python27\lib\site-packages\py4j\protocol.pyc in >> get_return_value(answer, gateway_client, target_id, name) >> 298 raise Py4JJavaError( >> 299 'An error occurred while calling >> {0}{1}{2}.\n'. >> --> 300 format(target_id, '.', name), value) >> 301 else: >> 302 raise Py4JError( >> >> Py4JJavaError: An error occurred while calling o229.filter. >> : java.lang.RuntimeException: [1.7] failure: ``('' expected but `>' found >> >> count > 1 >> ^ >> at scala.sys.package$.error(package.scala:27) >> at >> org.apache.spark.sql.catalyst.SqlParser.parseExpression(SqlParser.scala:45) >> at org.apache.spark.sql.DataFrame.filter(DataFrame.scala:652) >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) >> at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) >> at java.lang.reflect.Method.invoke(Unknown Source) >> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) >> at >> py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) >> at py4j.Gateway.invoke(Gateway.java:259) >> at >> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) >> at py4j.commands.CallCommand.execute(CallCommand.java:79) >> at py4j.GatewayConnection.run(GatewayConnection.java:207) >> at java.lang.Thread.run(Unknown Source) >> >> >> >> Is there a recommended workaround to the inability to filter on a column >> named count? Do I have to make a new DataFrame and rename the column just >> to work around this bug? What's the best way to do that? >> >> Thanks, >> >> -- Matthew Young >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >