[ https://issues.apache.org/jira/browse/SPARK-23950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16433296#comment-16433296 ]
Hyukjin Kwon commented on SPARK-23950: -------------------------------------- Seems fixed in the current master. Let me leave this resolved but it would be great if we can find which one fixes it and backports if applicable. > Coalescing an empty dataframe to 1 partition > -------------------------------------------- > > Key: SPARK-23950 > URL: https://issues.apache.org/jira/browse/SPARK-23950 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 2.2.1 > Environment: Operating System: Windows 7 > Tested in Jupyter notebooks using Python 2.7.14 and Python 3.6.3. > Hardware specs not relevant to the issue. > Reporter: João Neves > Priority: Major > > Coalescing an empty dataframe to 1 partition returns an error. > The funny thing is that coalescing an empty dataframe to 2 or more partitions > seem to work. > The test case is the following: > {code} > from pyspark.sql.types import StructType > df = spark.createDataFrame(spark.sparkContext.emptyRDD(), StructType([])) > print(df.coalesce(2).count()) > print(df.coalesce(3).count()) > print(df.coalesce(4).count()) > df.coalesce(1).count(){code} > Output: > {code:java} > 0 > 0 > 0 > --------------------------------------------------------------------------- > Py4JJavaError Traceback (most recent call last) > <ipython-input-5-c067400f2ef0> in <module>() > 7 print(df.coalesce(4).count()) > 8 > ----> 9 print(df.coalesce(1).count()) > C:\spark-2.2.1-bin-hadoop2.7\python\pyspark\sql\dataframe.py in count(self) > 425 2 > 426 """ > --> 427 return int(self._jdf.count()) > 428 > 429 @ignore_unicode_prefix > c:\python36\lib\site-packages\py4j\java_gateway.py in __call__(self, *args) > 1131 answer = self.gateway_client.send_command(command) > 1132 return_value = get_return_value( > -> 1133 answer, self.gateway_client, self.target_id, self.name) > 1134 > 1135 for temp_arg in temp_args: > C:\spark-2.2.1-bin-hadoop2.7\python\pyspark\sql\utils.py in deco(*a, **kw) > 61 def deco(*a, **kw): > 62 try: > ---> 63 return f(*a, **kw) > 64 except py4j.protocol.Py4JJavaError as e: > 65 s = e.java_exception.toString() > c:\python36\lib\site-packages\py4j\protocol.py in get_return_value(answer, > gateway_client, target_id, name) > 317 raise Py4JJavaError( > 318 "An error occurred while calling {0}{1}{2}.\n". > --> 319 format(target_id, ".", name), value) > 320 else: > 321 raise Py4JError( > Py4JJavaError: An error occurred while calling o176.count. > : java.util.NoSuchElementException: next on empty iterator > at scala.collection.Iterator$$anon$2.next(Iterator.scala:39) > at scala.collection.Iterator$$anon$2.next(Iterator.scala:37) > at scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:63) > at scala.collection.IterableLike$class.head(IterableLike.scala:107) > at > scala.collection.mutable.ArrayOps$ofRef.scala$collection$IndexedSeqOptimized$$super$head(ArrayOps.scala:186) > at > scala.collection.IndexedSeqOptimized$class.head(IndexedSeqOptimized.scala:126) > at scala.collection.mutable.ArrayOps$ofRef.head(ArrayOps.scala:186) > at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2435) > at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2434) > at org.apache.spark.sql.Dataset$$anonfun$55.apply(Dataset.scala:2842) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2841) > at org.apache.spark.sql.Dataset.count(Dataset.scala:2434) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) > at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) > at java.lang.reflect.Method.invoke(Unknown Source) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > at py4j.Gateway.invoke(Gateway.java:280) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:214) > at java.lang.Thread.run(Unknown Source){code} > Shouldn't this be consistent? > Thank you very much. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org