[jira] [Assigned] (SPARK-16409) regexp_extract with optional groups causes NPE
[ https://issues.apache.org/jira/browse/SPARK-16409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-16409: - Assignee: Sean Owen > regexp_extract with optional groups causes NPE > -- > > Key: SPARK-16409 > URL: https://issues.apache.org/jira/browse/SPARK-16409 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Max Moroz >Assignee: Sean Owen > Fix For: 1.6.3, 2.0.1, 2.1.0 > > > df = sqlContext.createDataFrame([['c']], ['s']) > df.select(F.regexp_extract('s', r'(a+)(b)?(c)', 2)).collect() > causes NPE. Worse, in a large program it doesn't cause NPE instantly; it > actually works fine, until some unpredictable (and inconsistent) moment in > the future when (presumably) the invalid memory access occurs, and then it > fails. For this reason, it took several hours to debug this. > Suggestion: either fill the group with null; or raise exception immediately > after examining the argument with a message that optional groups are not > allowed. > Traceback: > --- > Py4JJavaError Traceback (most recent call last) > in () > > 1 df.select(F.regexp_extract('s', r'(a+)(b)?(c)', 2)).collect() > C:\Users\me\Downloads\spark-2.0.0-preview-bin-hadoop2.7\python\pyspark\sql\dataframe.py > in collect(self) > 294 """ > 295 with SCCallSiteSync(self._sc) as css: > --> 296 port = self._jdf.collectToPython() > 297 return list(_load_from_socket(port, > BatchedSerializer(PickleSerializer( > 298 > C:\Users\me\Downloads\spark-2.0.0-preview-bin-hadoop2.7\python\lib\py4j-0.10.1-src.zip\py4j\java_gateway.py > in __call__(self, *args) > 931 answer = self.gateway_client.send_command(command) > 932 return_value = get_return_value( > --> 933 answer, self.gateway_client, self.target_id, self.name) > 934 > 935 for temp_arg in temp_args: > C:\Users\me\Downloads\spark-2.0.0-preview-bin-hadoop2.7\python\pyspark\sql\utils.py > in deco(*a, **kw) > 55 def deco(*a, **kw): > 56 try: > ---> 57 return f(*a, **kw) > 58 except py4j.protocol.Py4JJavaError as e: > 59 s = e.java_exception.toString() > C:\Users\me\Downloads\spark-2.0.0-preview-bin-hadoop2.7\python\lib\py4j-0.10.1-src.zip\py4j\protocol.py > in get_return_value(answer, gateway_client, target_id, name) > 310 raise Py4JJavaError( > 311 "An error occurred while calling {0}{1}{2}.\n". > --> 312 format(target_id, ".", name), value) > 313 else: > 314 raise Py4JError( > Py4JJavaError: An error occurred while calling o51.collectToPython. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 > (TID 0, localhost): java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:210) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$7$$anon$1.hasNext(WholeStageCodegenExec.scala:357) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:117) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:112) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) > at > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.to(SerDeUtil.scala:112) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) > at > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toBuffer(SerDeUtil.scala:112) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) > at > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toArray(SerDeUtil.scala:112) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:883) > at >
[jira] [Assigned] (SPARK-16409) regexp_extract with optional groups causes NPE
[ https://issues.apache.org/jira/browse/SPARK-16409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16409: Assignee: Apache Spark > regexp_extract with optional groups causes NPE > -- > > Key: SPARK-16409 > URL: https://issues.apache.org/jira/browse/SPARK-16409 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Max Moroz >Assignee: Apache Spark > > df = sqlContext.createDataFrame([['c']], ['s']) > df.select(F.regexp_extract('s', r'(a+)(b)?(c)', 2)).collect() > causes NPE. Worse, in a large program it doesn't cause NPE instantly; it > actually works fine, until some unpredictable (and inconsistent) moment in > the future when (presumably) the invalid memory access occurs, and then it > fails. For this reason, it took several hours to debug this. > Suggestion: either fill the group with null; or raise exception immediately > after examining the argument with a message that optional groups are not > allowed. > Traceback: > --- > Py4JJavaError Traceback (most recent call last) > in () > > 1 df.select(F.regexp_extract('s', r'(a+)(b)?(c)', 2)).collect() > C:\Users\me\Downloads\spark-2.0.0-preview-bin-hadoop2.7\python\pyspark\sql\dataframe.py > in collect(self) > 294 """ > 295 with SCCallSiteSync(self._sc) as css: > --> 296 port = self._jdf.collectToPython() > 297 return list(_load_from_socket(port, > BatchedSerializer(PickleSerializer( > 298 > C:\Users\me\Downloads\spark-2.0.0-preview-bin-hadoop2.7\python\lib\py4j-0.10.1-src.zip\py4j\java_gateway.py > in __call__(self, *args) > 931 answer = self.gateway_client.send_command(command) > 932 return_value = get_return_value( > --> 933 answer, self.gateway_client, self.target_id, self.name) > 934 > 935 for temp_arg in temp_args: > C:\Users\me\Downloads\spark-2.0.0-preview-bin-hadoop2.7\python\pyspark\sql\utils.py > in deco(*a, **kw) > 55 def deco(*a, **kw): > 56 try: > ---> 57 return f(*a, **kw) > 58 except py4j.protocol.Py4JJavaError as e: > 59 s = e.java_exception.toString() > C:\Users\me\Downloads\spark-2.0.0-preview-bin-hadoop2.7\python\lib\py4j-0.10.1-src.zip\py4j\protocol.py > in get_return_value(answer, gateway_client, target_id, name) > 310 raise Py4JJavaError( > 311 "An error occurred while calling {0}{1}{2}.\n". > --> 312 format(target_id, ".", name), value) > 313 else: > 314 raise Py4JError( > Py4JJavaError: An error occurred while calling o51.collectToPython. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 > (TID 0, localhost): java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:210) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$7$$anon$1.hasNext(WholeStageCodegenExec.scala:357) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:117) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:112) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) > at > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.to(SerDeUtil.scala:112) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) > at > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toBuffer(SerDeUtil.scala:112) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) > at > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toArray(SerDeUtil.scala:112) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:883) > at >
[jira] [Assigned] (SPARK-16409) regexp_extract with optional groups causes NPE
[ https://issues.apache.org/jira/browse/SPARK-16409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16409: Assignee: (was: Apache Spark) > regexp_extract with optional groups causes NPE > -- > > Key: SPARK-16409 > URL: https://issues.apache.org/jira/browse/SPARK-16409 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Max Moroz > > df = sqlContext.createDataFrame([['c']], ['s']) > df.select(F.regexp_extract('s', r'(a+)(b)?(c)', 2)).collect() > causes NPE. Worse, in a large program it doesn't cause NPE instantly; it > actually works fine, until some unpredictable (and inconsistent) moment in > the future when (presumably) the invalid memory access occurs, and then it > fails. For this reason, it took several hours to debug this. > Suggestion: either fill the group with null; or raise exception immediately > after examining the argument with a message that optional groups are not > allowed. > Traceback: > --- > Py4JJavaError Traceback (most recent call last) > in () > > 1 df.select(F.regexp_extract('s', r'(a+)(b)?(c)', 2)).collect() > C:\Users\me\Downloads\spark-2.0.0-preview-bin-hadoop2.7\python\pyspark\sql\dataframe.py > in collect(self) > 294 """ > 295 with SCCallSiteSync(self._sc) as css: > --> 296 port = self._jdf.collectToPython() > 297 return list(_load_from_socket(port, > BatchedSerializer(PickleSerializer( > 298 > C:\Users\me\Downloads\spark-2.0.0-preview-bin-hadoop2.7\python\lib\py4j-0.10.1-src.zip\py4j\java_gateway.py > in __call__(self, *args) > 931 answer = self.gateway_client.send_command(command) > 932 return_value = get_return_value( > --> 933 answer, self.gateway_client, self.target_id, self.name) > 934 > 935 for temp_arg in temp_args: > C:\Users\me\Downloads\spark-2.0.0-preview-bin-hadoop2.7\python\pyspark\sql\utils.py > in deco(*a, **kw) > 55 def deco(*a, **kw): > 56 try: > ---> 57 return f(*a, **kw) > 58 except py4j.protocol.Py4JJavaError as e: > 59 s = e.java_exception.toString() > C:\Users\me\Downloads\spark-2.0.0-preview-bin-hadoop2.7\python\lib\py4j-0.10.1-src.zip\py4j\protocol.py > in get_return_value(answer, gateway_client, target_id, name) > 310 raise Py4JJavaError( > 311 "An error occurred while calling {0}{1}{2}.\n". > --> 312 format(target_id, ".", name), value) > 313 else: > 314 raise Py4JError( > Py4JJavaError: An error occurred while calling o51.collectToPython. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 > (TID 0, localhost): java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:210) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$7$$anon$1.hasNext(WholeStageCodegenExec.scala:357) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:117) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:112) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) > at > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.to(SerDeUtil.scala:112) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) > at > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toBuffer(SerDeUtil.scala:112) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) > at > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toArray(SerDeUtil.scala:112) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:883) > at >