[ https://issues.apache.org/jira/browse/SPARK-24098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16453716#comment-16453716 ]
Apache Spark commented on SPARK-24098: -------------------------------------- User 'liutang123' has created a pull request for this issue: https://github.com/apache/spark/pull/21164 > ScriptTransformationExec should wait process exiting before output iterator > finish > ---------------------------------------------------------------------------------- > > Key: SPARK-24098 > URL: https://issues.apache.org/jira/browse/SPARK-24098 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.2.1, 2.3.0 > Environment: Spark Version: 2.2.1 > Hadoop Version: 2.7.1 > > Reporter: Lijia Liu > Priority: Critical > > In our spark cluster, some users find that spark may lost data when they use > transform in sql. > We check the output file and discovery that some file are empty. > Then we check the executor's log, some exception were found like follow: > > > {code:java} > 18/04/19 03:33:03 ERROR SparkUncaughtExceptionHandler: [Container in > shutdown] Uncaught exception in thread > Thread[Thread-ScriptTransformation-Feed,5,main] > java.io.IOException: Broken pipe > at java.io.FileOutputStream.writeBytes(Native Method) > at java.io.FileOutputStream.write(FileOutputStream.java:326) > at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) > at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126) > at java.io.DataOutputStream.write(DataOutputStream.java:107) > at > org.apache.hadoop.hive.ql.exec.TextRecordWriter.write(TextRecordWriter.java:53) > at > org.apache.spark.sql.hive.execution.ScriptTransformationWriterThread$$anonfun$run$1$$anonfun$apply$mcV$sp$2.apply(ScriptTransformationExec.scala:341) > at > org.apache.spark.sql.hive.execution.ScriptTransformationWriterThread$$anonfun$run$1$$anonfun$apply$mcV$sp$2.apply(ScriptTransformationExec.scala:317) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at > org.apache.spark.sql.hive.execution.ScriptTransformationWriterThread$$anonfun$run$1.apply$mcV$sp(ScriptTransformationExec.scala:317) > at > org.apache.spark.sql.hive.execution.ScriptTransformationWriterThread$$anonfun$run$1.apply(ScriptTransformationExec.scala:306) > at > org.apache.spark.sql.hive.execution.ScriptTransformationWriterThread$$anonfun$run$1.apply(ScriptTransformationExec.scala:306) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1952) > at > org.apache.spark.sql.hive.execution.ScriptTransformationWriterThread.run(ScriptTransformationExec.scala:306) > 18/04/19 03:33:03 INFO FileOutputCommitter: Saved output of task > 'attempt_20180419033229_0049_m_000001_0' to > viewfs://hadoop-meituan/ghnn07/warehouse/mart_waimai_crm.db/.hive-staging_hive_2018-04-19_03-18-12_389_7869324671857417311-1/-ext-10000/_temporary/0/task_20180419033229_0049_m_000001 > 18/04/19 03:33:03 INFO SparkHadoopMapRedUtil: > attempt_20180419033229_0049_m_000001_0: Committed > 18/04/19 03:33:03 INFO Executor: Finished task 1.0 in stage 49.0 (TID 9843). > 17342 bytes result sent to driver{code} > > ScriptTransformation-Feed fail but the task finished successful. > FInally, we analysed the class ScriptTransformationExec and find follow > result: > When feed thread doesn't set its _exception variable and the progress doesn't > exit completely, output Iterator will return false in hasNext function. > > Bug Reappearance: > 1. Add Thread._sleep_(1000 * 600) before assign for _exception. > 2. structure a python script witch will throw exception like follow: > test.py > > {code:java} > import sys > for line in sys.stdin: > raise Exception('error') > print line > {code} > 3. use script created in step 2 in transform. > {code:java} > spark-sql>add files /path_to/test.py; > spark-sql>select transform(id) using 'python test.py' as id from city; > {code} > The result is that spark will end successfully. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org