[ https://issues.apache.org/jira/browse/HADOOP-18546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643760#comment-17643760 ]
ASF GitHub Bot commented on HADOOP-18546: ----------------------------------------- pranavsaxena-microsoft commented on PR #5176: URL: https://github.com/apache/hadoop/pull/5176#issuecomment-1339051008 > sorry, should have been clearer: a local spark build and spark-shell process is ideal for replication and validation -as all splits are processed in different worker threads in that process, it recreates the exact failure mode. > > script you can take and tune for your system; uses the mkcsv command in cloudstore JAR. > > I am going to add this as a scalatest suite in the same module https://github.com/hortonworks-spark/cloud-integration/blob/master/spark-cloud-integration/src/scripts/validating-csv-record-io.sc Thanks for the script. I had applied following changes on the script: https://github.com/pranavsaxena-microsoft/cloud-integration/commit/1d779f22150be3102635819e4525967573602dd9. On trunk's jar, got exception: ``` 22/12/05 23:51:27 ERROR Executor: Exception in task 4.0 in stage 1.0 (TID 5) java.lang.NullPointerException: Null value appeared in non-nullable field: - field (class: "scala.Long", name: "rowId") - root class: "$line85.$read.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.CsvRecord" If the schema is inferred from a Scala tuple/case class, or a Java bean, please try to use scala.Option[_] or other nullable types (e.g. java.lang.Integer instead of int/scala.Int). at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.apply_0_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.apply(Unknown Source) at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) at org.apache.spark.rdd.RDD.$anonfun$foreach$2(RDD.scala:1001) at org.apache.spark.rdd.RDD.$anonfun$foreach$2$adapted(RDD.scala:1001) at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2302) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92) at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) at org.apache.spark.scheduler.Task.run(Task.scala:139) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1502) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) ``` Using the jar of the PR's code: ``` minimums=((action_http_get_request.min=-1) (action_http_get_request.failures.min=-1)); maximums=((action_http_get_request.max=-1) (action_http_get_request.failures.max=-1)); means=((action_http_get_request.failures.mean=(samples=0, sum=0, mean=0.0000)) (action_http_get_request.mean=(samples=0, sum=0, mean=0.0000))); }} 22/12/06 01:04:22 INFO TaskSetManager: Finished task 8.0 in stage 1.0 (TID 9) in 14727 ms on snvijaya-Virtual-Machine.mshome.net (executor driver) (9/9) 22/12/06 01:04:22 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 22/12/06 01:04:22 INFO DAGScheduler: ResultStage 1 (foreach at /home/snvijaya/Desktop/cloud-integration/spark-cloud-integration/src/scripts/validating-csv-record-io.sc:46) finished in 115.333 s 22/12/06 01:04:22 INFO DAGScheduler: Job 1 is finished. Cancelling potential speculative or zombie tasks for this job 22/12/06 01:04:22 INFO TaskSchedulerImpl: Killing all running tasks in stage 1: Stage finished 22/12/06 01:04:22 INFO DAGScheduler: Job 1 finished: foreach at /home/snvijaya/Desktop/cloud-integration/spark-cloud-integration/src/scripts/validating-csv-record-io.sc:46, took 115.337621 s res35: String = validation completed [start: string, rowId: bigint ... 6 more fields] ``` Commands executed: ``` :load /home/snvijaya/Desktop/cloud-integration/spark-cloud-integration/src/scripts/validating-csv-record-io.sc validateDS(rowsDS) ``` > disable purging list of in progress reads in abfs stream closed > --------------------------------------------------------------- > > Key: HADOOP-18546 > URL: https://issues.apache.org/jira/browse/HADOOP-18546 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs/azure > Affects Versions: 3.3.4 > Reporter: Steve Loughran > Assignee: Steve Loughran > Priority: Major > Labels: pull-request-available > > turn off the prune of in progress reads in > ReadBufferManager::purgeBuffersForStream > this will ensure active prefetches for a closed stream complete. they wiill > then get to the completed list and hang around until evicted by timeout, but > at least prefetching will be safe. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org