pranavsaxena-microsoft commented on PR #5176: URL: https://github.com/apache/hadoop/pull/5176#issuecomment-1339051008
> sorry, should have been clearer: a local spark build and spark-shell process is ideal for replication and validation -as all splits are processed in different worker threads in that process, it recreates the exact failure mode. > > script you can take and tune for your system; uses the mkcsv command in cloudstore JAR. > > I am going to add this as a scalatest suite in the same module https://github.com/hortonworks-spark/cloud-integration/blob/master/spark-cloud-integration/src/scripts/validating-csv-record-io.sc Thanks for the script. I had applied following changes on the script: https://github.com/pranavsaxena-microsoft/cloud-integration/commit/1d779f22150be3102635819e4525967573602dd9. On trunk's jar, got exception: ``` 22/12/05 23:51:27 ERROR Executor: Exception in task 4.0 in stage 1.0 (TID 5) java.lang.NullPointerException: Null value appeared in non-nullable field: - field (class: "scala.Long", name: "rowId") - root class: "$line85.$read.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.CsvRecord" If the schema is inferred from a Scala tuple/case class, or a Java bean, please try to use scala.Option[_] or other nullable types (e.g. java.lang.Integer instead of int/scala.Int). at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.apply_0_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.apply(Unknown Source) at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) at org.apache.spark.rdd.RDD.$anonfun$foreach$2(RDD.scala:1001) at org.apache.spark.rdd.RDD.$anonfun$foreach$2$adapted(RDD.scala:1001) at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2302) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92) at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) at org.apache.spark.scheduler.Task.run(Task.scala:139) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1502) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) ``` Using the jar of the PR's code: ``` minimums=((action_http_get_request.min=-1) (action_http_get_request.failures.min=-1)); maximums=((action_http_get_request.max=-1) (action_http_get_request.failures.max=-1)); means=((action_http_get_request.failures.mean=(samples=0, sum=0, mean=0.0000)) (action_http_get_request.mean=(samples=0, sum=0, mean=0.0000))); }} 22/12/06 01:04:22 INFO TaskSetManager: Finished task 8.0 in stage 1.0 (TID 9) in 14727 ms on snvijaya-Virtual-Machine.mshome.net (executor driver) (9/9) 22/12/06 01:04:22 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 22/12/06 01:04:22 INFO DAGScheduler: ResultStage 1 (foreach at /home/snvijaya/Desktop/cloud-integration/spark-cloud-integration/src/scripts/validating-csv-record-io.sc:46) finished in 115.333 s 22/12/06 01:04:22 INFO DAGScheduler: Job 1 is finished. Cancelling potential speculative or zombie tasks for this job 22/12/06 01:04:22 INFO TaskSchedulerImpl: Killing all running tasks in stage 1: Stage finished 22/12/06 01:04:22 INFO DAGScheduler: Job 1 finished: foreach at /home/snvijaya/Desktop/cloud-integration/spark-cloud-integration/src/scripts/validating-csv-record-io.sc:46, took 115.337621 s res35: String = validation completed [start: string, rowId: bigint ... 6 more fields] ``` Commands executed: ``` :load /home/snvijaya/Desktop/cloud-integration/spark-cloud-integration/src/scripts/validating-csv-record-io.sc validateDS(rowsDS) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org