pranavsaxena-microsoft commented on PR #5176:
URL: https://github.com/apache/hadoop/pull/5176#issuecomment-1339051008

   > sorry, should have been clearer: a local spark build and spark-shell 
process is ideal for replication and validation -as all splits are processed in 
different worker threads in that process, it recreates the exact failure mode.
   > 
   > script you can take and tune for your system; uses the mkcsv command in 
cloudstore JAR.
   > 
   > I am going to add this as a scalatest suite in the same module 
https://github.com/hortonworks-spark/cloud-integration/blob/master/spark-cloud-integration/src/scripts/validating-csv-record-io.sc
   
   Thanks for the script. I had applied following changes on the script: 
https://github.com/pranavsaxena-microsoft/cloud-integration/commit/1d779f22150be3102635819e4525967573602dd9.
   
   On trunk's jar, got exception:
   ```
   22/12/05 23:51:27 ERROR Executor: Exception in task 4.0 in stage 1.0 (TID 5)
   java.lang.NullPointerException: Null value appeared in non-nullable field:
   - field (class: "scala.Long", name: "rowId")
   - root class: 
"$line85.$read.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.CsvRecord"
   If the schema is inferred from a Scala tuple/case class, or a Java bean, 
please try to use scala.Option[_] or other nullable types (e.g. 
java.lang.Integer instead of int/scala.Int).
           at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.apply_0_0$(Unknown
 Source)
           at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.apply(Unknown
 Source)
           at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
           at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
           at scala.collection.Iterator.foreach(Iterator.scala:943)
           at scala.collection.Iterator.foreach$(Iterator.scala:943)
           at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
           at org.apache.spark.rdd.RDD.$anonfun$foreach$2(RDD.scala:1001)
           at 
org.apache.spark.rdd.RDD.$anonfun$foreach$2$adapted(RDD.scala:1001)
           at 
org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2302)
           at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
           at 
org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
           at org.apache.spark.scheduler.Task.run(Task.scala:139)
           at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
           at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1502)
           at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
           at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
           at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
           at java.lang.Thread.run(Thread.java:750)
   ```
   
   Using the jar of the PR's code:
   ```
   minimums=((action_http_get_request.min=-1) 
(action_http_get_request.failures.min=-1));
   maximums=((action_http_get_request.max=-1) 
(action_http_get_request.failures.max=-1));
   means=((action_http_get_request.failures.mean=(samples=0, sum=0, 
mean=0.0000)) (action_http_get_request.mean=(samples=0, sum=0, mean=0.0000)));
   }} 
   22/12/06 01:04:22 INFO TaskSetManager: Finished task 8.0 in stage 1.0 (TID 
9) in 14727 ms on snvijaya-Virtual-Machine.mshome.net (executor driver) (9/9)
   22/12/06 01:04:22 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks 
have all completed, from pool 
   22/12/06 01:04:22 INFO DAGScheduler: ResultStage 1 (foreach at 
/home/snvijaya/Desktop/cloud-integration/spark-cloud-integration/src/scripts/validating-csv-record-io.sc:46)
 finished in 115.333 s
   22/12/06 01:04:22 INFO DAGScheduler: Job 1 is finished. Cancelling potential 
speculative or zombie tasks for this job
   22/12/06 01:04:22 INFO TaskSchedulerImpl: Killing all running tasks in stage 
1: Stage finished
   22/12/06 01:04:22 INFO DAGScheduler: Job 1 finished: foreach at 
/home/snvijaya/Desktop/cloud-integration/spark-cloud-integration/src/scripts/validating-csv-record-io.sc:46,
 took 115.337621 s
   res35: String = validation completed [start: string, rowId: bigint ... 6 
more fields]
   ```
   
   Commands executed:
   ```
   :load 
/home/snvijaya/Desktop/cloud-integration/spark-cloud-integration/src/scripts/validating-csv-record-io.sc
   validateDS(rowsDS)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to