Estimating Task memory
Hi, I'm looking for a way to estimate the amount of memory that will be needed for a task looking at the size of its input data. It clearly depends on what the task is doing, but is there a place to look in the logs exported by Spark to see this information? Thanks
Re: Spark 1.4 release date
Does the pre-build come with hive support? Namely, has it been built with -Phive and -Phive-thriftserver? On Fri, Jun 12, 2015, 9:32 AM ayan guha guha.a...@gmail.com wrote: Thanks guys, my question must look like a stupid one today :) Looking forward to test out 1.4.0, just downloaded it. Congrats to the team for this much anticipate release. On Fri, Jun 12, 2015 at 10:12 PM, Guru Medasani gdm...@gmail.com wrote: Here is a spark 1.4 release blog by data bricks. https://databricks.com/blog/2015/06/11/announcing-apache-spark-1-4.html Guru Medasani gdm...@gmail.com On Jun 12, 2015, at 7:08 AM, ayan guha guha.a...@gmail.com wrote: Thanks a lot. On 12 Jun 2015 19:46, Todd Nist tsind...@gmail.com wrote: It was released yesterday. On Friday, June 12, 2015, ayan guha guha.a...@gmail.com wrote: Hi When is official spark 1.4 release date? Best Ayan -- Best Regards, Ayan Guha
Job aborted
I'm running PageRank on datasets with different sizes (from 1GB to 100GB). Sometime my job is aborted showing this error: Job aborted due to stage failure: Task 0 in stage 4.1 failed 4 times, most recent failure: Lost task 0.3 in stage 4.1 (TID 2051, 9.12.247.250): java.io.FileNotFoundException: /tmp/spark-23fcf792-e281-4d0d-a55e-4db2c31e7f10/executor-d4a67fea-3d6b-4e93-ad92-53221fa92f2b/blockmgr-96e4bd04-00ba-4893-a65a-cec09ec2dc52/17/rdd_3_0 (No such file or directory) at java.io.RandomAccessFile.open(Native Method) at java.io.RandomAccessFile.init(RandomAccessFile.java:241) at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:110) at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:134) at org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:510) at org.apache.spark.storage.BlockManager.getLocal(BlockManager.scala:427) at org.apache.spark.storage.BlockManager.get(BlockManager.scala:616) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:44) at org.apache.spark.rdd.RDD.iterator(RDD.scala:242) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Driver stacktrace: I can not find the cause of this, can somone help? I'm running Spark 1.4.0-rc3.
Problem with current spark
Hi, I'm trying to run an application that uses a Hive context to perform some queries over JSON files. The code of the application is here: https://github.com/GiovanniPaoloGibilisco/spark-log-processor/tree/fca93d95a227172baca58d51a4d799594a0429a1 I can run it on Spark 1.3.1 after rebuilding it with hive support using: mvn -Phive -Phive-thriftserver -DskipTests clean package but when I try to run the same application on the one built fromt he current master branch (at this commit of today https://github.com/apache/spark/tree/bec938f777a2e18757c7d04504d86a5342e2b49e) again built with hive support I get an error at Stage 2 that is not submitted, and after a while the application is killed. The logs look like this: 15/05/13 16:54:37 INFO SparkContext: Starting job: run at unknown:0 15/05/13 16:54:37 INFO DAGScheduler: Got job 2 (run at unknown:0) with 2 output partitions (allowLocal=false) 15/05/13 16:54:37 INFO DAGScheduler: Final stage: ResultStage 4(run at unknown:0) 15/05/13 16:54:37 INFO DAGScheduler: Parents of final stage: List() 15/05/13 16:54:37 INFO Exchange: Using SparkSqlSerializer2. 15/05/13 16:54:37 INFO SparkContext: Starting job: run at unknown:0 15/05/13 16:54:37 INFO SparkContext: Starting job: run at unknown:0 15/05/13 16:54:37 INFO SparkContext: Starting job: run at unknown:0 ^C15/05/13 16:54:42 INFO SparkContext: Invoking stop() from shutdown hook 15/05/13 16:54:42 INFO SparkUI: Stopped Spark web UI at http://192.168.230.130:4040 15/05/13 16:54:42 INFO DAGScheduler: Stopping DAGScheduler 15/05/13 16:54:42 INFO SparkDeploySchedulerBackend: Shutting down all executors 15/05/13 16:54:42 INFO SparkDeploySchedulerBackend: Asking each executor to shut down 15/05/13 16:54:52 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped! 15/05/13 16:54:52 ERROR TaskSchedulerImpl: Lost executor 0 on 192.168.230.130: remote Rpc client disassociated 15/05/13 16:54:53 INFO AppClient$ClientActor: Executor updated: app-20150513165402-/0 is now EXITED (Command exited with code 0) 15/05/13 16:54:53 INFO SparkDeploySchedulerBackend: Executor app-20150513165402-/0 removed: Command exited with code 0 15/05/13 16:54:53 ERROR SparkDeploySchedulerBackend: Asked to remove non-existent executor 0 15/05/13 16:56:42 WARN AkkaRpcEndpointRef: Error sending message [message = StopExecutors] in 1 attempts java.util.concurrent.TimeoutException: Futures timed out after [120 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102) at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:78) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(CoarseGrainedSchedulerBackend.scala:257) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stop(CoarseGrainedSchedulerBackend.scala:266) at org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend.stop(SparkDeploySchedulerBackend.scala:95) at org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:416) at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1404) at org.apache.spark.SparkContext.stop(SparkContext.scala:1562) at org.apache.spark.SparkContext$$anonfun$3.apply$mcV$sp(SparkContext.scala:551) at org.apache.spark.util.SparkShutdownHook.run(Utils.scala:2252) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(Utils.scala:) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(Utils.scala:) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(Utils.scala:) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1764) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(Utils.scala:) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(Utils.scala:) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(Utils.scala:) at scala.util.Try$.apply(Try.scala:161) at org.apache.spark.util.SparkShutdownHookManager.runAll(Utils.scala:) at org.apache.spark.util.SparkShutdownHookManager$$anon$6.run(Utils.scala:2204) at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54) Should I submit an Issue for this? What is the best way to do it? Best
SparkSQL Nested structure
Hi, I'm trying to parse log files generated by Spark using SparkSQL. In the JSON elements related to the StageCompleted event we have a nested structre containing an array of elements with RDD Info. (see the log below as an example (omitting some parts). { Event: SparkListenerStageCompleted, Stage Info: { Stage ID: 1, ... RDD Info: [ { RDD ID: 5, Name: 5, Storage Level: { Use Disk: false, Use Memory: false, Use Tachyon: false, Deserialized: false, Replication: 1 }, Number of Partitions: 2, Number of Cached Partitions: 0, Memory Size: 0, Tachyon Size: 0, Disk Size: 0 }, ... When i register the log as a table SparkSQL is able to generate the correct schema that for the RDD Info element looks like | -- RDD Info: array (nullable = true) ||-- element: struct (containsNull = true) |||-- Disk Size: long (nullable = true) |||-- Memory Size: long (nullable = true) |||-- Name: string (nullable = true) My problem is that if I try to query the table I can only get array buffers out of it: SELECT `stageEndInfos.Stage Info.Stage ID`, `stageEndInfos.Stage Info.RDD Info` FROM stageEndInfos Stage ID RDD Info 1ArrayBuffer([0,0,... 0ArrayBuffer([0,0,... 2ArrayBuffer([0,0,... or: SELECT `stageEndInfos.Stage Info.RDD Info.RDD ID` FROM stageEndInfos RDD ID ArrayBuffer(5, 4, 3) ArrayBuffer(2, 1, 0) ArrayBuffer(9, 6,... Is there a way to explode the arrays in the rows in order to build a single table? (Knowing that the RDD ID is unique and can be used as primary key)? Thanks! How can I get
Building DAG from log
Hi, I'm trying to build the DAG of an application from the logs. I've had a look at SparkReplayDebugger but it doesn't operato offline on logs. I looked also at the one in this pull: https://github.com/apache/spark/pull/2077 that seems to operate only on logs but it doesn't clealry show the dependency between the stages. Is there some other tool to do this? In the log files I could not find the information needed to define dependencies within the stages, is there any other way to derive this information offline? Thanks,
Metric collection
Hi, I would like to collect some metrics from spark and plot them with graphite. I managed to do that withe the metrics provided by the or.apache.park.metrics.source.JvmSource but I would like to know if there are other sources available beside this one. Best, Giovanni
DAG
Hi, I would like to know if it is possible to build the DAG before actually executing the application. My guess is that in the scheduler the DAG is built dynamically at runtime since it might depend on the data, but I was wondering if there is a way (and maybe a tool already) to analyze the code and buidl the DAG. Thank you!