Estimating Task memory

2015-06-30 Thread Giovanni Paolo Gibilisco
Hi,
I'm looking for a way to estimate the amount of memory that will be needed
for a task looking at the size of its input data. It clearly depends on
what the task is doing, but is there a place to look in the logs exported
by Spark to see this information?
Thanks


Re: Spark 1.4 release date

2015-06-13 Thread Giovanni Paolo Gibilisco
Does the pre-build come with hive support?
Namely, has it been built with -Phive and  -Phive-thriftserver?

On Fri, Jun 12, 2015, 9:32 AM ayan guha guha.a...@gmail.com wrote:

 Thanks guys, my question must look like a stupid one today :) Looking
 forward to test out 1.4.0, just downloaded it.

 Congrats to the team for this much anticipate release.

 On Fri, Jun 12, 2015 at 10:12 PM, Guru Medasani gdm...@gmail.com wrote:

 Here is a spark 1.4 release blog by data bricks.

 https://databricks.com/blog/2015/06/11/announcing-apache-spark-1-4.html


 Guru Medasani
 gdm...@gmail.com



 On Jun 12, 2015, at 7:08 AM, ayan guha guha.a...@gmail.com wrote:

 Thanks a lot.
 On 12 Jun 2015 19:46, Todd Nist tsind...@gmail.com wrote:

 It was released yesterday.

 On Friday, June 12, 2015, ayan guha guha.a...@gmail.com wrote:

 Hi

 When is official spark 1.4 release date?
 Best
 Ayan





 --
 Best Regards,
 Ayan Guha



Job aborted

2015-06-05 Thread Giovanni Paolo Gibilisco
I'm running PageRank on datasets with different sizes (from 1GB to 100GB).
Sometime my job is aborted showing this error:

Job aborted due to stage failure: Task 0 in stage 4.1 failed 4 times,
most recent failure: Lost task 0.3 in stage 4.1 (TID 2051,
9.12.247.250): java.io.FileNotFoundException:
/tmp/spark-23fcf792-e281-4d0d-a55e-4db2c31e7f10/executor-d4a67fea-3d6b-4e93-ad92-53221fa92f2b/blockmgr-96e4bd04-00ba-4893-a65a-cec09ec2dc52/17/rdd_3_0
(No such file or directory)
at java.io.RandomAccessFile.open(Native Method)
at java.io.RandomAccessFile.init(RandomAccessFile.java:241)
at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:110)
at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:134)
at 
org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:510)
at 
org.apache.spark.storage.BlockManager.getLocal(BlockManager.scala:427)
at org.apache.spark.storage.BlockManager.get(BlockManager.scala:616)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:44)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:242)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:


I can not find the cause of this, can somone help?
I'm running Spark 1.4.0-rc3.


Problem with current spark

2015-05-13 Thread Giovanni Paolo Gibilisco
Hi,
I'm trying to run an application that uses a Hive context to perform some
queries over JSON files.
The code of the application is here:
https://github.com/GiovanniPaoloGibilisco/spark-log-processor/tree/fca93d95a227172baca58d51a4d799594a0429a1

I can run it on Spark 1.3.1 after rebuilding it with hive support
using: mvn -Phive -Phive-thriftserver -DskipTests clean package
but when I try to run the same application on the one built fromt he
current master branch (at this commit of today
https://github.com/apache/spark/tree/bec938f777a2e18757c7d04504d86a5342e2b49e)
again built with hive support I get an error at Stage 2 that is not
submitted, and after a while the application is killed.
The logs look like this:

15/05/13 16:54:37 INFO SparkContext: Starting job: run at unknown:0
15/05/13 16:54:37 INFO DAGScheduler: Got job 2 (run at unknown:0) with 2
output partitions (allowLocal=false)
15/05/13 16:54:37 INFO DAGScheduler: Final stage: ResultStage 4(run at
unknown:0)
15/05/13 16:54:37 INFO DAGScheduler: Parents of final stage: List()
15/05/13 16:54:37 INFO Exchange: Using SparkSqlSerializer2.
15/05/13 16:54:37 INFO SparkContext: Starting job: run at unknown:0
15/05/13 16:54:37 INFO SparkContext: Starting job: run at unknown:0
15/05/13 16:54:37 INFO SparkContext: Starting job: run at unknown:0
^C15/05/13 16:54:42 INFO SparkContext: Invoking stop() from shutdown hook
15/05/13 16:54:42 INFO SparkUI: Stopped Spark web UI at
http://192.168.230.130:4040
15/05/13 16:54:42 INFO DAGScheduler: Stopping DAGScheduler
15/05/13 16:54:42 INFO SparkDeploySchedulerBackend: Shutting down all
executors
15/05/13 16:54:42 INFO SparkDeploySchedulerBackend: Asking each executor to
shut down
15/05/13 16:54:52 INFO
OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:
OutputCommitCoordinator stopped!
15/05/13 16:54:52 ERROR TaskSchedulerImpl: Lost executor 0 on
192.168.230.130: remote Rpc client disassociated
15/05/13 16:54:53 INFO AppClient$ClientActor: Executor updated:
app-20150513165402-/0 is now EXITED (Command exited with code 0)
15/05/13 16:54:53 INFO SparkDeploySchedulerBackend: Executor
app-20150513165402-/0 removed: Command exited with code 0
15/05/13 16:54:53 ERROR SparkDeploySchedulerBackend: Asked to remove
non-existent executor 0
15/05/13 16:56:42 WARN AkkaRpcEndpointRef: Error sending message [message =
StopExecutors] in 1 attempts
java.util.concurrent.TimeoutException: Futures timed out after [120 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at
org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:78)
at
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(CoarseGrainedSchedulerBackend.scala:257)
at
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stop(CoarseGrainedSchedulerBackend.scala:266)
at
org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend.stop(SparkDeploySchedulerBackend.scala:95)
at
org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:416)
at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1404)
at org.apache.spark.SparkContext.stop(SparkContext.scala:1562)
at
org.apache.spark.SparkContext$$anonfun$3.apply$mcV$sp(SparkContext.scala:551)
at org.apache.spark.util.SparkShutdownHook.run(Utils.scala:2252)
at
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(Utils.scala:)
at
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(Utils.scala:)
at
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(Utils.scala:)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1764)
at
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(Utils.scala:)
at
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(Utils.scala:)
at
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(Utils.scala:)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.util.SparkShutdownHookManager.runAll(Utils.scala:)
at
org.apache.spark.util.SparkShutdownHookManager$$anon$6.run(Utils.scala:2204)
at
org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)

Should I submit an Issue for this?
What is the best way to do it?
Best


SparkSQL Nested structure

2015-05-04 Thread Giovanni Paolo Gibilisco
Hi, I'm trying to parse log files generated by Spark using SparkSQL.

In the JSON elements related to the StageCompleted event we have a nested
structre containing an array of elements with RDD Info. (see the log below
as an example (omitting some parts).

{
Event: SparkListenerStageCompleted,
Stage Info: {
  Stage ID: 1,
  ...
  RDD Info: [
{
  RDD ID: 5,
  Name: 5,
  Storage Level: {
Use Disk: false,
Use Memory: false,
Use Tachyon: false,
Deserialized: false,
Replication: 1
  },
  Number of Partitions: 2,
  Number of Cached Partitions: 0,
  Memory Size: 0,
  Tachyon Size: 0,
  Disk Size: 0
},
...

When i register the log as a table SparkSQL is able to generate the correct
schema that for the RDD Info element looks like

 | -- RDD Info: array (nullable = true)
 ||-- element: struct (containsNull = true)
 |||-- Disk Size: long (nullable = true)
 |||-- Memory Size: long (nullable = true)
 |||-- Name: string (nullable = true)

My problem is that if I try to query the table I can only get array buffers
out of it:

SELECT `stageEndInfos.Stage Info.Stage ID`, `stageEndInfos.Stage Info.RDD
Info` FROM stageEndInfos
Stage ID RDD Info
1ArrayBuffer([0,0,...
0ArrayBuffer([0,0,...
2ArrayBuffer([0,0,...

or:

SELECT `stageEndInfos.Stage Info.RDD Info.RDD ID` FROM stageEndInfos
RDD ID
ArrayBuffer(5, 4, 3)
ArrayBuffer(2, 1, 0)
ArrayBuffer(9, 6,...

Is there a way to explode the arrays in the rows in order to build a single
table? (Knowing that the RDD ID is unique and can be used as primary key)?

Thanks!

How can I get


Building DAG from log

2015-05-04 Thread Giovanni Paolo Gibilisco
Hi,
I'm trying to build the DAG of an application from the logs.
I've had a look at SparkReplayDebugger but it doesn't operato offline on
logs. I looked also at the one in this pull:
https://github.com/apache/spark/pull/2077 that seems to operate only on
logs but it doesn't clealry show the dependency between the stages.
Is there some other tool to do this?

In the log files I could not find the information needed to define
dependencies within the stages, is there any other way to derive this
information offline?
Thanks,


Metric collection

2015-04-28 Thread Giovanni Paolo Gibilisco
Hi,
I would like to collect some metrics from spark and plot them with
graphite. I managed to do that withe the metrics provided by the
or.apache.park.metrics.source.JvmSource but I would like to know if there
are other sources available beside this one.
Best,
Giovanni


DAG

2015-04-24 Thread Giovanni Paolo Gibilisco
Hi,
I would like to know if it is possible to build the DAG before actually
executing the application. My guess is that in the scheduler the DAG is
built dynamically at runtime since it might depend on the data, but I was
wondering if there is a way (and maybe a tool already) to analyze the code
and buidl the DAG.

Thank you!