Estimating Task memory

2015-06-30 Thread Giovanni Paolo Gibilisco
Hi, I'm looking for a way to estimate the amount of memory that will be needed for a task looking at the size of its input data. It clearly depends on what the task is doing, but is there a place to look in the logs exported by Spark to see this information? Thanks

Re: Spark 1.4 release date

2015-06-13 Thread Giovanni Paolo Gibilisco
Does the pre-build come with hive support? Namely, has it been built with -Phive and -Phive-thriftserver? On Fri, Jun 12, 2015, 9:32 AM ayan guha guha.a...@gmail.com wrote: Thanks guys, my question must look like a stupid one today :) Looking forward to test out 1.4.0, just downloaded it.

Job aborted

2015-06-05 Thread Giovanni Paolo Gibilisco
I'm running PageRank on datasets with different sizes (from 1GB to 100GB). Sometime my job is aborted showing this error: Job aborted due to stage failure: Task 0 in stage 4.1 failed 4 times, most recent failure: Lost task 0.3 in stage 4.1 (TID 2051, 9.12.247.250): java.io.FileNotFoundException:

Problem with current spark

2015-05-13 Thread Giovanni Paolo Gibilisco
Hi, I'm trying to run an application that uses a Hive context to perform some queries over JSON files. The code of the application is here: https://github.com/GiovanniPaoloGibilisco/spark-log-processor/tree/fca93d95a227172baca58d51a4d799594a0429a1 I can run it on Spark 1.3.1 after rebuilding it

SparkSQL Nested structure

2015-05-04 Thread Giovanni Paolo Gibilisco
Hi, I'm trying to parse log files generated by Spark using SparkSQL. In the JSON elements related to the StageCompleted event we have a nested structre containing an array of elements with RDD Info. (see the log below as an example (omitting some parts). { Event: SparkListenerStageCompleted,

Building DAG from log

2015-05-04 Thread Giovanni Paolo Gibilisco
Hi, I'm trying to build the DAG of an application from the logs. I've had a look at SparkReplayDebugger but it doesn't operato offline on logs. I looked also at the one in this pull: https://github.com/apache/spark/pull/2077 that seems to operate only on logs but it doesn't clealry show the

Metric collection

2015-04-28 Thread Giovanni Paolo Gibilisco
Hi, I would like to collect some metrics from spark and plot them with graphite. I managed to do that withe the metrics provided by the or.apache.park.metrics.source.JvmSource but I would like to know if there are other sources available beside this one. Best, Giovanni

DAG

2015-04-24 Thread Giovanni Paolo Gibilisco
Hi, I would like to know if it is possible to build the DAG before actually executing the application. My guess is that in the scheduler the DAG is built dynamically at runtime since it might depend on the data, but I was wondering if there is a way (and maybe a tool already) to analyze the code