Executors killed in Workers with Error: invalid log directory

2016-06-22 Thread Yiannis Gkoufas
Hi there, I have been getting a strange error in spark-1.6.1 The job submitted uses only the executor launched on the Master node while the other workers are idle. When I check the errors from the web ui to investigate on the killed executors I see the error: Error: invalid log directory

Re: Spark Metrics Framework?

2016-04-01 Thread Yiannis Gkoufas
Hi Mike, I am forwarding you a mail I sent a while ago regarding some related work I did, hope you find it useful Hi all, I recently sent to the dev mailing list about this contribution, but I thought it might be useful to post it here, since I have seen a lot of people asking about OS-level

Re: spark metrics question

2016-02-03 Thread Yiannis Gkoufas
Hi Matt, there is some related work I recently did in IBM Research for visualizing the metrics produced. You can read about it here http://www.spark.tc/sparkoscope-enabling-spark-optimization-through-cross-stack-monitoring-and-visualization-2/ We recently opensourced it if you are interested to

Re: spark metrics question

2016-02-03 Thread Yiannis Gkoufas
cation, or does > it have to be pre-deployed on all Executor nodes? > > On Wed, Feb 3, 2016 at 10:36 AM, Yiannis Gkoufas <johngou...@gmail.com> > wrote: > >> Hi Matt, >> >> there is some related work I recently did in IBM Research for visualizing >>

SparkOscope: Enabling Spark Optimization through Cross-stack Monitoring and Visualization

2016-02-03 Thread Yiannis Gkoufas
Hi all, I recently sent to the dev mailing list about this contribution, but I thought it might be useful to post it here, since I have seen a lot of people asking about OS-level metrics of Spark. This is the result of the work we have been doing recently in IBM Research around Spark.

Trying to understand dynamic resource allocation

2016-01-11 Thread Yiannis Gkoufas
Hi, I am exploring a bit the dynamic resource allocation provided by the Standalone Cluster Mode and I was wondering whether this behavior I am experiencing is expected. In my configuration I have 3 slaves with 24 cores each. I have in my spark-defaults.conf: spark.shuffle.service.enabled true

Networking problems in Spark 1.6.0

2016-01-05 Thread Yiannis Gkoufas
Hi there, I have been using Spark 1.5.2 on my cluster without a problem and wanted to try Spark 1.6.0. I have the exact same configuration on both clusters. I am able to start the Standalone Cluster but I fail to submit a job getting errors like the following: 16/01/05 14:24:14 INFO

Re: Networking problems in Spark 1.6.0

2016-01-05 Thread Yiannis Gkoufas
> Typesafe <http://typesafe.com> > @deanwampler <http://twitter.com/deanwampler> > http://polyglotprogramming.com > > On Tue, Jan 5, 2016 at 9:01 AM, Yiannis Gkoufas <johngou...@gmail.com> > wrote: > >> Hi Dean, >> >> thanks so much for the response! It works

Re: Networking problems in Spark 1.6.0

2016-01-05 Thread Yiannis Gkoufas
glotprogramming.com > > On Tue, Jan 5, 2016 at 8:29 AM, Yiannis Gkoufas <johngou...@gmail.com> > wrote: > >> Hi there, >> >> I have been using Spark 1.5.2 on my cluster without a problem and wanted >> to try Spark 1.6.0. >> I have the exact same configuratio

Avoid Shuffling on Partitioned Data

2015-12-04 Thread Yiannis Gkoufas
Hi there, I have my data stored in HDFS partitioned by month in Parquet format. The directory looks like this: -month=201411 -month=201412 -month=201501 - I want to compute some aggregates for every timestamp. How is it possible to achieve that by taking advantage of the existing

Re: Sorted Multiple Outputs

2015-08-14 Thread Yiannis Gkoufas
distinct keys, collect them on driver and have a loop over they keys and filter out new RDD out of the original one by that key. for( key : keys ) { RDD.filter( key ).saveAsTextfile() } It might help to cache original rdd. On 16 Jul 2015, at 12:21, Yiannis Gkoufas johngou...@gmail.com

Re: Sorted Multiple Outputs

2015-07-16 Thread Yiannis Gkoufas
, but it might happen that one partition might contain more, than one key (with values). This I’m not sure, but that shouldn’t be a big deal as you would iterate over tuplekey, Iterablevalue and store one key to a specific file. On 15 Jul 2015, at 03:23, Yiannis Gkoufas johngou...@gmail.com wrote

Sorted Multiple Outputs

2015-07-14 Thread Yiannis Gkoufas
Hi there, I have been using the approach described here: http://stackoverflow.com/questions/23995040/write-to-multiple-outputs-by-key-spark-one-spark-job In addition to that, I was wondering if there is a way to set the customize the order of those values contained in each file. Thanks a lot!

Re: How to share large resources like dictionaries while processing data with Spark ?

2015-06-04 Thread Yiannis Gkoufas
Hi there, I would recommend checking out https://github.com/spark-jobserver/spark-jobserver which I think gives the functionality you are looking for. I haven't tested it though. BR On 5 June 2015 at 01:35, Olivier Girardot ssab...@gmail.com wrote: You can use it as a broadcast variable, but

Re: DataFrame operation on parquet: GC overhead limit exceeded

2015-03-23 Thread Yiannis Gkoufas
(default: total memory minus 1 GB); note that each application's individual memory is configured using its spark.executor.memory property. On Fri, Mar 20, 2015 at 9:25 AM, Yiannis Gkoufas johngou...@gmail.com wrote: Actually I realized that the correct way is: sqlContext.sql(set

How to handle under-performing nodes in the cluster

2015-03-20 Thread Yiannis Gkoufas
Hi all, I have 6 nodes in the cluster and one of the nodes is clearly under-performing: ​ I was wandering what is the impact of having such issues? Also what is the recommended way to workaround it? Thanks a lot, Yiannis

Re: DataFrame operation on parquet: GC overhead limit exceeded

2015-03-20 Thread Yiannis Gkoufas
the same error. :( Thanks a lot! On 19 March 2015 at 17:40, Yiannis Gkoufas johngou...@gmail.com wrote: Hi Yin, thanks a lot for that! Will give it a shot and let you know. On 19 March 2015 at 16:30, Yin Huai yh...@databricks.com wrote: Was the OOM thrown during the execution of first stage

Re: DataFrame operation on parquet: GC overhead limit exceeded

2015-03-20 Thread Yiannis Gkoufas
Actually I realized that the correct way is: sqlContext.sql(set spark.sql.shuffle.partitions=1000) but I am still experiencing the same behavior/error. On 20 March 2015 at 16:04, Yiannis Gkoufas johngou...@gmail.com wrote: Hi Yin, the way I set the configuration is: val sqlContext = new

Re: DataFrame operation on parquet: GC overhead limit exceeded

2015-03-19 Thread Yiannis Gkoufas
it until the OOM disappears. Hopefully this will help. Thanks, Yin On Wed, Mar 18, 2015 at 4:46 PM, Yiannis Gkoufas johngou...@gmail.com wrote: Hi Yin, Thanks for your feedback. I have 1700 parquet files, sized 100MB each. The number of tasks launched is equal to the number of parquet

DataFrame operation on parquet: GC overhead limit exceeded

2015-03-18 Thread Yiannis Gkoufas
Hi there, I was trying the new DataFrame API with some basic operations on a parquet dataset. I have 7 nodes of 12 cores and 8GB RAM allocated to each worker in a standalone cluster mode. The code is the following: val people = sqlContext.parquetFile(/data.parquet); val res =

Re: DataFrame operation on parquet: GC overhead limit exceeded

2015-03-18 Thread Yiannis Gkoufas
/latest/configuration.html Cheng On 3/18/15 9:15 PM, Yiannis Gkoufas wrote: Hi there, I was trying the new DataFrame API with some basic operations on a parquet dataset. I have 7 nodes of 12 cores and 8GB RAM allocated to each worker in a standalone cluster mode. The code

Re: DataFrame operation on parquet: GC overhead limit exceeded

2015-03-18 Thread Yiannis Gkoufas
in the second stage. Can you take a look at the numbers of tasks launched in these two stages? Thanks, Yin On Wed, Mar 18, 2015 at 11:42 AM, Yiannis Gkoufas johngou...@gmail.com wrote: Hi there, I set the executor memory to 8g but it didn't help On 18 March 2015 at 13:59, Cheng Lian lian.cs

Problems running version 1.3.0-rc1

2015-03-02 Thread Yiannis Gkoufas
Hi all, I have downloaded version 1.3.0-rc1 from https://github.com/apache/spark/archive/v1.3.0-rc1.zip, extracted it and built it using: mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.5.0 -DskipTests clean package It doesn't complain for any issues, but when I call sbin/start-all.sh I get on logs:

Re: Brodcast Variable updated from one transformation and used from another

2015-02-25 Thread Yiannis Gkoufas
, Yiannis Gkoufas johngou...@gmail.com wrote: Sorry for the mistake, I actually have it this way: val myObject = new MyObject(); val myObjectBroadcasted = sc.broadcast(myObject); val rdd1 = sc.textFile(/file1).map(e = { myObjectBroadcasted.value.insert(e._1); (e._1,1) }); rdd.cache.count

Re: Brodcast Variable updated from one transformation and used from another

2015-02-24 Thread Yiannis Gkoufas
directly which won't work because you are modifying the serialized copy on the executor. You want to do myObjectBroadcasted.value.insert and myObjectBroadcasted.value.lookup. Sent with Good (www.good.com) -Original Message- *From: *Yiannis Gkoufas [johngou...@gmail.com] *Sent

Brodcast Variable updated from one transformation and used from another

2015-02-24 Thread Yiannis Gkoufas
Hi all, I am trying to do the following. val myObject = new MyObject(); val myObjectBroadcasted = sc.broadcast(myObject); val rdd1 = sc.textFile(/file1).map(e = { myObject.insert(e._1); (e._1,1) }); rdd.cache.count(); //to make sure it is transformed. val rdd2 = sc.textFile(/file2).map(e = {

Re: Running out of space (when there's no shortage)

2015-02-24 Thread Yiannis Gkoufas
Hi there, I assume you are using spark 1.2.1 right? I faced the exact same issue and switched to 1.1.1 with the same configuration and it was solved. On 24 Feb 2015 19:22, Ted Yu yuzhih...@gmail.com wrote: Here is a tool which may give you some clue: http://file-leak-detector.kohsuke.org/

Re: Running out of space (when there's no shortage)

2015-02-24 Thread Yiannis Gkoufas
if there's a bug report for this regression? For some other (possibly connected) reason I upgraded from 1.1.1 to 1.2.1, but I can't remember what the bug was. Joe On 24 February 2015 at 19:26, Yiannis Gkoufas johngou...@gmail.com wrote: Hi there, I assume you are using spark 1.2.1 right? I

Re: Worker and Nodes

2015-02-21 Thread Yiannis Gkoufas
Hi, I have experienced the same behavior. You are talking about standalone cluster mode right? BR On 21 February 2015 at 14:37, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, I have been running some jobs in my local single node stand alone cluster. I am varying the worker instances for

Setting the number of executors in standalone mode

2015-02-20 Thread Yiannis Gkoufas
Hi there, I try to increase the number of executors per worker in the standalone mode and I have failed to achieve that. I followed a bit the instructions of this thread: http://stackoverflow.com/questions/26645293/spark-configuration-memory-instance-cores and did that: spark.executor.memory 1g

Re: Setting the number of executors in standalone mode

2015-02-20 Thread Yiannis Gkoufas
memory to Spark on each worker node. Nothing to do with # of executors. Mohammed *From:* Yiannis Gkoufas [mailto:johngou...@gmail.com] *Sent:* Friday, February 20, 2015 4:55 AM *To:* user@spark.apache.org *Subject:* Setting the number of executors in standalone mode Hi there, I