Launching EC2 instances with Spark compiled for Scala 2.11

2015-10-08 Thread Theodore Vasiloudis
Hello, I was wondering if there is an easy way launch EC2 instances which have a Spark built for Scala 2.11. The only way I can think of is to prepare the sources for 2.11 as shown in the Spark build instructions ( http://spark.apache.org/docs/latest/building-spark.html#building-for-scala-211),

Disable stage logging to stdout

2015-04-01 Thread Theodore Vasiloudis
Since switching to Spark 1.2.1 I'm seeing logging for the stage progress (ex.): [error] [Stage 2154: (14 + 8) / 48][Stage 2210: (0 + 0) / 48] Any reason why these are error level logs? Shouldn't they be info level? In any case is there a way to disable them other than

Re: Disable stage logging to stdout

2015-04-01 Thread Theodore Vasiloudis
to achieve the animation and this won't work via a logging framework. stderr is where log-like output goes, because stdout is for program output. On Wed, Apr 1, 2015 at 10:56 AM, Theodore Vasiloudis theodoros.vasilou...@gmail.com wrote: Since switching to Spark 1.2.1 I'm seeing logging

EC2 Having script run at startup

2015-03-24 Thread Theodore Vasiloudis
Hello, in the context of SPARK-2394 Make it easier to read LZO-compressed files from EC2 clusters https://issues.apache.org/jira/browse/SPARK-2394 , I was wondering: Is there an easy way to make a user-provided script run at every machine in a cluster launched on EC2? Regards, Theodore --

Re: Efficient self-joins

2014-12-08 Thread Theodore Vasiloudis
all incoming edge pairs without repartitioning the data by dstID. You need to perform this shuffle for joining too. Otherwise two incoming edges could be in separate partitions and never meet. Am I missing something? On Mon, Dec 8, 2014 at 3:53 PM, Theodore Vasiloudis theodoros.vasilou

Re: Efficient self-joins

2014-12-08 Thread Theodore Vasiloudis
improves performance. Decreasing the number of partitions has a large negative effect on the runtime. On Mon, Dec 8, 2014 at 5:46 PM, Daniel Darabos daniel.dara...@lynxanalytics.com wrote: On Mon, Dec 8, 2014 at 5:26 PM, Theodore Vasiloudis theodoros.vasilou...@gmail.com wrote: @Daniel It's

Efficient way to get top K values per key in (key, value) RDD?

2014-12-04 Thread Theodore Vasiloudis
Hello everyone, I was wondering what is the most efficient way for retrieving the top K values per key in a (key, value) RDD. The simplest way I can think of is to do a groupByKey, sort the iterables and then take the top K elements for every key. But reduceByKey is an operation that can be