Re: Spark Monitoring with Ganglia

2014-10-05 Thread manasdebashiskar
Have you checked reactive monitoring(https://github.com/eigengo/monitor) or kamon monitoring (https://github.com/kamon-io/Kamon) Instrumenting needs absolutely no code change. All you do is weaving. In our environment we use Graphite to get the statsd(you can also get dtrace) events and display

Re: Spark Streaming writing to HDFS

2014-10-05 Thread Sean Owen
On Sat, Oct 4, 2014 at 5:28 PM, Abraham Jacob abe.jac...@gmail.com wrote: import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; Good. There is also a

java.library.path

2014-10-05 Thread Tom
Hi, I am trying to call some c code, let's say the compiled file is /path/code, and it has chmod +x. When I call it directly, it works. Now i want to call it from Spark 1.1. My problem is not building it into Spark, but making sure Spark can find it. I have tried:

Re: java.library.path

2014-10-05 Thread Andrew Ash
You're putting those into spark-env.sh? Try setting LD_LIBRARY_PATH as well, that might help. Also where is the exception coming from? You have to set this properly for both the cluster and the driver, which are independently set. Cheers! Andrew On Sun, Oct 5, 2014 at 1:06 PM, Tom

Re: window every n elements instead of time based

2014-10-05 Thread Andrew Ash
Hi Michael, I couldn't find anything in Jira for it -- https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20text%20~%20%22window%22%20AND%20component%20%3D%20Streaming Could you or Adrian please file a Jira ticket explaining the functionality and maybe a proposed API? This

Re: Larger heap leads to perf degradation due to GC

2014-10-05 Thread Andrew Ash
Hi Mingyu, Maybe we should be limiting our heaps to 32GB max and running multiple workers per machine to avoid large GC issues. For a 128GB memory, 32 core machine, this could look like: SPARK_WORKER_INSTANCES=4 SPARK_WORKER_MEMORY=32 SPARK_WORKER_CORES=8 Are people running with large (32GB+)

Re: still GC overhead limit exceeded after increasing heap space

2014-10-05 Thread Andrew Ash
You may also be writing your algorithm in a way that it requires high peak memory usage. An example of this could be using .groupByKey() where .reduceByKey() might suffice instead. Maybe you can express the algorithm in a different way that's more efficient? On Thu, Oct 2, 2014 at 4:30 AM, Sean

Re: Using GraphX with Spark Streaming?

2014-10-05 Thread Tobias Pfeiffer
Arko, On Sat, Oct 4, 2014 at 1:40 AM, Arko Provo Mukherjee arkoprovomukher...@gmail.com wrote: Apologies if this is a stupid question but I am trying to understand why this can or cannot be done. As far as I understand that streaming algorithms need to be different from batch algorithms as

Stucked job work well after rdd.count or rdd.collect

2014-10-05 Thread Kevin Jung
Hi, all. I'm in an unusual situation. The code, ... 1: val cell = dataSet.flatMap(parse(_)).cache 2: val distinctCell = cell.keyBy(_._1).reduceByKey(removeDuplication(_, _)).mapValues(_._3).cache 3: val groupedCellByLine = distinctCell.map(cellToIterableColumn).groupByKey.cache 4: val result = (1

graphx - mutable?

2014-10-05 Thread ll
i understand that graphx is an immutable rdd. i'm working on an algorithm that requires a mutable graph. initially, the graph starts with just a few nodes and edges. then over time, it adds more and more nodes and edges. what would be the best way to implement this growing graph with