Spark ganglia jClassNotFoundException: org.apache.spark.metrics.sink.GangliaSink

2015-10-08 Thread gtanguy
I build spark with ganglia : $SPARK_HOME/build/sbt -Pspark-ganglia-lgpl -Phadoop-1 -Phive -Phive-thriftserver assembly ... [info] Including from cache: metrics-ganglia-3.1.0.jar ... In the master log : ERROR actor.OneForOneStrategy: org.apache.spark.metrics.sink.GangliaSink

Re: extracting the top 100 values from an rdd and save it as text file

2015-10-06 Thread gtanguy
Hello patelmiteshn, This could do the trick : rdd1 = rdd.sortBy(lambda x: x[1], ascending=False) rdd2 = rdd1.zipWithIndex().filter(tuple => tuple._2 < 1) rdd2.saveAsTextFile() -- View this message in context:

Spark metrics cpu/memory

2015-10-05 Thread gtanguy
I would like to monitor cpu/memory usage. I read the section Metrics on : http://spark.apache.org/docs/1.3.1/monitoring.html. Here my $SPARK_HOME/conf/metrics.properties # Enable CsvSink for all instances *.sink.csv.class=org.apache.spark.metrics.sink.CsvSink # Polling period for CsvSink

DataFrame GroupBy

2015-03-26 Thread gtanguy
Hello everybody, I am trying to do a simple groupBy : *Code:* val df = hiveContext.sql(SELECT * FROM table1) df .printSchema() df .groupBy(customer_id).count().show(5) *Stacktrace* : root |-- customer_id: string (nullable = true) |-- rank: string (nullable = true) |-- reco_material_id:

SPARKQL Join partitioner

2015-03-12 Thread gtanguy
Hello, I am wondering how does /join/ work in SparkQL? Does it co-partition two tables? or does it do it by wide dependency? I have two big tables to join, the query creates more than 150Go temporary data, so the query stops because I have no space left my disk. I guess I could use a

Re: How does Spark handle RDD via HDFS ?

2014-04-10 Thread gtanguy
Yes that help to understand better how works spark. But that was also what I was afraid, I think the network communications will take to much time for my job. I will continue to look for a trick in order to not have network communications. I saw on the hadoop website that : To minimize global

RDD creation on HDFS

2014-04-08 Thread gtanguy
I read on the RDD paper (http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf) : For example, an RDD representing an HDFS file has a partition for each block of the file and knows which machines each block is on And that on http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html To minimize