I build spark with ganglia :
$SPARK_HOME/build/sbt -Pspark-ganglia-lgpl -Phadoop-1 -Phive
-Phive-thriftserver assembly
...
[info] Including from cache: metrics-ganglia-3.1.0.jar
...
In the master log :
ERROR actor.OneForOneStrategy: org.apache.spark.metrics.sink.GangliaSink
Hello patelmiteshn,
This could do the trick :
rdd1 = rdd.sortBy(lambda x: x[1], ascending=False)
rdd2 = rdd1.zipWithIndex().filter(tuple => tuple._2 < 1)
rdd2.saveAsTextFile()
--
View this message in context:
I would like to monitor cpu/memory usage.
I read the section Metrics on :
http://spark.apache.org/docs/1.3.1/monitoring.html.
Here my $SPARK_HOME/conf/metrics.properties
# Enable CsvSink for all instances
*.sink.csv.class=org.apache.spark.metrics.sink.CsvSink
# Polling period for CsvSink
Hello everybody,
I am trying to do a simple groupBy :
*Code:*
val df = hiveContext.sql(SELECT * FROM table1)
df .printSchema()
df .groupBy(customer_id).count().show(5)
*Stacktrace* :
root
|-- customer_id: string (nullable = true)
|-- rank: string (nullable = true)
|-- reco_material_id:
Hello,
I am wondering how does /join/ work in SparkQL? Does it co-partition two
tables? or does it do it by wide dependency?
I have two big tables to join, the query creates more than 150Go temporary
data, so the query stops because I have no space left my disk.
I guess I could use a
Yes that help to understand better how works spark. But that was also what I
was afraid, I think the network communications will take to much time for my
job.
I will continue to look for a trick in order to not have network
communications.
I saw on the hadoop website that : To minimize global
I read on the RDD paper
(http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf) :
For example, an RDD representing an HDFS file has a partition for each block
of the file and knows which machines each block is on
And that on http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
To minimize