Read a TextFile(1 record contains 4 lines) into a RDD

2014-10-25 Thread Parthus
Hi, It might be a naive question, but I still wish that somebody could help me handle it. I have a textFile, in which every 4 lines represent a record. Since SparkContext.textFile() API deems of one line as a record, it does not fit into my case. I know that SparkContext.hadoopFile or

Re: spark-submit memory too larger

2014-10-25 Thread marylucy
Version: spark 1.1.0 42 workers,40g memory per worker Running graphx componentgraph ,use five hours 在 Oct 25, 2014,1:27,Sameer Farooqui same...@databricks.com 写道: That does seem a bit odd. How many Executors are running under this Driver? Does the spark-submit process start out using

NullPointerException when using Accumulators on cluster

2014-10-25 Thread octavian.ganea
Hi, I have a simple accumulator that needs to be passed to a foo() function inside a map job: val myCounter = sc.accumulator(0) val myRDD = sc.textFile(inputpath) // :spark.RDD[String] myRDD.flatMap(line = foo(line)) def foo(line: String) = { myCounter += 1 // line throwing

Accumulators : Task not serializable: java.io.NotSerializableException: org.apache.spark.SparkContext

2014-10-25 Thread octavian.ganea
Hi all, I tried to use accumulators without any success so far. My code is simple: val sc = new SparkContext(conf) val accum = sc.accumulator(0) val partialStats = sc.textFile(f.getAbsolutePath()) .map(line = { val key = line.split(\t).head; (key , line)} )

Bug in Accumulators...

2014-10-25 Thread octavian.ganea
There is for sure a bug in the Accumulators code. More specifically, the following code works well as expected: def main(args: Array[String]) { val conf = new SparkConf().setAppName(EL LBP SPARK) val sc = new SparkContext(conf) val accum = sc.accumulator(0)

Re: Shuffle issues in the current master

2014-10-25 Thread DB Tsai
Hi Andrew, We were running the master after SPARK-3613. Will give another shot against the current master while Josh fixed couple issues in shuffle. Thanks. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn:

Re: Bug in Accumulators...

2014-10-25 Thread Rishi Yadav
works fine. Spark 1.1.0 on REPL On Sat, Oct 25, 2014 at 1:41 PM, octavian.ganea octavian.ga...@inf.ethz.ch wrote: There is for sure a bug in the Accumulators code. More specifically, the following code works well as expected: def main(args: Array[String]) { val conf = new

Asymmetric spark cluster memory utilization

2014-10-25 Thread Manas Kar
Hi, I have a spark cluster that has 5 machines with 32 GB memory each and 2 machines with 24 GB each. I believe the spark.executor.memory will assign the executor memory for all executors. How can I use 32 GB memory from the first 5 machines and 24 GB from the next 2 machines. Thanks ..Manas

Re: Read a TextFile(1 record contains 4 lines) into a RDD

2014-10-25 Thread Xiangrui Meng
If your file is not very large, try sc.wholeTextFiles(...).values.flatMap(_.split(\n).grouped(4).map(_.mkString(\n))) -Xiangrui On Sat, Oct 25, 2014 at 12:57 AM, Parthus peng.wei@gmail.com wrote: Hi, It might be a naive question, but I still wish that somebody could help me handle it.

Re: Multitenancy in Spark - within/across spark context

2014-10-25 Thread RJ Nowling
Ashwin, What is your motivation for needing to share RDDs between jobs? Optimizing for reusing data across jobs? If so, you may want to look into Tachyon. My understanding is that Tachyon acts like a caching layer and you can designate when data will be reused in multiple jobs so it know to keep

Spark as Relational Database

2014-10-25 Thread Peter Wolf
Hello all, We are considering Spark for our organization. It is obviously a superb platform for processing massive amounts of data... how about retrieving it? We are currently storing our data in a relational database in a star schema. Retrieving our data requires doing many complicated joins

Re: Spark as Relational Database

2014-10-25 Thread Soumya Simanta
1. What data store do you want to store your data in ? HDFS, HBase, Cassandra, S3 or something else? 2. Have you looked at SparkSQL (https://spark.apache.org/sql/)? One option is to process the data in Spark and then store it in the relational database of your choice. On Sat, Oct 25, 2014 at