RE: Pivot Data in Spark and Scala

2015-10-30 Thread Andrianasolo Fanilo
Hey, The question is tricky, here is a possible answer by defining years as keys for a hashmap per client and merging those : import scalaz._ import Scalaz._ val sc = new SparkContext("local[*]", "sandbox") // Create RDD of your objects val rdd = sc.parallelize(Seq( ("A", 2015, 4), ("A",

RE: Loading binary files from NFS share

2015-10-26 Thread Andrianasolo Fanilo
Hi again, I found this : https://github.com/NetApp/NetApp-Hadoop-NFS-Connector Maybe it will enable you to read NFS data from Spark at least. Anyone from the community used it ? BR, Fanilo De : Andrianasolo Fanilo Envoyé : lundi 26 octobre 2015 15:24 À : 'Kayode Odeyemi'; user Objet : RE

RE: Loading binary files from NFS share

2015-10-26 Thread Andrianasolo Fanilo
Hi, I believe binaryFiles uses a custom Hadoop Input Format, so it can only read specific Hadoop protocols. You can find the full list of supported protocols by typing “Hadoop filesystems hdfs hftp” in Google (the link I found is a little bit long and references the Hadoop Definitive Guide,

RE: Analyzing consecutive elements

2015-10-22 Thread Andrianasolo Fanilo
Hi Sampo, There is a sliding method you could try inside the org.apache.spark.mllib.rdd.RDDFunctions package, though it’s DeveloperApi stuff (https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.rdd.RDDFunctions) import org.apache.spark.{SparkConf, SparkContext}

RDD caching, memory network input

2015-01-28 Thread Andrianasolo Fanilo
Hello Spark fellows :), I think I need some help to understand how .cache and task input works within a job. I have an 7 GB input matrix in HDFS that I load using .textFile(). I also have a config file which contains an array of 12 Logistic Regression Model parameters, loaded as an

RE: RDD caching, memory network input

2015-01-28 Thread Andrianasolo Fanilo
= PredictionReader.getFeatures(…).cache Where getFeatures() loads the file then parses it. De : Sandy Ryza [mailto:sandy.r...@cloudera.com] Envoyé : mercredi 28 janvier 2015 17:12 À : Andrianasolo Fanilo Cc : user@spark.apache.org Objet : Re: RDD caching, memory network input Hi Fanilo, How many

Object serialisation inside closures

2014-09-04 Thread Andrianasolo Fanilo
Hello Spark fellows :) I'm a new user of Spark and Scala and have been using both for 6 months without too many problems. Here I'm looking for best practices for using non-serializable classes inside closure. I'm using Spark-0.9.0-incubating here with Hadoop 2.2. Suppose I am using OpenCSV

RE: Object serialisation inside closures

2014-09-04 Thread Andrianasolo Fanilo
data within an executor sadly... Thanks for the input Fanilo -Message d'origine- De : Sean Owen [mailto:so...@cloudera.com] Envoyé : jeudi 4 septembre 2014 15:36 À : Andrianasolo Fanilo Cc : user@spark.apache.org Objet : Re: Object serialisation inside closures In your original version