Re: Best way to tranform string label into long label for classification problem

2016-06-28 Thread Jaonary Rabarisoa
t; indices. The indices are in [0, numLabels), ordered by label frequencies." > > Xinh > > On Tue, Jun 28, 2016 at 12:29 AM, Jaonary Rabarisoa > wrote: > >> Dear all, >> >> I'm trying to a find a way to transform a DataFrame into a data that is >&

Best way to tranform string label into long label for classification problem

2016-06-28 Thread Jaonary Rabarisoa
Dear all, I'm trying to a find a way to transform a DataFrame into a data that is more suitable for third party classification algorithm. The DataFrame have two columns : "feature" represented by a vector and "label" represented by a string. I want the "label" to be a number between [0, number of

GMM with diagonal covariance matrix

2015-12-21 Thread Jaonary Rabarisoa
Hi all, Is it possible to learn a gaussian mixture model with a diagonal covariance matrix in the GMM algorithm implemented in MLIb ? It seems to be possible but can't figure out how to do that. Cheers, Jao

ml.Pipeline without train step

2015-10-04 Thread Jaonary Rabarisoa
Hi there, The Pipeline of ml package is really a great feature and we use it in our every day task. But we have some use case where we need a Pipeline of Transformers only and the problem is that there's not train phase in that case. For example, we have a pipeline of image analytics with the foll

Why transformer from ml.Pipeline transform only a DataFrame ?

2015-08-28 Thread Jaonary Rabarisoa
Hi there, The actual API of ml.Transformer use only DataFrame as input. I have a use case where I need to transform a single element. For example transforming an element from spark-streaming. Is there any reason for this or the ml.Transformer will support transforming a single element later ? Che

Re: Build k-NN graph for large dataset

2015-08-26 Thread Jaonary Rabarisoa
-- >> >> If you don't want to compute all N^2 similarities, you need to implement >> some kind of blocking first. For example, LSH (locally sensitive hashing). >> A quick search gave this link to a Spark implementation: >> >> >> http:

Build k-NN graph for large dataset

2015-08-26 Thread Jaonary Rabarisoa
Dear all, I'm trying to find an efficient way to build a k-NN graph for a large dataset. Precisely, I have a large set of high dimensional vector (say d >>> 1) and I want to build a graph where those high dimensional points are the vertices and each one is linked to the k-nearest neighbor base

Re: spark mllib kmeans

2015-05-11 Thread Jaonary Rabarisoa
take a look at this https://github.com/derrickburns/generalized-kmeans-clustering Best, Jao On Mon, May 11, 2015 at 3:55 PM, Driesprong, Fokko wrote: > Hi Paul, > > I would say that it should be possible, but you'll need a different > distance measure which conforms to your coordinate system.

Re: SQL UserDefinedType can't be saved in parquet file when using assembly jar

2015-05-11 Thread Jaonary Rabarisoa
in the assembly jar) at runtime. Make sure the > full class name (with package name) is used. Btw, UDTs are not public > yet, so please use it with caution. -Xiangrui > > On Fri, Apr 17, 2015 at 12:45 AM, Jaonary Rabarisoa > wrote: > > Dear all, > > > > Here is an e

Re: SQL UserDefinedType can't be saved in parquet file when using assembly jar

2015-05-11 Thread Jaonary Rabarisoa
In this example, every thing work expect save to parquet file. On Mon, May 11, 2015 at 4:39 PM, Jaonary Rabarisoa wrote: > MyDenseVectorUDT do exist in the assembly jar and in this example all the > code is in a single file to make sure every thing is included. > > On Tue, Apr 21,

SQL UserDefinedType can't be saved in parquet file when using assembly jar

2015-04-17 Thread Jaonary Rabarisoa
Dear all, Here is an example of code to reproduce the issue I mentioned in a previous mail about saving an UserDefinedType into a parquet file. The problem here is that the code works when I run it inside intellij idea but fails when I create the assembly jar and run it with spark-submit. I use th

Re: Problem with Spark SQL UserDefinedType and sbt assembly

2015-04-16 Thread Jaonary Rabarisoa
ot; % "javacpp" % "0.11-SNAPSHOT", "org.scalatest" % "scalatest_2.10" % "2.2.0" % "test")* On Thu, Apr 16, 2015 at 11:16 PM, Richard Marscher wrote: > If it fails with sbt-assembly but not without it, then there's always the > l

Re: Problem with Spark SQL UserDefinedType and sbt assembly

2015-04-16 Thread Jaonary Rabarisoa
Any ideas ? On Thu, Apr 16, 2015 at 5:04 PM, Jaonary Rabarisoa wrote: > Dear all, > > Here is an issue that gets me mad. I wrote a UserDefineType in order to be > able to store a custom type in a parquet file. In my code I just create a > DataFrame with my custom data type and

Problem with Spark SQL UserDefinedType and sbt assembly

2015-04-16 Thread Jaonary Rabarisoa
Dear all, Here is an issue that gets me mad. I wrote a UserDefineType in order to be able to store a custom type in a parquet file. In my code I just create a DataFrame with my custom data type and write in into a parquet file. When I run my code directly inside idea every thing works like a charm

How to get a clean DataFrame schema merge

2015-04-15 Thread Jaonary Rabarisoa
Hi all, If you follow the example of schema merging in the spark documentation http://spark.apache.org/docs/latest/sql-programming-guide.html#schema-merging you obtain the following results when you want to load the result data : single triple double 1 3 null 2 6 null 4 1

Re: How DataFrame schema migration works ?

2015-04-14 Thread Jaonary Rabarisoa
I forgot to mention that the imageId field is a custom scala object. Do I need to implement some special method to make it works (equal, hashCode ) ? On Tue, Apr 14, 2015 at 5:00 PM, Jaonary Rabarisoa wrote: > Dear all, > > In the latest version of spark there's a feature call

How DataFrame schema migration works ?

2015-04-14 Thread Jaonary Rabarisoa
Dear all, In the latest version of spark there's a feature called : automatic partition discovery and Schema migration for parquet. As far as I know, this gives the ability to split the DataFrame into several parquet files, and by just loading the parent directory one can get the global schema of

Re: Unable to save dataframe with UDT created with sqlContext.createDataFrame

2015-04-03 Thread Jaonary Rabarisoa
chema() > > > > dataDF2.saveAsParquetFile("test3.parquet") // FAIL !!! > > } > > } > > > > > > On Tue, Mar 31, 2015 at 11:18 PM, Xiangrui Meng > wrote: > >> > >> I cannot reproduce this error on master, but I'm not awar

[SparkSQL] Zip DataFrame with a RDD

2015-04-01 Thread Jaonary Rabarisoa
Hi all, Is it posible to zip an existing DataFrame with a RDD[T] such that the result is a new DataFrame with one more column that the first one and the additionnal column corresponds to the RDD[T] ? In other words, is it possible to zip 2 DataFrame ? Cheers, Jaonary

Re: Unable to save dataframe with UDT created with sqlContext.createDataFrame

2015-04-01 Thread Jaonary Rabarisoa
quetFile("test3.parquet") // FAIL !!! }}* On Tue, Mar 31, 2015 at 11:18 PM, Xiangrui Meng wrote: > I cannot reproduce this error on master, but I'm not aware of any > recent bug fixes that are related. Could you build and try the current > master? -Xiangrui > > On Tue,

Unable to save dataframe with UDT created with sqlContext.createDataFrame

2015-03-31 Thread Jaonary Rabarisoa
Hi all, DataFrame with an user defined type (here mllib.Vector) created with sqlContex.createDataFrame can't be saved to parquet file and raise ClassCastException: org.apache.spark.mllib.linalg.DenseVector cannot be cast to org.apache.spark.sql.Row error. Here is an example of code to reproduce t

Re: Some questions after playing a little with the new ml.Pipeline.

2015-03-31 Thread Jaonary Rabarisoa
Shivaram > > On Tue, Mar 31, 2015 at 12:50 AM, Jaonary Rabarisoa > wrote: > >> Following your suggestion, I end up with the following implementation : >> >> >> >> >> >> >> >> *override def transform(dataSet: DataFrame, paramMap:

Re: Some questions after playing a little with the new ml.Pipeline.

2015-03-31 Thread Jaonary Rabarisoa
k with the > JNI calls and then convert back to RDD. > > Thanks > Shivaram > > On Mon, Mar 30, 2015 at 8:37 AM, Jaonary Rabarisoa > wrote: > >> Dear all, >> >> I'm still struggling to make a pre-trained caffe model transformer for >> dataframe works

Re: Some questions after playing a little with the new ml.Pipeline.

2015-03-30 Thread Jaonary Rabarisoa
a > great way to do something equivalent to mapPartitions with UDFs right now. > > On Tue, Mar 3, 2015 at 4:36 AM, Jaonary Rabarisoa > wrote: > >> Here is my current implementation with current master version of spark >> >> >> >> >> *class De

Re: Solve least square problem of the form min norm(A x - b)^2^ + lambda * n * norm(x)^2 ?

2015-03-13 Thread Jaonary Rabarisoa
hu, Mar 12, 2015 at 11:36 PM, Shivaram Venkataraman < shiva...@eecs.berkeley.edu> wrote: > > On Thu, Mar 12, 2015 at 3:05 PM, Jaonary Rabarisoa > wrote: > >> In fact, by activating netlib with native libraries it goes faster. >> >> Glad you got it work ! Bet

Re: Solve least square problem of the form min norm(A x - b)^2^ + lambda * n * norm(x)^2 ?

2015-03-12 Thread Jaonary Rabarisoa
s://github.com/fommil/netlib-java#machine-optimised-system-libraries > > Thanks > Shivaram > > On Tue, Mar 10, 2015 at 9:57 AM, Jaonary Rabarisoa > wrote: > >> I'm trying to play with the implementation of least square solver (Ax = >> b) in mlmatrix.TSQR where A

Re: Solve least square problem of the form min norm(A x - b)^2^ + lambda * n * norm(x)^2 ?

2015-03-10 Thread Jaonary Rabarisoa
ter/src/main/scala/edu/berkeley/cs/amplab/mlmatrix/TSQR.scala > [2] > https://github.com/amplab/ml-matrix/blob/master/src/main/scala/edu/berkeley/cs/amplab/mlmatrix/NormalEquations.scala > > On Tue, Mar 3, 2015 at 9:01 AM, Jaonary Rabarisoa > wrote: > >> Dear all, >> >

Re: Solve least square problem of the form min norm(A x - b)^2^ + lambda * n * norm(x)^2 ?

2015-03-06 Thread Jaonary Rabarisoa
/amplab/ml-matrix/blob/master/src/main/scala/edu/berkeley/cs/amplab/mlmatrix/NormalEquations.scala > > On Tue, Mar 3, 2015 at 9:01 AM, Jaonary Rabarisoa > wrote: > >> Dear all, >> >> Is there a least square solver based on DistributedMatrix that we can use >> out

Re: Solve least square problem of the form min norm(A x - b)^2^ + lambda * n * norm(x)^2 ?

2015-03-06 Thread Jaonary Rabarisoa
l-matrix/blob/master/src/main/scala/edu/berkeley/cs/amplab/mlmatrix/NormalEquations.scala > > On Tue, Mar 3, 2015 at 9:01 AM, Jaonary Rabarisoa > wrote: > >> Dear all, >> >> Is there a least square solver based on DistributedMatrix that we can use >> out of

Re: Data Frame types

2015-03-06 Thread Jaonary Rabarisoa
Hi Cesar, Yes, you can define an UDT with the new DataFrame, the same way that SchemaRDD did. Jaonary On Fri, Mar 6, 2015 at 4:22 PM, Cesar Flores wrote: > > The SchemaRDD supports the storage of user defined classes. However, in > order to do that, the user class needs to extends the UserDefi

Solve least square problem of the form min norm(A x - b)^2^ + lambda * n * norm(x)^2 ?

2015-03-03 Thread Jaonary Rabarisoa
Dear all, Is there a least square solver based on DistributedMatrix that we can use out of the box in the current (or the master) version of spark ? It seems that the only least square solver available in spark is private to recommender package. Cheers, Jao

Re: Some questions after playing a little with the new ml.Pipeline.

2015-03-03 Thread Jaonary Rabarisoa
r than stating it here) > because it changes between Spark 1.2 and 1.3. In 1.3, the DSL is much > improved and makes it easier to create a new column. > > Joseph > > On Sun, Mar 1, 2015 at 1:26 AM, Jaonary Rabarisoa > wrote: > >> class DeepCNNFeature extends Tran

Re: Some questions after playing a little with the new ml.Pipeline.

2015-03-01 Thread Jaonary Rabarisoa
class DeepCNNFeature extends Transformer ... { override def transform(data: DataFrame, paramMap: ParamMap): DataFrame = { // How can I do a map partition on the underlying RDD and then add the column ? } } On Sun, Mar 1, 2015 at 10:23 AM, Jaonary Rabarisoa wrote

Re: Some questions after playing a little with the new ml.Pipeline.

2015-03-01 Thread Jaonary Rabarisoa
econd question, I would modify the above call as follows: > > myRDD.mapPartitions { myDataOnPartition => > val myModel = // instantiate neural network on this partition > myDataOnPartition.map { myDatum => myModel.predict(myDatum) } > } > > I hope this helps! >

Some questions after playing a little with the new ml.Pipeline.

2015-02-27 Thread Jaonary Rabarisoa
Dear all, We mainly do large scale computer vision task (image classification, retrieval, ...). The pipeline is really great stuff for that. We're trying to reproduce the tutorial given on that topic during the latest spark summit ( http://ampcamp.berkeley.edu/5/exercises/image-classification-wit

Re: [ML][SQL] Select UserDefinedType attribute in a DataFrame

2015-02-24 Thread Jaonary Rabarisoa
or alias should be > `df.select($"image.data".as("features"))`. > > On Tue, Feb 24, 2015 at 3:35 PM, Xiangrui Meng wrote: > > If you make `Image` a case class, then select("image.data") should work. > > > > On Tue, Feb 24, 2015 at 3:06 PM,

[ML][SQL] Select UserDefinedType attribute in a DataFrame

2015-02-24 Thread Jaonary Rabarisoa
Hi all, I have a DataFrame that contains a user defined type. The type is an image with the following attribute *class Image(w: Int, h: Int, data: Vector)* In my DataFrame, images are stored in column named "image" that corresponds to the following case class *case class LabeledImage(label: Int

Re: Need some help to create user defined type for ML pipeline

2015-02-23 Thread Jaonary Rabarisoa
x27;re using. > > Are there particular issues you're running into? > > Joseph > > On Mon, Jan 19, 2015 at 12:59 AM, Jaonary Rabarisoa > wrote: > >> Hi all, >> >> I'm trying to implement a pipeline for computer vision based on the >> l

Unable to run spark-shell after build

2015-02-03 Thread Jaonary Rabarisoa
Hi all, I'm trying to run the master version of spark in order to test some alpha components in ml package. I follow the build spark documentation and build it with : $ mvn clean package The build is successful but when I try to run spark-shell I got the following errror : *Exception in thr

Re: Can't find spark-parent when using snapshot build

2015-02-02 Thread Jaonary Rabarisoa
That's what I did. On Mon, Feb 2, 2015 at 11:28 PM, Sean Owen wrote: > Snapshot builds are not published. Unless you build and install snapshots > locally (like with mvn install) they wont be found. > On Feb 2, 2015 10:58 AM, "Jaonary Rabarisoa" wrote: > >> Hi

Can't find spark-parent when using snapshot build

2015-02-02 Thread Jaonary Rabarisoa
Hi all, I'm trying to use the master version of spark. I build and install it with $ mvn clean clean install I manage to use it with the following configuration in my build.sbt : *libraryDependencies ++= Seq( "org.apache.spark" %% "spark-core" % "1.3.0-SNAPSHOT" % "provided", "org.apache.s

Need some help to create user defined type for ML pipeline

2015-01-19 Thread Jaonary Rabarisoa
Hi all, I'm trying to implement a pipeline for computer vision based on the latest ML package in spark. The first step of my pipeline is to decode image (jpeg for instance) stored in a parquet file. For this, I begin to create a UserDefinedType that represents a decoded image stored in a array of

Re: DeepLearning and Spark ?

2015-01-10 Thread Jaonary Rabarisoa
I've seen all >>>> kinds of hacking to improvise it: REST api, HDFS, tachyon, etc. >>>> Not sure if an 'official' benchmark & implementation will be released >>>> soon >>>> >>>> On 9 January 2015 at 10:59, Ma

DeepLearning and Spark ?

2015-01-09 Thread Jaonary Rabarisoa
Hi all, DeepLearning algorithms are popular and achieve many state of the art performance in several real world machine learning problems. Currently there are no DL implementation in spark and I wonder if there is an ongoing work on this topics. We can do DL in spark Sparkling water and H2O but t

Re: MLLib: Saving and loading a model

2014-12-16 Thread Jaonary Rabarisoa
Hi, There's is a ongoing work on model export https://www.github.com/apache/spark/pull/3062 For now, since LinearRegression is serializable you can save it as object file : sc.saveAsObjectFile(Seq(model)) then val model = sc.objectFile[LinearRegresionWithSGD]("path").first model.predict(...)

Re: Why KMeans with mllib is so slow ?

2014-12-15 Thread Jaonary Rabarisoa
www.dbtsai.com > > LinkedIn: https://www.linkedin.com/in/dbtsai > > > > > > On Mon, Dec 8, 2014 at 7:53 AM, Jaonary Rabarisoa > wrote: > >> After some investigation, I learned that I can't compare kmeans in mllib > >> with another kmeans impl

DIMSUM and ColumnSimilarity use case ?

2014-12-10 Thread Jaonary Rabarisoa
Dear all, I'm trying to understand what is the correct use case of ColumnSimilarity implemented in RowMatrix. As far as I know, this function computes the similarity of a column of a given matrix. The DIMSUM paper says that it's efficient for large m (rows) and small n (columns). In this case the

Re: Mllib native netlib-java/OpenBLAS

2014-12-09 Thread Jaonary Rabarisoa
+1 with 1.3-SNAPSHOT. On Mon, Dec 1, 2014 at 5:49 PM, agg212 wrote: > Thanks for your reply, but I'm still running into issues > installing/configuring the native libraries for MLlib. Here are the steps > I've taken, please let me know if anything is incorrect. > > - Download Spark source > - u

Re: Why KMeans with mllib is so slow ?

2014-12-05 Thread Jaonary Rabarisoa
application, I will have more than 248k data to cluster. On Fri, Dec 5, 2014 at 6:03 PM, Davies Liu wrote: > Could you post you script to reproduce the results (also how to > generate the dataset)? That will help us to investigate it. > > On Fri, Dec 5, 2014 at 8:40 AM, Jaonary Rabar

Re: Why KMeans with mllib is so slow ?

2014-12-05 Thread Jaonary Rabarisoa
ally use Spark when you > have a problem large enough to warrant distributing, or, your data > already lives in a distributed store like HDFS. > > But it's also possible you're not configuring the implementations the > same way, yes. There's not enough info here really to

Why KMeans with mllib is so slow ?

2014-12-05 Thread Jaonary Rabarisoa
Hi all, I'm trying to a run clustering with kmeans algorithm. The size of my data set is about 240k vectors of dimension 384. Solving the problem with the kmeans available in julia (kmean++) http://clusteringjl.readthedocs.org/en/latest/kmeans.html take about 8 minutes on a single core. Solvin

Re: Why my default partition size is set to 52 ?

2014-12-05 Thread Jaonary Rabarisoa
e Hadoop > InputFormat would make 52 splits for it. Data drives partitions, not > processing resource. Really, 8 splits is the minimum parallelism you > want. Several times your # of cores is better. > > On Fri, Dec 5, 2014 at 8:51 AM, Jaonary Rabarisoa > wrote: > > Hi all, > >

Why my default partition size is set to 52 ?

2014-12-05 Thread Jaonary Rabarisoa
Hi all, I'm trying to run some spark job with spark-shell. What I want to do is just to count the number of lines in a file. I start the spark-shell with the default argument i.e just with ./bin/spark-shell. Load the text file with sc.textFile("path") and then call count on my data. When I do th

Understanding and optimizing spark disk usage during a job.

2014-11-28 Thread Jaonary Rabarisoa
Dear all, I have a job that crashes before its end because of no space left on device, and I noticed that this job generates a lots of temporary data on my disk. To be precise, the job is a simple map job that takes a set of images, extracts local features and save these local features as a seque

Store kmeans model

2014-11-24 Thread Jaonary Rabarisoa
Dear all, How can one save a kmeans model after training ? Best, Jao

Re: Got java.lang.SecurityException: class "javax.servlet.FilterRegistration"'s when running job from intellij Idea

2014-11-04 Thread Jaonary Rabarisoa
" % "2.2.0" % "test" ) resolvers += "Akka Repository" at "http://repo.akka.io/releases/"; On Tue, Nov 4, 2014 at 11:00 AM, Sean Owen wrote: > Generally this means you included some javax.servlet dependency in > your project deps. You should exclude

Got java.lang.SecurityException: class "javax.servlet.FilterRegistration"'s when running job from intellij Idea

2014-11-03 Thread Jaonary Rabarisoa
Hi all, I have a spark job that I build with sbt and I can run without any problem with sbt run. But when I run it inside IntelliJ Idea I got the following error : *Exception encountered when invoking run on a nested suite - class "javax.servlet.FilterRegistration"'s signer information does not m

Re: unable to make a custom class as a key in a pairrdd

2014-10-24 Thread Jaonary Rabarisoa
e the type of the id param to Int it works for me but I don't >> know why. >> >> case class PersonID(id: Int) >> >> Looks like a strange behavior to me. Have a try. >> >> Good luck, >> Niklas >> >> >> On 23.10.2014 21:52, Jaonary

unable to make a custom class as a key in a pairrdd

2014-10-23 Thread Jaonary Rabarisoa
Hi all, I have the following case class that I want to use as a key in a key-value rdd. I defined the equals and hashCode methode but it's not working. What I'm doing wrong ? *case class PersonID(id: String) {* * override def hashCode = id.hashCode* * override def equals(other: Any) = o

Re: Optimizing pairwise similarity computation or how to avoid RDD.cartesian operation ?

2014-10-17 Thread Jaonary Rabarisoa
lib/CosineSimilarity.scala> > . > > This implements the DIMSUM sampling scheme, recently merged into master > <https://github.com/apache/spark/pull/1778>. > > Best, > Reza > > On Fri, Oct 17, 2014 at 3:43 AM, Jaonary Rabarisoa > wrote: > >> Hi all, >>

Optimizing pairwise similarity computation or how to avoid RDD.cartesian operation ?

2014-10-17 Thread Jaonary Rabarisoa
Hi all, I need to compute a similiarity between elements of two large sets of high dimensional feature vector. Naively, I create all possible pair of vectors with * features1.cartesian(features2)* and then map the produced paired rdd with my similarity function. The problem is that the cartesian

Re: key class requirement for PairedRDD ?

2014-10-17 Thread Jaonary Rabarisoa
you are > getting? > > Best Regards, > Sonal > Nube Technologies <http://www.nubetech.co> > > <http://in.linkedin.com/in/sonalgoyal> > > > > On Fri, Oct 17, 2014 at 12:28 PM, Jaonary Rabarisoa > wrote: > >> Dear all, >> >> Is it p

key class requirement for PairedRDD ?

2014-10-16 Thread Jaonary Rabarisoa
Dear all, Is it possible to use any kind of object as key in a PairedRDD. When I use a case class key, the groupByKey operation don't behave as I expected. I want to use a case class to avoid using a large tuple as it is easier to manipulate. Cheers, Jaonary

Re: Interactive interface tool for spark

2014-10-12 Thread Jaonary Rabarisoa
And what about Hue http://gethue.com ? On Sun, Oct 12, 2014 at 1:26 PM, andy petrella wrote: > Dear Sparkers, > > As promised, I've just updated the repo with a new name (for the sake of > clarity), default branch but specially with a dedicated README containing: > > * explanations on how to lau

Re: java.lang.OutOfMemoryError: Java heap space when running job via spark-submit

2014-10-09 Thread Jaonary Rabarisoa
in fact with --driver-memory 2G I can get it working On Thu, Oct 9, 2014 at 6:20 PM, Xiangrui Meng wrote: > Please use --driver-memory 2g instead of --conf > spark.driver.memory=2g. I'm not sure whether this is a bug. -Xiangrui > > On Thu, Oct 9, 2014 at 9:00 AM, Jaonary R

java.lang.OutOfMemoryError: Java heap space when running job via spark-submit

2014-10-09 Thread Jaonary Rabarisoa
Dear all, I have a spark job with the following configuration *val conf = new SparkConf()* * .setAppName("My Job")* * .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")* * .set("spark.kryo.registrator", "value.serializer.Registrator")* * .setMaster("local[4]")*

Protocol buffers with Spark ?

2014-10-01 Thread Jaonary Rabarisoa
Dear all, I have a spark job that communicates with a C++ code using pipe. Since, the data I need to send is rather complicated, I think about using protobuf to serialize it. The problem is that the string form of my data outputted by protobuf contains the "\n" character so it a bit complicated to

Re: Build error when using spark with breeze

2014-09-26 Thread Jaonary Rabarisoa
chanism.html#Dependency_Scope > > Cheers > > On Fri, Sep 26, 2014 at 8:57 AM, Jaonary Rabarisoa > wrote: > >> Thank Ted. Can you tell me how to adjust the scope ? >> >> On Fri, Sep 26, 2014 at 5:47 PM, Ted Yu wrote: >> >>> spark-c

Re: Build error when using spark with breeze

2014-09-26 Thread Jaonary Rabarisoa
> Adjusting the scope should solve the problem below. > > On Fri, Sep 26, 2014 at 8:42 AM, Jaonary Rabarisoa > wrote: > >> Hi all, >> >> I'm using some functions from Breeze in a spark job but I get the >> following build error : >> >> *Error:s

Build error when using spark with breeze

2014-09-26 Thread Jaonary Rabarisoa
Hi all, I'm using some functions from Breeze in a spark job but I get the following build error : *Error:scalac: bad symbolic reference. A signature in RandBasis.class refers to term math3* *in package org.apache.commons which is not available.* *It may be completely missing from the current clas

Better way to process large image data set ?

2014-09-18 Thread Jaonary Rabarisoa
Hi all, I'm trying to process a large image data set and need some way to optimize my implementation since it's very slow from now. In my current implementation I store my images in an object file with the following fields case class Image(groupId: String, imageId: String, buffer: String) Images

Why I get java.lang.OutOfMemoryError: Java heap space with join ?

2014-09-12 Thread Jaonary Rabarisoa
Dear all, I'm facing the following problem and I can't figure how to solve it. I need to join 2 rdd in order to find their intersections. The first RDD represent an image encoded in base64 string associated with image id. The second RDD represent a set of geometric primitives (rectangle) associa

RDD.pipe error on context cleaning

2014-09-01 Thread Jaonary Rabarisoa
Dear all, When callinig an external process with RDD.pipe I got the following error : *Not interrupting system thread Thread[process reaper,10,system]* *Not interrupting system thread Thread[process reaper,10,system]* *Not interrupting system thread Thread[process reaper,10,system]* *14/09/01 10

Re: Spark SQL : how to find element where a field is in a given set

2014-08-29 Thread Jaonary Rabarisoa
1.0.2 On Friday, August 29, 2014, Michael Armbrust wrote: > What version are you using? > > > > On Fri, Aug 29, 2014 at 2:22 AM, Jaonary Rabarisoa > wrote: > >> Still not working for me. I got a compilation error : *value in is not a >> member of Symbol.* An

Re: Spark SQL : how to find element where a field is in a given set

2014-08-29 Thread Jaonary Rabarisoa
[Expression]("a", "b", ...) > table("src").where('key in (longList: _*)) > > Also, note that I had to explicitly specify Expression as the type > parameter of Seq to ensure that the compiler converts "a" and "b" into > Spark SQ

Re: Spark SQL : how to find element where a field is in a given set

2014-08-28 Thread Jaonary Rabarisoa
le.where('name in ("foo", "bar")) > > > > On Thu, Aug 28, 2014 at 3:09 AM, Jaonary Rabarisoa > wrote: > >> Hi all, >> >> What is the expression that I should use with spark sql DSL if I need to >> retreive >> data with a fi

Spark SQL : how to find element where a field is in a given set

2014-08-28 Thread Jaonary Rabarisoa
Hi all, What is the expression that I should use with spark sql DSL if I need to retreive data with a field in a given set. For example : I have the following schema case class Person(name: String, age: Int) And I need to do something like : personTable.where('name in Seq("foo", "bar")) ? Ch

Re: spark and matlab

2014-08-27 Thread Jaonary Rabarisoa
forgot the second point, I found the answer myself inside the source code PipedRDD :) On Wed, Aug 27, 2014 at 1:36 PM, Jaonary Rabarisoa wrote: > Thank you Matei. > > I found a solution using pipe and matlab engine (an executable that can > call matlab behind the scene and us

Re: spark and matlab

2014-08-27 Thread Jaonary Rabarisoa
the command line. Just watch out for any environment variables > needed (you can pass them to pipe() as an optional argument if there are > some). > > On August 25, 2014 at 12:41:29 AM, Jaonary Rabarisoa (jaon...@gmail.com) > wrote: > > Hi all, > > Is there someone

External dependencies management with spark

2014-08-27 Thread Jaonary Rabarisoa
Dear all, I'm looking for an efficient way to manage external dependencies. I know that one can add .jar or .py dependencies easily but how can I handle other type of dependencies. Specifically, I have some data processing algorithm implemented with other languages (ruby, octave, matlab, c++) and

spark and matlab

2014-08-25 Thread Jaonary Rabarisoa
Hi all, Is there someone that tried to pipe RDD into matlab script ? I'm trying to do something similiar if one of you could point some hints. Best regards, Jao

RDD pipe partitionwise

2014-07-21 Thread Jaonary Rabarisoa
Dear all, Is there any example of mapPartitions that fork external process or how to make RDD.pipe working on every data of a partition ? Cheers, Jaonary

Re: Ambiguous references to id : what does it mean ?

2014-07-15 Thread Jaonary Rabarisoa
is your query? Did you use the Hive Parser (your query was > submitted through hql(...)) or the basic SQL Parser (your query was > submitted through sql(...)). > > Thanks, > > Yin > > > On Tue, Jul 15, 2014 at 8:52 AM, Jaonary Rabarisoa > wrote: > >> Hi all,

Ambiguous references to id : what does it mean ?

2014-07-15 Thread Jaonary Rabarisoa
Hi all, When running a join operation with Spark SQL I got the following error : Exception in thread "main" org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Ambiguous references to id: (id#303,List()),(id#0,List()), tree: Filter ('videoId = 'id) Join Inner, None ParquetRelation

Store one to many relation ship in parquet file with spark sql

2014-07-15 Thread Jaonary Rabarisoa
Hi all, How should I store a one to many relationship using spark sql and parquet format. For example I the following case class case class Person(key: String, name: String, friends: Array[String]) gives an error when I try to insert the data in a parquet file. It doesn't like the Array[String]

Re: Need advice to create an objectfile of set of images from Spark

2014-07-09 Thread Jaonary Rabarisoa
single object in 1 rdd > would perhaps not be super optimized. > Regards > Mayur > > Mayur Rustagi > Ph: +1 (760) 203 3257 > http://www.sigmoidanalytics.com > @mayur_rustagi <https://twitter.com/mayur_rustagi> > > > > On Wed, Jul 9, 2014 at 12:17 PM, Jaonary

Need advice to create an objectfile of set of images from Spark

2014-07-08 Thread Jaonary Rabarisoa
Hi all, I need to run a spark job that need a set of images as input. I need something that load these images as RDD but I just don't know how to do that. Do some of you have any idea ? Cheers, Jao

Configure and run external process with RDD.pipe

2014-07-02 Thread Jaonary Rabarisoa
Hi all, I need to run a complex external process with a lots of dependencies from spark. The "pipe" and "addFile" function seem to be my friends but there are just some issues that I need to solve. Precisely, the process I want to run are C++ executable that may depend on some libraries and addit

wholeTextFiles like for binary files ?

2014-06-25 Thread Jaonary Rabarisoa
Is there an equivalent of wholeTextFiles for binary files for example a set of images ? Cheers, Jaonary

Need help to make spark sql works in stand alone application

2014-06-25 Thread Jaonary Rabarisoa
Hi all, I'm trying to use spark sql to store data in parquet file. I create the file and insert data into it with the following code : *val conf = new SparkConf().setAppName("MCT").setMaster("local[2]") val sc = new SparkContext(conf)val sqlContext = new SQLContext(sc)

Re: Using Spark as web app backend

2014-06-25 Thread Jaonary Rabarisoa
n talk to > > > On Tue, Jun 24, 2014 at 3:12 AM, Jaonary Rabarisoa > wrote: > >> Hi all, >> >> So far, I run my spark jobs with spark-shell or spark-submit command. I'd >> like to go further and I wonder how to use spark as a backend of a web >>

Using Spark as web app backend

2014-06-24 Thread Jaonary Rabarisoa
Hi all, So far, I run my spark jobs with spark-shell or spark-submit command. I'd like to go further and I wonder how to use spark as a backend of a web application. Specificaly, I want a frontend application ( build with nodejs ) to communicate with spark on the backend, so that every query from

Re: Hybrid GPU CPU computation

2014-04-11 Thread Jaonary Rabarisoa
ltiple gpu machines. > > Sent from my iPhone > > > On Apr 11, 2014, at 8:38 AM, Jaonary Rabarisoa > wrote: > > > > Hi all, > > > > I'm just wondering if hybrid GPU/CPU computation is something that is > feasible with spark ? And what should be the best way to do it. > > > > > > Cheers, > > > > Jaonary >

Hybrid GPU CPU computation

2014-04-11 Thread Jaonary Rabarisoa
Hi all, I'm just wondering if hybrid GPU/CPU computation is something that is feasible with spark ? And what should be the best way to do it. Cheers, Jaonary

Re: Strange behavior of RDD.cartesian

2014-04-03 Thread Jaonary Rabarisoa
es this? >> >> Matei >> >> On Mar 28, 2014, at 3:09 AM, Jaonary Rabarisoa wrote: >> >> I forgot to mention that I don't really use all of my data. Instead I use >> a sample extracted with randomSample. >> >> >> On Fri, Mar 28, 2014 at

Use combineByKey and StatCount

2014-04-01 Thread Jaonary Rabarisoa
Hi all; Can someone give me some tips to compute mean of RDD by key , maybe with combineByKey and StatCount. Cheers, Jaonary

Re: Strange behavior of RDD.cartesian

2014-03-28 Thread Jaonary Rabarisoa
I forgot to mention that I don't really use all of my data. Instead I use a sample extracted with randomSample. On Fri, Mar 28, 2014 at 10:58 AM, Jaonary Rabarisoa wrote: > Hi all, > > I notice that RDD.cartesian has a strange behavior with cached and > uncached data. More pr

Strange behavior of RDD.cartesian

2014-03-28 Thread Jaonary Rabarisoa
Hi all, I notice that RDD.cartesian has a strange behavior with cached and uncached data. More precisely, I have a set of data that I load with objectFile *val data: RDD[(Int,String,Array[Double])] = sc.objectFile("data")* Then I split it in two set depending on some criteria *val part1 = data

Re: java.lang.ClassNotFoundException

2014-03-26 Thread Jaonary Rabarisoa
m, you need > to be careful - ObjectInputStream uses root classloader to load classes and > does not work with jars that are added to TCCC. Apache commons has > ClassLoaderObjectInputStream to workaround this. > > > On Wed, Mar 26, 2014 at 1:38 PM, Jaonary Rabarisoa wrote:

Re: java.lang.ClassNotFoundException

2014-03-26 Thread Jaonary Rabarisoa
wrote: > > Have you looked through the logs fully? I have seen this (in my limited > > experience) pop up as a result of previous exceptions/errors, also as a > > result of being unable to serialize objects etc. > > Ognen > > > > > > On 3/26/14, 10:39

  1   2   >