Re: Best way to tranform string label into long label for classification problem

2016-06-28 Thread Jaonary Rabarisoa
column of label > indices. The indices are in [0, numLabels), ordered by label frequencies." > > Xinh > > On Tue, Jun 28, 2016 at 12:29 AM, Jaonary Rabarisoa <jaon...@gmail.com> > wrote: > >> Dear all, >> >> I'm trying to a find a way to transform a D

Best way to tranform string label into long label for classification problem

2016-06-28 Thread Jaonary Rabarisoa
Dear all, I'm trying to a find a way to transform a DataFrame into a data that is more suitable for third party classification algorithm. The DataFrame have two columns : "feature" represented by a vector and "label" represented by a string. I want the "label" to be a number between [0, number of

GMM with diagonal covariance matrix

2015-12-21 Thread Jaonary Rabarisoa
Hi all, Is it possible to learn a gaussian mixture model with a diagonal covariance matrix in the GMM algorithm implemented in MLIb ? It seems to be possible but can't figure out how to do that. Cheers, Jao

ml.Pipeline without train step

2015-10-04 Thread Jaonary Rabarisoa
Hi there, The Pipeline of ml package is really a great feature and we use it in our every day task. But we have some use case where we need a Pipeline of Transformers only and the problem is that there's not train phase in that case. For example, we have a pipeline of image analytics with the

Why transformer from ml.Pipeline transform only a DataFrame ?

2015-08-28 Thread Jaonary Rabarisoa
Hi there, The actual API of ml.Transformer use only DataFrame as input. I have a use case where I need to transform a single element. For example transforming an element from spark-streaming. Is there any reason for this or the ml.Transformer will support transforming a single element later ?

Re: Build k-NN graph for large dataset

2015-08-27 Thread Jaonary Rabarisoa
(locally sensitive hashing). A quick search gave this link to a Spark implementation: http://stackoverflow.com/questions/2771/spark-implementation-for-locality-sensitive-hashing On Wed, Aug 26, 2015 at 7:35 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Dear all, I'm trying to find

Build k-NN graph for large dataset

2015-08-26 Thread Jaonary Rabarisoa
Dear all, I'm trying to find an efficient way to build a k-NN graph for a large dataset. Precisely, I have a large set of high dimensional vector (say d 1) and I want to build a graph where those high dimensional points are the vertices and each one is linked to the k-nearest neighbor based

Re: SQL UserDefinedType can't be saved in parquet file when using assembly jar

2015-05-11 Thread Jaonary Rabarisoa
In this example, every thing work expect save to parquet file. On Mon, May 11, 2015 at 4:39 PM, Jaonary Rabarisoa jaon...@gmail.com wrote: MyDenseVectorUDT do exist in the assembly jar and in this example all the code is in a single file to make sure every thing is included. On Tue, Apr 21

Re: SQL UserDefinedType can't be saved in parquet file when using assembly jar

2015-05-11 Thread Jaonary Rabarisoa
(or in the assembly jar) at runtime. Make sure the full class name (with package name) is used. Btw, UDTs are not public yet, so please use it with caution. -Xiangrui On Fri, Apr 17, 2015 at 12:45 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Dear all, Here is an example of code

Problem with Spark SQL UserDefinedType and sbt assembly

2015-04-16 Thread Jaonary Rabarisoa
Dear all, Here is an issue that gets me mad. I wrote a UserDefineType in order to be able to store a custom type in a parquet file. In my code I just create a DataFrame with my custom data type and write in into a parquet file. When I run my code directly inside idea every thing works like a

Re: Problem with Spark SQL UserDefinedType and sbt assembly

2015-04-16 Thread Jaonary Rabarisoa
Any ideas ? On Thu, Apr 16, 2015 at 5:04 PM, Jaonary Rabarisoa jaon...@gmail.com wrote: Dear all, Here is an issue that gets me mad. I wrote a UserDefineType in order to be able to store a custom type in a parquet file. In my code I just create a DataFrame with my custom data type and write

Re: Problem with Spark SQL UserDefinedType and sbt assembly

2015-04-16 Thread Jaonary Rabarisoa
assembly jar? On Thu, Apr 16, 2015 at 4:46 PM, Jaonary Rabarisoa jaon...@gmail.com wrote: Any ideas ? On Thu, Apr 16, 2015 at 5:04 PM, Jaonary Rabarisoa jaon...@gmail.com wrote: Dear all, Here is an issue that gets me mad. I wrote a UserDefineType in order to be able to store a custom type

How to get a clean DataFrame schema merge

2015-04-15 Thread Jaonary Rabarisoa
Hi all, If you follow the example of schema merging in the spark documentation http://spark.apache.org/docs/latest/sql-programming-guide.html#schema-merging you obtain the following results when you want to load the result data : single triple double 1 3 null 2 6 null 4

Re: How DataFrame schema migration works ?

2015-04-14 Thread Jaonary Rabarisoa
I forgot to mention that the imageId field is a custom scala object. Do I need to implement some special method to make it works (equal, hashCode ) ? On Tue, Apr 14, 2015 at 5:00 PM, Jaonary Rabarisoa jaon...@gmail.com wrote: Dear all, In the latest version of spark there's a feature called

How DataFrame schema migration works ?

2015-04-14 Thread Jaonary Rabarisoa
Dear all, In the latest version of spark there's a feature called : automatic partition discovery and Schema migration for parquet. As far as I know, this gives the ability to split the DataFrame into several parquet files, and by just loading the parent directory one can get the global schema of

Re: Unable to save dataframe with UDT created with sqlContext.createDataFrame

2015-04-03 Thread Jaonary Rabarisoa
, Jaonary Rabarisoa jaon...@gmail.com wrote: Hmm, I got the same error with the master. Here is another test example that fails. Here, I explicitly create a Row RDD which corresponds to the use case I am in : object TestDataFrame { def main(args: Array[String]): Unit

Re: Unable to save dataframe with UDT created with sqlContext.createDataFrame

2015-04-01 Thread Jaonary Rabarisoa
, Mar 31, 2015 at 11:18 PM, Xiangrui Meng men...@gmail.com wrote: I cannot reproduce this error on master, but I'm not aware of any recent bug fixes that are related. Could you build and try the current master? -Xiangrui On Tue, Mar 31, 2015 at 4:10 AM, Jaonary Rabarisoa jaon...@gmail.com

Re: Some questions after playing a little with the new ml.Pipeline.

2015-03-31 Thread Jaonary Rabarisoa
to RDD. Thanks Shivaram On Mon, Mar 30, 2015 at 8:37 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Dear all, I'm still struggling to make a pre-trained caffe model transformer for dataframe works. The main problem is that creating a caffe model inside the UDF is very slow and consumes

Re: Some questions after playing a little with the new ml.Pipeline.

2015-03-31 Thread Jaonary Rabarisoa
, Jaonary Rabarisoa jaon...@gmail.com wrote: Following your suggestion, I end up with the following implementation : *override def transform(dataSet: DataFrame, paramMap: ParamMap): DataFrame = { val schema = transformSchema(dataSet.schema, paramMap, logging = true) val map

Unable to save dataframe with UDT created with sqlContext.createDataFrame

2015-03-31 Thread Jaonary Rabarisoa
Hi all, DataFrame with an user defined type (here mllib.Vector) created with sqlContex.createDataFrame can't be saved to parquet file and raise ClassCastException: org.apache.spark.mllib.linalg.DenseVector cannot be cast to org.apache.spark.sql.Row error. Here is an example of code to reproduce

Re: Some questions after playing a little with the new ml.Pipeline.

2015-03-30 Thread Jaonary Rabarisoa
. There is not a great way to do something equivalent to mapPartitions with UDFs right now. On Tue, Mar 3, 2015 at 4:36 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Here is my current implementation with current master version of spark *class DeepCNNFeature extends Transformer with HasInputCol

Re: Solve least square problem of the form min norm(A x - b)^2^ + lambda * n * norm(x)^2 ?

2015-03-13 Thread Jaonary Rabarisoa
, 2015 at 11:36 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: On Thu, Mar 12, 2015 at 3:05 PM, Jaonary Rabarisoa jaon...@gmail.com wrote: In fact, by activating netlib with native libraries it goes faster. Glad you got it work ! Better performance was one of the reasons we

Re: Solve least square problem of the form min norm(A x - b)^2^ + lambda * n * norm(x)^2 ?

2015-03-12 Thread Jaonary Rabarisoa
Thanks Shivaram On Tue, Mar 10, 2015 at 9:57 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: I'm trying to play with the implementation of least square solver (Ax = b) in mlmatrix.TSQR where A is a 5*1024 matrix and b a 5*10 matrix. It works but I notice that it's 8 times slower

Re: Solve least square problem of the form min norm(A x - b)^2^ + lambda * n * norm(x)^2 ?

2015-03-10 Thread Jaonary Rabarisoa
://github.com/amplab/ml-matrix/blob/master/src/main/scala/edu/berkeley/cs/amplab/mlmatrix/NormalEquations.scala On Tue, Mar 3, 2015 at 9:01 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Dear all, Is there a least square solver based on DistributedMatrix that we can use out of the box

Re: Solve least square problem of the form min norm(A x - b)^2^ + lambda * n * norm(x)^2 ?

2015-03-06 Thread Jaonary Rabarisoa
/mlmatrix/NormalEquations.scala On Tue, Mar 3, 2015 at 9:01 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Dear all, Is there a least square solver based on DistributedMatrix that we can use out of the box in the current (or the master) version of spark ? It seems that the only least square

Re: Data Frame types

2015-03-06 Thread Jaonary Rabarisoa
Hi Cesar, Yes, you can define an UDT with the new DataFrame, the same way that SchemaRDD did. Jaonary On Fri, Mar 6, 2015 at 4:22 PM, Cesar Flores ces...@gmail.com wrote: The SchemaRDD supports the storage of user defined classes. However, in order to do that, the user class needs to

Re: Solve least square problem of the form min norm(A x - b)^2^ + lambda * n * norm(x)^2 ?

2015-03-06 Thread Jaonary Rabarisoa
/NormalEquations.scala On Tue, Mar 3, 2015 at 9:01 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Dear all, Is there a least square solver based on DistributedMatrix that we can use out of the box in the current (or the master) version of spark ? It seems that the only least square solver available in spark

Re: Some questions after playing a little with the new ml.Pipeline.

2015-03-03 Thread Jaonary Rabarisoa
between Spark 1.2 and 1.3. In 1.3, the DSL is much improved and makes it easier to create a new column. Joseph On Sun, Mar 1, 2015 at 1:26 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: class DeepCNNFeature extends Transformer ... { override def transform(data: DataFrame, paramMap

Solve least square problem of the form min norm(A x - b)^2^ + lambda * n * norm(x)^2 ?

2015-03-03 Thread Jaonary Rabarisoa
Dear all, Is there a least square solver based on DistributedMatrix that we can use out of the box in the current (or the master) version of spark ? It seems that the only least square solver available in spark is private to recommender package. Cheers, Jao

Re: Some questions after playing a little with the new ml.Pipeline.

2015-03-01 Thread Jaonary Rabarisoa
class DeepCNNFeature extends Transformer ... { override def transform(data: DataFrame, paramMap: ParamMap): DataFrame = { // How can I do a map partition on the underlying RDD and then add the column ? } } On Sun, Mar 1, 2015 at 10:23 AM, Jaonary Rabarisoa jaon

Re: Some questions after playing a little with the new ml.Pipeline.

2015-03-01 Thread Jaonary Rabarisoa
: myRDD.mapPartitions { myDataOnPartition = val myModel = // instantiate neural network on this partition myDataOnPartition.map { myDatum = myModel.predict(myDatum) } } I hope this helps! Joseph On Fri, Feb 27, 2015 at 10:27 PM, Jaonary Rabarisoa jaon...@gmail.com wrote: Dear all

Some questions after playing a little with the new ml.Pipeline.

2015-02-27 Thread Jaonary Rabarisoa
Dear all, We mainly do large scale computer vision task (image classification, retrieval, ...). The pipeline is really great stuff for that. We're trying to reproduce the tutorial given on that topic during the latest spark summit (

Re: [ML][SQL] Select UserDefinedType attribute in a DataFrame

2015-02-24 Thread Jaonary Rabarisoa
should be `df.select($image.data.as(features))`. On Tue, Feb 24, 2015 at 3:35 PM, Xiangrui Meng men...@gmail.com wrote: If you make `Image` a case class, then select(image.data) should work. On Tue, Feb 24, 2015 at 3:06 PM, Jaonary Rabarisoa jaon...@gmail.com wrote: Hi all, I have

[ML][SQL] Select UserDefinedType attribute in a DataFrame

2015-02-24 Thread Jaonary Rabarisoa
Hi all, I have a DataFrame that contains a user defined type. The type is an image with the following attribute *class Image(w: Int, h: Int, data: Vector)* In my DataFrame, images are stored in column named image that corresponds to the following case class *case class LabeledImage(label: Int,

Re: Need some help to create user defined type for ML pipeline

2015-02-23 Thread Jaonary Rabarisoa
issues you're running into? Joseph On Mon, Jan 19, 2015 at 12:59 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Hi all, I'm trying to implement a pipeline for computer vision based on the latest ML package in spark. The first step of my pipeline is to decode image (jpeg for instance) stored

Unable to run spark-shell after build

2015-02-03 Thread Jaonary Rabarisoa
Hi all, I'm trying to run the master version of spark in order to test some alpha components in ml package. I follow the build spark documentation and build it with : $ mvn clean package The build is successful but when I try to run spark-shell I got the following errror : *Exception in

Re: Can't find spark-parent when using snapshot build

2015-02-02 Thread Jaonary Rabarisoa
That's what I did. On Mon, Feb 2, 2015 at 11:28 PM, Sean Owen so...@cloudera.com wrote: Snapshot builds are not published. Unless you build and install snapshots locally (like with mvn install) they wont be found. On Feb 2, 2015 10:58 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Hi all

Can't find spark-parent when using snapshot build

2015-02-02 Thread Jaonary Rabarisoa
Hi all, I'm trying to use the master version of spark. I build and install it with $ mvn clean clean install I manage to use it with the following configuration in my build.sbt : *libraryDependencies ++= Seq( org.apache.spark %% spark-core % 1.3.0-SNAPSHOT % provided, org.apache.spark %%

Need some help to create user defined type for ML pipeline

2015-01-19 Thread Jaonary Rabarisoa
Hi all, I'm trying to implement a pipeline for computer vision based on the latest ML package in spark. The first step of my pipeline is to decode image (jpeg for instance) stored in a parquet file. For this, I begin to create a UserDefinedType that represents a decoded image stored in a array of

Re: DeepLearning and Spark ?

2015-01-10 Thread Jaonary Rabarisoa
' benchmark implementation will be released soon On 9 January 2015 at 10:59, Marco Shaw marco.s...@gmail.com wrote: Pretty vague on details: http://www.datasciencecentral.com/m/blogpost?id=6448529%3ABlogPost%3A227199 On Jan 9, 2015, at 11:39 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Hi

DeepLearning and Spark ?

2015-01-09 Thread Jaonary Rabarisoa
Hi all, DeepLearning algorithms are popular and achieve many state of the art performance in several real world machine learning problems. Currently there are no DL implementation in spark and I wonder if there is an ongoing work on this topics. We can do DL in spark Sparkling water and H2O but

Re: MLLib: Saving and loading a model

2014-12-16 Thread Jaonary Rabarisoa
Hi, There's is a ongoing work on model export https://www.github.com/apache/spark/pull/3062 For now, since LinearRegression is serializable you can save it as object file : sc.saveAsObjectFile(Seq(model)) then val model = sc.objectFile[LinearRegresionWithSGD](path).first model.predict(...)

Re: Why KMeans with mllib is so slow ?

2014-12-15 Thread Jaonary Rabarisoa
On Mon, Dec 8, 2014 at 7:53 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: After some investigation, I learned that I can't compare kmeans in mllib with another kmeans implementation directly. The kmeans|| initialization step takes more time than the algorithm implemented in julia

DIMSUM and ColumnSimilarity use case ?

2014-12-10 Thread Jaonary Rabarisoa
Dear all, I'm trying to understand what is the correct use case of ColumnSimilarity implemented in RowMatrix. As far as I know, this function computes the similarity of a column of a given matrix. The DIMSUM paper says that it's efficient for large m (rows) and small n (columns). In this case

Re: Mllib native netlib-java/OpenBLAS

2014-12-09 Thread Jaonary Rabarisoa
+1 with 1.3-SNAPSHOT. On Mon, Dec 1, 2014 at 5:49 PM, agg212 alexander_galaka...@brown.edu wrote: Thanks for your reply, but I'm still running into issues installing/configuring the native libraries for MLlib. Here are the steps I've taken, please let me know if anything is incorrect. -

Why my default partition size is set to 52 ?

2014-12-05 Thread Jaonary Rabarisoa
Hi all, I'm trying to run some spark job with spark-shell. What I want to do is just to count the number of lines in a file. I start the spark-shell with the default argument i.e just with ./bin/spark-shell. Load the text file with sc.textFile(path) and then call count on my data. When I do

Re: Why my default partition size is set to 52 ?

2014-12-05 Thread Jaonary Rabarisoa
that the Hadoop InputFormat would make 52 splits for it. Data drives partitions, not processing resource. Really, 8 splits is the minimum parallelism you want. Several times your # of cores is better. On Fri, Dec 5, 2014 at 8:51 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Hi all, I'm trying

Why KMeans with mllib is so slow ?

2014-12-05 Thread Jaonary Rabarisoa
Hi all, I'm trying to a run clustering with kmeans algorithm. The size of my data set is about 240k vectors of dimension 384. Solving the problem with the kmeans available in julia (kmean++) http://clusteringjl.readthedocs.org/en/latest/kmeans.html take about 8 minutes on a single core.

Re: Why KMeans with mllib is so slow ?

2014-12-05 Thread Jaonary Rabarisoa
PM, Davies Liu dav...@databricks.com wrote: Could you post you script to reproduce the results (also how to generate the dataset)? That will help us to investigate it. On Fri, Dec 5, 2014 at 8:40 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Hmm, here I use spark on local mode on my laptop

Store kmeans model

2014-11-24 Thread Jaonary Rabarisoa
Dear all, How can one save a kmeans model after training ? Best, Jao

Got java.lang.SecurityException: class javax.servlet.FilterRegistration's when running job from intellij Idea

2014-11-03 Thread Jaonary Rabarisoa
Hi all, I have a spark job that I build with sbt and I can run without any problem with sbt run. But when I run it inside IntelliJ Idea I got the following error : *Exception encountered when invoking run on a nested suite - class javax.servlet.FilterRegistration's signer information does not

Re: unable to make a custom class as a key in a pairrdd

2014-10-24 Thread Jaonary Rabarisoa
to me. Have a try. Good luck, Niklas On 23.10.2014 21:52, Jaonary Rabarisoa wrote: Hi all, I have the following case class that I want to use as a key in a key-value rdd. I defined the equals and hashCode methode but it's not working. What I'm doing wrong ? *case class PersonID(id

key class requirement for PairedRDD ?

2014-10-17 Thread Jaonary Rabarisoa
Dear all, Is it possible to use any kind of object as key in a PairedRDD. When I use a case class key, the groupByKey operation don't behave as I expected. I want to use a case class to avoid using a large tuple as it is easier to manipulate. Cheers, Jaonary

Re: key class requirement for PairedRDD ?

2014-10-17 Thread Jaonary Rabarisoa
, Sonal Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Fri, Oct 17, 2014 at 12:28 PM, Jaonary Rabarisoa jaon...@gmail.com wrote: Dear all, Is it possible to use any kind of object as key in a PairedRDD. When I use a case class key, the groupByKey operation

Optimizing pairwise similarity computation or how to avoid RDD.cartesian operation ?

2014-10-17 Thread Jaonary Rabarisoa
Hi all, I need to compute a similiarity between elements of two large sets of high dimensional feature vector. Naively, I create all possible pair of vectors with * features1.cartesian(features2)* and then map the produced paired rdd with my similarity function. The problem is that the cartesian

Re: Optimizing pairwise similarity computation or how to avoid RDD.cartesian operation ?

2014-10-17 Thread Jaonary Rabarisoa
. This implements the DIMSUM sampling scheme, recently merged into master https://github.com/apache/spark/pull/1778. Best, Reza On Fri, Oct 17, 2014 at 3:43 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Hi all, I need to compute a similiarity between elements of two large sets of high

Re: Interactive interface tool for spark

2014-10-12 Thread Jaonary Rabarisoa
And what about Hue http://gethue.com ? On Sun, Oct 12, 2014 at 1:26 PM, andy petrella andy.petre...@gmail.com wrote: Dear Sparkers, As promised, I've just updated the repo with a new name (for the sake of clarity), default branch but specially with a dedicated README containing: *

java.lang.OutOfMemoryError: Java heap space when running job via spark-submit

2014-10-09 Thread Jaonary Rabarisoa
Dear all, I have a spark job with the following configuration *val conf = new SparkConf()* * .setAppName(My Job)* * .set(spark.serializer, org.apache.spark.serializer.KryoSerializer)* * .set(spark.kryo.registrator, value.serializer.Registrator)* * .setMaster(local[4])* *

Re: java.lang.OutOfMemoryError: Java heap space when running job via spark-submit

2014-10-09 Thread Jaonary Rabarisoa
in fact with --driver-memory 2G I can get it working On Thu, Oct 9, 2014 at 6:20 PM, Xiangrui Meng men...@gmail.com wrote: Please use --driver-memory 2g instead of --conf spark.driver.memory=2g. I'm not sure whether this is a bug. -Xiangrui On Thu, Oct 9, 2014 at 9:00 AM, Jaonary Rabarisoa

Build error when using spark with breeze

2014-09-26 Thread Jaonary Rabarisoa
Hi all, I'm using some functions from Breeze in a spark job but I get the following build error : *Error:scalac: bad symbolic reference. A signature in RandBasis.class refers to term math3* *in package org.apache.commons which is not available.* *It may be completely missing from the current

Re: Build error when using spark with breeze

2014-09-26 Thread Jaonary Rabarisoa
version3.3/version scopetest/scope /dependency Adjusting the scope should solve the problem below. On Fri, Sep 26, 2014 at 8:42 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Hi all, I'm using some functions from Breeze in a spark job but I get the following build error

Re: Build error when using spark with breeze

2014-09-26 Thread Jaonary Rabarisoa
-mechanism.html#Dependency_Scope Cheers On Fri, Sep 26, 2014 at 8:57 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Thank Ted. Can you tell me how to adjust the scope ? On Fri, Sep 26, 2014 at 5:47 PM, Ted Yu yuzhih...@gmail.com wrote: spark-core's dependency on commons-math3 is @ test scope (core

Better way to process large image data set ?

2014-09-18 Thread Jaonary Rabarisoa
Hi all, I'm trying to process a large image data set and need some way to optimize my implementation since it's very slow from now. In my current implementation I store my images in an object file with the following fields case class Image(groupId: String, imageId: String, buffer: String)

Why I get java.lang.OutOfMemoryError: Java heap space with join ?

2014-09-12 Thread Jaonary Rabarisoa
Dear all, I'm facing the following problem and I can't figure how to solve it. I need to join 2 rdd in order to find their intersections. The first RDD represent an image encoded in base64 string associated with image id. The second RDD represent a set of geometric primitives (rectangle)

Re: Spark SQL : how to find element where a field is in a given set

2014-08-29 Thread Jaonary Rabarisoa
. personTable.where('name in (foo, bar)) On Thu, Aug 28, 2014 at 3:09 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Hi all, What is the expression that I should use with spark sql DSL if I need to retreive data with a field in a given set. For example : I have the following schema case

Re: Spark SQL : how to find element where a field is in a given set

2014-08-29 Thread Jaonary Rabarisoa
[Expression](a, b, ...) table(src).where('key in (longList: _*)) Also, note that I had to explicitly specify Expression as the type parameter of Seq to ensure that the compiler converts a and b into Spark SQL expressions. On Thu, Aug 28, 2014 at 11:52 PM, Jaonary Rabarisoa jaon...@gmail.com

Re: Spark SQL : how to find element where a field is in a given set

2014-08-29 Thread Jaonary Rabarisoa
1.0.2 On Friday, August 29, 2014, Michael Armbrust mich...@databricks.com wrote: What version are you using? On Fri, Aug 29, 2014 at 2:22 AM, Jaonary Rabarisoa jaon...@gmail.com javascript:_e(%7B%7D,'cvml','jaon...@gmail.com'); wrote: Still not working for me. I got a compilation error

Spark SQL : how to find element where a field is in a given set

2014-08-28 Thread Jaonary Rabarisoa
Hi all, What is the expression that I should use with spark sql DSL if I need to retreive data with a field in a given set. For example : I have the following schema case class Person(name: String, age: Int) And I need to do something like : personTable.where('name in Seq(foo, bar)) ?

External dependencies management with spark

2014-08-27 Thread Jaonary Rabarisoa
Dear all, I'm looking for an efficient way to manage external dependencies. I know that one can add .jar or .py dependencies easily but how can I handle other type of dependencies. Specifically, I have some data processing algorithm implemented with other languages (ruby, octave, matlab, c++) and

Re: spark and matlab

2014-08-27 Thread Jaonary Rabarisoa
. Just watch out for any environment variables needed (you can pass them to pipe() as an optional argument if there are some). On August 25, 2014 at 12:41:29 AM, Jaonary Rabarisoa (jaon...@gmail.com) wrote: Hi all, Is there someone that tried to pipe RDD into matlab script ? I'm trying to do

spark and matlab

2014-08-25 Thread Jaonary Rabarisoa
Hi all, Is there someone that tried to pipe RDD into matlab script ? I'm trying to do something similiar if one of you could point some hints. Best regards, Jao

RDD pipe partitionwise

2014-07-21 Thread Jaonary Rabarisoa
Dear all, Is there any example of mapPartitions that fork external process or how to make RDD.pipe working on every data of a partition ? Cheers, Jaonary

Re: Ambiguous references to id : what does it mean ?

2014-07-16 Thread Jaonary Rabarisoa
query? Did you use the Hive Parser (your query was submitted through hql(...)) or the basic SQL Parser (your query was submitted through sql(...)). Thanks, Yin On Tue, Jul 15, 2014 at 8:52 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Hi all, When running a join operation with Spark

Need advice to create an objectfile of set of images from Spark

2014-07-09 Thread Jaonary Rabarisoa
Hi all, I need to run a spark job that need a set of images as input. I need something that load these images as RDD but I just don't know how to do that. Do some of you have any idea ? Cheers, Jao

Re: Need advice to create an objectfile of set of images from Spark

2014-07-09 Thread Jaonary Rabarisoa
image as a single object in 1 rdd would perhaps not be super optimized. Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Wed, Jul 9, 2014 at 12:17 PM, Jaonary Rabarisoa jaon...@gmail.com wrote: Hi all

Configure and run external process with RDD.pipe

2014-07-02 Thread Jaonary Rabarisoa
Hi all, I need to run a complex external process with a lots of dependencies from spark. The pipe and addFile function seem to be my friends but there are just some issues that I need to solve. Precisely, the process I want to run are C++ executable that may depend on some libraries and

Re: Using Spark as web app backend

2014-06-25 Thread Jaonary Rabarisoa
, that the front end can talk to On Tue, Jun 24, 2014 at 3:12 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Hi all, So far, I run my spark jobs with spark-shell or spark-submit command. I'd like to go further and I wonder how to use spark as a backend of a web application. Specificaly, I want a frontend

Need help to make spark sql works in stand alone application

2014-06-25 Thread Jaonary Rabarisoa
Hi all, I'm trying to use spark sql to store data in parquet file. I create the file and insert data into it with the following code : *val conf = new SparkConf().setAppName(MCT).setMaster(local[2]) val sc = new SparkContext(conf)val sqlContext = new SQLContext(sc)

wholeTextFiles like for binary files ?

2014-06-25 Thread Jaonary Rabarisoa
Is there an equivalent of wholeTextFiles for binary files for example a set of images ? Cheers, Jaonary

Using Spark as web app backend

2014-06-24 Thread Jaonary Rabarisoa
Hi all, So far, I run my spark jobs with spark-shell or spark-submit command. I'd like to go further and I wonder how to use spark as a backend of a web application. Specificaly, I want a frontend application ( build with nodejs ) to communicate with spark on the backend, so that every query

Hybrid GPU CPU computation

2014-04-11 Thread Jaonary Rabarisoa
Hi all, I'm just wondering if hybrid GPU/CPU computation is something that is feasible with spark ? And what should be the best way to do it. Cheers, Jaonary

Re: Hybrid GPU CPU computation

2014-04-11 Thread Jaonary Rabarisoa
machines. Sent from my iPhone On Apr 11, 2014, at 8:38 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Hi all, I'm just wondering if hybrid GPU/CPU computation is something that is feasible with spark ? And what should be the best way to do it. Cheers, Jaonary

Re: Strange behavior of RDD.cartesian

2014-04-03 Thread Jaonary Rabarisoa
On Mar 28, 2014, at 3:09 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: I forgot to mention that I don't really use all of my data. Instead I use a sample extracted with randomSample. On Fri, Mar 28, 2014 at 10:58 AM, Jaonary Rabarisoa jaon...@gmail.comwrote: Hi all, I notice

Use combineByKey and StatCount

2014-04-01 Thread Jaonary Rabarisoa
Hi all; Can someone give me some tips to compute mean of RDD by key , maybe with combineByKey and StatCount. Cheers, Jaonary

Re: Strange behavior of RDD.cartesian

2014-03-28 Thread Jaonary Rabarisoa
I forgot to mention that I don't really use all of my data. Instead I use a sample extracted with randomSample. On Fri, Mar 28, 2014 at 10:58 AM, Jaonary Rabarisoa jaon...@gmail.comwrote: Hi all, I notice that RDD.cartesian has a strange behavior with cached and uncached data. More

Re: java.lang.ClassNotFoundException

2014-03-26 Thread Jaonary Rabarisoa
fully? I have seen this (in my limited experience) pop up as a result of previous exceptions/errors, also as a result of being unable to serialize objects etc. Ognen On 3/26/14, 10:39 AM, Jaonary Rabarisoa wrote: I notice that I get this error when I'm trying to load an objectFile

mapPartitions use case

2014-03-24 Thread Jaonary Rabarisoa
Dear all, Sorry for asking such a basic question, but someone can explain when one should use mapPartiontions instead of map. Thanks Jaonary

Yet another question on saving RDD into files

2014-03-22 Thread Jaonary Rabarisoa
Dear all, As a Spark newbie, I need some help to understand how RDD save to file behaves. After reading the post on saving single files efficiently http://apache-spark-user-list.1001560.n3.nabble.com/How-to-save-as-a-single-file-efficiently-td3014.html I understand that each partition of the

Does RDD.saveAsObjectFile appends or create a new file ?

2014-03-21 Thread Jaonary Rabarisoa
Dear all, I need to run a series of transformations that map a RDD into another RDD. The computation changes over times and so does the resulting RDD. Each results is then saved to the disk in order to do further analysis (for example variation of the result over time). The question is, if I

N-Fold validation and RDD partitions

2014-03-21 Thread Jaonary Rabarisoa
Hi I need to partition my data represented as RDD into n folds and run metrics computation in each fold and finally compute the means of my metrics overall the folds. Does spark can do the data partition out of the box or do I need to implement it myself. I know that RDD has a partitions method

Re: Hadoop streaming like feature for Spark

2014-03-20 Thread Jaonary Rabarisoa
, then collect the output. This might be useful if, e.g., your external process doesn't use line-oriented input/output. -Ewen Jaonary Rabarisoa jaon...@gmail.com March 20, 2014 at 1:04 AM Dear all, Dear all, Does Spark has a kind of Hadoop streaming feature to run external process

How to distribute external executable (script) with Spark ?

2014-03-19 Thread Jaonary Rabarisoa
Hi all, I'm trying to build an evaluation platform based on Spark. The idea is to run a blackbox executable (build with c/c++ or some scripting language). This blackbox takes a set of data as input and outpout some metrics. Since I have a huge amount of data, I need to distribute the computation

Feed KMeans algorithm with a row major matrix

2014-03-18 Thread Jaonary Rabarisoa
Dear All, I'm trying to cluster data from native library code with Spark Kmeans||. In my native library the data are represented as a matrix (row = number of data and col = dimension). For efficiency reason, they are copied into a one dimensional scala Array row major wise so after the

Reading sequencefile

2014-03-11 Thread Jaonary Rabarisoa
Hi all, I'm trying to read a sequenceFile that represent a set of jpeg image generated using this tool : http://stuartsierra.com/2008/04/24/a-million-little-files . According to the documentation : Each key is the name of a file (a Hadoop “Text”), the value is the binary contents of the file (a