Re: unit testing in spark

2016-12-08 Thread ndjido
Hi Pseudo,

Just use unittest https://docs.python.org/2/library/unittest.html .

> On 8 Dec 2016, at 19:14, pseudo oduesp  wrote:
> 
> somone can tell me how i can make unit test on pyspark ?
> (book, tutorial ...)


Re: how to generate a column using mapParition and then add it back to the df?

2016-08-08 Thread ndjido

Hi MoTao,
What about broadcasting the model?

Cheers,
Ndjido.

> On 08 Aug 2016, at 11:00, MoTao <mo...@sensetime.com> wrote:
> 
> Hi all,
> 
> I'm trying to append a column to a df.
> I understand that the new column must be created by
> 1) using literals,
> 2) transforming an existing column in df,
> or 3) generated from udf over this df
> 
> In my case, the column to be appended is created by processing each row,
> like
> 
> val df = spark.createDataFrame(Seq(1.0, 2.0, 3.0)).toDF("value")
> val func = udf { 
>  v: Double => {
>val model = initModel()
>model.process(v)
>  }
> }
> val df2 = df.withColumn("valueWithBias", func(col("value")))
> 
> This works fine. However, for performance reason, I want to avoid
> initModel() for each row.
> So I come with mapParitions, like
> 
> val df = spark.createDataFrame(Seq(1.0, 2.0, 3.0)).toDF("value")
> val df2 = df.mapPartitions(rows => {
>  val model = initModel()  
>  rows.map(row => model.process(row.getAs[Double](0)))
> })
> val df3 = df.withColumn("valueWithBias", df2.col("value")) // FAIL
> 
> But this is wrong as a column of df2 *CANNOT* be appended to df.
> 
> The only solution I got is to force mapPartitions to return a whole row
> instead of the new column,
> ( Something like "row => Row.fromSeq(row.toSeq ++
> Array(model.process(...)))" )
> which requires a lot of copy as well.
> 
> I wonder how to deal with this problem with as few overhead as possible?
> 
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/how-to-generate-a-column-using-mapParition-and-then-add-it-back-to-the-df-tp27493.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: add spark-csv jar to ipython notbook without packages flags

2016-07-25 Thread Ndjido Ardo BAR
Hi Pseudo,

try this :

export SPARK_SUBMIT_OPTIONS  =  "--jars spark-csv_2.10-1.4.0.jar,
commons-csv-1.1.jar"

this have been working for me for a longtime ;-) both in Zeppelin(for Spark
Scala)  and Ipython Notebook (for PySpark).

cheers,

Ardo



On Mon, Jul 25, 2016 at 1:28 PM, pseudo oduesp 
wrote:

> PYSPARK_SUBMIT_ARGS  =  --jars spark-csv_2.10-1.4.0.jar,commons-csv-1.1.jar
> without succecs
>
> thanks
>
>
> 2016-07-25 13:27 GMT+02:00 pseudo oduesp :
>
>> Hi ,
>>  someone can telle me how i can add jars to ipython  i try spark
>>
>>
>>
>


Re: lift coefficien

2016-07-22 Thread ndjido
Just apply Lift = Recall / Support formula with respect to a given threshold on 
your population distribution. The computation is quite straightforward. 

Cheers,
Ardo

> On 20 Jul 2016, at 15:05, pseudo oduesp  wrote:
> 
> Hi ,
> how we can claculate lift coeff  from pyspark result of prediction ?
> 
> thanks ?

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: add multiple columns

2016-06-26 Thread ndjido
Hi guy!

I'm afraid you have to loop...The update of the Logical Plan is getting faster 
on Spark. 

Cheers, 
Ardo.

Sent from my iPhone

> On 26 Jun 2016, at 14:20, pseudo oduesp  wrote:
> 
> Hi who i can add multiple columns to data frame 
> 
> withcolumns allow to add one columns but when you  have multiple  i have to 
> loop on eache columns ?
> 
> thanks 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Labeledpoint

2016-06-21 Thread Ndjido Ardo BAR
To answer more accurately to your question, the model.fit(df) method takes
in a DataFrame of Row(label=double, feature=Vectors.dense([...])) .

cheers,
Ardo.


On Tue, Jun 21, 2016 at 6:44 PM, Ndjido Ardo BAR <ndj...@gmail.com> wrote:

> Hi,
>
> You can use a RDD of LabelPoints to fit your model. Check the doc for more
> example :
> http://spark.apache.org/docs/latest/api/python/pyspark.ml.html?highlight=transform#pyspark.ml.classification.RandomForestClassificationModel.transform
>
> cheers,
> Ardo.
>
> On Tue, Jun 21, 2016 at 6:12 PM, pseudo oduesp <pseudo20...@gmail.com>
> wrote:
>
>> Hi,
>> i am pyspark user and i want test Randomforest.
>>
>> i have dataframe with 100 columns
>> i should give Rdd or data frame to algorithme i transformed my dataframe
>> to only tow columns
>> label ands features  columns
>>
>>  df.label df.features
>>   0(517,(0,1,2,333,56 ...
>>1   (517,(0,11,0,33,6 ...
>> 0   (517,(0,1,0,33,8 ...
>>
>> but i dont have no ieda to transorme data frame like input to data frame
>> i test the example in offciel web page without succes
>>
>> please give me example how i can work and specily with test set  .
>>
>> thanks
>>
>
>


Re: Labeledpoint

2016-06-21 Thread Ndjido Ardo BAR
Hi,

You can use a RDD of LabelPoints to fit your model. Check the doc for more
example :
http://spark.apache.org/docs/latest/api/python/pyspark.ml.html?highlight=transform#pyspark.ml.classification.RandomForestClassificationModel.transform

cheers,
Ardo.

On Tue, Jun 21, 2016 at 6:12 PM, pseudo oduesp 
wrote:

> Hi,
> i am pyspark user and i want test Randomforest.
>
> i have dataframe with 100 columns
> i should give Rdd or data frame to algorithme i transformed my dataframe
> to only tow columns
> label ands features  columns
>
>  df.label df.features
>   0(517,(0,1,2,333,56 ...
>1   (517,(0,11,0,33,6 ...
> 0   (517,(0,1,0,33,8 ...
>
> but i dont have no ieda to transorme data frame like input to data frame i
> test the example in offciel web page without succes
>
> please give me example how i can work and specily with test set  .
>
> thanks
>


Re: H2O + Spark Streaming?

2016-05-05 Thread ndjido
Sure! Check the following working example : 
https://github.com/h2oai/qcon2015/tree/master/05-spark-streaming/ask-craig-streaming-app
 

Cheers.
Ardo

Sent from my iPhone

> On 05 May 2016, at 17:26, diplomatic Guru  wrote:
> 
> Hello all, I was wondering if it is possible to use H2O with Spark Streaming 
> for online prediction? 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Mllib using model to predict probability

2016-05-05 Thread ndjido
You can user the BinaryClassificationEvaluator class to get both predicted 
classes (0/1) and probabilities. Check the following spark doc 
https://spark.apache.org/docs/latest/mllib-evaluation-metrics.html .


Cheers,
Ardo 

Sent from my iPhone

> On 05 May 2016, at 07:59, colin  wrote:
> 
> In 2-class problems, when I use SVM, RondomForest models to do
> classifications, they predict "0" or "1".
> And when I use ROC to evaluate the model, sometimes I need a probability
> that a record belongs to "0" or "1".
> In scikit-learn, every model can do "predict" and "predict_prob", which the
> last one can ouput the probability.
> I find the document, and didn't found this function in MLLIB. 
> Does mllib has this function?
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Mllib-using-model-to-predict-probability-tp26886.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: prefix column Spark

2016-04-19 Thread Ndjido Ardo BAR
This can help:

import org.apache.spark.sql.DataFrame

def prefixDf(dataFrame: DataFrame, prefix: String): DataFrame = {
  val colNames = dataFrame.columns
  colNames.foldLeft(dataFrame){
(df, colName) => {
  df.withColumnRenamed(colName, s"${prefix}_${colName}")
}
}
}

cheers,
Ardo


On Tue, Apr 19, 2016 at 10:53 AM, nihed mbarek  wrote:

> Hi,
>
> I want to prefix a set of dataframes and I try two solutions:
> * A for loop calling withColumnRename based on columns()
> * transforming my Dataframe to and RDD, updating the old schema and
> recreating the dataframe.
>
>
> both are working for me, the second one is faster with tables that contain
> 800 columns but have a more stage of transformation toRDD.
>
> Is there any other solution?
>
> Thank you
>
> --
>
> M'BAREK Med Nihed,
> Fedora Ambassador, TUNISIA, Northern Africa
> http://www.nihed.com
>
> 
>
>


Re: Calling Python code from Scala

2016-04-18 Thread Ndjido Ardo BAR
Hi Didier,

I think with PySpark you can wrap your legacy Python functions into UDFs
and use it in your DataFrames. But you have to use DataFrames instead of
RDD.

cheers,
Ardo

On Mon, Apr 18, 2016 at 7:13 PM, didmar  wrote:

> Hi,
>
> I have a Spark project in Scala and I would like to call some Python
> functions from within the program.
> Both parts are quite big, so re-coding everything in one language is not
> really an option.
>
> The workflow would be:
> - Creating a RDD with Scala code
> - Mapping a Python function over this RDD
> - Using the result directly in Scala
>
> I've read about PySpark internals, but that didn't help much.
> Is it possible to do so, and preferably in an efficent manner ?
>
> Cheers,
> Didier
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Calling-Python-code-from-Scala-tp26798.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: How to estimate the size of dataframe using pyspark?

2016-04-09 Thread Ndjido Ardo BAR
What's the size of your driver?
On Sat, 9 Apr 2016 at 20:33, Buntu Dev  wrote:

> Actually, df.show() works displaying 20 rows but df.count() is the one
> which is causing the driver to run out of memory. There are just 3 INT
> columns.
>
> Any idea what could be the reason?
>
> On Sat, Apr 9, 2016 at 10:47 AM,  wrote:
>
>> You seem to have a lot of column :-) !
>> df.count() displays the size of your data frame.
>> df.columns.size() the number of columns.
>>
>> Finally, I suggest you check the size of your drive and customize it
>> accordingly.
>>
>> Cheers,
>>
>> Ardo
>>
>> Sent from my iPhone
>>
>> > On 09 Apr 2016, at 19:37, bdev  wrote:
>> >
>> > I keep running out of memory on the driver when I attempt to do
>> df.show().
>> > Can anyone let me know how to estimate the size of the dataframe?
>> >
>> > Thanks!
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-estimate-the-size-of-dataframe-using-pyspark-tp26729.html
>> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> >
>> > -
>> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> > For additional commands, e-mail: user-h...@spark.apache.org
>> >
>>
>
>


Re: Sample project on Image Processing

2016-02-22 Thread ndjido
Hi folks,

KeystoneML has some image processing features: 
http://keystone-ml.org/examples.html 

 Cheers,
Ardo

Sent from my iPhone

> On 22 Feb 2016, at 14:34, Sainath Palla  wrote:
> 
> Here is one simple example of Image classification in Java.
> 
> http://blogs.quovantis.com/image-classification-using-apache-spark-with-linear-svm/
> 
> Personally, I feel python provides better libraries for image processing. But 
> it mostly depends on what kind of Image processing you are doing.
> 
> If you are stuck at the initial stages to load/save images, here is sample 
> code to do the same. This is in PySpark.
> 
> 
> 
> from PIL import Image
> import numpy as np
> 
> #Load Images in form of binary Files
> 
> images = sc.binaryFiles("Path") 
> 
> #Convert Image to array. It converts the image into [x,y,3]  format
> # x,y are image dimensions and 3 is for R,G,B format.
> 
> image_to_array = lambda rawdata: np.asarray(Image.open(StringIO(rawdata)))
> 
> #Saving the image to file after processing
> #x has image name and img has image in array
> 
>   for x,img in imageOutIMG.toLocalIterator():
>   path="Path"+x+".jpg"
>   img.save(path)
> 
> 
> 
> 
> 
> 
>> On Mon, Feb 22, 2016 at 3:23 AM, Mishra, Abhishek 
>>  wrote:
>> Hello,
>> 
>> I am working on image processing samples. Was wondering if anyone has worked 
>> on Image processing project in spark. Please let me know if any sample 
>> project or example is available.
>> 
>>  
>> 
>> Please guide in this.
>> 
>> Sincerely,
>> 
>> Abhishek
>> 
> 


Re: Pyspark - How to add new column to dataframe based on existing column value

2016-02-10 Thread ndjido
Hi Viktor,

Try to create a UDF. It's quite simple!

Ardo.


> On 10 Feb 2016, at 10:34, Viktor ARDELEAN  wrote:
> 
> Hello,
> 
> I want to add a new String column to the dataframe based on an existing 
> column values:
> 
> from pyspark.sql.functions import lit
> df.withColumn('strReplaced', lit(df.str.replace("a", "b").replace("c", "d")))
> So basically I want to add a new column named "strReplaced", that is the same 
> as the "str" column, just with character "a" replaced with "b" and "c" 
> replaced with "d".
> When I try the code above I get following error:
> Traceback (most recent call last):
>   File "", line 1, in 
> AttributeError: 'Column' object has no attribute 'replace'
> 
> So in fact I need somehow to get the value of the column df.str in order to 
> call replace on it.
> Any ideas how to do this?
> -- 
> Viktor ARDELEAN
> 
> P   Don't print this email, unless it's really necessary. Take care of the 
> environment.


Issue with spark-shell in yarn mode

2016-01-26 Thread ndjido
Hi folks,

On Spark 1.6.0, I submitted 2 lines of code via spark-shell in Yarn-client mode:

1) sc.parallelize(Array(1,2,3,3,3,3,4)).collect()

2) sc.parallelize(Array(1,2,3,3,3,3,4)).map( x => (x, 1)).collect()

1) works well whereas 2) raises the following exception: 

Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858)
at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1314)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.RDD.take(RDD.scala:1288)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:28)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:33)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:35)
at $iwC$$iwC$$iwC$$iwC$$iwC.(:37)
at $iwC$$iwC$$iwC$$iwC.(:39)
at $iwC$$iwC$$iwC.(:41)
at $iwC$$iwC.(:43)
at $iwC.(:45)
at (:47)
at .(:51)
at .()
at .(:7)
at .()
at $print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
at 
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346)
at 
org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
at 
org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
at 
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
at 
org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
at 
scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at 
org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 

Re: GLM I'm ml pipeline

2016-01-03 Thread ndjido
keyStoneML could be an alternative. 

Ardo.

> On 03 Jan 2016, at 15:50, Arunkumar Pillai  wrote:
> 
> Is there any road map for glm in pipeline?


Re: Can't filter

2015-12-10 Thread Ndjido Ardo Bar
Please send your call stack with the full description of the exception .

> On 10 Dec 2015, at 12:10, Бобров Виктор  wrote:
> 
> Hi, I can’t filter my rdd.
>  
> def filter1(tp: ((Array[String], Int), (Array[String], Int))): Boolean= {
>   tp._1._2 > tp._2._2
> }
> val mail_rdd = sc.parallelize(A.toSeq).cache()
> val step1 = mail_rdd.cartesian(mail_rdd)
> val step2 = step1.filter(filter1)
>  
> Get error “Class not found”. What I’m doing wrong ? Thanks for help.
>  
>  
>  


Re: RDD functions

2015-12-04 Thread Ndjido Ardo BAR
Hi Michal,

I think the following link could interest you. You gonna find there a lot
of examples!

http://homepage.cs.latrobe.edu.au/zhe/ZhenHeSparkRDDAPIExamples.html

cheers,
Ardo

On Fri, Dec 4, 2015 at 2:31 PM, Michal Klos  wrote:

> http://spark.apache.org/docs/latest/programming-guide.html#rdd-operations
>
> M
>
> On Dec 4, 2015, at 8:21 AM, Sateesh Karuturi 
> wrote:
>
> Hello Spark experts...
> Iam new to Apache Spark..Can anyone send me the proper Documentation to
> learn RDD functions.
> Thanks in advance...
>
>


Re: Grid search with Random Forest

2015-12-01 Thread Ndjido Ardo BAR
Thanks for the clarification. Gonna test that and give you feedbacks.

Ndjido
On Tue, 1 Dec 2015 at 19:29, Joseph Bradley <jos...@databricks.com> wrote:

> You can do grid search if you set the evaluator to a
> MulticlassClassificationEvaluator, which expects a prediction column, not a
> rawPrediction column.  There's a JIRA for making
> BinaryClassificationEvaluator accept prediction instead of rawPrediction.
> Joseph
>
> On Tue, Dec 1, 2015 at 5:10 AM, Benjamin Fradet <benjamin.fra...@gmail.com
> > wrote:
>
>> Someone correct me if I'm wrong but no there isn't one that I am aware of.
>>
>> Unless someone is willing to explain how to obtain the raw prediction
>> column with the GBTClassifier. In this case I'd be happy to work on a PR.
>> On 1 Dec 2015 8:43 a.m., "Ndjido Ardo BAR" <ndj...@gmail.com> wrote:
>>
>>> Hi Benjamin,
>>>
>>> Thanks, the documentation you sent is clear.
>>> Is there any other way to perform a Grid Search with GBT?
>>>
>>>
>>> Ndjido
>>> On Tue, 1 Dec 2015 at 08:32, Benjamin Fradet <benjamin.fra...@gmail.com>
>>> wrote:
>>>
>>>> Hi Ndjido,
>>>>
>>>> This is because GBTClassifier doesn't yet have a rawPredictionCol like
>>>> the. RandomForestClassifier has.
>>>> Cf:
>>>> http://spark.apache.org/docs/latest/ml-ensembles.html#output-columns-predictions-1
>>>> On 1 Dec 2015 3:57 a.m., "Ndjido Ardo BAR" <ndj...@gmail.com> wrote:
>>>>
>>>>> Hi Joseph,
>>>>>
>>>>> Yes Random Forest support Grid Search on Spark 1.5.+ . But I'm getting
>>>>> a "rawPredictionCol field does not exist exception" on Spark 1.5.2 for
>>>>> Gradient Boosting Trees classifier.
>>>>>
>>>>>
>>>>> Ardo
>>>>> On Tue, 1 Dec 2015 at 01:34, Joseph Bradley <jos...@databricks.com>
>>>>> wrote:
>>>>>
>>>>>> It should work with 1.5+.
>>>>>>
>>>>>> On Thu, Nov 26, 2015 at 12:53 PM, Ndjido Ardo Bar <ndj...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>> Hi folks,
>>>>>>>
>>>>>>> Does anyone know whether the Grid Search capability is enabled since
>>>>>>> the issue spark-9011 of version 1.4.0 ? I'm getting the 
>>>>>>> "rawPredictionCol
>>>>>>> column doesn't exist" when trying to perform a grid search with Spark 
>>>>>>> 1.4.0.
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Ardo
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> -
>>>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>>>>
>>>>>>>
>>>>>>
>


Re: Grid search with Random Forest

2015-11-30 Thread Ndjido Ardo BAR
Hi Joseph,

Yes Random Forest support Grid Search on Spark 1.5.+ . But I'm getting a
"rawPredictionCol field does not exist exception" on Spark 1.5.2 for
Gradient Boosting Trees classifier.


Ardo
On Tue, 1 Dec 2015 at 01:34, Joseph Bradley <jos...@databricks.com> wrote:

> It should work with 1.5+.
>
> On Thu, Nov 26, 2015 at 12:53 PM, Ndjido Ardo Bar <ndj...@gmail.com>
> wrote:
>
>>
>> Hi folks,
>>
>> Does anyone know whether the Grid Search capability is enabled since the
>> issue spark-9011 of version 1.4.0 ? I'm getting the "rawPredictionCol
>> column doesn't exist" when trying to perform a grid search with Spark 1.4.0.
>>
>> Cheers,
>> Ardo
>>
>>
>>
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: Grid search with Random Forest

2015-11-30 Thread Ndjido Ardo BAR
Hi Benjamin,

Thanks, the documentation you sent is clear.
Is there any other way to perform a Grid Search with GBT?


Ndjido
On Tue, 1 Dec 2015 at 08:32, Benjamin Fradet <benjamin.fra...@gmail.com>
wrote:

> Hi Ndjido,
>
> This is because GBTClassifier doesn't yet have a rawPredictionCol like
> the. RandomForestClassifier has.
> Cf:
> http://spark.apache.org/docs/latest/ml-ensembles.html#output-columns-predictions-1
> On 1 Dec 2015 3:57 a.m., "Ndjido Ardo BAR" <ndj...@gmail.com> wrote:
>
>> Hi Joseph,
>>
>> Yes Random Forest support Grid Search on Spark 1.5.+ . But I'm getting a
>> "rawPredictionCol field does not exist exception" on Spark 1.5.2 for
>> Gradient Boosting Trees classifier.
>>
>>
>> Ardo
>> On Tue, 1 Dec 2015 at 01:34, Joseph Bradley <jos...@databricks.com>
>> wrote:
>>
>>> It should work with 1.5+.
>>>
>>> On Thu, Nov 26, 2015 at 12:53 PM, Ndjido Ardo Bar <ndj...@gmail.com>
>>> wrote:
>>>
>>>>
>>>> Hi folks,
>>>>
>>>> Does anyone know whether the Grid Search capability is enabled since
>>>> the issue spark-9011 of version 1.4.0 ? I'm getting the "rawPredictionCol
>>>> column doesn't exist" when trying to perform a grid search with Spark 
>>>> 1.4.0.
>>>>
>>>> Cheers,
>>>> Ardo
>>>>
>>>>
>>>>
>>>>
>>>> -
>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>
>>>>
>>>


Re: Debug Spark

2015-11-29 Thread Ndjido Ardo BAR
Spark Job server allows you to submit your apps to any kind of deployment
(Standalone, Cluster). I think that it could be suitable for your use case.
Check the following Github repo:
https://github.com/spark-jobserver/spark-jobserver

Ardo

On Sun, Nov 29, 2015 at 6:42 PM, Նարեկ Գալստեան <ngalsty...@gmail.com>
wrote:

> A question regarding the topic,
>
> I am using Intellij to write spark applications and then have to ship the
> source code to my cluster on the cloud to compile and test
>
> is there a way to automatise the process using Intellij?
>
> Narek Galstyan
>
> Նարեկ Գալստյան
>
> On 29 November 2015 at 20:51, Ndjido Ardo BAR <ndj...@gmail.com> wrote:
>
>> Masf, the following link sets the basics to start debugging your spark
>> apps in local mode:
>>
>>
>> https://medium.com/large-scale-data-processing/how-to-kick-start-spark-development-on-intellij-idea-in-4-steps-c7c8f5c2fe63#.675s86940
>>
>> Ardo
>>
>> On Sun, Nov 29, 2015 at 5:34 PM, Masf <masfwo...@gmail.com> wrote:
>>
>>> Hi Ardo
>>>
>>>
>>> Some tutorial to debug with Intellij?
>>>
>>> Thanks
>>>
>>> Regards.
>>> Miguel.
>>>
>>>
>>> On Sun, Nov 29, 2015 at 5:32 PM, Ndjido Ardo BAR <ndj...@gmail.com>
>>> wrote:
>>>
>>>> hi,
>>>>
>>>> IntelliJ is just great for that!
>>>>
>>>> cheers,
>>>> Ardo.
>>>>
>>>> On Sun, Nov 29, 2015 at 5:18 PM, Masf <masfwo...@gmail.com> wrote:
>>>>
>>>>> Hi
>>>>>
>>>>> Is it possible to debug spark locally with IntelliJ or another IDE?
>>>>>
>>>>> Thanks
>>>>>
>>>>> --
>>>>> Regards.
>>>>> Miguel Ángel
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>>
>>>
>>> Saludos.
>>> Miguel Ángel
>>>
>>
>>
>


Re: Debug Spark

2015-11-29 Thread Ndjido Ardo BAR
Masf, the following link sets the basics to start debugging your spark apps
in local mode:

https://medium.com/large-scale-data-processing/how-to-kick-start-spark-development-on-intellij-idea-in-4-steps-c7c8f5c2fe63#.675s86940

Ardo

On Sun, Nov 29, 2015 at 5:34 PM, Masf <masfwo...@gmail.com> wrote:

> Hi Ardo
>
>
> Some tutorial to debug with Intellij?
>
> Thanks
>
> Regards.
> Miguel.
>
>
> On Sun, Nov 29, 2015 at 5:32 PM, Ndjido Ardo BAR <ndj...@gmail.com> wrote:
>
>> hi,
>>
>> IntelliJ is just great for that!
>>
>> cheers,
>> Ardo.
>>
>> On Sun, Nov 29, 2015 at 5:18 PM, Masf <masfwo...@gmail.com> wrote:
>>
>>> Hi
>>>
>>> Is it possible to debug spark locally with IntelliJ or another IDE?
>>>
>>> Thanks
>>>
>>> --
>>> Regards.
>>> Miguel Ángel
>>>
>>
>>
>
>
> --
>
>
> Saludos.
> Miguel Ángel
>


Re: Debug Spark

2015-11-29 Thread Ndjido Ardo BAR
hi,

IntelliJ is just great for that!

cheers,
Ardo.

On Sun, Nov 29, 2015 at 5:18 PM, Masf  wrote:

> Hi
>
> Is it possible to debug spark locally with IntelliJ or another IDE?
>
> Thanks
>
> --
> Regards.
> Miguel Ángel
>


Grid search with Random Forest

2015-11-26 Thread Ndjido Ardo Bar

Hi folks,

Does anyone know whether the Grid Search capability is enabled since the issue 
spark-9011 of version 1.4.0 ? I'm getting the "rawPredictionCol column doesn't 
exist" when trying to perform a grid search with Spark 1.4.0.

Cheers,
Ardo 




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: can I use Spark as alternative for gem fire cache ?

2015-10-17 Thread Ndjido Ardo Bar
Hi Kali,

If I do understand you well, Tachyon ( http://tachyon-project.org) can be good 
alternative. You can use Spark Api to load and persist data into Tachyon. 
Hope that will help.

Ardo 

> On 17 Oct 2015, at 15:28, "kali.tumm...@gmail.com"  
> wrote:
> 
> Hi All, 
> 
> Can spark be used as an alternative to gem fire cache ? we use gem fire
> cache to save (cache) dimension data in memory which is later used by our
> Java custom made ETL tool can I do something like below ?
> 
> can I cache a RDD in memory for a whole day ? as of I know RDD will get
> empty once the spark code finish executing (correct me if I am wrong).
> 
> Spark:- 
> create a RDD 
> rdd.persistance 
> 
> Thanks
> 
> 
> 
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/can-I-use-Spark-as-alternative-for-gem-fire-cache-tp25106.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Scala api end points

2015-09-24 Thread Ndjido Ardo BAR
Hi Masoom Alam,

I successfully experimented the following project on Github
https://github.com/erisa85/WikiSparkJobServer . I do recommand it to you.

cheers,
Ardo.

On Thu, Sep 24, 2015 at 5:20 PM, masoom alam 
wrote:

> Hi everyone
>
> I am new to Scala. I have a written an application using scala in
> spark Now we want to interface it through rest api end points..what
> is the best choice with usplease share ur experiences
>
> Thanks
>


Re: Small File to HDFS

2015-09-03 Thread Ndjido Ardo Bar
Hi Nibiau,

Hbase seems to be a good solution to your problems. As you may know storing 
yours messages as a key-value pairs in Hbase saves you the overhead of manually 
resizing blocks of data using zip files. 
The added advantage along with the fact that Hbase uses HDFS for storage, is 
the capability of updating your records for example with the "put" function. 

Cheers,
Ardo

> On 03 Sep 2015, at 13:35, nib...@free.fr wrote:
> 
> Ok but so some questions :
> - Sometimes I have to remove some messages from HDFS (cancel/replace cases) , 
> is it possible ?
> - In the case of a big zip file, is it possible to easily process Pig on it 
> directly ?
> 
> Tks
> Nicolas
> 
> - Mail original -
> De: "Tao Lu" 
> À: nib...@free.fr
> Cc: "Ted Yu" , "user" 
> Envoyé: Mercredi 2 Septembre 2015 19:09:23
> Objet: Re: Small File to HDFS
> 
> 
> You may consider storing it in one big HDFS file, and to keep appending new 
> messages to it. 
> 
> 
> For instance, 
> one message -> zip it -> append it to the HDFS as one line 
> 
> 
> On Wed, Sep 2, 2015 at 12:43 PM, < nib...@free.fr > wrote: 
> 
> 
> Hi, 
> I already store them in MongoDB in parralel for operational access and don't 
> want to add an other database in the loop 
> Is it the only solution ? 
> 
> Tks 
> Nicolas 
> 
> - Mail original - 
> De: "Ted Yu" < yuzhih...@gmail.com > 
> À: nib...@free.fr 
> Cc: "user" < user@spark.apache.org > 
> Envoyé: Mercredi 2 Septembre 2015 18:34:17 
> Objet: Re: Small File to HDFS 
> 
> 
> 
> 
> Instead of storing those messages in HDFS, have you considered storing them 
> in key-value store (e.g. hbase) ? 
> 
> 
> Cheers 
> 
> 
> On Wed, Sep 2, 2015 at 9:07 AM, < nib...@free.fr > wrote: 
> 
> 
> Hello, 
> I'am currently using Spark Streaming to collect small messages (events) , 
> size being <50 KB , volume is high (several millions per day) and I have to 
> store those messages in HDFS. 
> I understood that storing small files can be problematic in HDFS , how can I 
> manage it ? 
> 
> Tks 
> Nicolas 
> 
> - 
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> For additional commands, e-mail: user-h...@spark.apache.org 
> 
> 
> 
> - 
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> For additional commands, e-mail: user-h...@spark.apache.org 
> 
> 
> 
> 
> 
> -- 
> 
> 
>  Thanks! 
> Tao
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org