Re: Sporadic ClassNotFoundException with Kryo

2017-01-12 Thread Nirmal Fernando
I faced a similar issue and had to do two things;

1. Submit Kryo jar with the spark-submit
2. Set spark.executor.userClassPathFirst true in Spark conf

On Fri, Nov 18, 2016 at 7:39 PM, chrism 
wrote:

> Regardless of the different ways we have tried deploying a jar together
> with
> Spark, when running a Spark Streaming job with Kryo as serializer on top of
> Mesos, we sporadically get the following error (I have truncated a bit):
>
> /16/11/18 08:39:10 ERROR OneForOneBlockFetcher: Failed while starting block
> fetches
> java.lang.RuntimeException: org.apache.spark.SparkException: Failed to
> register classes with Kryo
>   at
> org.apache.spark.serializer.KryoSerializer.newKryo(KryoSeria
> lizer.scala:129)
>   at
> org.apache.spark.serializer.KryoSerializerInstance.borrowKry
> o(KryoSerializer.scala:274)
> ...
>   at
> org.apache.spark.serializer.SerializerManager.dataSerializeS
> tream(SerializerManager.scala:125)
>   at
> org.apache.spark.storage.BlockManager$$anonfun$dropFromMemor
> y$3.apply(BlockManager.scala:1265)
>   at
> org.apache.spark.storage.BlockManager$$anonfun$dropFromMemor
> y$3.apply(BlockManager.scala:1261)
> ...
> Caused by: java.lang.ClassNotFoundException: cics.udr.compound_ran_udr
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)/
>
> where "cics.udr.compound_ran_udr" is a class provided by us in a jar.
>
> We know that the jar containing "cics.udr.compound_ran_udr" is being
> deployed and works because it is listed in the "Environment" tab in the
> GUI,
> and calculations using this class succeed.
>
> We have tried the following methods of deploying the jar containing the
> class
>  * Through --jars in spark-submit
>  * Through SparkConf.setJar
>  * Through spark.driver.extraClassPath and spark.executor.extraClassPath
>  * By having it as the main jar used by spark-submit
> with no luck. The logs (see attached) recognize that the jar is being added
> to the classloader.
>
> We have tried registering the class using
>  * SparkConf.registerKryoClasses.
>  * spark.kryo.classesToRegister
> with no luck.
>
> We are running on Mesos and the jar has been deployed on every machine on
> the local file system in the same location.
>
> I would be very grateful for any help or ideas :)
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Sporadic-ClassNotFoundException-with-K
> ryo-tp28104.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


-- 

Thanks & regards,
Nirmal

Associate Technical Lead - Data Technologies Team, WSO2 Inc.
Mobile: +94715779733 <+94%2071%20577%209733>
Blog: http://nirmalfdo.blogspot.com/


Re: Apply ML to grouped dataframe

2016-08-22 Thread Nirmal Fernando
On Tue, Aug 23, 2016 at 10:56 AM, Wen Pei Yu <yuw...@cn.ibm.com> wrote:

> We can group a dataframe by one column like
>
> df.groupBy(df.col("gender"))
>

On top of this DF, use a filter that would enable you to extract the
grouped DF as separated DFs. Then you can apply ML on top of each DF.

eg: xyzDF.filter(col("x").equalTo(x))

>
> It like split a dataframe to multiple dataframe. Currently, we can only
> apply simple sql function to this GroupedData like agg, max etc.
>
> What we want is apply one ML algorithm to each group.
>
> Regards.
>
> [image: Inactive hide details for Nirmal Fernando ---08/23/2016 01:14:48
> PM---Hi Wen, AFAIK Spark MLlib implements its machine learning]Nirmal
> Fernando ---08/23/2016 01:14:48 PM---Hi Wen, AFAIK Spark MLlib implements
> its machine learning algorithms on top of
>
> From: Nirmal Fernando <nir...@wso2.com>
> To: Wen Pei Yu/China/IBM@IBMCN
> Cc: User <user@spark.apache.org>
> Date: 08/23/2016 01:14 PM
>
> Subject: Re: Apply ML to grouped dataframe
> --
>
>
>
> Hi Wen,
>
> AFAIK Spark MLlib implements its machine learning algorithms on top of
> Spark dataframe API. What did you mean by a grouped dataframe?
>
> On Tue, Aug 23, 2016 at 10:42 AM, Wen Pei Yu <*yuw...@cn.ibm.com*
> <yuw...@cn.ibm.com>> wrote:
>
>Hi Nirmal
>
>I didn't get your point.
>Can you tell me more about how to use MLlib to grouped dataframe?
>
>Regards.
>Wenpei.
>
>[image: Inactive hide details for Nirmal Fernando ---08/23/2016
>10:26:36 AM---You can use Spark MLlib 
> http://spark.apache.org/docs/late]Nirmal
>Fernando ---08/23/2016 10:26:36 AM---You can use Spark MLlib
>
> *http://spark.apache.org/docs/latest/ml-guide.html#announcement-dataframe-bas*
>
> <http://spark.apache.org/docs/latest/ml-guide.html#announcement-dataframe-bas>
>
>From: Nirmal Fernando <*nir...@wso2.com* <nir...@wso2.com>>
>To: Wen Pei Yu/China/IBM@IBMCN
>Cc: User <*user@spark.apache.org* <user@spark.apache.org>>
>Date: 08/23/2016 10:26 AM
>Subject: Re: Apply ML to grouped dataframe
>--
>
>
>
>
>You can use Spark MLlib
>
> *http://spark.apache.org/docs/latest/ml-guide.html#announcement-dataframe-based-api-is-primary-api*
>
> <http://spark.apache.org/docs/latest/ml-guide.html#announcement-dataframe-based-api-is-primary-api>
>
>On Tue, Aug 23, 2016 at 7:34 AM, Wen Pei Yu <*yuw...@cn.ibm.com*
><yuw...@cn.ibm.com>> wrote:
>   Hi
>
>  We have a dataframe, then want group it and apply a ML algorithm
>  or statistics(say t test) to each one. Is there any efficient way 
> for this
>  situation?
>
>  Currently, we transfer to pyspark, use groupbykey and apply
>  numpy function to array. But this wasn't an efficient way, right?
>
>  Regards.
>  Wenpei.
>
>
>
>
>--
>
>Thanks & regards,
>Nirmal
>
>Team Lead - WSO2 Machine Learner
>Associate Technical Lead - Data Technologies Team, WSO2 Inc.
>Mobile: *+94715779733* <%2B94715779733>
>Blog: *http://nirmalfdo.blogspot.com/* <http://nirmalfdo.blogspot.com/>
>
>
>
>
>
>
>
> --
>
> Thanks & regards,
> Nirmal
>
> Team Lead - WSO2 Machine Learner
> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
> Mobile: +94715779733
> Blog: *http://nirmalfdo.blogspot.com/* <http://nirmalfdo.blogspot.com/>
>
>
>
>


-- 

Thanks & regards,
Nirmal

Team Lead - WSO2 Machine Learner
Associate Technical Lead - Data Technologies Team, WSO2 Inc.
Mobile: +94715779733
Blog: http://nirmalfdo.blogspot.com/


Re: Apply ML to grouped dataframe

2016-08-22 Thread Nirmal Fernando
Hi Wen,

AFAIK Spark MLlib implements its machine learning algorithms on top of
Spark dataframe API. What did you mean by a grouped dataframe?

On Tue, Aug 23, 2016 at 10:42 AM, Wen Pei Yu <yuw...@cn.ibm.com> wrote:

> Hi Nirmal
>
> I didn't get your point.
> Can you tell me more about how to use MLlib to grouped dataframe?
>
> Regards.
> Wenpei.
>
> [image: Inactive hide details for Nirmal Fernando ---08/23/2016 10:26:36
> AM---You can use Spark MLlib http://spark.apache.org/docs/late]Nirmal
> Fernando ---08/23/2016 10:26:36 AM---You can use Spark MLlib
> http://spark.apache.org/docs/latest/ml-guide.html#
> announcement-dataframe-bas
>
> From: Nirmal Fernando <nir...@wso2.com>
> To: Wen Pei Yu/China/IBM@IBMCN
> Cc: User <user@spark.apache.org>
> Date: 08/23/2016 10:26 AM
> Subject: Re: Apply ML to grouped dataframe
> --
>
>
>
> You can use Spark MLlib
> *http://spark.apache.org/docs/latest/ml-guide.html#announcement-dataframe-based-api-is-primary-api*
> <http://spark.apache.org/docs/latest/ml-guide.html#announcement-dataframe-based-api-is-primary-api>
>
> On Tue, Aug 23, 2016 at 7:34 AM, Wen Pei Yu <*yuw...@cn.ibm.com*
> <yuw...@cn.ibm.com>> wrote:
>
>Hi
>
>We have a dataframe, then want group it and apply a ML algorithm or
>statistics(say t test) to each one. Is there any efficient way for this
>situation?
>
>Currently, we transfer to pyspark, use groupbykey and apply numpy
>function to array. But this wasn't an efficient way, right?
>
>Regards.
>Wenpei.
>
>
>
>
>
> --
>
> Thanks & regards,
> Nirmal
>
> Team Lead - WSO2 Machine Learner
> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
> Mobile: +94715779733
> Blog: *http://nirmalfdo.blogspot.com/* <http://nirmalfdo.blogspot.com/>
>
>
>
>


-- 

Thanks & regards,
Nirmal

Team Lead - WSO2 Machine Learner
Associate Technical Lead - Data Technologies Team, WSO2 Inc.
Mobile: +94715779733
Blog: http://nirmalfdo.blogspot.com/


Re: Apply ML to grouped dataframe

2016-08-22 Thread Nirmal Fernando
You can use Spark MLlib
http://spark.apache.org/docs/latest/ml-guide.html#announcement-dataframe-based-api-is-primary-api

On Tue, Aug 23, 2016 at 7:34 AM, Wen Pei Yu  wrote:

> Hi
>
> We have a dataframe, then want group it and apply a ML algorithm or
> statistics(say t test) to each one. Is there any efficient way for this
> situation?
>
> Currently, we transfer to pyspark, use groupbykey and apply numpy function
> to array. But this wasn't an efficient way, right?
>
> Regards.
> Wenpei.
>



-- 

Thanks & regards,
Nirmal

Team Lead - WSO2 Machine Learner
Associate Technical Lead - Data Technologies Team, WSO2 Inc.
Mobile: +94715779733
Blog: http://nirmalfdo.blogspot.com/


Re: thought experiment: use spark ML to real time prediction

2015-11-12 Thread Nirmal Fernando
On Fri, Nov 13, 2015 at 2:04 AM, darren <dar...@ontrenet.com> wrote:

> I agree 100%. Making the model requires large data and many cpus.
>
> Using it does not.
>
> This is a very useful side effect of ML models.
>
> If mlib can't use models outside spark that's a real shame.
>

Well you can as mentioned earlier. You don't need Spark runtime for
predictions, save the serialized model and deserialize to use. (you need
the Spark Jars in the classpath though)

>
>
> Sent from my Verizon Wireless 4G LTE smartphone
>
>
>  Original message 
> From: "Kothuvatiparambil, Viju" <viju.kothuvatiparam...@bankofamerica.com>
>
> Date: 11/12/2015 3:09 PM (GMT-05:00)
> To: DB Tsai <dbt...@dbtsai.com>, Sean Owen <so...@cloudera.com>
> Cc: Felix Cheung <felixcheun...@hotmail.com>, Nirmal Fernando <
> nir...@wso2.com>, Andy Davidson <a...@santacruzintegration.com>, Adrian
> Tanase <atan...@adobe.com>, "user @spark" <user@spark.apache.org>,
> Xiangrui Meng <men...@gmail.com>, hol...@pigscanfly.ca
> Subject: RE: thought experiment: use spark ML to real time prediction
>
> I am glad to see DB’s comments, make me feel I am not the only one facing
> these issues. If we are able to use MLLib to load the model in web
> applications (outside the spark cluster), that would have solved the
> issue.  I understand Spark is manly for processing big data in a
> distributed mode. But, there is no purpose in training a model using MLLib,
> if we are not able to use it in applications where needs to access the
> model.
>
>
>
> Thanks
>
> Viju
>
>
>
> *From:* DB Tsai [mailto:dbt...@dbtsai.com]
> *Sent:* Thursday, November 12, 2015 11:04 AM
> *To:* Sean Owen
> *Cc:* Felix Cheung; Nirmal Fernando; Andy Davidson; Adrian Tanase; user
> @spark; Xiangrui Meng; hol...@pigscanfly.ca
> *Subject:* Re: thought experiment: use spark ML to real time prediction
>
>
>
> I think the use-case can be quick different from PMML.
>
>
>
> By having a Spark platform independent ML jar, this can empower users to
> do the following,
>
>
>
> 1) PMML doesn't contain all the models we have in mllib. Also, for a ML
> pipeline trained by Spark, most of time, PMML is not expressive enough to
> do all the transformation we have in Spark ML. As a result, if we are able
> to serialize the entire Spark ML pipeline after training, and then load
> them back in app without any Spark platform for production scorning, this
> will be very useful for production deployment of Spark ML models. The only
> issue will be if the transformer involves with shuffle, we need to figure
> out a way to handle it. When I chatted with Xiangrui about this, he
> suggested that we may tag if a transformer is shuffle ready. Currently, at
> Netflix, we are not able to use ML pipeline because of those issues, and we
> have to write our own scorers in our production which is quite a duplicated
> work.
>
>
>
> 2) If users can use Spark's linear algebra like vector or matrix code in
> their application, this will be very useful. This can help to share code in
> Spark training pipeline and production deployment. Also, lots of good stuff
> at Spark's mllib doesn't depend on Spark platform, and people can use them
> in their application without pulling lots of dependencies. In fact, in my
> project, I have to copy & paste code from mllib into my project to use
> those goodies in apps.
>
>
>
> 3) Currently, mllib depends on graphx which means in graphx, there is no
> way to use mllib's vector or matrix. And
>



-- 

Thanks & regards,
Nirmal

Team Lead - WSO2 Machine Learner
Associate Technical Lead - Data Technologies Team, WSO2 Inc.
Mobile: +94715779733
Blog: http://nirmalfdo.blogspot.com/


Re: thought experiment: use spark ML to real time prediction

2015-11-11 Thread Nirmal Fernando
As of now, we are basically serializing the ML model and then deserialize
it for prediction at real time.

On Wed, Nov 11, 2015 at 4:39 PM, Adrian Tanase  wrote:

> I don’t think this answers your question but here’s how you would evaluate
> the model in realtime in a streaming app
>
> https://databricks.gitbooks.io/databricks-spark-reference-applications/content/twitter_classifier/predict.html
>
> Maybe you can find a way to extract portions of MLLib and run them outside
> of spark – loading the precomputed model and calling .predict on it…
>
> -adrian
>
> From: Andy Davidson
> Date: Tuesday, November 10, 2015 at 11:31 PM
> To: "user @spark"
> Subject: thought experiment: use spark ML to real time prediction
>
> Lets say I have use spark ML to train a linear model. I know I can save
> and load the model to disk. I am not sure how I can use the model in a real
> time environment. For example I do not think I can return a “prediction” to
> the client using spark streaming easily. Also for some applications the
> extra latency created by the batch process might not be acceptable.
>
> If I was not using spark I would re-implement the model I trained in my
> batch environment in a lang like Java  and implement a rest service that
> uses the model to create a prediction and return the prediction to the
> client. Many models make predictions using linear algebra. Implementing
> predictions is relatively easy if you have a good vectorized LA package. Is
> there a way to use a model I trained using spark ML outside of spark?
>
> As a motivating example, even if its possible to return data to the client
> using spark streaming. I think the mini batch latency would not be
> acceptable for a high frequency stock trading system.
>
> Kind regards
>
> Andy
>
> P.s. The examples I have seen so far use spark streaming to “preprocess”
> predictions. For example a recommender system might use what current users
> are watching to calculate “trending recommendations”. These are stored on
> disk and served up to users when the use the “movie guide”. If a
> recommendation was a couple of min. old it would not effect the end users
> experience.
>
>


-- 

Thanks & regards,
Nirmal

Team Lead - WSO2 Machine Learner
Associate Technical Lead - Data Technologies Team, WSO2 Inc.
Mobile: +94715779733
Blog: http://nirmalfdo.blogspot.com/


Applying transformations on a JavaRDD using reflection

2015-09-08 Thread Nirmal Fernando
Hi All,

I'd like to apply a chain of Spark transformations (map/filter) on a given
JavaRDD. I'll have the set of Spark transformations as Function, and
even though I can determine the classes of T and A at the runtime, due to
the type erasure, I cannot call JavaRDD's transformations as they expect
generics. Any idea on how to resolve this?

-- 

Thanks & regards,
Nirmal

Team Lead - WSO2 Machine Learner
Associate Technical Lead - Data Technologies Team, WSO2 Inc.
Mobile: +94715779733
Blog: http://nirmalfdo.blogspot.com/


Re: Applying transformations on a JavaRDD using reflection

2015-09-08 Thread Nirmal Fernando
Any thoughts?

On Tue, Sep 8, 2015 at 3:37 PM, Nirmal Fernando <nir...@wso2.com> wrote:

> Hi All,
>
> I'd like to apply a chain of Spark transformations (map/filter) on a given
> JavaRDD. I'll have the set of Spark transformations as Function<T,A>, and
> even though I can determine the classes of T and A at the runtime, due to
> the type erasure, I cannot call JavaRDD's transformations as they expect
> generics. Any idea on how to resolve this?
>
> --
>
> Thanks & regards,
> Nirmal
>
> Team Lead - WSO2 Machine Learner
> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
> Mobile: +94715779733
> Blog: http://nirmalfdo.blogspot.com/
>
>
>


-- 

Thanks & regards,
Nirmal

Team Lead - WSO2 Machine Learner
Associate Technical Lead - Data Technologies Team, WSO2 Inc.
Mobile: +94715779733
Blog: http://nirmalfdo.blogspot.com/


[MLLib][Kmeans] KMeansModel.computeCost takes lot of time

2015-07-13 Thread Nirmal Fernando
Hi,

For a fairly large dataset, 30MB, KMeansModel.computeCost takes lot of time
(16+ mints).

It takes lot of time at this task;

org.apache.spark.rdd.DoubleRDDFunctions.sum(DoubleRDDFunctions.scala:33)
org.apache.spark.mllib.clustering.KMeansModel.computeCost(KMeansModel.scala:70)

Can this be improved?

-- 

Thanks  regards,
Nirmal

Associate Technical Lead - Data Technologies Team, WSO2 Inc.
Mobile: +94715779733
Blog: http://nirmalfdo.blogspot.com/


Re: [MLLib][Kmeans] KMeansModel.computeCost takes lot of time

2015-07-13 Thread Nirmal Fernando
Thanks Burak.

Now it takes minutes to repartition;

Active Stages (1) Stage IdDescriptionSubmittedDurationTasks: Succeeded/Total
InputOutputShuffle Read Shuffle Write  42 (kill)
http://localhost:4040/stages/stage/kill/?id=42terminate=true repartition
at UnsupervisedSparkModelBuilder.java:120
http://localhost:4040/stages/stage?id=42attempt=0 +details

org.apache.spark.api.java.JavaRDD.repartition(JavaRDD.scala:100)
org.wso2.carbon.ml.core.spark.algorithms.UnsupervisedSparkModelBuilder.buildKMeansModel(UnsupervisedSparkModelBuilder.java:120)
org.wso2.carbon.ml.core.spark.algorithms.UnsupervisedSparkModelBuilder.build(UnsupervisedSparkModelBuilder.java:84)
org.wso2.carbon.ml.core.impl.MLModelHandler$ModelBuilder.run(MLModelHandler.java:576)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)

 2015/07/14 08:59:30 3.6 min
 0/3
 14.6 MB Pending Stages (1) Stage IdDescriptionSubmittedDurationTasks:
Succeeded/TotalInputOutputShuffle Read Shuffle Write  43 sum at
KMeansModel.scala:70
http://localhost:4040/stages/stage?id=43attempt=0 +details


org.apache.spark.rdd.DoubleRDDFunctions.sum(DoubleRDDFunctions.scala:33)
org.apache.spark.mllib.clustering.KMeansModel.computeCost(KMeansModel.scala:70)
org.wso2.carbon.ml.core.spark.algorithms.UnsupervisedSparkModelBuilder.buildKMeansModel(UnsupervisedSparkModelBuilder.java:121)
org.wso2.carbon.ml.core.spark.algorithms.UnsupervisedSparkModelBuilder.build(UnsupervisedSparkModelBuilder.java:84)
org.wso2.carbon.ml.core.impl.MLModelHandler$ModelBuilder.run(MLModelHandler.java:576)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)

 Unknown Unknown
 0/8

On Mon, Jul 13, 2015 at 11:44 PM, Burak Yavuz brk...@gmail.com wrote:

 Can you call repartition(8) or 16 on data.rdd(), before KMeans, and also,
 .cache()?

 something like, (I'm assuming you are using Java):
 ```
 JavaRDDVector input = data.repartition(8).cache();
 org.apache.spark.mllib.clustering.KMeans.train(input.rdd(), 3, 20);
 ```

 On Mon, Jul 13, 2015 at 11:10 AM, Nirmal Fernando nir...@wso2.com wrote:

 I'm using;

 org.apache.spark.mllib.clustering.KMeans.train(data.rdd(), 3, 20);

 Cpu cores: 8 (using default Spark conf thought)

 On partitions, I'm not sure how to find that.

 On Mon, Jul 13, 2015 at 11:30 PM, Burak Yavuz brk...@gmail.com wrote:

 What are the other parameters? Are you just setting k=3? What about # of
 runs? How many partitions do you have? How many cores does your machine
 have?

 Thanks,
 Burak

 On Mon, Jul 13, 2015 at 10:57 AM, Nirmal Fernando nir...@wso2.com
 wrote:

 Hi Burak,

 k = 3
 dimension = 785 features
 Spark 1.4

 On Mon, Jul 13, 2015 at 10:28 PM, Burak Yavuz brk...@gmail.com wrote:

 Hi,

 How are you running K-Means? What is your k? What is the dimension of
 your dataset (columns)? Which Spark version are you using?

 Thanks,
 Burak

 On Mon, Jul 13, 2015 at 2:53 AM, Nirmal Fernando nir...@wso2.com
 wrote:

 Hi,

 For a fairly large dataset, 30MB, KMeansModel.computeCost takes lot
 of time (16+ mints).

 It takes lot of time at this task;

 org.apache.spark.rdd.DoubleRDDFunctions.sum(DoubleRDDFunctions.scala:33)
 org.apache.spark.mllib.clustering.KMeansModel.computeCost(KMeansModel.scala:70)

 Can this be improved?

 --

 Thanks  regards,
 Nirmal

 Associate Technical Lead - Data Technologies Team, WSO2 Inc.
 Mobile: +94715779733
 Blog: http://nirmalfdo.blogspot.com/






 --

 Thanks  regards,
 Nirmal

 Associate Technical Lead - Data Technologies Team, WSO2 Inc.
 Mobile: +94715779733
 Blog: http://nirmalfdo.blogspot.com/






 --

 Thanks  regards,
 Nirmal

 Associate Technical Lead - Data Technologies Team, WSO2 Inc.
 Mobile: +94715779733
 Blog: http://nirmalfdo.blogspot.com/






-- 

Thanks  regards,
Nirmal

Associate Technical Lead - Data Technologies Team, WSO2 Inc.
Mobile: +94715779733
Blog: http://nirmalfdo.blogspot.com/


Re: [MLLib][Kmeans] KMeansModel.computeCost takes lot of time

2015-07-13 Thread Nirmal Fernando
Can it be the limited memory causing this slowness?

On Tue, Jul 14, 2015 at 9:00 AM, Nirmal Fernando nir...@wso2.com wrote:

 Thanks Burak.

 Now it takes minutes to repartition;

 Active Stages (1) Stage IdDescriptionSubmittedDurationTasks:
 Succeeded/TotalInputOutputShuffle Read Shuffle Write  42 (kill)
 http://localhost:4040/stages/stage/kill/?id=42terminate=true repartition
 at UnsupervisedSparkModelBuilder.java:120
 http://localhost:4040/stages/stage?id=42attempt=0 +details

 org.apache.spark.api.java.JavaRDD.repartition(JavaRDD.scala:100)
 org.wso2.carbon.ml.core.spark.algorithms.UnsupervisedSparkModelBuilder.buildKMeansModel(UnsupervisedSparkModelBuilder.java:120)
 org.wso2.carbon.ml.core.spark.algorithms.UnsupervisedSparkModelBuilder.build(UnsupervisedSparkModelBuilder.java:84)
 org.wso2.carbon.ml.core.impl.MLModelHandler$ModelBuilder.run(MLModelHandler.java:576)
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 java.lang.Thread.run(Thread.java:745)

  2015/07/14 08:59:30 3.6 min
  0/3
  14.6 MB Pending Stages (1) Stage IdDescriptionSubmittedDurationTasks:
 Succeeded/TotalInputOutputShuffle Read Shuffle Write  43 sum at
 KMeansModel.scala:70 http://localhost:4040/stages/stage?id=43attempt=0 
 +details


 org.apache.spark.rdd.DoubleRDDFunctions.sum(DoubleRDDFunctions.scala:33)
 org.apache.spark.mllib.clustering.KMeansModel.computeCost(KMeansModel.scala:70)
 org.wso2.carbon.ml.core.spark.algorithms.UnsupervisedSparkModelBuilder.buildKMeansModel(UnsupervisedSparkModelBuilder.java:121)
 org.wso2.carbon.ml.core.spark.algorithms.UnsupervisedSparkModelBuilder.build(UnsupervisedSparkModelBuilder.java:84)
 org.wso2.carbon.ml.core.impl.MLModelHandler$ModelBuilder.run(MLModelHandler.java:576)
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 java.lang.Thread.run(Thread.java:745)

  Unknown Unknown
  0/8

 On Mon, Jul 13, 2015 at 11:44 PM, Burak Yavuz brk...@gmail.com wrote:

 Can you call repartition(8) or 16 on data.rdd(), before KMeans, and also,
 .cache()?

 something like, (I'm assuming you are using Java):
 ```
 JavaRDDVector input = data.repartition(8).cache();
 org.apache.spark.mllib.clustering.KMeans.train(input.rdd(), 3, 20);
 ```

 On Mon, Jul 13, 2015 at 11:10 AM, Nirmal Fernando nir...@wso2.com
 wrote:

 I'm using;

 org.apache.spark.mllib.clustering.KMeans.train(data.rdd(), 3, 20);

 Cpu cores: 8 (using default Spark conf thought)

 On partitions, I'm not sure how to find that.

 On Mon, Jul 13, 2015 at 11:30 PM, Burak Yavuz brk...@gmail.com wrote:

 What are the other parameters? Are you just setting k=3? What about #
 of runs? How many partitions do you have? How many cores does your machine
 have?

 Thanks,
 Burak

 On Mon, Jul 13, 2015 at 10:57 AM, Nirmal Fernando nir...@wso2.com
 wrote:

 Hi Burak,

 k = 3
 dimension = 785 features
 Spark 1.4

 On Mon, Jul 13, 2015 at 10:28 PM, Burak Yavuz brk...@gmail.com
 wrote:

 Hi,

 How are you running K-Means? What is your k? What is the dimension of
 your dataset (columns)? Which Spark version are you using?

 Thanks,
 Burak

 On Mon, Jul 13, 2015 at 2:53 AM, Nirmal Fernando nir...@wso2.com
 wrote:

 Hi,

 For a fairly large dataset, 30MB, KMeansModel.computeCost takes lot
 of time (16+ mints).

 It takes lot of time at this task;

 org.apache.spark.rdd.DoubleRDDFunctions.sum(DoubleRDDFunctions.scala:33)
 org.apache.spark.mllib.clustering.KMeansModel.computeCost(KMeansModel.scala:70)

 Can this be improved?

 --

 Thanks  regards,
 Nirmal

 Associate Technical Lead - Data Technologies Team, WSO2 Inc.
 Mobile: +94715779733
 Blog: http://nirmalfdo.blogspot.com/






 --

 Thanks  regards,
 Nirmal

 Associate Technical Lead - Data Technologies Team, WSO2 Inc.
 Mobile: +94715779733
 Blog: http://nirmalfdo.blogspot.com/






 --

 Thanks  regards,
 Nirmal

 Associate Technical Lead - Data Technologies Team, WSO2 Inc.
 Mobile: +94715779733
 Blog: http://nirmalfdo.blogspot.com/






 --

 Thanks  regards,
 Nirmal

 Associate Technical Lead - Data Technologies Team, WSO2 Inc.
 Mobile: +94715779733
 Blog: http://nirmalfdo.blogspot.com/





-- 

Thanks  regards,
Nirmal

Associate Technical Lead - Data Technologies Team, WSO2 Inc.
Mobile: +94715779733
Blog: http://nirmalfdo.blogspot.com/


Re: How to speed up Spark process

2015-07-13 Thread Nirmal Fernando
If you press on the +details you could see the code that takes time. Did
you already check it?

On Tue, Jul 14, 2015 at 9:56 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:

 Job view. Others are fast, but the first one (repartition) is taking 95%
 of job run time.

 On Mon, Jul 13, 2015 at 9:23 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com
 wrote:

 It completed in 32 minutes. Attached is screenshots. How do i speed it up
 ?


 On Mon, Jul 13, 2015 at 9:19 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com
 wrote:

 Its been 30 minutes and still the partitioner has not completed yet, its
 ever.

 Without repartition, i see this error
 https://issues.apache.org/jira/browse/SPARK-5928


  FetchFailed(BlockManagerId(1, imran-2.ent.cloudera.com, 55028), 
 shuffleId=1, mapId=0, reduceId=0, message=
 org.apache.spark.shuffle.FetchFailedException: Adjusted frame length 
 exceeds 2147483647: 3021252889 - discarded
 at 
 org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67)
 at 
 org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
 at 
 org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
 at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
 at 
 org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)




 On Mon, Jul 13, 2015 at 8:34 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com
 wrote:

 I have 100 MB of Avro data. and i do repartition(307) is taking forever.

 2. val x = input.repartition(7907).map( {k1,k2,k3,k4}, {inputRecord} )
 3. val quantiles = x.map( {k1,k2,k3,k4},  TDigest(inputRecord).asBytes
 ).reduceByKey() [ This was groupBy earlier ]
 4. x.join(quantiles).coalesce(100).writeInAvro


 Attached is full Scala code.

 I have 340 Yarn node cluster with 14G Ram on each node and have input
 data of just just 100 MB.  (Hadoop takes 2.5 hours on 1 TB dataset)


 ./bin/spark-submit -v --master yarn-cluster  --jars
 /apache/hadoop-2.4.1-2.1.3.0-2-EBAY/share/hadoop/hdfs/hadoop-hdfs-2.4.1-EBAY-2.jar,/home/dvasthimal/spark1.4/lib/spark_reporting_dep_only-1.0-SNAPSHOT.jar
  --num-executors 330 --driver-memory 14g --driver-java-options
 -XX:MaxPermSize=512M -Xmx4096M -Xms4096M -verbose:gc -XX:+PrintGCDetails
 -XX:+PrintGCTimeStamps --executor-memory 14g --executor-cores 1 --queue
 hdmi-others --class com.ebay.ep.poc.spark.reporting.SparkApp
 /home/dvasthimal/spark1.4/lib/spark_reporting-1.0-SNAPSHOT.jar
 startDate=2015-06-20 endDate=2015-06-21
 input=/apps/hdmi-prod/b_um/epdatasets/exptsession subcommand=ppwmasterprime
 output=/user/dvasthimal/epdatasets/ppwmasterprime buffersize=128
 maxbuffersize=1068 maxResultSize=200G


 I see this in stdout of the task on that executor

 15/07/13 19:58:48 WARN hdfs.BlockReaderLocal: The short-circuit local 
 reads feature cannot be used because libhadoop cannot be loaded.
 15/07/13 20:00:08 INFO collection.ExternalSorter: Thread 47 spilling 
 in-memory map of 2.2 GB to disk (1 time so far)
 15/07/13 20:01:31 INFO collection.ExternalSorter: Thread 47 spilling 
 in-memory map of 2.2 GB to disk (2 times so far)
 15/07/13 20:03:07 INFO collection.ExternalSorter: Thread 47 spilling 
 in-memory map of 2.2 GB to disk (3 times so far)
 15/07/13 20:04:32 INFO collection.ExternalSorter: Thread 47 spilling 
 in-memory map of 2.2 GB to disk (4 times so far)
 15/07/13 20:06:21 INFO collection.ExternalSorter: Thread 47 spilling 
 in-memory map of 2.2 GB to disk (5 times so far)
 15/07/13 20:08:09 INFO collection.ExternalSorter: Thread 47 spilling 
 in-memory map of 2.2 GB to disk (6 times so far)
 15/07/13 20:09:51 INFO collection.ExternalSorter: Thread 47 spilling 
 in-memory map of 2.2 GB to disk (7 times so far)



 Also attached is the thread dump


 --
 Deepak




 --
 Deepak




 --
 Deepak




 --
 Deepak



 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-- 

Thanks  regards,
Nirmal

Associate Technical Lead - Data Technologies Team, WSO2 Inc.
Mobile: +94715779733
Blog: http://nirmalfdo.blogspot.com/


Re: [MLLib][Kmeans] KMeansModel.computeCost takes lot of time

2015-07-13 Thread Nirmal Fernando
I'm using;

org.apache.spark.mllib.clustering.KMeans.train(data.rdd(), 3, 20);

Cpu cores: 8 (using default Spark conf thought)

On partitions, I'm not sure how to find that.

On Mon, Jul 13, 2015 at 11:30 PM, Burak Yavuz brk...@gmail.com wrote:

 What are the other parameters? Are you just setting k=3? What about # of
 runs? How many partitions do you have? How many cores does your machine
 have?

 Thanks,
 Burak

 On Mon, Jul 13, 2015 at 10:57 AM, Nirmal Fernando nir...@wso2.com wrote:

 Hi Burak,

 k = 3
 dimension = 785 features
 Spark 1.4

 On Mon, Jul 13, 2015 at 10:28 PM, Burak Yavuz brk...@gmail.com wrote:

 Hi,

 How are you running K-Means? What is your k? What is the dimension of
 your dataset (columns)? Which Spark version are you using?

 Thanks,
 Burak

 On Mon, Jul 13, 2015 at 2:53 AM, Nirmal Fernando nir...@wso2.com
 wrote:

 Hi,

 For a fairly large dataset, 30MB, KMeansModel.computeCost takes lot of
 time (16+ mints).

 It takes lot of time at this task;

 org.apache.spark.rdd.DoubleRDDFunctions.sum(DoubleRDDFunctions.scala:33)
 org.apache.spark.mllib.clustering.KMeansModel.computeCost(KMeansModel.scala:70)

 Can this be improved?

 --

 Thanks  regards,
 Nirmal

 Associate Technical Lead - Data Technologies Team, WSO2 Inc.
 Mobile: +94715779733
 Blog: http://nirmalfdo.blogspot.com/






 --

 Thanks  regards,
 Nirmal

 Associate Technical Lead - Data Technologies Team, WSO2 Inc.
 Mobile: +94715779733
 Blog: http://nirmalfdo.blogspot.com/






-- 

Thanks  regards,
Nirmal

Associate Technical Lead - Data Technologies Team, WSO2 Inc.
Mobile: +94715779733
Blog: http://nirmalfdo.blogspot.com/


Re: [MLLib][Kmeans] KMeansModel.computeCost takes lot of time

2015-07-13 Thread Nirmal Fernando
Hi Burak,

k = 3
dimension = 785 features
Spark 1.4

On Mon, Jul 13, 2015 at 10:28 PM, Burak Yavuz brk...@gmail.com wrote:

 Hi,

 How are you running K-Means? What is your k? What is the dimension of your
 dataset (columns)? Which Spark version are you using?

 Thanks,
 Burak

 On Mon, Jul 13, 2015 at 2:53 AM, Nirmal Fernando nir...@wso2.com wrote:

 Hi,

 For a fairly large dataset, 30MB, KMeansModel.computeCost takes lot of
 time (16+ mints).

 It takes lot of time at this task;

 org.apache.spark.rdd.DoubleRDDFunctions.sum(DoubleRDDFunctions.scala:33)
 org.apache.spark.mllib.clustering.KMeansModel.computeCost(KMeansModel.scala:70)

 Can this be improved?

 --

 Thanks  regards,
 Nirmal

 Associate Technical Lead - Data Technologies Team, WSO2 Inc.
 Mobile: +94715779733
 Blog: http://nirmalfdo.blogspot.com/






-- 

Thanks  regards,
Nirmal

Associate Technical Lead - Data Technologies Team, WSO2 Inc.
Mobile: +94715779733
Blog: http://nirmalfdo.blogspot.com/


Spark MLLib 140 - logistic regression with SGD model accuracy is different in local mode and cluster mode

2015-07-02 Thread Nirmal Fernando
Hi All,

I'm facing a quite strange case, where after migrating to Spark 140, I'm
seen SparkMLLib produces different results when runs on local mode and
cluster mode. Is there any possibility of that happening? (I feel this is
an issue in my environment, but just wanted to get confirmed.)

Thanks.

-- 

Thanks  regards,
Nirmal

Associate Technical Lead - Data Technologies Team, WSO2 Inc.
Mobile: +94715779733
Blog: http://nirmalfdo.blogspot.com/


Run multiple Spark jobs concurrently

2015-07-01 Thread Nirmal Fernando
Hi All,

Is there any additional configs that we have to do to perform $subject?

-- 

Thanks  regards,
Nirmal

Associate Technical Lead - Data Technologies Team, WSO2 Inc.
Mobile: +94715779733
Blog: http://nirmalfdo.blogspot.com/


Re: path to hdfs

2015-06-08 Thread Nirmal Fernando
HDFS path should be something like; hdfs://
127.0.0.1:8020/user/cloudera/inputs/

On Mon, Jun 8, 2015 at 4:15 PM, Pa Rö paul.roewer1...@googlemail.com
wrote:

 hello,

 i submit my spark job with the following parameters:

 ./spark-1.1.0-bin-hadoop2.4/bin/spark-submit \
   --class mgm.tp.bigdata.ma_spark.SparkMain \
   --master spark://quickstart.cloudera:7077 \
   ma-spark.jar \
   1000

 and get the following exception:

 java.io.IOException: Mkdirs failed to create file:/
 127.0.0.1:8020/user/cloudera/outputs/output_spark
 at
 org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:438)
 at
 org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:424)
 at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:906)
 at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:887)
 at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:784)
 at mgm.tp.bigdata.ma_spark.Helper.writeCenterHistory(Helper.java:35)
 at mgm.tp.bigdata.ma_spark.SparkMain.main(SparkMain.java:100)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328)
 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
 java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative
 path in absolute URI: 127.0.0.1:8020
 at org.apache.hadoop.fs.Path.initialize(Path.java:206)
 at org.apache.hadoop.fs.Path.init(Path.java:172)
 at org.apache.hadoop.fs.Path.init(Path.java:94)
 at org.apache.hadoop.fs.Globber.glob(Globber.java:211)
 at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1642)
 at
 org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:257)
 at
 org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
 at
 org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304)
 at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:179)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
 at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
 at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
 at org.apache.spark.rdd.FilteredRDD.getPartitions(FilteredRDD.scala:29)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
 at org.apache.spark.SparkContext.runJob(SparkContext.scala:1135)
 at org.apache.spark.rdd.RDD.count(RDD.scala:904)
 at org.apache.spark.rdd.RDD.takeSample(RDD.scala:401)
 at
 org.apache.spark.api.java.JavaRDDLike$class.takeSample(JavaRDDLike.scala:426)
 at org.apache.spark.api.java.JavaRDD.takeSample(JavaRDD.scala:32)
 at
 org.apache.spark.api.java.JavaRDDLike$class.takeSample(JavaRDDLike.scala:422)
 at org.apache.spark.api.java.JavaRDD.takeSample(JavaRDD.scala:32)
 at mgm.tp.bigdata.ma_spark.SparkMain.kmeans(SparkMain.java:123)
 at mgm.tp.bigdata.ma_spark.SparkMain.main(SparkMain.java:102)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328)
 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
 Caused by: java.net.URISyntaxException: Relative path in absolute URI:
 127.0.0.1:8020
 at java.net.URI.checkPath(URI.java:1804)
 at java.net.URI.init(URI.java:752)
 at org.apache.hadoop.fs.Path.initialize(Path.java:203)
 ... 43 

Is there a way to disable the Spark UI?

2015-02-02 Thread Nirmal Fernando
Hi All,

Is there a way to disable the Spark UI? What I really need is to stop the
startup of the Jetty server.

-- 

Thanks  regards,
Nirmal

Senior Software Engineer- Platform Technologies Team, WSO2 Inc.
Mobile: +94715779733
Blog: http://nirmalfdo.blogspot.com/


Re: Is there a way to disable the Spark UI?

2015-02-02 Thread Nirmal Fernando
Thanks Zhan! Was this introduced from Spark 1.2? or is this available in
Spark 1.1 ?

On Tue, Feb 3, 2015 at 11:52 AM, Zhan Zhang zzh...@hortonworks.com wrote:

  You can set spark.ui.enabled to false to disable ui.

  Thanks.

  Zhan Zhang

  On Feb 2, 2015, at 8:06 PM, Nirmal Fernando nir...@wso2.com wrote:

  Hi All,

  Is there a way to disable the Spark UI? What I really need is to stop
 the startup of the Jetty server.

  --

 Thanks  regards,
 Nirmal

 Senior Software Engineer- Platform Technologies Team, WSO2 Inc.
 Mobile: +94715779733
 Blog: http://nirmalfdo.blogspot.com/






-- 

Thanks  regards,
Nirmal

Senior Software Engineer- Platform Technologies Team, WSO2 Inc.
Mobile: +94715779733
Blog: http://nirmalfdo.blogspot.com/