Re: Sporadic ClassNotFoundException with Kryo
I faced a similar issue and had to do two things; 1. Submit Kryo jar with the spark-submit 2. Set spark.executor.userClassPathFirst true in Spark conf On Fri, Nov 18, 2016 at 7:39 PM, chrismwrote: > Regardless of the different ways we have tried deploying a jar together > with > Spark, when running a Spark Streaming job with Kryo as serializer on top of > Mesos, we sporadically get the following error (I have truncated a bit): > > /16/11/18 08:39:10 ERROR OneForOneBlockFetcher: Failed while starting block > fetches > java.lang.RuntimeException: org.apache.spark.SparkException: Failed to > register classes with Kryo > at > org.apache.spark.serializer.KryoSerializer.newKryo(KryoSeria > lizer.scala:129) > at > org.apache.spark.serializer.KryoSerializerInstance.borrowKry > o(KryoSerializer.scala:274) > ... > at > org.apache.spark.serializer.SerializerManager.dataSerializeS > tream(SerializerManager.scala:125) > at > org.apache.spark.storage.BlockManager$$anonfun$dropFromMemor > y$3.apply(BlockManager.scala:1265) > at > org.apache.spark.storage.BlockManager$$anonfun$dropFromMemor > y$3.apply(BlockManager.scala:1261) > ... > Caused by: java.lang.ClassNotFoundException: cics.udr.compound_ran_udr > at java.net.URLClassLoader.findClass(URLClassLoader.java:381) > at java.lang.ClassLoader.loadClass(ClassLoader.java:424)/ > > where "cics.udr.compound_ran_udr" is a class provided by us in a jar. > > We know that the jar containing "cics.udr.compound_ran_udr" is being > deployed and works because it is listed in the "Environment" tab in the > GUI, > and calculations using this class succeed. > > We have tried the following methods of deploying the jar containing the > class > * Through --jars in spark-submit > * Through SparkConf.setJar > * Through spark.driver.extraClassPath and spark.executor.extraClassPath > * By having it as the main jar used by spark-submit > with no luck. The logs (see attached) recognize that the jar is being added > to the classloader. > > We have tried registering the class using > * SparkConf.registerKryoClasses. > * spark.kryo.classesToRegister > with no luck. > > We are running on Mesos and the jar has been deployed on every machine on > the local file system in the same location. > > I would be very grateful for any help or ideas :) > > > > -- > View this message in context: http://apache-spark-user-list. > 1001560.n3.nabble.com/Sporadic-ClassNotFoundException-with-K > ryo-tp28104.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- Thanks & regards, Nirmal Associate Technical Lead - Data Technologies Team, WSO2 Inc. Mobile: +94715779733 <+94%2071%20577%209733> Blog: http://nirmalfdo.blogspot.com/
Re: Apply ML to grouped dataframe
On Tue, Aug 23, 2016 at 10:56 AM, Wen Pei Yu <yuw...@cn.ibm.com> wrote: > We can group a dataframe by one column like > > df.groupBy(df.col("gender")) > On top of this DF, use a filter that would enable you to extract the grouped DF as separated DFs. Then you can apply ML on top of each DF. eg: xyzDF.filter(col("x").equalTo(x)) > > It like split a dataframe to multiple dataframe. Currently, we can only > apply simple sql function to this GroupedData like agg, max etc. > > What we want is apply one ML algorithm to each group. > > Regards. > > [image: Inactive hide details for Nirmal Fernando ---08/23/2016 01:14:48 > PM---Hi Wen, AFAIK Spark MLlib implements its machine learning]Nirmal > Fernando ---08/23/2016 01:14:48 PM---Hi Wen, AFAIK Spark MLlib implements > its machine learning algorithms on top of > > From: Nirmal Fernando <nir...@wso2.com> > To: Wen Pei Yu/China/IBM@IBMCN > Cc: User <user@spark.apache.org> > Date: 08/23/2016 01:14 PM > > Subject: Re: Apply ML to grouped dataframe > -- > > > > Hi Wen, > > AFAIK Spark MLlib implements its machine learning algorithms on top of > Spark dataframe API. What did you mean by a grouped dataframe? > > On Tue, Aug 23, 2016 at 10:42 AM, Wen Pei Yu <*yuw...@cn.ibm.com* > <yuw...@cn.ibm.com>> wrote: > >Hi Nirmal > >I didn't get your point. >Can you tell me more about how to use MLlib to grouped dataframe? > >Regards. >Wenpei. > >[image: Inactive hide details for Nirmal Fernando ---08/23/2016 >10:26:36 AM---You can use Spark MLlib > http://spark.apache.org/docs/late]Nirmal >Fernando ---08/23/2016 10:26:36 AM---You can use Spark MLlib > > *http://spark.apache.org/docs/latest/ml-guide.html#announcement-dataframe-bas* > > <http://spark.apache.org/docs/latest/ml-guide.html#announcement-dataframe-bas> > >From: Nirmal Fernando <*nir...@wso2.com* <nir...@wso2.com>> >To: Wen Pei Yu/China/IBM@IBMCN >Cc: User <*user@spark.apache.org* <user@spark.apache.org>> >Date: 08/23/2016 10:26 AM >Subject: Re: Apply ML to grouped dataframe >-- > > > > >You can use Spark MLlib > > *http://spark.apache.org/docs/latest/ml-guide.html#announcement-dataframe-based-api-is-primary-api* > > <http://spark.apache.org/docs/latest/ml-guide.html#announcement-dataframe-based-api-is-primary-api> > >On Tue, Aug 23, 2016 at 7:34 AM, Wen Pei Yu <*yuw...@cn.ibm.com* ><yuw...@cn.ibm.com>> wrote: > Hi > > We have a dataframe, then want group it and apply a ML algorithm > or statistics(say t test) to each one. Is there any efficient way > for this > situation? > > Currently, we transfer to pyspark, use groupbykey and apply > numpy function to array. But this wasn't an efficient way, right? > > Regards. > Wenpei. > > > > >-- > >Thanks & regards, >Nirmal > >Team Lead - WSO2 Machine Learner >Associate Technical Lead - Data Technologies Team, WSO2 Inc. >Mobile: *+94715779733* <%2B94715779733> >Blog: *http://nirmalfdo.blogspot.com/* <http://nirmalfdo.blogspot.com/> > > > > > > > > -- > > Thanks & regards, > Nirmal > > Team Lead - WSO2 Machine Learner > Associate Technical Lead - Data Technologies Team, WSO2 Inc. > Mobile: +94715779733 > Blog: *http://nirmalfdo.blogspot.com/* <http://nirmalfdo.blogspot.com/> > > > > -- Thanks & regards, Nirmal Team Lead - WSO2 Machine Learner Associate Technical Lead - Data Technologies Team, WSO2 Inc. Mobile: +94715779733 Blog: http://nirmalfdo.blogspot.com/
Re: Apply ML to grouped dataframe
Hi Wen, AFAIK Spark MLlib implements its machine learning algorithms on top of Spark dataframe API. What did you mean by a grouped dataframe? On Tue, Aug 23, 2016 at 10:42 AM, Wen Pei Yu <yuw...@cn.ibm.com> wrote: > Hi Nirmal > > I didn't get your point. > Can you tell me more about how to use MLlib to grouped dataframe? > > Regards. > Wenpei. > > [image: Inactive hide details for Nirmal Fernando ---08/23/2016 10:26:36 > AM---You can use Spark MLlib http://spark.apache.org/docs/late]Nirmal > Fernando ---08/23/2016 10:26:36 AM---You can use Spark MLlib > http://spark.apache.org/docs/latest/ml-guide.html# > announcement-dataframe-bas > > From: Nirmal Fernando <nir...@wso2.com> > To: Wen Pei Yu/China/IBM@IBMCN > Cc: User <user@spark.apache.org> > Date: 08/23/2016 10:26 AM > Subject: Re: Apply ML to grouped dataframe > -- > > > > You can use Spark MLlib > *http://spark.apache.org/docs/latest/ml-guide.html#announcement-dataframe-based-api-is-primary-api* > <http://spark.apache.org/docs/latest/ml-guide.html#announcement-dataframe-based-api-is-primary-api> > > On Tue, Aug 23, 2016 at 7:34 AM, Wen Pei Yu <*yuw...@cn.ibm.com* > <yuw...@cn.ibm.com>> wrote: > >Hi > >We have a dataframe, then want group it and apply a ML algorithm or >statistics(say t test) to each one. Is there any efficient way for this >situation? > >Currently, we transfer to pyspark, use groupbykey and apply numpy >function to array. But this wasn't an efficient way, right? > >Regards. >Wenpei. > > > > > > -- > > Thanks & regards, > Nirmal > > Team Lead - WSO2 Machine Learner > Associate Technical Lead - Data Technologies Team, WSO2 Inc. > Mobile: +94715779733 > Blog: *http://nirmalfdo.blogspot.com/* <http://nirmalfdo.blogspot.com/> > > > > -- Thanks & regards, Nirmal Team Lead - WSO2 Machine Learner Associate Technical Lead - Data Technologies Team, WSO2 Inc. Mobile: +94715779733 Blog: http://nirmalfdo.blogspot.com/
Re: Apply ML to grouped dataframe
You can use Spark MLlib http://spark.apache.org/docs/latest/ml-guide.html#announcement-dataframe-based-api-is-primary-api On Tue, Aug 23, 2016 at 7:34 AM, Wen Pei Yuwrote: > Hi > > We have a dataframe, then want group it and apply a ML algorithm or > statistics(say t test) to each one. Is there any efficient way for this > situation? > > Currently, we transfer to pyspark, use groupbykey and apply numpy function > to array. But this wasn't an efficient way, right? > > Regards. > Wenpei. > -- Thanks & regards, Nirmal Team Lead - WSO2 Machine Learner Associate Technical Lead - Data Technologies Team, WSO2 Inc. Mobile: +94715779733 Blog: http://nirmalfdo.blogspot.com/
Re: thought experiment: use spark ML to real time prediction
On Fri, Nov 13, 2015 at 2:04 AM, darren <dar...@ontrenet.com> wrote: > I agree 100%. Making the model requires large data and many cpus. > > Using it does not. > > This is a very useful side effect of ML models. > > If mlib can't use models outside spark that's a real shame. > Well you can as mentioned earlier. You don't need Spark runtime for predictions, save the serialized model and deserialize to use. (you need the Spark Jars in the classpath though) > > > Sent from my Verizon Wireless 4G LTE smartphone > > > Original message > From: "Kothuvatiparambil, Viju" <viju.kothuvatiparam...@bankofamerica.com> > > Date: 11/12/2015 3:09 PM (GMT-05:00) > To: DB Tsai <dbt...@dbtsai.com>, Sean Owen <so...@cloudera.com> > Cc: Felix Cheung <felixcheun...@hotmail.com>, Nirmal Fernando < > nir...@wso2.com>, Andy Davidson <a...@santacruzintegration.com>, Adrian > Tanase <atan...@adobe.com>, "user @spark" <user@spark.apache.org>, > Xiangrui Meng <men...@gmail.com>, hol...@pigscanfly.ca > Subject: RE: thought experiment: use spark ML to real time prediction > > I am glad to see DB’s comments, make me feel I am not the only one facing > these issues. If we are able to use MLLib to load the model in web > applications (outside the spark cluster), that would have solved the > issue. I understand Spark is manly for processing big data in a > distributed mode. But, there is no purpose in training a model using MLLib, > if we are not able to use it in applications where needs to access the > model. > > > > Thanks > > Viju > > > > *From:* DB Tsai [mailto:dbt...@dbtsai.com] > *Sent:* Thursday, November 12, 2015 11:04 AM > *To:* Sean Owen > *Cc:* Felix Cheung; Nirmal Fernando; Andy Davidson; Adrian Tanase; user > @spark; Xiangrui Meng; hol...@pigscanfly.ca > *Subject:* Re: thought experiment: use spark ML to real time prediction > > > > I think the use-case can be quick different from PMML. > > > > By having a Spark platform independent ML jar, this can empower users to > do the following, > > > > 1) PMML doesn't contain all the models we have in mllib. Also, for a ML > pipeline trained by Spark, most of time, PMML is not expressive enough to > do all the transformation we have in Spark ML. As a result, if we are able > to serialize the entire Spark ML pipeline after training, and then load > them back in app without any Spark platform for production scorning, this > will be very useful for production deployment of Spark ML models. The only > issue will be if the transformer involves with shuffle, we need to figure > out a way to handle it. When I chatted with Xiangrui about this, he > suggested that we may tag if a transformer is shuffle ready. Currently, at > Netflix, we are not able to use ML pipeline because of those issues, and we > have to write our own scorers in our production which is quite a duplicated > work. > > > > 2) If users can use Spark's linear algebra like vector or matrix code in > their application, this will be very useful. This can help to share code in > Spark training pipeline and production deployment. Also, lots of good stuff > at Spark's mllib doesn't depend on Spark platform, and people can use them > in their application without pulling lots of dependencies. In fact, in my > project, I have to copy & paste code from mllib into my project to use > those goodies in apps. > > > > 3) Currently, mllib depends on graphx which means in graphx, there is no > way to use mllib's vector or matrix. And > -- Thanks & regards, Nirmal Team Lead - WSO2 Machine Learner Associate Technical Lead - Data Technologies Team, WSO2 Inc. Mobile: +94715779733 Blog: http://nirmalfdo.blogspot.com/
Re: thought experiment: use spark ML to real time prediction
As of now, we are basically serializing the ML model and then deserialize it for prediction at real time. On Wed, Nov 11, 2015 at 4:39 PM, Adrian Tanasewrote: > I don’t think this answers your question but here’s how you would evaluate > the model in realtime in a streaming app > > https://databricks.gitbooks.io/databricks-spark-reference-applications/content/twitter_classifier/predict.html > > Maybe you can find a way to extract portions of MLLib and run them outside > of spark – loading the precomputed model and calling .predict on it… > > -adrian > > From: Andy Davidson > Date: Tuesday, November 10, 2015 at 11:31 PM > To: "user @spark" > Subject: thought experiment: use spark ML to real time prediction > > Lets say I have use spark ML to train a linear model. I know I can save > and load the model to disk. I am not sure how I can use the model in a real > time environment. For example I do not think I can return a “prediction” to > the client using spark streaming easily. Also for some applications the > extra latency created by the batch process might not be acceptable. > > If I was not using spark I would re-implement the model I trained in my > batch environment in a lang like Java and implement a rest service that > uses the model to create a prediction and return the prediction to the > client. Many models make predictions using linear algebra. Implementing > predictions is relatively easy if you have a good vectorized LA package. Is > there a way to use a model I trained using spark ML outside of spark? > > As a motivating example, even if its possible to return data to the client > using spark streaming. I think the mini batch latency would not be > acceptable for a high frequency stock trading system. > > Kind regards > > Andy > > P.s. The examples I have seen so far use spark streaming to “preprocess” > predictions. For example a recommender system might use what current users > are watching to calculate “trending recommendations”. These are stored on > disk and served up to users when the use the “movie guide”. If a > recommendation was a couple of min. old it would not effect the end users > experience. > > -- Thanks & regards, Nirmal Team Lead - WSO2 Machine Learner Associate Technical Lead - Data Technologies Team, WSO2 Inc. Mobile: +94715779733 Blog: http://nirmalfdo.blogspot.com/
Applying transformations on a JavaRDD using reflection
Hi All, I'd like to apply a chain of Spark transformations (map/filter) on a given JavaRDD. I'll have the set of Spark transformations as Function, and even though I can determine the classes of T and A at the runtime, due to the type erasure, I cannot call JavaRDD's transformations as they expect generics. Any idea on how to resolve this? -- Thanks & regards, Nirmal Team Lead - WSO2 Machine Learner Associate Technical Lead - Data Technologies Team, WSO2 Inc. Mobile: +94715779733 Blog: http://nirmalfdo.blogspot.com/
Re: Applying transformations on a JavaRDD using reflection
Any thoughts? On Tue, Sep 8, 2015 at 3:37 PM, Nirmal Fernando <nir...@wso2.com> wrote: > Hi All, > > I'd like to apply a chain of Spark transformations (map/filter) on a given > JavaRDD. I'll have the set of Spark transformations as Function<T,A>, and > even though I can determine the classes of T and A at the runtime, due to > the type erasure, I cannot call JavaRDD's transformations as they expect > generics. Any idea on how to resolve this? > > -- > > Thanks & regards, > Nirmal > > Team Lead - WSO2 Machine Learner > Associate Technical Lead - Data Technologies Team, WSO2 Inc. > Mobile: +94715779733 > Blog: http://nirmalfdo.blogspot.com/ > > > -- Thanks & regards, Nirmal Team Lead - WSO2 Machine Learner Associate Technical Lead - Data Technologies Team, WSO2 Inc. Mobile: +94715779733 Blog: http://nirmalfdo.blogspot.com/
[MLLib][Kmeans] KMeansModel.computeCost takes lot of time
Hi, For a fairly large dataset, 30MB, KMeansModel.computeCost takes lot of time (16+ mints). It takes lot of time at this task; org.apache.spark.rdd.DoubleRDDFunctions.sum(DoubleRDDFunctions.scala:33) org.apache.spark.mllib.clustering.KMeansModel.computeCost(KMeansModel.scala:70) Can this be improved? -- Thanks regards, Nirmal Associate Technical Lead - Data Technologies Team, WSO2 Inc. Mobile: +94715779733 Blog: http://nirmalfdo.blogspot.com/
Re: [MLLib][Kmeans] KMeansModel.computeCost takes lot of time
Thanks Burak. Now it takes minutes to repartition; Active Stages (1) Stage IdDescriptionSubmittedDurationTasks: Succeeded/Total InputOutputShuffle Read Shuffle Write 42 (kill) http://localhost:4040/stages/stage/kill/?id=42terminate=true repartition at UnsupervisedSparkModelBuilder.java:120 http://localhost:4040/stages/stage?id=42attempt=0 +details org.apache.spark.api.java.JavaRDD.repartition(JavaRDD.scala:100) org.wso2.carbon.ml.core.spark.algorithms.UnsupervisedSparkModelBuilder.buildKMeansModel(UnsupervisedSparkModelBuilder.java:120) org.wso2.carbon.ml.core.spark.algorithms.UnsupervisedSparkModelBuilder.build(UnsupervisedSparkModelBuilder.java:84) org.wso2.carbon.ml.core.impl.MLModelHandler$ModelBuilder.run(MLModelHandler.java:576) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) 2015/07/14 08:59:30 3.6 min 0/3 14.6 MB Pending Stages (1) Stage IdDescriptionSubmittedDurationTasks: Succeeded/TotalInputOutputShuffle Read Shuffle Write 43 sum at KMeansModel.scala:70 http://localhost:4040/stages/stage?id=43attempt=0 +details org.apache.spark.rdd.DoubleRDDFunctions.sum(DoubleRDDFunctions.scala:33) org.apache.spark.mllib.clustering.KMeansModel.computeCost(KMeansModel.scala:70) org.wso2.carbon.ml.core.spark.algorithms.UnsupervisedSparkModelBuilder.buildKMeansModel(UnsupervisedSparkModelBuilder.java:121) org.wso2.carbon.ml.core.spark.algorithms.UnsupervisedSparkModelBuilder.build(UnsupervisedSparkModelBuilder.java:84) org.wso2.carbon.ml.core.impl.MLModelHandler$ModelBuilder.run(MLModelHandler.java:576) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) Unknown Unknown 0/8 On Mon, Jul 13, 2015 at 11:44 PM, Burak Yavuz brk...@gmail.com wrote: Can you call repartition(8) or 16 on data.rdd(), before KMeans, and also, .cache()? something like, (I'm assuming you are using Java): ``` JavaRDDVector input = data.repartition(8).cache(); org.apache.spark.mllib.clustering.KMeans.train(input.rdd(), 3, 20); ``` On Mon, Jul 13, 2015 at 11:10 AM, Nirmal Fernando nir...@wso2.com wrote: I'm using; org.apache.spark.mllib.clustering.KMeans.train(data.rdd(), 3, 20); Cpu cores: 8 (using default Spark conf thought) On partitions, I'm not sure how to find that. On Mon, Jul 13, 2015 at 11:30 PM, Burak Yavuz brk...@gmail.com wrote: What are the other parameters? Are you just setting k=3? What about # of runs? How many partitions do you have? How many cores does your machine have? Thanks, Burak On Mon, Jul 13, 2015 at 10:57 AM, Nirmal Fernando nir...@wso2.com wrote: Hi Burak, k = 3 dimension = 785 features Spark 1.4 On Mon, Jul 13, 2015 at 10:28 PM, Burak Yavuz brk...@gmail.com wrote: Hi, How are you running K-Means? What is your k? What is the dimension of your dataset (columns)? Which Spark version are you using? Thanks, Burak On Mon, Jul 13, 2015 at 2:53 AM, Nirmal Fernando nir...@wso2.com wrote: Hi, For a fairly large dataset, 30MB, KMeansModel.computeCost takes lot of time (16+ mints). It takes lot of time at this task; org.apache.spark.rdd.DoubleRDDFunctions.sum(DoubleRDDFunctions.scala:33) org.apache.spark.mllib.clustering.KMeansModel.computeCost(KMeansModel.scala:70) Can this be improved? -- Thanks regards, Nirmal Associate Technical Lead - Data Technologies Team, WSO2 Inc. Mobile: +94715779733 Blog: http://nirmalfdo.blogspot.com/ -- Thanks regards, Nirmal Associate Technical Lead - Data Technologies Team, WSO2 Inc. Mobile: +94715779733 Blog: http://nirmalfdo.blogspot.com/ -- Thanks regards, Nirmal Associate Technical Lead - Data Technologies Team, WSO2 Inc. Mobile: +94715779733 Blog: http://nirmalfdo.blogspot.com/ -- Thanks regards, Nirmal Associate Technical Lead - Data Technologies Team, WSO2 Inc. Mobile: +94715779733 Blog: http://nirmalfdo.blogspot.com/
Re: [MLLib][Kmeans] KMeansModel.computeCost takes lot of time
Can it be the limited memory causing this slowness? On Tue, Jul 14, 2015 at 9:00 AM, Nirmal Fernando nir...@wso2.com wrote: Thanks Burak. Now it takes minutes to repartition; Active Stages (1) Stage IdDescriptionSubmittedDurationTasks: Succeeded/TotalInputOutputShuffle Read Shuffle Write 42 (kill) http://localhost:4040/stages/stage/kill/?id=42terminate=true repartition at UnsupervisedSparkModelBuilder.java:120 http://localhost:4040/stages/stage?id=42attempt=0 +details org.apache.spark.api.java.JavaRDD.repartition(JavaRDD.scala:100) org.wso2.carbon.ml.core.spark.algorithms.UnsupervisedSparkModelBuilder.buildKMeansModel(UnsupervisedSparkModelBuilder.java:120) org.wso2.carbon.ml.core.spark.algorithms.UnsupervisedSparkModelBuilder.build(UnsupervisedSparkModelBuilder.java:84) org.wso2.carbon.ml.core.impl.MLModelHandler$ModelBuilder.run(MLModelHandler.java:576) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) 2015/07/14 08:59:30 3.6 min 0/3 14.6 MB Pending Stages (1) Stage IdDescriptionSubmittedDurationTasks: Succeeded/TotalInputOutputShuffle Read Shuffle Write 43 sum at KMeansModel.scala:70 http://localhost:4040/stages/stage?id=43attempt=0 +details org.apache.spark.rdd.DoubleRDDFunctions.sum(DoubleRDDFunctions.scala:33) org.apache.spark.mllib.clustering.KMeansModel.computeCost(KMeansModel.scala:70) org.wso2.carbon.ml.core.spark.algorithms.UnsupervisedSparkModelBuilder.buildKMeansModel(UnsupervisedSparkModelBuilder.java:121) org.wso2.carbon.ml.core.spark.algorithms.UnsupervisedSparkModelBuilder.build(UnsupervisedSparkModelBuilder.java:84) org.wso2.carbon.ml.core.impl.MLModelHandler$ModelBuilder.run(MLModelHandler.java:576) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) Unknown Unknown 0/8 On Mon, Jul 13, 2015 at 11:44 PM, Burak Yavuz brk...@gmail.com wrote: Can you call repartition(8) or 16 on data.rdd(), before KMeans, and also, .cache()? something like, (I'm assuming you are using Java): ``` JavaRDDVector input = data.repartition(8).cache(); org.apache.spark.mllib.clustering.KMeans.train(input.rdd(), 3, 20); ``` On Mon, Jul 13, 2015 at 11:10 AM, Nirmal Fernando nir...@wso2.com wrote: I'm using; org.apache.spark.mllib.clustering.KMeans.train(data.rdd(), 3, 20); Cpu cores: 8 (using default Spark conf thought) On partitions, I'm not sure how to find that. On Mon, Jul 13, 2015 at 11:30 PM, Burak Yavuz brk...@gmail.com wrote: What are the other parameters? Are you just setting k=3? What about # of runs? How many partitions do you have? How many cores does your machine have? Thanks, Burak On Mon, Jul 13, 2015 at 10:57 AM, Nirmal Fernando nir...@wso2.com wrote: Hi Burak, k = 3 dimension = 785 features Spark 1.4 On Mon, Jul 13, 2015 at 10:28 PM, Burak Yavuz brk...@gmail.com wrote: Hi, How are you running K-Means? What is your k? What is the dimension of your dataset (columns)? Which Spark version are you using? Thanks, Burak On Mon, Jul 13, 2015 at 2:53 AM, Nirmal Fernando nir...@wso2.com wrote: Hi, For a fairly large dataset, 30MB, KMeansModel.computeCost takes lot of time (16+ mints). It takes lot of time at this task; org.apache.spark.rdd.DoubleRDDFunctions.sum(DoubleRDDFunctions.scala:33) org.apache.spark.mllib.clustering.KMeansModel.computeCost(KMeansModel.scala:70) Can this be improved? -- Thanks regards, Nirmal Associate Technical Lead - Data Technologies Team, WSO2 Inc. Mobile: +94715779733 Blog: http://nirmalfdo.blogspot.com/ -- Thanks regards, Nirmal Associate Technical Lead - Data Technologies Team, WSO2 Inc. Mobile: +94715779733 Blog: http://nirmalfdo.blogspot.com/ -- Thanks regards, Nirmal Associate Technical Lead - Data Technologies Team, WSO2 Inc. Mobile: +94715779733 Blog: http://nirmalfdo.blogspot.com/ -- Thanks regards, Nirmal Associate Technical Lead - Data Technologies Team, WSO2 Inc. Mobile: +94715779733 Blog: http://nirmalfdo.blogspot.com/ -- Thanks regards, Nirmal Associate Technical Lead - Data Technologies Team, WSO2 Inc. Mobile: +94715779733 Blog: http://nirmalfdo.blogspot.com/
Re: How to speed up Spark process
If you press on the +details you could see the code that takes time. Did you already check it? On Tue, Jul 14, 2015 at 9:56 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: Job view. Others are fast, but the first one (repartition) is taking 95% of job run time. On Mon, Jul 13, 2015 at 9:23 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: It completed in 32 minutes. Attached is screenshots. How do i speed it up ? On Mon, Jul 13, 2015 at 9:19 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: Its been 30 minutes and still the partitioner has not completed yet, its ever. Without repartition, i see this error https://issues.apache.org/jira/browse/SPARK-5928 FetchFailed(BlockManagerId(1, imran-2.ent.cloudera.com, 55028), shuffleId=1, mapId=0, reduceId=0, message= org.apache.spark.shuffle.FetchFailedException: Adjusted frame length exceeds 2147483647: 3021252889 - discarded at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67) at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83) at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) On Mon, Jul 13, 2015 at 8:34 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: I have 100 MB of Avro data. and i do repartition(307) is taking forever. 2. val x = input.repartition(7907).map( {k1,k2,k3,k4}, {inputRecord} ) 3. val quantiles = x.map( {k1,k2,k3,k4}, TDigest(inputRecord).asBytes ).reduceByKey() [ This was groupBy earlier ] 4. x.join(quantiles).coalesce(100).writeInAvro Attached is full Scala code. I have 340 Yarn node cluster with 14G Ram on each node and have input data of just just 100 MB. (Hadoop takes 2.5 hours on 1 TB dataset) ./bin/spark-submit -v --master yarn-cluster --jars /apache/hadoop-2.4.1-2.1.3.0-2-EBAY/share/hadoop/hdfs/hadoop-hdfs-2.4.1-EBAY-2.jar,/home/dvasthimal/spark1.4/lib/spark_reporting_dep_only-1.0-SNAPSHOT.jar --num-executors 330 --driver-memory 14g --driver-java-options -XX:MaxPermSize=512M -Xmx4096M -Xms4096M -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps --executor-memory 14g --executor-cores 1 --queue hdmi-others --class com.ebay.ep.poc.spark.reporting.SparkApp /home/dvasthimal/spark1.4/lib/spark_reporting-1.0-SNAPSHOT.jar startDate=2015-06-20 endDate=2015-06-21 input=/apps/hdmi-prod/b_um/epdatasets/exptsession subcommand=ppwmasterprime output=/user/dvasthimal/epdatasets/ppwmasterprime buffersize=128 maxbuffersize=1068 maxResultSize=200G I see this in stdout of the task on that executor 15/07/13 19:58:48 WARN hdfs.BlockReaderLocal: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded. 15/07/13 20:00:08 INFO collection.ExternalSorter: Thread 47 spilling in-memory map of 2.2 GB to disk (1 time so far) 15/07/13 20:01:31 INFO collection.ExternalSorter: Thread 47 spilling in-memory map of 2.2 GB to disk (2 times so far) 15/07/13 20:03:07 INFO collection.ExternalSorter: Thread 47 spilling in-memory map of 2.2 GB to disk (3 times so far) 15/07/13 20:04:32 INFO collection.ExternalSorter: Thread 47 spilling in-memory map of 2.2 GB to disk (4 times so far) 15/07/13 20:06:21 INFO collection.ExternalSorter: Thread 47 spilling in-memory map of 2.2 GB to disk (5 times so far) 15/07/13 20:08:09 INFO collection.ExternalSorter: Thread 47 spilling in-memory map of 2.2 GB to disk (6 times so far) 15/07/13 20:09:51 INFO collection.ExternalSorter: Thread 47 spilling in-memory map of 2.2 GB to disk (7 times so far) Also attached is the thread dump -- Deepak -- Deepak -- Deepak -- Deepak - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Thanks regards, Nirmal Associate Technical Lead - Data Technologies Team, WSO2 Inc. Mobile: +94715779733 Blog: http://nirmalfdo.blogspot.com/
Re: [MLLib][Kmeans] KMeansModel.computeCost takes lot of time
I'm using; org.apache.spark.mllib.clustering.KMeans.train(data.rdd(), 3, 20); Cpu cores: 8 (using default Spark conf thought) On partitions, I'm not sure how to find that. On Mon, Jul 13, 2015 at 11:30 PM, Burak Yavuz brk...@gmail.com wrote: What are the other parameters? Are you just setting k=3? What about # of runs? How many partitions do you have? How many cores does your machine have? Thanks, Burak On Mon, Jul 13, 2015 at 10:57 AM, Nirmal Fernando nir...@wso2.com wrote: Hi Burak, k = 3 dimension = 785 features Spark 1.4 On Mon, Jul 13, 2015 at 10:28 PM, Burak Yavuz brk...@gmail.com wrote: Hi, How are you running K-Means? What is your k? What is the dimension of your dataset (columns)? Which Spark version are you using? Thanks, Burak On Mon, Jul 13, 2015 at 2:53 AM, Nirmal Fernando nir...@wso2.com wrote: Hi, For a fairly large dataset, 30MB, KMeansModel.computeCost takes lot of time (16+ mints). It takes lot of time at this task; org.apache.spark.rdd.DoubleRDDFunctions.sum(DoubleRDDFunctions.scala:33) org.apache.spark.mllib.clustering.KMeansModel.computeCost(KMeansModel.scala:70) Can this be improved? -- Thanks regards, Nirmal Associate Technical Lead - Data Technologies Team, WSO2 Inc. Mobile: +94715779733 Blog: http://nirmalfdo.blogspot.com/ -- Thanks regards, Nirmal Associate Technical Lead - Data Technologies Team, WSO2 Inc. Mobile: +94715779733 Blog: http://nirmalfdo.blogspot.com/ -- Thanks regards, Nirmal Associate Technical Lead - Data Technologies Team, WSO2 Inc. Mobile: +94715779733 Blog: http://nirmalfdo.blogspot.com/
Re: [MLLib][Kmeans] KMeansModel.computeCost takes lot of time
Hi Burak, k = 3 dimension = 785 features Spark 1.4 On Mon, Jul 13, 2015 at 10:28 PM, Burak Yavuz brk...@gmail.com wrote: Hi, How are you running K-Means? What is your k? What is the dimension of your dataset (columns)? Which Spark version are you using? Thanks, Burak On Mon, Jul 13, 2015 at 2:53 AM, Nirmal Fernando nir...@wso2.com wrote: Hi, For a fairly large dataset, 30MB, KMeansModel.computeCost takes lot of time (16+ mints). It takes lot of time at this task; org.apache.spark.rdd.DoubleRDDFunctions.sum(DoubleRDDFunctions.scala:33) org.apache.spark.mllib.clustering.KMeansModel.computeCost(KMeansModel.scala:70) Can this be improved? -- Thanks regards, Nirmal Associate Technical Lead - Data Technologies Team, WSO2 Inc. Mobile: +94715779733 Blog: http://nirmalfdo.blogspot.com/ -- Thanks regards, Nirmal Associate Technical Lead - Data Technologies Team, WSO2 Inc. Mobile: +94715779733 Blog: http://nirmalfdo.blogspot.com/
Spark MLLib 140 - logistic regression with SGD model accuracy is different in local mode and cluster mode
Hi All, I'm facing a quite strange case, where after migrating to Spark 140, I'm seen SparkMLLib produces different results when runs on local mode and cluster mode. Is there any possibility of that happening? (I feel this is an issue in my environment, but just wanted to get confirmed.) Thanks. -- Thanks regards, Nirmal Associate Technical Lead - Data Technologies Team, WSO2 Inc. Mobile: +94715779733 Blog: http://nirmalfdo.blogspot.com/
Run multiple Spark jobs concurrently
Hi All, Is there any additional configs that we have to do to perform $subject? -- Thanks regards, Nirmal Associate Technical Lead - Data Technologies Team, WSO2 Inc. Mobile: +94715779733 Blog: http://nirmalfdo.blogspot.com/
Re: path to hdfs
HDFS path should be something like; hdfs:// 127.0.0.1:8020/user/cloudera/inputs/ On Mon, Jun 8, 2015 at 4:15 PM, Pa Rö paul.roewer1...@googlemail.com wrote: hello, i submit my spark job with the following parameters: ./spark-1.1.0-bin-hadoop2.4/bin/spark-submit \ --class mgm.tp.bigdata.ma_spark.SparkMain \ --master spark://quickstart.cloudera:7077 \ ma-spark.jar \ 1000 and get the following exception: java.io.IOException: Mkdirs failed to create file:/ 127.0.0.1:8020/user/cloudera/outputs/output_spark at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:438) at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:424) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:906) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:887) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:784) at mgm.tp.bigdata.ma_spark.Helper.writeCenterHistory(Helper.java:35) at mgm.tp.bigdata.ma_spark.SparkMain.main(SparkMain.java:100) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: 127.0.0.1:8020 at org.apache.hadoop.fs.Path.initialize(Path.java:206) at org.apache.hadoop.fs.Path.init(Path.java:172) at org.apache.hadoop.fs.Path.init(Path.java:94) at org.apache.hadoop.fs.Globber.glob(Globber.java:211) at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1642) at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:257) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:179) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.FilteredRDD.getPartitions(FilteredRDD.scala:29) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1135) at org.apache.spark.rdd.RDD.count(RDD.scala:904) at org.apache.spark.rdd.RDD.takeSample(RDD.scala:401) at org.apache.spark.api.java.JavaRDDLike$class.takeSample(JavaRDDLike.scala:426) at org.apache.spark.api.java.JavaRDD.takeSample(JavaRDD.scala:32) at org.apache.spark.api.java.JavaRDDLike$class.takeSample(JavaRDDLike.scala:422) at org.apache.spark.api.java.JavaRDD.takeSample(JavaRDD.scala:32) at mgm.tp.bigdata.ma_spark.SparkMain.kmeans(SparkMain.java:123) at mgm.tp.bigdata.ma_spark.SparkMain.main(SparkMain.java:102) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.net.URISyntaxException: Relative path in absolute URI: 127.0.0.1:8020 at java.net.URI.checkPath(URI.java:1804) at java.net.URI.init(URI.java:752) at org.apache.hadoop.fs.Path.initialize(Path.java:203) ... 43
Is there a way to disable the Spark UI?
Hi All, Is there a way to disable the Spark UI? What I really need is to stop the startup of the Jetty server. -- Thanks regards, Nirmal Senior Software Engineer- Platform Technologies Team, WSO2 Inc. Mobile: +94715779733 Blog: http://nirmalfdo.blogspot.com/
Re: Is there a way to disable the Spark UI?
Thanks Zhan! Was this introduced from Spark 1.2? or is this available in Spark 1.1 ? On Tue, Feb 3, 2015 at 11:52 AM, Zhan Zhang zzh...@hortonworks.com wrote: You can set spark.ui.enabled to false to disable ui. Thanks. Zhan Zhang On Feb 2, 2015, at 8:06 PM, Nirmal Fernando nir...@wso2.com wrote: Hi All, Is there a way to disable the Spark UI? What I really need is to stop the startup of the Jetty server. -- Thanks regards, Nirmal Senior Software Engineer- Platform Technologies Team, WSO2 Inc. Mobile: +94715779733 Blog: http://nirmalfdo.blogspot.com/ -- Thanks regards, Nirmal Senior Software Engineer- Platform Technologies Team, WSO2 Inc. Mobile: +94715779733 Blog: http://nirmalfdo.blogspot.com/