unsubscribe
Thanks & Regards, Meethu M - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
[jira] [Commented] (SPARK-25452) Query with where clause is giving unexpected result in case of float column
[ https://issues.apache.org/jira/browse/SPARK-25452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16628433#comment-16628433 ] Meethu Mathew commented on SPARK-25452: --- This is not duplicate of -SPARK-24829.- !image-2018-09-26-14-14-47-504.png! > Query with where clause is giving unexpected result in case of float column > --- > > Key: SPARK-25452 > URL: https://issues.apache.org/jira/browse/SPARK-25452 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 > Environment: *Spark 2.3.1* > *Hadoop 2.7.2* >Reporter: Ayush Anubhava >Priority: Major > Attachments: image-2018-09-26-14-14-47-504.png > > > *Description* : Query with clause is giving unexpected result in case of > float column > > {color:#d04437}*Query with filter less than equal to is giving inappropriate > result{code}*{color} > {code} > 0: jdbc:hive2://10.18.18.214:23040/default> create table k2 ( a int, b float); > +-+--+ > | Result | > +-+--+ > +-+--+ > 0: jdbc:hive2://10.18.18.214:23040/default> insert into table k2 values > (0,0.0); > +-+--+ > | Result | > +-+--+ > +-+--+ > 0: jdbc:hive2://10.18.18.214:23040/default> insert into table k2 values > (1,1.1); > +-+--+ > | Result | > +-+--+ > +-+--+ > 0: jdbc:hive2://10.18.18.214:23040/default> select * from k2 where b >=0.0; > +++--+ > | a | b | > +++--+ > | 0 | 0.0 | > | 1 | 1.10023841858 | > +++--+ > Query with filter less than equal to is giving in appropriate result > 0: jdbc:hive2://10.18.18.214:23040/default> select * from k2 where b <=1.1; > ++--+--+ > | a | b | > ++--+--+ > | 0 | 0.0 | > ++--+--+ > 1 row selected (0.299 seconds) > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25452) Query with where clause is giving unexpected result in case of float column
[ https://issues.apache.org/jira/browse/SPARK-25452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Meethu Mathew updated SPARK-25452: -- Attachment: image-2018-09-26-14-14-47-504.png > Query with where clause is giving unexpected result in case of float column > --- > > Key: SPARK-25452 > URL: https://issues.apache.org/jira/browse/SPARK-25452 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 > Environment: *Spark 2.3.1* > *Hadoop 2.7.2* >Reporter: Ayush Anubhava >Priority: Major > Attachments: image-2018-09-26-14-14-47-504.png > > > *Description* : Query with clause is giving unexpected result in case of > float column > > {color:#d04437}*Query with filter less than equal to is giving inappropriate > result{code}*{color} > {code} > 0: jdbc:hive2://10.18.18.214:23040/default> create table k2 ( a int, b float); > +-+--+ > | Result | > +-+--+ > +-+--+ > 0: jdbc:hive2://10.18.18.214:23040/default> insert into table k2 values > (0,0.0); > +-+--+ > | Result | > +-+--+ > +-+--+ > 0: jdbc:hive2://10.18.18.214:23040/default> insert into table k2 values > (1,1.1); > +-+--+ > | Result | > +-+--+ > +-+--+ > 0: jdbc:hive2://10.18.18.214:23040/default> select * from k2 where b >=0.0; > +++--+ > | a | b | > +++--+ > | 0 | 0.0 | > | 1 | 1.10023841858 | > +++--+ > Query with filter less than equal to is giving in appropriate result > 0: jdbc:hive2://10.18.18.214:23040/default> select * from k2 where b <=1.1; > ++--+--+ > | a | b | > ++--+--+ > | 0 | 0.0 | > ++--+--+ > 1 row selected (0.299 seconds) > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
Filtering based on a float value with more than one decimal place not working correctly in Pyspark dataframe
Hi all, I tried the following code and the output was not as expected. schema = StructType([StructField('Id', StringType(), False), > StructField('Value', FloatType(), False)]) > df_test = spark.createDataFrame([('a',5.0),('b',1.236),('c',-0.31)],schema) df_test Output : DataFrame[Id: string, Value: float] [image: image.png] But when the value is given as a string, it worked. [image: image.png] Again tried with a floating point number with one decimal place and it worked. [image: image.png] And when the equals operation is changed to greater than or less than, its working with more than one decimal place numbers [image: image.png] Is this a bug? Regards, Meethu Mathew
[jira] [Created] (ZEPPELIN-3126) More than 2 notebooks in R failing with error sparkr intrepreter not responding
Meethu Mathew created ZEPPELIN-3126: --- Summary: More than 2 notebooks in R failing with error sparkr intrepreter not responding Key: ZEPPELIN-3126 URL: https://issues.apache.org/jira/browse/ZEPPELIN-3126 Project: Zeppelin Issue Type: Bug Components: r-interpreter Affects Versions: 0.7.2 Environment: spark version 1.6.2 Reporter: Meethu Mathew Priority: Critical Spark interpreter is in per note Scoped mode. Please find the steps below to reproduce the issue: 1. Create a notebook (Note1) and run any r code in a paragraph. I ran the following code. %r rdf <- data.frame(c(1,2,3,4)) colnames(rdf) <- c("myCol") sdf <- createDataFrame(sqlContext, rdf) withColumn(sdf, "newCol", sdf$myCol * 2.0) 2. Create another notebook (Note2) and run any r code in a paragraph. I ran the same code as above. Till now everything works fine. 3. Create third notebook (Note3) and run any r code in a paragraph. I ran the same code. This notebook fails with the error org.apache.zeppelin.interpreter.InterpreterException: sparkr is not responding The problem will be solved on restarting the sparkr interpreter and another 2 models could be executed successfully. But again, for the third model run using the sparkr interpreter, the error is thrown. Once a notebook throws the error, all further notebooks will throw the same error and each time we run those failed notebooks, a new R shell process will be started and these processes are not getting killed even if we we delete the failed notebook.i.e it does not reuse original R shell after failure -- This message was sent by Atlassian JIRA (v6.4.14#64029)
Re: More than 2 notebooks in R failing with error sparkr intrepreter not responding
Hi Jeff, PFB the interpreter log. INFO [2018-01-03 12:10:05,960] ({pool-2-thread-9} Logging.scala[logInfo]:58) - Starting HTTP Server INFO [2018-01-03 12:10:05,961] ({pool-2-thread-9} Server.java[doStart]:272) - jetty-8.y.z-SNAPSHOT INFO [2018-01-03 12:10:05,963] ({pool-2-thread-9} AbstractConnector.java[doStart]:338) - Started SocketConnector@0.0.0.0:58989 INFO [2018-01-03 12:10:05,963] ({pool-2-thread-9} Logging.scala[logInfo]:58) - Successfully started service 'HTTP class server' on port 58989. INFO [2018-01-03 12:10:06,094] ({dispatcher-event-loop-1} Logging.scala[logInfo]:58) - Removed broadcast_1_piece0 on localhost:42453 in memory (size: 854.0 B, free: 511.1 MB) INFO [2018-01-03 12:10:07,049] ({pool-2-thread-9} ZeppelinR.java[createRScript]:353) - File /tmp/zeppelin_sparkr-5046601627391341672.R created ERROR [2018-01-03 12:10:17,051] ({pool-2-thread-9} Job.java[run]:188) - Job failed *org.apache.zeppelin.interpreter.InterpreterException: sparkr is not responding * R version 3.4.1 (2017-06-30) -- "Single Candle" Copyright (C) 2017 The R Foundation for Statistical Computing Platform: x86_64-pc-linux-gnu (64-bit) > args <- commandArgs(trailingOnly = TRUE) > hashCode <- as.integer(args[1]) > port <- as.integer(args[2]) > libPath <- args[3] > version <- as.integer(args[4]) > rm(args) > > print(paste("Port ", toString(port))) [1] "Port 58063" > print(paste("LibPath ", libPath)) [1] "LibPath /home/meethu/spark-1.6.1-bin-hadoop2.6/R/lib" > > .libPaths(c(file.path(libPath), .libPaths())) > library(SparkR) Attaching package: ‘SparkR’ The following objects are masked from ‘package:stats’: cov, filter, lag, na.omit, predict, sd, var The following objects are masked from ‘package:base’: colnames, colnames<-, endsWith, intersect, rank, rbind, sample, startsWith, subset, summary, table, transform > SparkR:::connectBackend("localhost", port, 6000) A connection with description "->localhost:58063" class "sockconn" mode"wb" text"binary" opened "opened" can read"yes" can write "yes" > > # scStartTime is needed by R/pkg/R/sparkR.R > assign(".scStartTime", as.integer(Sys.time()), envir = SparkR:::.sparkREnv) > # getZeppelinR > *.zeppelinR = SparkR:::callJStatic("org.apache.zeppelin.spark.ZeppelinR", "getZeppelinR", hashCode)* at org.apache.zeppelin.spark.ZeppelinR.waitForRScriptInitialized(ZeppelinR.java:285) at org.apache.zeppelin.spark.ZeppelinR.request(ZeppelinR.java:227) at org.apache.zeppelin.spark.ZeppelinR.eval(ZeppelinR.java:176) at org.apache.zeppelin.spark.ZeppelinR.open(ZeppelinR.java:165) at org.apache.zeppelin.spark.SparkRInterpreter.open(SparkRInterpreter.java:90) at org.apache.zeppelin.interpreter.LazyOpenInterpreter.open(LazyOpenInterpreter.java:70) at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:491) at org.apache.zeppelin.scheduler.Job.run(Job.java:175) at org.apache.zeppelin.scheduler.FIFOScheduler$1.run(FIFOScheduler.java:139) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) INFO [2018-01-03 12:10:17,070] ({pool-2-thread-9} SchedulerFactory.java[jobFinished]:137) - Job remoteInterpretJob_1514961605951 finished by scheduler org.apache.zeppelin.spark.SparkRInterpreter392022746 INFO [2018-01-03 12:39:22,664] ({Spark Context Cleaner} Logging.scala[logInfo]:58) - Cleaned accumulator 2 PFB the output of the command *ps -ef | grep /usr/lib/R/bin/exec/R* meethu6647 6470 0 12:09 pts/100:00:00 /usr/lib/R/bin/exec/R --no-save --no-restore -f /tmp/zeppelin_sparkr-1100854828050763213.R --args 214655664 58063 /home/meethu/spark-1.6.1-bin-hadoop2.6/R/lib 10601 meethu6701 6470 0 12:09 pts/100:00:00 /usr/lib/R/bin/exec/R --no-save --no-restore -f /tmp/zeppelin_sparkr-4152305170353311178.R --args 1642312173 58063 /home/meethu/spark-1.6.1-bin-hadoop2.6/R/lib 10601 meethu6745 6470 0 12:10 pts/100:00:00 /usr/lib/R/bin/exec/R --no-save --no-restore -f /tmp/zeppelin_sparkr-5046601627391341672.R --args 1158632477 58063 /home/meethu/spark-1.6.1-bin-hadoop2.6/R/lib 10601 Regards, Meethu Mathew On Wed, Jan 3, 2018 at 12:56 PM, Jeff Zhang <zjf...@gmail.com> wrote: > > Could
More than 2 notebooks in R failing with error sparkr intrepreter not responding
Hi, I have met with a strange issue in running R notebooks in zeppelin(0.7.2). Spark intrepreter is in per note Scoped mode and spark version is 1.6.2 Please find the steps below to reproduce the issue: 1. Create a notebook (Note1) and run any r code in a paragraph. I ran the following code. > %r > > rdf <- data.frame(c(1,2,3,4)) > > colnames(rdf) <- c("myCol") > > sdf <- createDataFrame(sqlContext, rdf) > > withColumn(sdf, "newCol", sdf$myCol * 2.0) > > 2. Create another notebook (Note2) and run any r code in a paragraph. I ran the same code as above. Till now everything works fine. 3. Create third notebook (Note3) and run any r code in a paragraph. I ran the same code. This notebook fails with the error > org.apache.zeppelin.interpreter.InterpreterException: sparkr is not > responding What I understood from the analysis is that the process created for sparkr interpreter is not getting killed properly and this makes every third model to throw an error while executing. The process will be killed on restarting the sparkr interpreter and another 2 models could be executed successfully. ie, For every third model run using the sparkr interpreter, the error is thrown. We suspect this as a limitation with zeppelin. Please help to solve this issue Regards, Meethu Mathew
Re: Zeppelin framework is not getting unregistered from Mesos
Hi Moon, Yes its fixed in 0.7.1. Thank you Regards, Meethu Mathew On Wed, Apr 26, 2017 at 10:42 PM, moon soo Lee <m...@apache.org> wrote: > Some bugs related to interpreter process management has been fixed in > 0.7.1 release [1]. Could you try 0.7.1 or master branch and see if the same > problem occurs? > > Thanks, > moon > > [1] https://issues.apache.org/jira/browse/ZEPPELIN-1832 > > On Wed, Apr 26, 2017 at 1:13 AM Meethu Mathew <meethu.mat...@flytxt.com> > wrote: > >> Hi, >> >> We have connected our zeppelin to mesos. But the issue we are facing is >> that Zeppelin framework is not getting unregistered from Mesos even if the >> notebook is closed. >> >> Another problem is if the user logout from zeppelin, the SparkContext is >> getting stopped. When the same user login again, it creates another >> SparkContext and then the previous SparkContext will become a dead process >> and exist. >> >> Is it a bug of zeppelin or is there any other proper way to unbind the >> zeppelin framework? >> >> Zeppelin version is 0.7.0 >> >> Regards, >> >> >> Meethu Mathew >> >>
Zeppelin framework is not getting unregistered from Mesos
Hi, We have connected our zeppelin to mesos. But the issue we are facing is that Zeppelin framework is not getting unregistered from Mesos even if the notebook is closed. Another problem is if the user logout from zeppelin, the SparkContext is getting stopped. When the same user login again, it creates another SparkContext and then the previous SparkContext will become a dead process and exist. Is it a bug of zeppelin or is there any other proper way to unbind the zeppelin framework? Zeppelin version is 0.7.0 Regards, Meethu Mathew
Re: UnicodeDecodeError in zeppelin 0.7.1
Hi, Thanks for the repsonse. @ moon soo Lee: The interpreter setting is same in 0.7.0 and 0.7.1 @ Felix Cheng : The Python version is same. The code is as follows: *PYSPARK* def textPreProcessor(text): > for w in text.split(): > > > regex = re.compile('[%s]' % re.escape(string.punctuation)) > > * * > *no_punctuation = unicode(regex.sub(' ', w),'utf8')* > > > tokens = word_tokenize(no_punctuation) > > > lowercased = [t.lower() for t in tokens] > > > no_stopwords = [w for w in lowercased if not w in stopwordsX] > > > stemmed = [stemmerX.stem(w) for w in no_stopwords] > > > return [w for w in stemmed if w] >- docs =sc.textFile(hdfs_path+training_data,*use_unicode=False* >).repartition(96) >- docs.map(lambda features: sentimentObject.textPreProcessor(features. >split(delimiter)[text_colum])).count() > > *Error:* - UnicodeDecodeError: 'utf8' codec can't decode byte 0x9b in position 17: invalid start byte - Same error *use_unicode=False* is not used - Error change to *'ascii' codec can't decode byte 0x97 in position 3: ordinal not in range(128) when **no_punctuation = regex.sub(' ', w)* is used instead of *no_punctuation = unicode(regex.sub(' ', w),'utf8'). * *Note :: In version 0.7.0 the code was running fine without using use_unicode and unicode(regex.sub(' ', w),'utf8')* *PYTHON* def textPreProcessor(text_column): > processed_text=[] > for text in text_column: >for w in text.split(): > regex = re.compile('[%s]' % re.escape(string.punctuation)) # reg > exprn for puntuation > no_punctuation = unicode(regex.sub(' ', text_),'utf8') > tokens = word_tokenize(no_punctuation) > lowercased = [t.lower() for t in tokens] >no_stopwords = [w for w in lowercased if not w in stopwordsX] >stemmed = [stemmerX.stem(w) for w in no_stopwords] >processed_text.append([w for w in stemmed if w]) > return processed_text - new_training = pd.read_csv(training_data,header=None, delimiter=delimiter, error_bad_lines=False, usecols=[label_column,text_ column],names=['label','msg']).dropna() - new_training['processed_msg'] = textPreProcessor(new_training['msg']) This python code is working and I am getting result. In version 0.7.0, I am getting output without using the unicode function. Hope the problem is clear now. Regards, Meethu Mathew On Fri, Apr 21, 2017 at 3:07 AM, Felix Cheung <felixcheun...@hotmail.com> wrote: > And are they running with the same Python version? What is the Python > version? > > _ > From: moon soo Lee <m...@apache.org> > Sent: Thursday, April 20, 2017 11:53 AM > Subject: Re: UnicodeDecodeError in zeppelin 0.7.1 > To: <users@zeppelin.apache.org> > > > > Hi, > > 0.7.1 didn't changed any encoding type as far as i know. > One difference is 0.7.1 official artifact has been built with JDK8 while > 0.7.0 built with JDK7 (we'll use JDK7 to build upcoming 0.7.2 binary). But > i'm not sure that can make pyspark and spark encoding type changes. > > Do you have exactly the same interpreter setting in 0.7.1 and 0.7.0? > > Thanks, > moon > > On Wed, Apr 19, 2017 at 5:30 AM Meethu Mathew <meethu.mat...@flytxt.com> > wrote: > >> Hi, >> >> I just migrated from zeppelin 0.7.0 to zeppelin 0.7.1 and I am facing >> this error while creating an RDD(in pyspark). >> >> UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: >>> invalid start byte >> >> >> I was able to create the RDD without any error after adding >> use_unicode=False as follows >> >>> sc.textFile("file.csv",use_unicode=False) >> >> >> But it fails when I try to stem the text. I am getting similar error >> when trying to apply stemming to the text using python interpreter. >> >> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: >>> ordinal not in range(128) >> >> All these code is working in 0.7.0 version. There is no change in the >> dataset and code. Is there any change in the encoding type in the new >> version of zeppelin? >> >> Regards, >> >> >> Meethu Mathew >> >> > >
UnicodeDecodeError in zeppelin 0.7.1
Hi, I just migrated from zeppelin 0.7.0 to zeppelin 0.7.1 and I am facing this error while creating an RDD(in pyspark). UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: > invalid start byte I was able to create the RDD without any error after adding use_unicode=False as follows > sc.textFile("file.csv",use_unicode=False) But it fails when I try to stem the text. I am getting similar error when trying to apply stemming to the text using python interpreter. UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: > ordinal not in range(128) All these code is working in 0.7.0 version. There is no change in the dataset and code. Is there any change in the encoding type in the new version of zeppelin? Regards, Meethu Mathew
sqlContext not avilable as hiveContext in notebook
Hi, I am running zeppelin 0.7.0. the sqlContext already created in the zeppelin notebook returns a , even though my spark is built with HIVE. "zeppelin.spark.useHiveContext" in the spark properties is set to true. As mentioned in https://issues.apache.org/jira/browse/ZEPPELIN-1728, I tried hc = HiveContext.getOrCreate(sc) but still its returning . My pyspark shell and jupyter notebook is returning without doing anything. How to get in the zeppelin notebook ? Regards, Meethu Mathew
Separate interpreter running scope Per user or Per Note documentation
Hi, I couldnt find the documentation for the feature Separate interpreter running scope Per user or Per Note at https://zeppelin.apache.org/docs/0.7.0/manual/interpreters.html#interpreter-binding-mode . Can somebody help me in understanding the per note scoped mode and per user scoped mode? Regards, Meethu Mathew
[jira] [Created] (ZEPPELIN-2313) Run-a-paragraph-synchronously response documented incorrectly
Meethu Mathew created ZEPPELIN-2313: --- Summary: Run-a-paragraph-synchronously response documented incorrectly Key: ZEPPELIN-2313 URL: https://issues.apache.org/jira/browse/ZEPPELIN-2313 Project: Zeppelin Issue Type: Bug Components: documentation Affects Versions: 0.7.0 Reporter: Meethu Mathew The documentation at https://zeppelin.apache.org/docs/0.7.0/rest-api/rest-notebook.html#run-a-paragraph-synchronously says the sample json error as { "status": "INTERNAL_SERVER_ERROR", "body": { "code": "ERROR", "type": "TEXT", "msg": "bash: -c: line 0: unexpected EOF while looking for matching ``'\nbash: -c: line 1: syntax error: unexpected end of file\nExitValue: 2" } } But it is actually coming like { "status": "OK", "body": { "code": "SUCCESS", "msg": [ { "type": "TEXT", "data": "hello world" }] }} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (ZEPPELIN-2312) Allow to Undo edits in a paragraph once its executed and undo deleted paragraph
Meethu Mathew created ZEPPELIN-2312: --- Summary: Allow to Undo edits in a paragraph once its executed and undo deleted paragraph Key: ZEPPELIN-2312 URL: https://issues.apache.org/jira/browse/ZEPPELIN-2312 Project: Zeppelin Issue Type: Improvement Components: Core Affects Versions: 0.7.0 Reporter: Meethu Mathew Priority: Minor Its not possible to undo edits in a paragraph once its executed. But it was possible in 0.6.0. There should an option to undo a delete paragraph. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (ZEPPELIN-2305) overall experience on auto-completion need to improve.
Meethu Mathew created ZEPPELIN-2305: --- Summary: overall experience on auto-completion need to improve. Key: ZEPPELIN-2305 URL: https://issues.apache.org/jira/browse/ZEPPELIN-2305 Project: Zeppelin Issue Type: Improvement Components: Core Affects Versions: 0.7.0 Reporter: Meethu Mathew There is no Auto-completion or suggestions for the defined variable names which is available in other frameworks. Also Ctrl+. is giving awkward suggestions for related functions also. For example, the relevant functions for a spark rdd or dataframe is not available in the suggestions list. The overall experience on auto-completion is something that Zeppelin need to improve. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
Auto completion for defined variable names
Hi, Is there any way to get auto-completion or suggestions for the defined variable names? In Jupyter notebooks, once defined variables will show under suggestions. Ctrl+. is giving awkward suggestions for related functions also. For a spark data frame, it wont show the relevant functions. Please improve the suggestion functionality. Regards, Meethu Mathew
--files in SPARK_SUBMIT_OPTIONS not working - ZEPPELIN-2136
Hi, Acc to the zeppelin documentation, to pass a python package to zeppelin pyspark interpreter, you can export it through --files option in SPARK_SUBMIT_OPTIONS in conf/zeppelin-env.sh. When I add a .egg file through the --files option in SPARK_SUBMIT_OPTIONS , zeppelin notebook is not throwing error, but I am not able to import the module inside the zeppelin notebook. Spark version is 1.6.2 and the zepplein-env.sh(version 0.7.0) file looks like: export SPARK_HOME=/home/me/spark-1.6.1-bin-hadoop2.6 export SPARK_SUBMIT_OPTIONS="--jars /home/me/spark-csv-1.5.0-s_2.10.jar,/home/me/commons-csv-1.4.jar --files /home/me/models/Churn/package/build/dist/fly_libs-1.1-py2.7.egg" Any progress in this ticket ZEPPELIN-2136 <https://issues.apache.org/jira/browse/ZEPPELIN-2136> ? Regards, Meethu Mathew
python prints "..." in the place of comments in output
Hi, The output of following code prints unexpected dots in the result if there is a comment in the code. Is it a bug with zeppelin? *Code :* %python v = [1,2,3] #comment 1 #comment print v *output* ... ... [1, 2, 3] Regards, Meethu Mathew
Re: "spark ui" button in spark interpreter does not show Spark web-ui
Hi, I have noticed the same problem Regards, Meethu Mathew On Mon, Mar 13, 2017 at 9:56 AM, Xiaohui Liu <hero...@gmail.com> wrote: > Hi, > > We used 0.7.1-snapshot with our Mesos cluster, almost all our needed > features (ldap login, notebook acl control, livy/pyspark/rspark/scala, > etc.) work pretty well. > > But one thing does not work for us is the 'spark ui' button does not > response to user clicks. No errors in browser side. > > Anyone has met similar issues? Any suggestions about where I should check? > > Regards > Xiaohui >
Adding images in the %md interpreter
Hi all, I am trying to display images in the %md interpreter of zeppelin(version 0.7.0) notebook using the following code. * ![](model-files/sentiment_donut_viz.png)* But I am facing the following problems: 1. Not able to give a local path 2. I put the file inside the {zeppelin_home}/webapps/webapp and it worked. But the files or folders added in this folder which is the ZEPPELIN_WAR_TEMPDIR is deleted after a restart. How can I add images in the mark down interpreter without using other webservers? Regards, Meethu Mathew
[jira] [Created] (ZEPPELIN-2141) sc.addPyFile("hdfs://path/to file) in zeppelin causing UnKnownHostException
Meethu Mathew created ZEPPELIN-2141: --- Summary: sc.addPyFile("hdfs://path/to file) in zeppelin causing UnKnownHostException Key: ZEPPELIN-2141 URL: https://issues.apache.org/jira/browse/ZEPPELIN-2141 Project: Zeppelin Issue Type: Bug Components: pySpark Affects Versions: 0.6.0 Reporter: Meethu Mathew Priority: Minor In the documentation of sc.addPyFile(0 its is mentioned that " Add a .py or .zip dependency for all tasks to be executed on this SparkContext in the future. The path passed can be either a local file, a file in HDFS (or other Hadoop-supported filesystems), or an HTTP, HTTPS or FTP URI" But when I added an HDFS path in the method in zeppelin, it results in the following exception: Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, demo-node4.flytxt.com): java.lang.IllegalArgumentException: java.net.UnknownHostException: flycluster Spark version used is 1.6.2. The same command is working fine with pyspark shell and hence I think something is wrong with Zeppelin -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (ZEPPELIN-2136) --files in SPARK_SUBMIT_OPTIONS not working
Meethu Mathew created ZEPPELIN-2136: --- Summary: --files in SPARK_SUBMIT_OPTIONS not working Key: ZEPPELIN-2136 URL: https://issues.apache.org/jira/browse/ZEPPELIN-2136 Project: Zeppelin Issue Type: Bug Components: pySpark Affects Versions: 0.6.0 Reporter: Meethu Mathew Acc to the zeppelin documentation, to pass a python package to zeppelin pyspark interpreter, you can export it through --files option in SPARK_SUBMIT_OPTIONS in conf/zeppelin-env.sh. When I add a .egg file through the --files option in SPARK_SUBMIT_OPTIONS , zeppelin notebook is not throwing error, but I am not able to import the module inside the zeppelin notebook. Spark version is 1.6.2 and the zepplein-env.sh file looks like: export SPARK_HOME=/home/me/spark-1.6.1-bin-hadoop2.6 export SPARK_SUBMIT_OPTIONS="--jars /home/me/spark-csv-1.5.0-s_2.10.jar,/home/me/commons-csv-1.4.jar --files /home/me/models/Churn/package/build/dist/fly_libs-1.1-py2.7.egg" My work around for this problem was to add the .rgg file using sc.addPyFile() inside the notebook. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
Re: Failed to run spark jobs on mesos due to "hadoop" not found.
Hi, Add HADOOP_HOME=/path/to/hadoop/folder in /etc/default/mesos-slave in all mesos agents and restart mesos Regards, Meethu Mathew On Thu, Nov 10, 2016 at 4:57 PM, Yu Wei <yu20...@hotmail.com> wrote: > Hi Guys, > > I failed to launch spark jobs on mesos. Actually I submitted the job to > cluster successfully. > > But the job failed to run. > > I1110 18:25:11.095507 301 fetcher.cpp:498] Fetcher Info: > {"cache_directory":"\/tmp\/mesos\/fetch\/slaves\/1f8e621b-3cbf-4b86-a1c1- > 9e2cf77265ee-S7\/root","items":[{"action":"BYPASS_CACHE"," > uri":{"extract":true,"value":"hdfs:\/\/192.168.111.74:9090\/ > bigdata\/package\/spark-examples_2.11-2.0.1.jar"}}]," > sandbox_directory":"\/var\/lib\/mesos\/agent\/slaves\/ > 1f8e621b-3cbf-4b86-a1c1-9e2cf77265ee-S7\/frameworks\/ > 1f8e621b-3cbf-4b86-a1c1-9e2cf77265ee-0002\/executors\/ > driver-20161110182510-0001\/runs\/b561328e-9110-4583-b740- > 98f9653e7fc2","user":"root"} > I1110 18:25:11.099799 301 fetcher.cpp:409] Fetching URI 'hdfs:// > 192.168.111.74:9090/bigdata/package/spark-examples_2.11-2.0.1.jar' > I1110 18:25:11.099820 301 fetcher.cpp:250] Fetching directly into the > sandbox directory > I1110 18:25:11.099862 301 fetcher.cpp:187] Fetching URI 'hdfs:// > 192.168.111.74:9090/bigdata/package/spark-examples_2.11-2.0.1.jar' > E1110 18:25:11.101842 301 shell.hpp:106] Command 'hadoop version 2>&1' > failed; this is the output: > sh: hadoop: command not found > Failed to fetch 'hdfs://192.168.111.74:9090/bigdata/package/spark- > examples_2.11-2.0.1.jar': Failed to create HDFS client: Failed to execute > 'hadoop version 2>&1'; the command was either not found or exited with a > non-zero exit status: 127 > Failed to synchronize with agent (it's probably exited > > Actually I installed hadoop on each agent node. > > > Any advice? > > > Thanks, > > Jared, (韦煜) > Software developer > Interested in open source software, big data, Linux >
[jira] [Created] (ZEPPELIN-1562) Wrong documentation in 'Run a paragraph synchronously' rest api
Meethu Mathew created ZEPPELIN-1562: --- Summary: Wrong documentation in 'Run a paragraph synchronously' rest api Key: ZEPPELIN-1562 URL: https://issues.apache.org/jira/browse/ZEPPELIN-1562 Project: Zeppelin Issue Type: Bug Components: documentation Affects Versions: 0.7.0 Reporter: Meethu Mathew Fix For: 0.7.0 The URL for running a paragraph synchronously using REST api is given as "http://[zeppelin-server]:[zeppelin-port]/api/notebook/job/[notebookId]/[paragraphId] " in the documentation. https://zeppelin.apache.org/docs/0.7.0-SNAPSHOT/rest-api/rest-notebook.html#run-a-paragraph-synchronously. When I searched the same in the github code , https://zeppelin.apache.org/docs/0.7.0-SNAPSHOT/rest-api/rest-notebook.html#run-a-paragraph-synchronously , the URL is given as "run/notebookId/paragraphId" -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (SPARK-12755) Spark may attempt to rebuild application UI before finishing writing the event logs in possible race condition
[ https://issues.apache.org/jira/browse/SPARK-12755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15282452#comment-15282452 ] Meethu Mathew commented on SPARK-12755: --- Hi, I am facing similar issues again in 1.6.1 standalone. 1. My completed applications are listed under in the incompleted applications list. My application was completed using sc.stop() and the log directory contains app folders without .inprogress suffix. No permission issues is there for the log directory. 2 From the incompleted list, I can view the UI of only those apps ,which has a .inprogress suffix in the folder name in log directory. For other apps it's showing error "Application app-2015x not found". Please help me. > Spark may attempt to rebuild application UI before finishing writing the > event logs in possible race condition > -- > > Key: SPARK-12755 > URL: https://issues.apache.org/jira/browse/SPARK-12755 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.2 >Reporter: Michael Allman >Assignee: Michael Allman >Priority: Minor > Fix For: 1.5.3, 1.6.1, 2.0.0 > > > As reported in SPARK-6950, it appears that sometimes the standalone master > attempts to build an application's historical UI before closing the app's > event log. This is still an issue for us in 1.5.2+, and I believe I've found > the underlying cause. > When stopping a {{SparkContext}}, the {{stop}} method stops the DAG scheduler: > https://github.com/apache/spark/blob/a76cf51ed91d99c88f301ec85f3cda1288bcf346/core/src/main/scala/org/apache/spark/SparkContext.scala#L1722-L1727 > and then stops the event logger: > https://github.com/apache/spark/blob/a76cf51ed91d99c88f301ec85f3cda1288bcf346/core/src/main/scala/org/apache/spark/SparkContext.scala#L1722-L1727 > Though it is difficult to follow the chain of events, one of the sequelae of > stopping the DAG scheduler is that the master's {{rebuildSparkUI}} method is > called. This method looks for the application's event logs, and its behavior > varies based on the existence of an {{.inprogress}} file suffix. In > particular, a warning is logged if this suffix exists: > https://github.com/apache/spark/blob/a76cf51ed91d99c88f301ec85f3cda1288bcf346/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L935 > After calling the {{stop}} method on the DAG scheduler, the {{SparkContext}} > stops the event logger: > https://github.com/apache/spark/blob/a76cf51ed91d99c88f301ec85f3cda1288bcf346/core/src/main/scala/org/apache/spark/SparkContext.scala#L1734-L1736 > This renames the event log, dropping the {{.inprogress}} file sequence. > As such, a race condition exists where the master may attempt to process the > application log file before finalizing it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11227) Spark1.5+ HDFS HA mode throw java.net.UnknownHostException: nameservice1
[ https://issues.apache.org/jira/browse/SPARK-11227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15266237#comment-15266237 ] Meethu Mathew commented on SPARK-11227: --- I am also facing the same issue when HA is setup in cloudera HDFS . I am using spark 1.6.1 and using ipython notebook. When HA is disabled, everything is fine. > Spark1.5+ HDFS HA mode throw java.net.UnknownHostException: nameservice1 > > > Key: SPARK-11227 > URL: https://issues.apache.org/jira/browse/SPARK-11227 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.0, 1.5.1 > Environment: OS: CentOS 6.6 > Memory: 28G > CPU: 8 > Mesos: 0.22.0 > HDFS: Hadoop 2.6.0-CDH5.4.0 (build by Cloudera Manager) >Reporter: Yuri Saito > > When running jar including Spark Job at HDFS HA Cluster, Mesos and > Spark1.5.1, the job throw Exception as "java.net.UnknownHostException: > nameservice1" and fail. > I do below in Terminal. > {code} > /opt/spark/bin/spark-submit \ > --class com.example.Job /jobs/job-assembly-1.0.0.jar > {code} > So, job throw below message. > {code} > 15/10/21 15:22:12 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 > (TID 0, spark003.example.com): java.lang.IllegalArgumentException: > java.net.UnknownHostException: nameservice1 > at > org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:374) > at > org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:312) > at > org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:178) > at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:665) > at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:601) > at > org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:148) > at > org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2596) > at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91) > at > org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630) > at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:169) > at > org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:656) > at > org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:436) > at > org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:409) > at > org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$32.apply(SparkContext.scala:1016) > at > org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$32.apply(SparkContext.scala:1016) > at > org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176) > at > org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176) > at scala.Option.map(Option.scala:145) > at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:176) > at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:220) > at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216) > at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurren
[jira] [Commented] (SPARK-8402) Add DP means clustering to MLlib
[ https://issues.apache.org/jira/browse/SPARK-8402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15121332#comment-15121332 ] Meethu Mathew commented on SPARK-8402: -- [~mengxr] [~josephkb] This ticket is in idle state for a long time . Could you please comment on what we can do next? > Add DP means clustering to MLlib > > > Key: SPARK-8402 > URL: https://issues.apache.org/jira/browse/SPARK-8402 > Project: Spark > Issue Type: New Feature > Components: MLlib > Reporter: Meethu Mathew > Assignee: Meethu Mathew > Labels: features > > At present, all the clustering algorithms in MLlib require the number of > clusters to be specified in advance. > The Dirichlet process (DP) is a popular non-parametric Bayesian mixture model > that allows for flexible clustering of data without having to specify apriori > the number of clusters. > DP means is a non-parametric clustering algorithm that uses a scale parameter > 'lambda' to control the creation of new clusters ["Revisiting k-means: New > Algorithms via Bayesian Nonparametrics" by Brian Kulis, Michael I. Jordan]. > We have followed the distributed implementation of DP means which has been > proposed in the paper titled "MLbase: Distributed Machine Learning Made Easy" > by Xinghao Pan, Evan R. Sparks, Andre Wibisono. > A benchmark comparison between k-means and dp-means based on Normalized > Mutual Information between ground truth clusters and algorithm outputs, have > been provided in the following table. It can be seen from the table that > DP-means reported a higher NMI on 5 of 8 data sets in comparison to > k-means[Source: Kulis, B., Jordan, M.I.: Revisiting k-means: New algorithms > via Bayesian nonparametrics (2011) Arxiv:.0352. (Table 1)] > | Dataset | DP-means | k-means | > | Wine | .41 | .43 | > | Iris | .75 | .76 | > | Pima | .02 | .03 | > | Soybean | .72 | .66 | > | Car | .07 | .05 | > | Balance Scale | .17 | .11 | > | Breast Cancer | .04 | .03 | > | Vehicle | .18 | .18 | > Experiment on our spark cluster setup: > An initial benchmark study was performed on a 3 node Spark cluster setup on > mesos where each node config was 8 Cores, 64 GB RAM and the spark version > used was 1.5(git branch). > Tests were done using a mixture of 10 Gaussians with varying number of > features and instances. The results from the benchmark study are provided > below. The reported stats are average over 5 runs. > | DATASET || DPMEANS | | > | KMEANS (k =10) | | > | Instances | Dimensions | No of clusters obtained | Time | Converged in > iterations | Time | Converged in iterations | > | 10 million | 10 |10 | 43.6s |2 > | 52.2s |2| > | 1 million | 100|10 | 39.8s |2 > | 43.39s |2| > | 0.1 million |1000|10 | 37.3s |2 > | 41.64s |2| -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8402) Add DP means clustering to MLlib
[ https://issues.apache.org/jira/browse/SPARK-8402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Meethu Mathew updated SPARK-8402: - Summary: Add DP means clustering to MLlib (was: DP means clustering ) > Add DP means clustering to MLlib > > > Key: SPARK-8402 > URL: https://issues.apache.org/jira/browse/SPARK-8402 > Project: Spark > Issue Type: New Feature > Components: MLlib > Reporter: Meethu Mathew > Assignee: Meethu Mathew > Labels: features > > At present, all the clustering algorithms in MLlib require the number of > clusters to be specified in advance. > The Dirichlet process (DP) is a popular non-parametric Bayesian mixture model > that allows for flexible clustering of data without having to specify apriori > the number of clusters. > DP means is a non-parametric clustering algorithm that uses a scale parameter > 'lambda' to control the creation of new clusters ["Revisiting k-means: New > Algorithms via Bayesian Nonparametrics" by Brian Kulis, Michael I. Jordan]. > We have followed the distributed implementation of DP means which has been > proposed in the paper titled "MLbase: Distributed Machine Learning Made Easy" > by Xinghao Pan, Evan R. Sparks, Andre Wibisono. > A benchmark comparison between k-means and dp-means based on Normalized > Mutual Information between ground truth clusters and algorithm outputs, have > been provided in the following table. It can be seen from the table that > DP-means reported a higher NMI on 5 of 8 data sets in comparison to > k-means[Source: Kulis, B., Jordan, M.I.: Revisiting k-means: New algorithms > via Bayesian nonparametrics (2011) Arxiv:.0352. (Table 1)] > | Dataset | DP-means | k-means | > | Wine | .41 | .43 | > | Iris | .75 | .76 | > | Pima | .02 | .03 | > | Soybean | .72 | .66 | > | Car | .07 | .05 | > | Balance Scale | .17 | .11 | > | Breast Cancer | .04 | .03 | > | Vehicle | .18 | .18 | > Experiment on our spark cluster setup: > An initial benchmark study was performed on a 3 node Spark cluster setup on > mesos where each node config was 8 Cores, 64 GB RAM and the spark version > used was 1.5(git branch). > Tests were done using a mixture of 10 Gaussians with varying number of > features and instances. The results from the benchmark study are provided > below. The reported stats are average over 5 runs. > | DATASET || DPMEANS | | > | KMEANS (k =10) | | > | Instances | Dimensions | No of clusters obtained | Time | Converged in > iterations | Time | Converged in iterations | > | 10 million | 10 |10 | 43.6s |2 > | 52.2s |2| > | 1 million | 100|10 | 39.8s |2 > | 43.39s |2| > | 0.1 million |1000|10 | 37.3s |2 > | 41.64s |2| -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6612) Python KMeans parity
[ https://issues.apache.org/jira/browse/SPARK-6612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15050263#comment-15050263 ] Meethu Mathew commented on SPARK-6612: -- [~mengxr] This issue is resolved. But it seems Apache Spark made a wrong comment here. Could you please check it out ? > Python KMeans parity > > > Key: SPARK-6612 > URL: https://issues.apache.org/jira/browse/SPARK-6612 > Project: Spark > Issue Type: Improvement > Components: MLlib, PySpark >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Assignee: Hrishikesh >Priority: Minor > Fix For: 1.4.0 > > > This is a subtask of [SPARK-6258] for the Python API of KMeans. These items > are missing: > KMeans > * setEpsilon > * setInitializationSteps > KMeansModel > * computeCost > * k -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2572) Can't delete local dir on executor automatically when running spark over Mesos.
[ https://issues.apache.org/jira/browse/SPARK-2572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15028657#comment-15028657 ] Meethu Mathew commented on SPARK-2572: -- [~srowen] We are facing this issue with Mesos fine grained mode in Spark 1.4.1. The /tmp/spark-* and and some blockmgr-* files exist even after calling sc.stop(). Is there any any other way to solve this issue? > Can't delete local dir on executor automatically when running spark over > Mesos. > --- > > Key: SPARK-2572 > URL: https://issues.apache.org/jira/browse/SPARK-2572 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 1.0.0 >Reporter: Yadong Qi >Priority: Minor > > When running spark over Mesos in “fine-grained” modes or “coarse-grained” > mode. After the application finished.The local > dir(/tmp/spark-local-20140718114058-834c) on executor can't not delete > automatically. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
How is the predict() working in LogisticRegressionModel?
Hi all,Can somebody point me to the implementation of predict() in LogisticRegressionModel of spark mllib? I could find a predictPoint() in the class LogisticRegressionModel, but where is predict()? Thanks & Regards, Meethu M
Re: Please reply if you use Mesos fine grained mode
Hi, We are using Mesos fine grained mode because we can have multiple instances of spark to share machines and each application get resources dynamically allocated. Thanks & Regards, Meethu M On Wednesday, 4 November 2015 5:24 AM, Reynold Xinwrote: If you are using Spark with Mesos fine grained mode, can you please respond to this email explaining why you use it over the coarse grained mode? Thanks.
Re: Please reply if you use Mesos fine grained mode
Hi, We are using Mesos fine grained mode because we can have multiple instances of spark to share machines and each application get resources dynamically allocated. Thanks & Regards, Meethu M On Wednesday, 4 November 2015 5:24 AM, Reynold Xinwrote: If you are using Spark with Mesos fine grained mode, can you please respond to this email explaining why you use it over the coarse grained mode? Thanks.
[jira] [Commented] (SPARK-6724) Model import/export for FPGrowth
[ https://issues.apache.org/jira/browse/SPARK-6724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14958374#comment-14958374 ] Meethu Mathew commented on SPARK-6724: -- I am not able to take this PR forward. Can somebody take this? > Model import/export for FPGrowth > > > Key: SPARK-6724 > URL: https://issues.apache.org/jira/browse/SPARK-6724 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Priority: Minor > > Note: experimental model API -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
Spark 1.6 Release window is not updated in Spark-wiki
Hi, In the https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage the current release window has not been changed from 1.5. Can anybody give an idea of the expected dates for 1.6 version? Regards, Meethu Mathew Senior Engineer Flytxt
Re: Best way to merge final output part files created by Spark job
Try coalesce(1) before writing Thanks & Regards, Meethu M On Tuesday, 15 September 2015 6:49 AM, java8964wrote: #yiv1620377612 #yiv1620377612 --.yiv1620377612hmmessage P{margin:0px;padding:0px;}#yiv1620377612 body.yiv1620377612hmmessage{font-size:12pt;font-family:Calibri;}#yiv1620377612 For text file, this merge works fine, but for binary format like "ORC", "Parquet" or "AVOR", not sure this will work. These kind of formats in fact are not append-able, as they write the detail data information either in the head or at tail part of the file. You have to use the format specified API to merge the data. Yong Date: Mon, 14 Sep 2015 09:10:33 +0200 Subject: Re: Best way to merge final output part files created by Spark job From: gmu...@stratio.com To: umesh.ka...@gmail.com CC: user@spark.apache.org Hi, check out FileUtil.copyMerge function in the Hadoop API. It's simple, - Get the hadoop configuration from Spark Context FileSystem fs = FileSystem.get(sparkContext.hadoopConfiguration()); - Create new Path with destination and source directory. - Call copyMerge FileUtil.copyMerge(fs, inputPath, fs, destPath, true, sparkContext.hadoopConfiguration(), null); 2015-09-13 23:25 GMT+02:00 unk1102 : Hi I have a spark job which creates around 500 part files inside each directory I process. So I have thousands of such directories. So I need to merge these small small 500 part files. I am using spark.sql.shuffle.partition as 500 and my final small files are ORC files. Is there a way to merge orc files in Spark if not please suggest the best way to merge files created by Spark job in hdfs please guide. Thanks much. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Best-way-to-merge-final-output-part-files-created-by-Spark-job-tp24681.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Gaspar Muñoz @gmunozsoria Vía de las dos Castillas, 33, Ática 4, 3ª Planta 28224 Pozuelo de Alarcón, MadridTel: +34 91 352 59 42 // @stratiobd
[jira] [Commented] (SPARK-6724) Model import/export for FPGrowth
[ https://issues.apache.org/jira/browse/SPARK-6724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14740160#comment-14740160 ] Meethu Mathew commented on SPARK-6724: -- [~josephkb] I will take a look into it and update the PR accordingly. Thank you. > Model import/export for FPGrowth > > > Key: SPARK-6724 > URL: https://issues.apache.org/jira/browse/SPARK-6724 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Priority: Minor > > Note: experimental model API -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8402) DP means clustering
[ https://issues.apache.org/jira/browse/SPARK-8402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Meethu Mathew updated SPARK-8402: - Description: At present, all the clustering algorithms in MLlib require the number of clusters to be specified in advance. The Dirichlet process (DP) is a popular non-parametric Bayesian mixture model that allows for flexible clustering of data without having to specify apriori the number of clusters. DP means is a non-parametric clustering algorithm that uses a scale parameter 'lambda' to control the creation of new clusters ["Revisiting k-means: New Algorithms via Bayesian Nonparametrics" by Brian Kulis, Michael I. Jordan]. We have followed the distributed implementation of DP means which has been proposed in the paper titled "MLbase: Distributed Machine Learning Made Easy" by Xinghao Pan, Evan R. Sparks, Andre Wibisono. A benchmark comparison between k-means and dp-means based on Normalized Mutual Information between ground truth clusters and algorithm outputs, have been provided in the following table. It can be seen from the table that DP-means reported a higher NMI on 5 of 8 data sets in comparison to k-means[Source: Kulis, B., Jordan, M.I.: Revisiting k-means: New algorithms via Bayesian nonparametrics (2011) Arxiv:.0352. (Table 1)] | Dataset | DP-means | k-means | | Wine | .41 | .43 | | Iris | .75 | .76 | | Pima | .02 | .03 | | Soybean | .72 | .66 | | Car | .07 | .05 | | Balance Scale | .17 | .11 | | Breast Cancer | .04 | .03 | | Vehicle | .18 | .18 | Experiment on our spark cluster setup: An initial benchmark study was performed on a 3 node Spark cluster setup on mesos where each node config was 8 Cores, 64 GB RAM and the spark version used was 1.5(git branch). Tests were done using a mixture of 10 Gaussians with varying number of features and instances. The results from the benchmark study are provided below. The reported stats are average over 5 runs. | DATASET || DPMEANS | | | KMEANS (k =10) | | | Instances | Dimensions | No of clusters obtained | Time | Converged in iterations | Time | Converged in iterations | | 10 million | 10 |10 | 43.6s |2 | 52.2s |2| | 1 million | 100|10 | 39.8s |2 | 43.39s |2| | 0.1 million |1000|10 | 37.3s |2 | 41.64s |2| was: At present, all the clustering algorithms in MLlib require the number of clusters to be specified in advance. The Dirichlet process (DP) is a popular non-parametric Bayesian mixture model that allows for flexible clustering of data without having to specify apriori the number of clusters. DP means is a non-parametric clustering algorithm that uses a scale parameter 'lambda' to control the creation of new clusters["Revisiting k-means: New Algorithms via Bayesian Nonparametrics" by Brian Kulis, Michael I. Jordan]. We have followed the distributed implementation of DP means which has been proposed in the paper titled "MLbase: Distributed Machine Learning Made Easy" by Xinghao Pan, Evan R. Sparks, Andre Wibisono. > DP means clustering > > > Key: SPARK-8402 > URL: https://issues.apache.org/jira/browse/SPARK-8402 > Project: Spark > Issue Type: New Feature > Components: MLlib > Reporter: Meethu Mathew >Assignee: Meethu Mathew > Labels: features > > At present, all the clustering algorithms in MLlib require the number of > clusters to be specified in advance. > The Dirichlet process (DP) is a popular non-parametric Bayesian mixture model > that allows for flexible clustering of data without having to specify apriori > the number of clusters. > DP means is a non-parametric clustering algorithm that uses a scale parameter > 'lambda' to control the creation of new clusters ["Revisiting k-means: New > Algorithms via Bayesian Nonparametrics" by Brian Kulis, Michael I. Jordan]. > We have followed the distributed implementation of DP means which has been > proposed in the paper titled "MLbase: Distributed Machine Learning Made Easy" > by Xinghao Pan, Evan R. Sparks, Andre Wibisono. > A benchmark comparison between k-means and dp-means based on Normalized > Mutual Information between ground truth clusters and algorithm outputs, have > been provided in the following table. It c
[jira] [Commented] (SPARK-6724) Model import/export for FPGrowth
[ https://issues.apache.org/jira/browse/SPARK-6724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14722999#comment-14722999 ] Meethu Mathew commented on SPARK-6724: -- [~josephkb] Could you plz give your opinion on this ? > Model import/export for FPGrowth > > > Key: SPARK-6724 > URL: https://issues.apache.org/jira/browse/SPARK-6724 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Priority: Minor > > Note: experimental model API -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
Re: make-distribution.sh failing at spark/R/lib/sparkr.zip
Hi, It worked after removing that line. Thank you for the response and fix . Thanks Regards, Meethu M On Thursday, 13 August 2015 4:12 AM, Burak Yavuz brk...@gmail.com wrote: For the record:https://github.com/apache/spark/pull/8147 https://issues.apache.org/jira/browse/SPARK-9916 On Wed, Aug 12, 2015 at 3:08 PM, Burak Yavuz brk...@gmail.com wrote: Are you running from master? Could you delete line 222 of make-distribution.sh?We updated when we build sparkr.zip. I'll submit a fix for it for 1.5 and master. Burak On Wed, Aug 12, 2015 at 3:31 AM, MEETHU MATHEW meethu2...@yahoo.co.in wrote: Hi, I am trying to create a package using the make-distribution.sh script from the github master branch. But its not getting successfully completed. The last statement printed is + cp /home/meethu/git/FlytxtRnD/spark/R/lib/sparkr.zip /home/meethu/git/FlytxtRnD/spark/dist/R/libcp: cannot stat `/home/meethu/git/FlytxtRnD/spark/R/lib/sparkr.zip': No such file or directory My bulid is success and I am trying to execute the following command ./make-distribution.sh --tgz -Pyarn -Dyarn.version=2.6.0 -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive Please help. Thanks Regards, Meethu M
Re: Combining Spark Files with saveAsTextFile
Hi,Try using coalesce(1) before calling saveAsTextFile() Thanks Regards, Meethu M On Wednesday, 5 August 2015 7:53 AM, Brandon White bwwintheho...@gmail.com wrote: What is the best way to make saveAsTextFile save as only a single file?
RE:Building scaladoc using build/sbt unidoc failure
Hi, I am getting the assertion error while trying to run build/sbt unidoc same as you described in Building scaladoc using build/sbt unidoc failure .Could you tell me how you get it working ? | | | | | | | | | Building scaladoc using build/sbt unidoc failureHello,I am trying to build scala doc from the 1.4 branch. | | | | View on mail-archives.apache.org | Preview by Yahoo | | | | | Thanks Regards, Meethu M
[jira] [Commented] (SPARK-8402) DP means clustering
[ https://issues.apache.org/jira/browse/SPARK-8402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14591197#comment-14591197 ] Meethu Mathew commented on SPARK-8402: -- Could you please assign the ticket to me? DP means clustering Key: SPARK-8402 URL: https://issues.apache.org/jira/browse/SPARK-8402 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Meethu Mathew Labels: features At present, all the clustering algorithms in MLlib require the number of clusters to be specified in advance. The Dirichlet process (DP) is a popular non-parametric Bayesian mixture model that allows for flexible clustering of data without having to specify apriori the number of clusters. DP means is a non-parametric clustering algorithm that uses a scale parameter 'lambda' to control the creation of new clusters[Revisiting k-means: New Algorithms via Bayesian Nonparametrics by Brian Kulis, Michael I. Jordan]. We have followed the distributed implementation of DP means which has been proposed in the paper titled MLbase: Distributed Machine Learning Made Easy by Xinghao Pan, Evan R. Sparks, Andre Wibisono. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[MLlib] Contributing algorithm for DP means clustering
Hi all, At present, all the clustering algorithms in MLlib require the number of clusters to be specified in advance. The Dirichlet process (DP) is a popular non-parametric Bayesian mixture model that allows for flexible clustering of data without having to specify apriori the number of clusters. DP means is a non-parametric clustering algorithm that uses a scale parameter 'lambda' to control the creation of new clusters. We have followed the distributed implementation of DP means which has been proposed in the paper titled MLbase: Distributed Machine Learning Made Easy by Xinghao Pan, Evan R. Sparks, Andre Wibisono. I have raised a JIRA ticket at https://issues.apache.org/jira/browse/SPARK-8402 Suggestions and guidance are welcome. Regards, Meethu Mathew Senior Engineer Flytxt www.flytxt.com | Visit our blog http://blog.flytxt.com/ | Follow us http://www.twitter.com/flytxt | Connect on LinkedIn http://www.linkedin.com/company/22166?goback=%2Efcs_GLHD_flytxt_false_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2trk=ncsrch_hits
[jira] [Created] (SPARK-8402) DP means clustering
Meethu Mathew created SPARK-8402: Summary: DP means clustering Key: SPARK-8402 URL: https://issues.apache.org/jira/browse/SPARK-8402 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Meethu Mathew At present, all the clustering algorithms in MLlib require the number of clusters to be specified in advance. The Dirichlet process (DP) is a popular non-parametric Bayesian mixture model that allows for flexible clustering of data without having to specify apriori the number of clusters. DP means is a non-parametric clustering algorithm that uses a scale parameter 'lambda' to control the creation of new clusters[Revisiting k-means: New Algorithms via Bayesian Nonparametrics by Brian Kulis, Michael I. Jordan]. We have followed the distributed implementation of DP means which has been proposed in the paper titled MLbase: Distributed Machine Learning Made Easy by Xinghao Pan, Evan R. Sparks, Andre Wibisono. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8402) DP means clustering
[ https://issues.apache.org/jira/browse/SPARK-8402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14589392#comment-14589392 ] Meethu Mathew commented on SPARK-8402: -- Could anyone please assign this ticket to me ? DP means clustering Key: SPARK-8402 URL: https://issues.apache.org/jira/browse/SPARK-8402 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Meethu Mathew Labels: features At present, all the clustering algorithms in MLlib require the number of clusters to be specified in advance. The Dirichlet process (DP) is a popular non-parametric Bayesian mixture model that allows for flexible clustering of data without having to specify apriori the number of clusters. DP means is a non-parametric clustering algorithm that uses a scale parameter 'lambda' to control the creation of new clusters[Revisiting k-means: New Algorithms via Bayesian Nonparametrics by Brian Kulis, Michael I. Jordan]. We have followed the distributed implementation of DP means which has been proposed in the paper titled MLbase: Distributed Machine Learning Made Easy by Xinghao Pan, Evan R. Sparks, Andre Wibisono. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8018) KMeans should accept initial cluster centers as param
[ https://issues.apache.org/jira/browse/SPARK-8018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14578751#comment-14578751 ] Meethu Mathew commented on SPARK-8018: -- Should I add a new test for this in the test suite or can I add it along with any other test(like model save/load) ? KMeans should accept initial cluster centers as param - Key: SPARK-8018 URL: https://issues.apache.org/jira/browse/SPARK-8018 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley Assignee: Meethu Mathew KMeans should allow model initialization using an existing set of cluster centers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
Re: Anyone facing problem in incremental building of individual project
Hi, I added createDependencyReducedPom in my pom.xml and the problem is solved. !-- Work around MSHADE-148 -- + createDependencyReducedPomfalse/createDependencyReducedPom Thank you @Steve and @Ted Regards, Meethu Mathew Senior Engineer Flytxt On Thu, Jun 4, 2015 at 9:51 PM, Ted Yu yuzhih...@gmail.com wrote: Andrew Or put in this workaround : diff --git a/pom.xml b/pom.xml index 0b1aaad..d03d33b 100644 --- a/pom.xml +++ b/pom.xml @@ -1438,6 +1438,8 @@ version2.3/version configuration shadedArtifactAttachedfalse/shadedArtifactAttached + !-- Work around MSHADE-148 -- + createDependencyReducedPomfalse/createDependencyReducedPom artifactSet includes !-- At a minimum we must include this to force effective pom generation -- FYI On Thu, Jun 4, 2015 at 6:25 AM, Steve Loughran ste...@hortonworks.com wrote: On 4 Jun 2015, at 11:16, Meethu Mathew meethu.mat...@flytxt.com wrote: Hi all, I added some new code to MLlib. When I am trying to build only the mllib project using *mvn --projects mllib/ -DskipTests clean install* * *after setting export S PARK_PREPEND_CLASSES=true , the build is getting stuck with the following message. Excluding org.jpmml:pmml-schema:jar:1.1.15 from the shaded jar. [INFO] Excluding com.sun.xml.bind:jaxb-impl:jar:2.2.7 from the shaded jar. [INFO] Excluding com.sun.xml.bind:jaxb-core:jar:2.2.7 from the shaded jar. [INFO] Excluding javax.xml.bind:jaxb-api:jar:2.2.7 from the shaded jar. [INFO] Including org.spark-project.spark:unused:jar:1.0.0 in the shaded jar. [INFO] Excluding org.scala-lang:scala-reflect:jar:2.10.4 from the shaded jar. [INFO] Replacing original artifact with shaded artifact. [INFO] Replacing /home/meethu/git/FlytxtRnD/spark/mllib/target/spark-mllib_2.10-1.4.0-SNAPSHOT.jar with /home/meethu/git/FlytxtRnD/spark/mllib/target/spark-mllib_2.10-1.4.0-SNAPSHOT-shaded.jar [INFO] Dependency-reduced POM written at: /home/meethu/git/FlytxtRnD/spark/mllib/dependency-reduced-pom.xml [INFO] Dependency-reduced POM written at: /home/meethu/git/FlytxtRnD/spark/mllib/dependency-reduced-pom.xml [INFO] Dependency-reduced POM written at: /home/meethu/git/FlytxtRnD/spark/mllib/dependency-reduced-pom.xml [INFO] Dependency-reduced POM written at: /home/meethu/git/FlytxtRnD/spark/mllib/dependency-reduced-pom.xml . I've seen something similar in a different build, It looks like MSHADE-148: https://issues.apache.org/jira/browse/MSHADE-148 if you apply Tom White's patch, does your problem go away?
[jira] [Commented] (SPARK-8018) KMeans should accept initial cluster centers as param
[ https://issues.apache.org/jira/browse/SPARK-8018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14572489#comment-14572489 ] Meethu Mathew commented on SPARK-8018: -- [~josephkb][~mengxr] Thank you for the comments. In the method suggested by Xiangrui, do we need to get the value of k as a parameter and then compare it with the value of model.k as in GMM? KMeans should accept initial cluster centers as param - Key: SPARK-8018 URL: https://issues.apache.org/jira/browse/SPARK-8018 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley KMeans should allow model initialization using an existing set of cluster centers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
Anyone facing problem in incremental building of individual project
Hi all, I added some new code to MLlib. When I am trying to build only the mllib project using *mvn --projects mllib/ -DskipTests clean install* * *after setting export S PARK_PREPEND_CLASSES=true , the build is getting stuck with the following message. Excluding org.jpmml:pmml-schema:jar:1.1.15 from the shaded jar. [INFO] Excluding com.sun.xml.bind:jaxb-impl:jar:2.2.7 from the shaded jar. [INFO] Excluding com.sun.xml.bind:jaxb-core:jar:2.2.7 from the shaded jar. [INFO] Excluding javax.xml.bind:jaxb-api:jar:2.2.7 from the shaded jar. [INFO] Including org.spark-project.spark:unused:jar:1.0.0 in the shaded jar. [INFO] Excluding org.scala-lang:scala-reflect:jar:2.10.4 from the shaded jar. [INFO] Replacing original artifact with shaded artifact. [INFO] Replacing /home/meethu/git/FlytxtRnD/spark/mllib/target/spark-mllib_2.10-1.4.0-SNAPSHOT.jar with /home/meethu/git/FlytxtRnD/spark/mllib/target/spark-mllib_2.10-1.4.0-SNAPSHOT-shaded.jar [INFO] Dependency-reduced POM written at: /home/meethu/git/FlytxtRnD/spark/mllib/dependency-reduced-pom.xml [INFO] Dependency-reduced POM written at: /home/meethu/git/FlytxtRnD/spark/mllib/dependency-reduced-pom.xml [INFO] Dependency-reduced POM written at: /home/meethu/git/FlytxtRnD/spark/mllib/dependency-reduced-pom.xml [INFO] Dependency-reduced POM written at: /home/meethu/git/FlytxtRnD/spark/mllib/dependency-reduced-pom.xml . But a full build completes as usual. Please help if anyone is facing the same issue. Regards, Meethu Mathew Senior Engineer Flytxt
[jira] [Comment Edited] (SPARK-8018) KMeans should accept initial cluster centers as param
[ https://issues.apache.org/jira/browse/SPARK-8018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14572489#comment-14572489 ] Meethu Mathew edited comment on SPARK-8018 at 6/4/15 10:11 AM: --- [~josephkb][~mengxr] Thank you for the comments. In the method suggested by Xiangrui, do we need to get the value of k as a parameter and then compare it with the value of model.k as in GMM? I am interested to work on this ticket. Please assign it to me was (Author: meethumathew): [~josephkb][~mengxr] Thank you for the comments. In the method suggested by Xiangrui, do we need to get the value of k as a parameter and then compare it with the value of model.k as in GMM? KMeans should accept initial cluster centers as param - Key: SPARK-8018 URL: https://issues.apache.org/jira/browse/SPARK-8018 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley KMeans should allow model initialization using an existing set of cluster centers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
Re: How to create fewer output files for Spark job ?
Try using coalesce Thanks Regards, Meethu M On Wednesday, 3 June 2015 11:26 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: I am running a series of spark functions with 9000 executors and its resulting in 9000+ files that is execeeding the namespace file count qutota. How can Spark be configured to use CombinedOutputFormat. {code}protected def writeOutputRecords(detailRecords: RDD[(AvroKey[DetailOutputRecord], NullWritable)], outputDir: String) { val writeJob = new Job() val schema = SchemaUtil.outputSchema(_detail) AvroJob.setOutputKeySchema(writeJob, schema) detailRecords.saveAsNewAPIHadoopFile(outputDir, classOf[AvroKey[GenericRecord]], classOf[org.apache.hadoop.io.NullWritable], classOf[AvroKeyOutputFormat[GenericRecord]], writeJob.getConfiguration) }{code} -- Deepak
[jira] [Commented] (SPARK-8018) KMeans should accept initial cluster centers as param
[ https://issues.apache.org/jira/browse/SPARK-8018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14570620#comment-14570620 ] Meethu Mathew commented on SPARK-8018: -- [~josephkb] For initialization using an existing set of cluster centers , do we need to supply centers for only 1 run ? or should we supply initial centers for multiple runs ? KMeans should accept initial cluster centers as param - Key: SPARK-8018 URL: https://issues.apache.org/jira/browse/SPARK-8018 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley KMeans should allow model initialization using an existing set of cluster centers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
Regarding Connecting spark to Mesos documentation
Hi List, In the documentation of Connecting Spark to Mesos http://spark.apache.org/docs/latest/running-on-mesos.html#connecting-spark-to-mesos, is it possible to modify and write in detail the step Create a binary package using make-distribution.sh --tgz ? When we use custom compiled version of Spark, mostly we specify a hadoop version (which is not the default one). In this case, make-distribution.sh should be supplied the same maven options we used for building spark. This is not specified in the documentation. Please correct me , if I am wrong. Regards, Meethu Mathew
Re: How to run multiple jobs in one sparkcontext from separate threads in pyspark?
Hi Davies,Thank you for pointing to spark streaming. I am confused about how to return the result after running a function via a thread.I tried using Queue to add the results to it and print it at the end.But here, I can see the results after all threads are finished.How to get the result of the function once a thread is finished, rather than waiting for all other threads to finish? Thanks Regards, Meethu M On Tuesday, 19 May 2015 2:43 AM, Davies Liu dav...@databricks.com wrote: SparkContext can be used in multiple threads (Spark streaming works with multiple threads), for example: import threading import time def show(x): time.sleep(1) print x def job(): sc.parallelize(range(100)).foreach(show) threading.Thread(target=job).start() On Mon, May 18, 2015 at 12:34 AM, ayan guha guha.a...@gmail.com wrote: Hi So to be clear, do you want to run one operation in multiple threads within a function or you want run multiple jobs using multiple threads? I am wondering why python thread module can't be used? Or you have already gave it a try? On 18 May 2015 16:39, MEETHU MATHEW meethu2...@yahoo.co.in wrote: Hi Akhil, The python wrapper for Spark Job Server did not help me. I actually need the pyspark code sample which shows how I can call a function from 2 threads and execute it simultaneously. Thanks Regards, Meethu M On Thursday, 14 May 2015 12:38 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Did you happened to have a look at the spark job server? Someone wrote a python wrapper around it, give it a try. Thanks Best Regards On Thu, May 14, 2015 at 11:10 AM, MEETHU MATHEW meethu2...@yahoo.co.in wrote: Hi all, Quote Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from separate threads. How to run multiple jobs in one SPARKCONTEXT using separate threads in pyspark? I found some examples in scala and java, but couldn't find python code. Can anyone help me with a pyspark example? Thanks Regards, Meethu M - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How to run multiple jobs in one sparkcontext from separate threads in pyspark?
Hi Akhil, The python wrapper for Spark Job Server did not help me. I actually need the pyspark code sample which shows how I can call a function from 2 threads and execute it simultaneously. Thanks Regards, Meethu M On Thursday, 14 May 2015 12:38 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Did you happened to have a look at the spark job server? Someone wrote a python wrapper around it, give it a try. ThanksBest Regards On Thu, May 14, 2015 at 11:10 AM, MEETHU MATHEW meethu2...@yahoo.co.in wrote: Hi all, Quote Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from separate threads. How to run multiple jobs in one SPARKCONTEXT using separate threads in pyspark? I found some examples in scala and java, but couldn't find python code. Can anyone help me with a pyspark example? Thanks Regards, Meethu M
Re: Restricting the number of iterations in Mllib Kmeans
Hi,I think you cant supply an initial set of centroids to kmeans Thanks Regards, Meethu M On Friday, 15 May 2015 12:37 AM, Suman Somasundar suman.somasun...@oracle.com wrote: !--#yiv5602900621 _filtered #yiv5602900621 {font-family:Cambria Math;panose-1:2 4 5 3 5 4 6 3 2 4;} _filtered #yiv5602900621 {font-family:Calibri;panose-1:2 15 5 2 2 2 4 3 2 4;}#yiv5602900621 #yiv5602900621 p.yiv5602900621MsoNormal, #yiv5602900621 li.yiv5602900621MsoNormal, #yiv5602900621 div.yiv5602900621MsoNormal {margin:0in;margin-bottom:.0001pt;font-size:11.0pt;font-family:Calibri, sans-serif;}#yiv5602900621 a:link, #yiv5602900621 span.yiv5602900621MsoHyperlink {color:blue;text-decoration:underline;}#yiv5602900621 a:visited, #yiv5602900621 span.yiv5602900621MsoHyperlinkFollowed {color:purple;text-decoration:underline;}#yiv5602900621 span.yiv5602900621EmailStyle17 {font-family:Calibri, sans-serif;color:windowtext;}#yiv5602900621 .yiv5602900621MsoChpDefault {} _filtered #yiv5602900621 {margin:1.0in 1.0in 1.0in 1.0in;}#yiv5602900621 div.yiv5602900621WordSection1 {}--Hi,, I want to run a definite number of iterations in Kmeans. There is a command line argument to set maxIterations, but even if I set it to a number, Kmeans runs until the centroids converge. Is there a specific way to specify it in command line? Also, I wanted to know if we can supply the initial set of centroids to the program instead of it choosing the centroids in random? Thanks, Suman.
[jira] [Commented] (SPARK-7651) PySpark GMM predict, predictSoft should fail on bad input
[ https://issues.apache.org/jira/browse/SPARK-7651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14544910#comment-14544910 ] Meethu Mathew commented on SPARK-7651: -- [~josephkb] Yea, I wil fix it asap. PySpark GMM predict, predictSoft should fail on bad input - Key: SPARK-7651 URL: https://issues.apache.org/jira/browse/SPARK-7651 Project: Spark Issue Type: Bug Components: MLlib, PySpark Affects Versions: 1.3.0, 1.3.1, 1.4.0 Reporter: Joseph K. Bradley Priority: Minor In PySpark, GaussianMixtureModel predict and predictSoft test if the argument is an RDD and operate correctly if so. But if the argument is not an RDD, they fail silently, returning nothing. [https://github.com/apache/spark/blob/11a1a135d1fe892cd48a9116acc7554846aed84c/python/pyspark/mllib/clustering.py#L176] Instead, they should raise errors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7651) PySpark GMM predict, predictSoft should fail on bad input
[ https://issues.apache.org/jira/browse/SPARK-7651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14544924#comment-14544924 ] Meethu Mathew commented on SPARK-7651: -- Could you please tell me where I should make the changes? In master branch or 1.3.0? PySpark GMM predict, predictSoft should fail on bad input - Key: SPARK-7651 URL: https://issues.apache.org/jira/browse/SPARK-7651 Project: Spark Issue Type: Bug Components: MLlib, PySpark Affects Versions: 1.3.0, 1.3.1, 1.4.0 Reporter: Joseph K. Bradley Priority: Minor In PySpark, GaussianMixtureModel predict and predictSoft test if the argument is an RDD and operate correctly if so. But if the argument is not an RDD, they fail silently, returning nothing. [https://github.com/apache/spark/blob/11a1a135d1fe892cd48a9116acc7554846aed84c/python/pyspark/mllib/clustering.py#L176] Instead, they should raise errors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7651) PySpark GMM predict, predictSoft should fail on bad input
[ https://issues.apache.org/jira/browse/SPARK-7651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14544929#comment-14544929 ] Meethu Mathew commented on SPARK-7651: -- Ok thank you PySpark GMM predict, predictSoft should fail on bad input - Key: SPARK-7651 URL: https://issues.apache.org/jira/browse/SPARK-7651 Project: Spark Issue Type: Bug Components: MLlib, PySpark Affects Versions: 1.3.0, 1.3.1, 1.4.0 Reporter: Joseph K. Bradley Priority: Minor In PySpark, GaussianMixtureModel predict and predictSoft test if the argument is an RDD and operate correctly if so. But if the argument is not an RDD, they fail silently, returning nothing. [https://github.com/apache/spark/blob/11a1a135d1fe892cd48a9116acc7554846aed84c/python/pyspark/mllib/clustering.py#L176] Instead, they should raise errors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7651) PySpark GMM predict, predictSoft should fail on bad input
[ https://issues.apache.org/jira/browse/SPARK-7651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14544930#comment-14544930 ] Meethu Mathew commented on SPARK-7651: -- Ok thank you PySpark GMM predict, predictSoft should fail on bad input - Key: SPARK-7651 URL: https://issues.apache.org/jira/browse/SPARK-7651 Project: Spark Issue Type: Bug Components: MLlib, PySpark Affects Versions: 1.3.0, 1.3.1, 1.4.0 Reporter: Joseph K. Bradley Priority: Minor In PySpark, GaussianMixtureModel predict and predictSoft test if the argument is an RDD and operate correctly if so. But if the argument is not an RDD, they fail silently, returning nothing. [https://github.com/apache/spark/blob/11a1a135d1fe892cd48a9116acc7554846aed84c/python/pyspark/mllib/clustering.py#L176] Instead, they should raise errors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
How to run multiple jobs in one sparkcontext from separate threads in pyspark?
Hi all, Quote Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from separate threads. How to run multiple jobs in one SPARKCONTEXT using separate threads in pyspark? I found some examples in scala and java, but couldn't find python code. Can anyone help me with a pyspark example? Thanks Regards, Meethu M
Re: Speeding up Spark build during development
* * ** ** ** ** ** ** Hi, Is it really necessary to run **mvn --projects assembly/ -DskipTests install ? Could you please explain why this is needed? I got the changes after running mvn --projects streaming/ -DskipTests package. Regards, Meethu On Monday 04 May 2015 02:20 PM, Emre Sevinc wrote: Just to give you an example: When I was trying to make a small change only to the Streaming component of Spark, first I built and installed the whole Spark project (this took about 15 minutes on my 4-core, 4 GB RAM laptop). Then, after having changed files only in Streaming, I ran something like (in the top-level directory): mvn --projects streaming/ -DskipTests package and then mvn --projects assembly/ -DskipTests install This was much faster than trying to build the whole Spark from scratch, because Maven was only building one component, in my case the Streaming component, of Spark. I think you can use a very similar approach. -- Emre Sevinç On Mon, May 4, 2015 at 10:44 AM, Pramod Biligiri pramodbilig...@gmail.com wrote: No, I just need to build one project at a time. Right now SparkSql. Pramod On Mon, May 4, 2015 at 12:09 AM, Emre Sevinc emre.sev...@gmail.com wrote: Hello Pramod, Do you need to build the whole project every time? Generally you don't, e.g., when I was changing some files that belong only to Spark Streaming, I was building only the streaming (of course after having build and installed the whole project, but that was done only once), and then the assembly. This was much faster than trying to build the whole Spark every time. -- Emre Sevinç On Mon, May 4, 2015 at 9:01 AM, Pramod Biligiri pramodbilig...@gmail.com wrote: Using the inbuilt maven and zinc it takes around 10 minutes for each build. Is that reasonable? My maven opts looks like this: $ echo $MAVEN_OPTS -Xmx12000m -XX:MaxPermSize=2048m I'm running it as build/mvn -DskipTests package Should I be tweaking my Zinc/Nailgun config? Pramod On Sun, May 3, 2015 at 3:40 PM, Mark Hamstra m...@clearstorydata.com wrote: https://spark.apache.org/docs/latest/building-spark.html#building-with-buildmvn On Sun, May 3, 2015 at 2:54 PM, Pramod Biligiri pramodbilig...@gmail.com wrote: This is great. I didn't know about the mvn script in the build directory. Pramod On Fri, May 1, 2015 at 9:51 AM, York, Brennon brennon.y...@capitalone.com wrote: Following what Ted said, if you leverage the `mvn` from within the `build/` directory of Spark you¹ll get zinc for free which should help speed up build times. On 5/1/15, 9:45 AM, Ted Yu yuzhih...@gmail.com wrote: Pramod: Please remember to run Zinc so that the build is faster. Cheers On Fri, May 1, 2015 at 9:36 AM, Ulanov, Alexander alexander.ula...@hp.com wrote: Hi Pramod, For cluster-like tests you might want to use the same code as in mllib's LocalClusterSparkContext. You can rebuild only the package that you change and then run this main class. Best regards, Alexander -Original Message- From: Pramod Biligiri [mailto:pramodbilig...@gmail.com] Sent: Friday, May 01, 2015 1:46 AM To: dev@spark.apache.org Subject: Speeding up Spark build during development Hi, I'm making some small changes to the Spark codebase and trying it out on a cluster. I was wondering if there's a faster way to build than running the package target each time. Currently I'm using: mvn -DskipTests package All the nodes have the same filesystem mounted at the same mount point. Pramod The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer. -- Emre Sevinc
Spark-1.3.0 UI shows 0 cores in completed applications tab
Hi all, I started spark-shell in spark-1.3.0 and did some actions. The UI was showing 8 cores under the running applications tab. But when I exited the spark-shell using exit, the application is moved to completed applications tab and the number of cores is 0. Again when I exited the spark-shell using sc.stop() ,it is showing correctly 8 cores under completed applications tab. Why it is showing 0 cores when I didnt use sc.stop()?Does anyone face this issue? Thanks Regards, Meethu M
[jira] [Commented] (SPARK-6485) Add CoordinateMatrix/RowMatrix/IndexedRowMatrix in PySpark
[ https://issues.apache.org/jira/browse/SPARK-6485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14379333#comment-14379333 ] Meethu Mathew commented on SPARK-6485: -- As you had mentioned here https://issues.apache.org/jira/browse/SPARK-6100, MatrixUDT has been merged. But MatrixUDT for PySpark seems to be under progress. Does https://issues.apache.org/jira/browse/SPARK-6390 block this task? Add CoordinateMatrix/RowMatrix/IndexedRowMatrix in PySpark -- Key: SPARK-6485 URL: https://issues.apache.org/jira/browse/SPARK-6485 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Reporter: Xiangrui Meng We should add APIs for CoordinateMatrix/RowMatrix/IndexedRowMatrix in PySpark. Internally, we can use DataFrames for serialization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6227) PCA and SVD for PySpark
[ https://issues.apache.org/jira/browse/SPARK-6227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14358529#comment-14358529 ] Meethu Mathew commented on SPARK-6227: -- [~mengxr] Please give your inputs on the same. PCA and SVD for PySpark --- Key: SPARK-6227 URL: https://issues.apache.org/jira/browse/SPARK-6227 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Affects Versions: 1.2.1 Reporter: Julien Amelot The Dimensionality Reduction techniques are not available via Python (Scala + Java only). * Principal component analysis (PCA) * Singular value decomposition (SVD) Doc: http://spark.apache.org/docs/1.2.1/mllib-dimensionality-reduction.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6227) PCA and SVD for PySpark
[ https://issues.apache.org/jira/browse/SPARK-6227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14356428#comment-14356428 ] Meethu Mathew commented on SPARK-6227: -- Interested to work on this ticket.Could anyone assign to it to me? PCA and SVD for PySpark --- Key: SPARK-6227 URL: https://issues.apache.org/jira/browse/SPARK-6227 Project: Spark Issue Type: Improvement Components: MLlib, PySpark Affects Versions: 1.2.1 Reporter: Julien Amelot Priority: Minor The Dimensionality Reduction techniques are not available via Python (Scala + Java only). * Principal component analysis (PCA) * Singular value decomposition (SVD) Doc: http://spark.apache.org/docs/1.2.1/mllib-dimensionality-reduction.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
How to build Spark and run examples using Intellij ?
Hi, I am trying to run examples of spark(master branch from git) from Intellij(14.0.2) but facing errors. These are the steps I followed: 1. git clone the master branch of apache spark.2. Build it using mvn -DskipTests clean install3. In Intellij select Import Projects and choose the POM.xml of spark root folder(Auto Import enabled)4. Then I tried to run SparkPi program but getting the following errors Information:9/3/15 3:46 PM - Compilation completed with 44 errors and 0 warnings in 5 sec usr/local/spark-1.3.0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scalaError:(314, 109) polymorphic expression cannot be instantiated to expected type; found : [T(in method apply)]org.apache.spark.sql.catalyst.dsl.ScalaUdfBuilder[T(in method apply)] required: org.apache.spark.sql.catalyst.dsl.package.ScalaUdfBuilder[T(in method functionToUdfBuilder)] implicit def functionToUdfBuilder[T: TypeTag](func: Function1[_, T]): ScalaUdfBuilder[T] = ScalaUdfBuilder(func) I am able to run examples of this built version of spark from terminal using ./bin/run-example script. Could someone please help me in this issue? Thanks Regards, Meethu M
How to read from hdfs using spark-shell in Intel hadoop?
Hi, I am not able to read from HDFS(Intel distribution hadoop,Hadoop version is 1.0.3) from spark-shell(spark version is 1.2.1). I built spark using the commandmvn -Dhadoop.version=1.0.3 clean package and started spark-shell and read a HDFS file using sc.textFile() and the exception is WARN hdfs.DFSClient: Failed to connect to /10.88.6.133:50010, add to deadNodes and continuejava.net.SocketTimeoutException: 12 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.88.6.131:44264 remote=/10.88.6.133:50010] The same problem is asked in the this mail. RE: Spark is unable to read from HDFS | | | | | | | | | RE: Spark is unable to read from HDFSHi,Thanks for the reply. I've tried the below. | | | | View on mail-archives.us.apache.org | Preview by Yahoo | | | | | As suggested in the above mail,In addition to specifying HADOOP_VERSION=1.0.3 in the ./project/SparkBuild.scala file, you will need to specify the libraryDependencies and name spark-core resolvers. Otherwise, sbt will fetch version 1.0.3 of hadoop-core from apache instead of Intel. You can set up your own local or remote repository that you specify Now HADOOP_VERSION is deprecated and -Dhadoop.version should be used. Can anybody please elaborate on how to specify tat SBT should fetch hadoop-core from Intel which is in our internal repository? Thanks Regards, Meethu M
Mail to u...@spark.apache.org failing
Hi, The mail id given in https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark seems to be failing. Can anyone tell me how to get added to Powered By Spark list? -- Regards, *Meethu*
[jira] [Commented] (SPARK-5609) PythonMLlibAPI trainGaussianMixture seed should use Java type
[ https://issues.apache.org/jira/browse/SPARK-5609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14306593#comment-14306593 ] Meethu Mathew commented on SPARK-5609: -- Please assign the ticket to me. PythonMLlibAPI trainGaussianMixture seed should use Java type - Key: SPARK-5609 URL: https://issues.apache.org/jira/browse/SPARK-5609 Project: Spark Issue Type: Bug Components: MLlib, PySpark Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Priority: Trivial trainGaussianMixture takes parameter seed of type scala.Long but should take java.lang.Long. Otherwise, the test for whether seed is null (None in Python) will be ineffective. See compilation warning: {code} [warn] /Users/josephkb/spark/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala:304: comparing values of types Long and Null using `!=' will always yield true [warn] if (seed != null) gmmAlg.setSeed(seed) [warn] ^ {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5609) PythonMLlibAPI trainGaussianMixture seed should use Java type
[ https://issues.apache.org/jira/browse/SPARK-5609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14306593#comment-14306593 ] Meethu Mathew edited comment on SPARK-5609 at 2/5/15 4:03 AM: -- Please assign the ticket to me. [~josephkb] was (Author: meethumathew): Please assign the ticket to me. PythonMLlibAPI trainGaussianMixture seed should use Java type - Key: SPARK-5609 URL: https://issues.apache.org/jira/browse/SPARK-5609 Project: Spark Issue Type: Bug Components: MLlib, PySpark Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Priority: Trivial trainGaussianMixture takes parameter seed of type scala.Long but should take java.lang.Long. Otherwise, the test for whether seed is null (None in Python) will be ineffective. See compilation warning: {code} [warn] /Users/josephkb/spark/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala:304: comparing values of types Long and Null using `!=' will always yield true [warn] if (seed != null) gmmAlg.setSeed(seed) [warn] ^ {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
Test suites in the python wrapper of kmeans failing
Hi, The test suites in the Kmeans class in clustering.py is not updated to take the seed value and hence it is failing. Shall I make the changes and submit it along with my PR( Python API for Gaussian Mixture Model) or create a JIRA ? Regards, Meethu - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Test suites in the python wrapper of kmeans failing
Hi, Sorry it was my mistake. My code was not properly built. Regards, Meethu _http://www.linkedin.com/home?trk=hb_tab_home_top_ On Thursday 22 January 2015 10:39 AM, Meethu Mathew wrote: Hi, The test suites in the Kmeans class in clustering.py is not updated to take the seed value and hence it is failing. Shall I make the changes and submit it along with my PR( Python API for Gaussian Mixture Model) or create a JIRA ? Regards, Meethu - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
[jira] [Commented] (SPARK-5012) Python API for Gaussian Mixture Model
[ https://issues.apache.org/jira/browse/SPARK-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286942#comment-14286942 ] Meethu Mathew commented on SPARK-5012: -- [~tgaloppo] Thank you..Will update this PR asap.. Python API for Gaussian Mixture Model - Key: SPARK-5012 URL: https://issues.apache.org/jira/browse/SPARK-5012 Project: Spark Issue Type: New Feature Components: MLlib, PySpark Reporter: Xiangrui Meng Assignee: Meethu Mathew Add Python API for the Scala implementation of GMM. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5012) Python API for Gaussian Mixture Model
[ https://issues.apache.org/jira/browse/SPARK-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279811#comment-14279811 ] Meethu Mathew commented on SPARK-5012: -- Once SPARK-5019 is resolved, we will make the changes accordingly.Thanks [~josephkb] [~tgaloppo] for the comments Python API for Gaussian Mixture Model - Key: SPARK-5012 URL: https://issues.apache.org/jira/browse/SPARK-5012 Project: Spark Issue Type: New Feature Components: MLlib, PySpark Reporter: Xiangrui Meng Assignee: Meethu Mathew Add Python API for the Scala implementation of GMM. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
Use of MapConverter, ListConverter in python to java object conversion
Hi all, In the python object to java conversion done in the method _py2java in spark/python/pyspark/mllib/common.py, why we are doing individual conversion using MpaConverter,ListConverter? The same can be acheived using bytearray(PickleSerializer().dumps(obj)) obj = sc._jvm.SerDe.loads(bytes) Is there any performance gain or something in using individual converters rather than PickleSerializer? -- Regards, *Meethu*
[jira] [Commented] (SPARK-5012) Python API for Gaussian Mixture Model
[ https://issues.apache.org/jira/browse/SPARK-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14273561#comment-14273561 ] Meethu Mathew commented on SPARK-5012: -- I added a new class GaussianMixtureModel in clustering.py and the method predict in it and trying to pass a List of more than one dimension to the function _py2java , but I am getting the exception 'list' object has no attribute '_get_object_id' and when I give a tuple input (Vectors.dense([0.8786, -0.7855]),Vectors.dense([-0.1863, 0.7799])) exception is like 'numpy.ndarray' object has no attribute '_get_object_id'. Can you help me to solve this? My aim is to call the predictsoft() in GaussianMixtureModel.scala from clustering.py by passing the values of weight,mean and sigma Python API for Gaussian Mixture Model - Key: SPARK-5012 URL: https://issues.apache.org/jira/browse/SPARK-5012 Project: Spark Issue Type: New Feature Components: MLlib, PySpark Reporter: Xiangrui Meng Assignee: Meethu Mathew Add Python API for the Scala implementation of GMM. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
Re: Python to Java object conversion of numpy array
Hi, This is the function defined in PythonMLLibAPI.scala def findPredict( data: JavaRDD[Vector], wt: Object, mu: Array[Object], si: Array[Object]): RDD[Array[Double]] = { } So the parameter mu should be converted to Array[object]. mu = (Vectors.dense([0.8786, -0.7855]),Vectors.dense([-0.1863, 0.7799])) def _py2java(sc, obj): if isinstance(obj, RDD): ... elif isinstance(obj, SparkContext): ... elif isinstance(obj, dict): ... elif isinstance(obj, (list, tuple)): obj = ListConverter().convert(obj, sc._gateway._gateway_client) elif isinstance(obj, JavaObject): pass elif isinstance(obj, (int, long, float, bool, basestring)): pass else: bytes = bytearray(PickleSerializer().dumps(obj)) obj = sc._jvm.SerDe.loads(bytes) return obj Since its a tuple of Densevectors, in _py2java() its entering the isinstance(obj, (list, tuple)) condition and throwing exception(happens because the dimension of tuple 1). However the conversion occurs correctly if the Pickle conversion is done (last else part). Hope its clear now. Regards, Meethu On Monday 12 January 2015 11:35 PM, Davies Liu wrote: On Sun, Jan 11, 2015 at 10:21 PM, Meethu Mathew meethu.mat...@flytxt.com wrote: Hi, This is the code I am running. mu = (Vectors.dense([0.8786, -0.7855]),Vectors.dense([-0.1863, 0.7799])) membershipMatrix = callMLlibFunc(findPredict, rdd.map(_convert_to_vector), mu) What's the Java API looks like? all the arguments of findPredict should be converted into java objects, so what should `mu` be converted to? Regards, Meethu On Monday 12 January 2015 11:46 AM, Davies Liu wrote: Could you post a piece of code here? On Sun, Jan 11, 2015 at 9:28 PM, Meethu Mathew meethu.mat...@flytxt.com wrote: Hi, Thanks Davies . I added a new class GaussianMixtureModel in clustering.py and the method predict in it and trying to pass numpy array from this method.I converted it to DenseVector and its solved now. Similarly I tried passing a List of more than one dimension to the function _py2java , but now the exception is 'list' object has no attribute '_get_object_id' and when I give a tuple input (Vectors.dense([0.8786, -0.7855]),Vectors.dense([-0.1863, 0.7799])) exception is like 'numpy.ndarray' object has no attribute '_get_object_id' Regards, Meethu Mathew Engineer Flytxt www.flytxt.com | Visit our blog | Follow us | Connect on Linkedin On Friday 09 January 2015 11:37 PM, Davies Liu wrote: Hey Meethu, The Java API accepts only Vector, so you should convert the numpy array into pyspark.mllib.linalg.DenseVector. BTW, which class are you using? the KMeansModel.predict() accept numpy.array, it will do the conversion for you. Davies On Fri, Jan 9, 2015 at 4:45 AM, Meethu Mathew meethu.mat...@flytxt.com wrote: Hi, I am trying to send a numpy array as an argument to a function predict() in a class in spark/python/pyspark/mllib/clustering.py which is passed to the function callMLlibFunc(name, *args) in spark/python/pyspark/mllib/common.py. Now the value is passed to the function _py2java(sc, obj) .Here I am getting an exception Py4JJavaError: An error occurred while calling z:org.apache.spark.mllib.api.python.SerDe.loads. : net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for numpy.core.multiarray._reconstruct) at net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23) at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:617) at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:170) at net.razorvine.pickle.Unpickler.load(Unpickler.java:84) at net.razorvine.pickle.Unpickler.loads(Unpickler.java:97) Why common._py2java(sc, obj) is not handling numpy array type? Please help.. -- Regards, *Meethu Mathew* *Engineer* *Flytxt* www.flytxt.com | Visit our blog http://blog.flytxt.com/ | Follow us http://www.twitter.com/flytxt | _Connect on Linkedin http://www.linkedin.com/home?trk=hb_tab_home_top_
Re: Python to Java object conversion of numpy array
Hi, This is the code I am running. mu = (Vectors.dense([0.8786, -0.7855]),Vectors.dense([-0.1863, 0.7799])) membershipMatrix = callMLlibFunc(findPredict, rdd.map(_convert_to_vector), mu) Regards, Meethu On Monday 12 January 2015 11:46 AM, Davies Liu wrote: Could you post a piece of code here? On Sun, Jan 11, 2015 at 9:28 PM, Meethu Mathew meethu.mat...@flytxt.com wrote: Hi, Thanks Davies . I added a new class GaussianMixtureModel in clustering.py and the method predict in it and trying to pass numpy array from this method.I converted it to DenseVector and its solved now. Similarly I tried passing a List of more than one dimension to the function _py2java , but now the exception is 'list' object has no attribute '_get_object_id' and when I give a tuple input (Vectors.dense([0.8786, -0.7855]),Vectors.dense([-0.1863, 0.7799])) exception is like 'numpy.ndarray' object has no attribute '_get_object_id' Regards, Meethu Mathew Engineer Flytxt www.flytxt.com | Visit our blog | Follow us | Connect on Linkedin On Friday 09 January 2015 11:37 PM, Davies Liu wrote: Hey Meethu, The Java API accepts only Vector, so you should convert the numpy array into pyspark.mllib.linalg.DenseVector. BTW, which class are you using? the KMeansModel.predict() accept numpy.array, it will do the conversion for you. Davies On Fri, Jan 9, 2015 at 4:45 AM, Meethu Mathew meethu.mat...@flytxt.com wrote: Hi, I am trying to send a numpy array as an argument to a function predict() in a class in spark/python/pyspark/mllib/clustering.py which is passed to the function callMLlibFunc(name, *args) in spark/python/pyspark/mllib/common.py. Now the value is passed to the function _py2java(sc, obj) .Here I am getting an exception Py4JJavaError: An error occurred while calling z:org.apache.spark.mllib.api.python.SerDe.loads. : net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for numpy.core.multiarray._reconstruct) at net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23) at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:617) at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:170) at net.razorvine.pickle.Unpickler.load(Unpickler.java:84) at net.razorvine.pickle.Unpickler.loads(Unpickler.java:97) Why common._py2java(sc, obj) is not handling numpy array type? Please help.. -- Regards, *Meethu Mathew* *Engineer* *Flytxt* www.flytxt.com | Visit our blog http://blog.flytxt.com/ | Follow us http://www.twitter.com/flytxt | _Connect on Linkedin http://www.linkedin.com/home?trk=hb_tab_home_top_
Re: Python to Java object conversion of numpy array
Hi, Thanks Davies . I added a new class GaussianMixtureModel in clustering.py and the method predict in it and trying to pass numpy array from this method.I converted it to DenseVector and its solved now. Similarly I tried passing a List of more than one dimension to the function _py2java , but now the exception is 'list' object has no attribute '_get_object_id' and when I give a tuple input (Vectors.dense([0.8786, -0.7855]),Vectors.dense([-0.1863, 0.7799])) exception is like 'numpy.ndarray' object has no attribute '_get_object_id' Regards, *Meethu Mathew* *Engineer* *Flytxt* www.flytxt.com | Visit our blog http://blog.flytxt.com/ | Follow us http://www.twitter.com/flytxt | _Connect on Linkedin http://www.linkedin.com/home?trk=hb_tab_home_top_ On Friday 09 January 2015 11:37 PM, Davies Liu wrote: Hey Meethu, The Java API accepts only Vector, so you should convert the numpy array into pyspark.mllib.linalg.DenseVector. BTW, which class are you using? the KMeansModel.predict() accept numpy.array, it will do the conversion for you. Davies On Fri, Jan 9, 2015 at 4:45 AM, Meethu Mathew meethu.mat...@flytxt.com wrote: Hi, I am trying to send a numpy array as an argument to a function predict() in a class in spark/python/pyspark/mllib/clustering.py which is passed to the function callMLlibFunc(name, *args) in spark/python/pyspark/mllib/common.py. Now the value is passed to the function _py2java(sc, obj) .Here I am getting an exception Py4JJavaError: An error occurred while calling z:org.apache.spark.mllib.api.python.SerDe.loads. : net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for numpy.core.multiarray._reconstruct) at net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23) at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:617) at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:170) at net.razorvine.pickle.Unpickler.load(Unpickler.java:84) at net.razorvine.pickle.Unpickler.loads(Unpickler.java:97) Why common._py2java(sc, obj) is not handling numpy array type? Please help.. -- Regards, *Meethu Mathew* *Engineer* *Flytxt* www.flytxt.com | Visit our blog http://blog.flytxt.com/ | Follow us http://www.twitter.com/flytxt | _Connect on Linkedin http://www.linkedin.com/home?trk=hb_tab_home_top_
[jira] [Commented] (SPARK-5012) Python API for Gaussian Mixture Model
[ https://issues.apache.org/jira/browse/SPARK-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14261923#comment-14261923 ] Meethu Mathew commented on SPARK-5012: -- The python implementation of the algorithm has already been added to spark-packages http://spark-packages.org/package/11 and it would be great if we are given a chance to write the Python wrappers for the algorithm. Python API for Gaussian Mixture Model - Key: SPARK-5012 URL: https://issues.apache.org/jira/browse/SPARK-5012 Project: Spark Issue Type: New Feature Components: MLlib, PySpark Reporter: Xiangrui Meng Assignee: Travis Galoppo Add Python API for the Scala implementation of GMM. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5015) GaussianMixtureEM should take random seed parameter
[ https://issues.apache.org/jira/browse/SPARK-5015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14261936#comment-14261936 ] Meethu Mathew commented on SPARK-5015: -- Instead of using random seed , using the cluster centers returned by kmeans++ for initializing the means in GMM would be good strategy as implemented in scikit-learn http://scikit-learn.org/stable/modules/mixture.html#mixture. What is your opinion ? GaussianMixtureEM should take random seed parameter --- Key: SPARK-5015 URL: https://issues.apache.org/jira/browse/SPARK-5015 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.2.0 Reporter: Joseph K. Bradley Priority: Minor GaussianMixtureEM uses randomness but does not take a random seed. It should take one as a parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5015) GaussianMixtureEM should take random seed parameter
[ https://issues.apache.org/jira/browse/SPARK-5015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14261946#comment-14261946 ] Meethu Mathew commented on SPARK-5015: -- We would try to experiment with both the initialization methods and come up with a comparison on cluster quality and running time. GaussianMixtureEM should take random seed parameter --- Key: SPARK-5015 URL: https://issues.apache.org/jira/browse/SPARK-5015 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.2.0 Reporter: Joseph K. Bradley Priority: Minor GaussianMixtureEM uses randomness but does not take a random seed. It should take one as a parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
Re: Problems concerning implementing machine learning algorithm from scratch based on Spark
Hi, The GMMSpark.py you mentioned is the old one.The new code is now added to spark-packages and is available at http://spark-packages.org/package/11 . Have a look at the new code. We have used numpy functions in our code and didnt notice any slowdown because of this. Thanks Regards, Meethu M On Tuesday, 30 December 2014 11:50 AM, danqing0703 danqing0...@berkeley.edu wrote: Hi all, I am trying to use some machine learning algorithms that are not included in the Mllib. Like Mixture Model and LDA(Latent Dirichlet Allocation), and I am using pyspark and Spark SQL. My problem is: I have some scripts that implement these algorithms, but I am not sure which part I shall change to make it fit into Big Data. - Like some very simple calculation may take much time if data is too big,but also constructing RDD or SQLContext table takes too much time. I am really not sure if I shall use map(), reduce() every time I need to make calculation. - Also, there are some matrix/array level calculation that can not be implemented easily merely using map(),reduce(), thus functions of the Numpy package shall be used. I am not sure when data is too big, and we simply use the numpy functions. Will it take too much time? I have found some scripts that are not from Mllib and was created by other developers(credits to Meethu Mathew from Flytxt, thanks for giving me insights!:)) Many thanks and look forward to getting feedbacks! Best, Danqing GMMSpark.py (7K) http://apache-spark-developers-list.1001551.n3.nabble.com/attachment/9964/0/GMMSpark.py -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Problems-concerning-implementing-machine-learning-algorithm-from-scratch-based-on-Spark-tp9964.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
[jira] [Commented] (SPARK-4156) Add expectation maximization for Gaussian mixture models to MLLib clustering
[ https://issues.apache.org/jira/browse/SPARK-4156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14242486#comment-14242486 ] Meethu Mathew commented on SPARK-4156: -- [~tgaloppo] The current version of the code has no predict function to return the cluster labels, i.e, the index of the cluster to which the point has maximum membership.We have written a predict function to return the cluster labels and the membership values.We would be happy to contribute this to your code. cc [~mengxr] Add expectation maximization for Gaussian mixture models to MLLib clustering Key: SPARK-4156 URL: https://issues.apache.org/jira/browse/SPARK-4156 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Travis Galoppo Assignee: Travis Galoppo As an additional clustering algorithm, implement expectation maximization for Gaussian mixture models -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
Re: Mllib Error
Hi,Try this.Change spark-mllib to spark-mllib_2.10 libraryDependencies ++=Seq( org.apache.spark % spark-core_2.10 % 1.1.1 org.apache.spark % spark-mllib_2.10 % 1.1.1 ) Thanks Regards, Meethu M On Friday, 12 December 2014 12:22 PM, amin mohebbi aminn_...@yahoo.com.INVALID wrote: I'm trying to build a very simple scala standalone app using the Mllib, but I get the following error when trying to bulid the program:Object Mllib is not a member of package org.apache.sparkThen, I realized that I have to add Mllib as dependency as follow :libraryDependencies ++= Seq( org.apache.spark %% spark-core % 1.1.0, org.apache.spark %% spark-mllib % 1.1.0 )But, here I got an error that says :unresolved dependency spark-core_2.10.4;1.1.1 : not foundso I had to modify it toorg.apache.spark % spark-core_2.10 % 1.1.1,But there is still an error that says :unresolved dependency spark-mllib;1.1.1 : not foundAnyone knows how to add dependency of Mllib in .sbt file? Best Regards ... Amin Mohebbi PhD candidate in Software Engineering at university of Malaysia Tel : +60 18 2040 017 E-Mail : tp025...@ex.apiit.edu.my amin_...@me.com
Re: How to incrementally compile spark examples using mvn
Hi all, I made some code changes in mllib project and as mentioned in the previous mails I did mvn install -pl mllib Now I run a program in examples using run-example, the new code is not executing.Instead the previous code itself is running. But if I do an mvn install in the entire spark project , I can see the new code running.But installing the entire spark takes a lot of time and so its difficult to do this each time I make some changes. Can someone tell me how to compile mllib alone and get the changes working? Thanks Regards, Meethu M On Friday, 28 November 2014 2:39 PM, MEETHU MATHEW meethu2...@yahoo.co.in wrote: Hi,I have a similar problem.I modified the code in mllib and examples.I did mvn install -pl mllib mvn install -pl examples But when I run the program in examples using run-example,the older version of mllib (before the changes were made) is getting executed.How to get the changes made in mllib while calling it from examples project? Thanks Regards, Meethu M On Monday, 24 November 2014 3:33 PM, Yiming (John) Zhang sdi...@gmail.com wrote: Thank you, Marcelo and Sean, mvn install is a good answer for my demands. -邮件原件- 发件人: Marcelo Vanzin [mailto:van...@cloudera.com] 发送时间: 2014年11月21日 1:47 收件人: yiming zhang 抄送: Sean Owen; user@spark.apache.org 主题: Re: How to incrementally compile spark examples using mvn Hi Yiming, On Wed, Nov 19, 2014 at 5:35 PM, Yiming (John) Zhang sdi...@gmail.com wrote: Thank you for your reply. I was wondering whether there is a method of reusing locally-built components without installing them? That is, if I have successfully built the spark project as a whole, how should I configure it so that I can incrementally build (only) the spark-examples sub project without the need of downloading or installation? As Sean suggest, you shouldn't need to install anything. After mvn install, your local repo is a working Spark installation, and you can use spark-submit and other tool directly within it. You just need to remember to rebuild the assembly/ project when modifying Spark code (or the examples/ project when modifying examples). -- Marcelo - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
[jira] [Commented] (SPARK-4156) Add expectation maximization for Gaussian mixture models to MLLib clustering
[ https://issues.apache.org/jira/browse/SPARK-4156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14231226#comment-14231226 ] Meethu Mathew commented on SPARK-4156: -- We had run the GMM code on two public datasets : http://cs.joensuu.fi/sipu/datasets/s1.txt http://cs.joensuu.fi/sipu/datasets/birch2.txt It was observed in both the cases that the execution converged at the 3rd iteration and the w , mu and sigma were identical for all the components.The code was run using the following commands: ./bin/run-example org.apache.spark.examples.mllib.DenseGmmEM s1.csv 15 .0001 ./bin/run-example org.apache.spark.examples.mllib.DenseGmmEM birch2.csv 100 .0001 Are we missing something here? Add expectation maximization for Gaussian mixture models to MLLib clustering Key: SPARK-4156 URL: https://issues.apache.org/jira/browse/SPARK-4156 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Travis Galoppo Assignee: Travis Galoppo As an additional clustering algorithm, implement expectation maximization for Gaussian mixture models -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4156) Add expectation maximization for Gaussian mixture models to MLLib clustering
[ https://issues.apache.org/jira/browse/SPARK-4156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14232639#comment-14232639 ] Meethu Mathew commented on SPARK-4156: -- we considered only diagonal covariance matrix and it was initialized using the variance of each feature. Add expectation maximization for Gaussian mixture models to MLLib clustering Key: SPARK-4156 URL: https://issues.apache.org/jira/browse/SPARK-4156 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Travis Galoppo Assignee: Travis Galoppo As an additional clustering algorithm, implement expectation maximization for Gaussian mixture models -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
Re: How to incrementally compile spark examples using mvn
Hi,I have a similar problem.I modified the code in mllib and examples.I did mvn install -pl mllib mvn install -pl examples But when I run the program in examples using run-example,the older version of mllib (before the changes were made) is getting executed.How to get the changes made in mllib while calling it from examples project? Thanks Regards, Meethu M On Monday, 24 November 2014 3:33 PM, Yiming (John) Zhang sdi...@gmail.com wrote: Thank you, Marcelo and Sean, mvn install is a good answer for my demands. -邮件原件- 发件人: Marcelo Vanzin [mailto:van...@cloudera.com] 发送时间: 2014年11月21日 1:47 收件人: yiming zhang 抄送: Sean Owen; user@spark.apache.org 主题: Re: How to incrementally compile spark examples using mvn Hi Yiming, On Wed, Nov 19, 2014 at 5:35 PM, Yiming (John) Zhang sdi...@gmail.com wrote: Thank you for your reply. I was wondering whether there is a method of reusing locally-built components without installing them? That is, if I have successfully built the spark project as a whole, how should I configure it so that I can incrementally build (only) the spark-examples sub project without the need of downloading or installation? As Sean suggest, you shouldn't need to install anything. After mvn install, your local repo is a working Spark installation, and you can use spark-submit and other tool directly within it. You just need to remember to rebuild the assembly/ project when modifying Spark code (or the examples/ project when modifying examples). -- Marcelo - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
[jira] [Commented] (SPARK-3588) Gaussian Mixture Model clustering
[ https://issues.apache.org/jira/browse/SPARK-3588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14224091#comment-14224091 ] Meethu Mathew commented on SPARK-3588: -- [~mengxr] We have completed the pyspark implementation which is available at https://github.com/FlytxtRnD/GMM. We are in the process of porting the code to Scala and were planning to create a PR once the coding and test cases are completed. By merging do you mean to merge the tickets or the implementations? Kindly explain how the merge would be done. Will our work be a duplicate effort if we continue with our scala implementation? Could you please suggest the next course of action? Gaussian Mixture Model clustering - Key: SPARK-3588 URL: https://issues.apache.org/jira/browse/SPARK-3588 Project: Spark Issue Type: New Feature Components: MLlib, PySpark Reporter: Meethu Mathew Assignee: Meethu Mathew Attachments: GMMSpark.py Gaussian Mixture Models (GMM) is a popular technique for soft clustering. GMM models the entire data set as a finite mixture of Gaussian distributions,each parameterized by a mean vector µ ,a covariance matrix ∑ and a mixture weight π. In this technique, probability of each point to belong to each cluster is computed along with the cluster statistics. We have come up with an initial distributed implementation of GMM in pyspark where the parameters are estimated using the Expectation-Maximization algorithm.Our current implementation considers diagonal covariance matrix for each component. We did an initial benchmark study on a 2 node Spark standalone cluster setup where each node config is 8 Cores,8 GB RAM, the spark version used is 1.0.0. We also evaluated python version of k-means available in spark on the same datasets. Below are the results from this benchmark study. The reported stats are average from 10 runs.Tests were done on multiple datasets with varying number of features and instances. ||nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;Dataset nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;||nbsp;nbsp;nbsp;Gaussian mixture modelnbsp;nbsp;nbsp;nbsp;nbsp;|| nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;Kmeans(Python)nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;|| |Instances|Dimensions |Avg time per iteration|Time for 100 iterations |Avg time per iteration |Time for 100 iterations | |0.7million| nbsp;nbsp;nbsp;13 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;| nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; 7s nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; | nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; 12min nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; | nbsp;nbsp;nbsp;nbsp; nbsp;nbsp; 13s nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; | nbsp;nbsp;nbsp;nbsp;26min nbsp;nbsp;nbsp;| |1.8million| nbsp;nbsp;nbsp;11 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;| nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; 17s nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; | nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; 29min nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; | nbsp;nbsp;nbsp;nbsp; nbsp;nbsp; 33s nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; | nbsp;nbsp;nbsp;nbsp;53min nbsp;nbsp;nbsp; | |10million|nbsp;nbsp;nbsp;16 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;| nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; 1.6min nbsp;nbsp;nbsp;nbsp;nbsp; | nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; 2.7hr nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; | nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; 1.2min nbsp;nbsp;nbsp;nbsp;| nbsp;nbsp;nbsp;nbsp;2hr nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; | -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
Re: [MLlib] Contributing Algorithm for Outlier Detection
Hi Ashutosh, Please edit the README file.I think the following function call is changed now. |model = OutlierWithAVFModel.outliers(master:String, input dir:String , percentage:Double||) | Regards, *Meethu Mathew* *Engineer* *Flytxt* _http://www.linkedin.com/home?trk=hb_tab_home_top_ On Friday 14 November 2014 12:01 AM, Ashutosh wrote: Hi Anant, Please see the changes. https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala I have changed the input format to Vector of String. I think we can also make it generic. Line 59 72 : that counter will not affect in parallelism, Since it only work on one datapoint. It only does the Indexing of the column. Rest all side effects have been removed. Thanks, Ashutosh From: slcclimber [via Apache Spark Developers List] ml-node+s1001551n9287...@n3.nabble.com Sent: Tuesday, November 11, 2014 11:46 PM To: Ashutosh Trivedi (MT2013030) Subject: Re: [MLlib] Contributing Algorithm for Outlier Detection Mayur, Libsvm format sounds good to me. I could work on writing the tests if that helps you? Anant On Nov 11, 2014 11:06 AM, Ashutosh [via Apache Spark Developers List] [hidden email]/user/SendEmail.jtp?type=nodenode=9287i=0 wrote: Hi Mayur, Vector data types are implemented using breeze library, it is presented at .../org/apache/spark/mllib/linalg Anant, One restriction I found that a vector can only be of 'Double', so it actually restrict the user. What are you thoughts on LibSVM format? Thanks for the comments, I was just trying to get away from those increment /decrement functions, they look ugly. Points are noted. I'll try to fix them soon. Tests are also required for the code. Regards, Ashutosh From: Mayur Rustagi [via Apache Spark Developers List] ml-node+[hidden email]http://user/SendEmail.jtp?type=nodenode=9286i=0 Sent: Saturday, November 8, 2014 12:52 PM To: Ashutosh Trivedi (MT2013030) Subject: Re: [MLlib] Contributing Algorithm for Outlier Detection We should take a vector instead giving the user flexibility to decide data source/ type What do you mean by vector datatype exactly? Mayur Rustagi Ph: a href=tel:%2B1%20%28760%29%20203%203257 value=+17602033257 target=_blank+1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Wed, Nov 5, 2014 at 6:45 AM, slcclimber [hidden email]http://user/SendEmail.jtp?type=nodenode=9239i=0 wrote: Ashutosh, I still see a few issues. 1. On line 112 you are counting using a counter. Since this will happen in a RDD the counter will cause issues. Also that is not good functional style to use a filter function with a side effect. You could use randomSplit instead. This does not the same thing without the side effect. 2. Similar shared usage of j in line 102 is going to be an issue as well. also hash seed does not need to be sequential it could be randomly generated or hashed on the values. 3. The compute function and trim scores still runs on a comma separeated RDD. We should take a vector instead giving the user flexibility to decide data source/ type. what if we want data from hive tables or parquet or JSON or avro formats. This is a very restrictive format. With vectors the user has the choice of taking in whatever data format and converting them to vectors insteda of reading json files creating a csv file and then workig on that. 4. Similar use of counters in 54 and 65 is an issue. Basically the shared state counters is a huge issue that does not scale. Since the processing of RDD's is distributed and the value j lives on the master. Anant On Tue, Nov 4, 2014 at 7:22 AM, Ashutosh [via Apache Spark Developers List] [hidden email]http://user/SendEmail.jtp?type=nodenode=9239i=1 wrote: Anant, I got rid of those increment/ decrements functions and now code is much cleaner. Please check. All your comments have been looked after. https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala _Ashu https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala Outlier-Detection-with-AVF-Spark/OutlierWithAVFModel.scala at master · codeAshu/Outlier-Detection-with-AVF-Spark · GitHub Contribute to Outlier-Detection-with-AVF-Spark development by creating an account on GitHub. Read more... https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala -- *From:* slcclimber [via Apache Spark Developers List] ml-node+[hidden email] http://user/SendEmail.jtp?type=nodenode=9083i=0 *Sent:* Friday, October 31, 2014 10:09 AM *To:* Ashutosh Trivedi (MT2013030) *Subject:* Re: [MLlib] Contributing Algorithm for Outlier Detection You should create a jira ticket to go with it as well. Thanks On Oct 30, 2014 10:38 PM, Ashutosh [via Apache Spark
Re: [MLlib] Contributing Algorithm for Outlier Detection
Hi, I have a doubt regarding the input to your algorithm. _http://www.linkedin.com/home?trk=hb_tab_home_top_ val model = OutlierWithAVFModel.outliers(data :RDD[Vector[String]], percent : Double, sc :SparkContext) Here our input data is an RDD[Vector[String]]. How we can create this RDD from a file? sc.textFile will simply give us an RDD, how to make it a Vector[String]? Could you plz share any code snippet of this conversion if you have.. Regards, Meethu Mathew On Friday 14 November 2014 10:02 AM, Meethu Mathew wrote: Hi Ashutosh, Please edit the README file.I think the following function call is changed now. |model = OutlierWithAVFModel.outliers(master:String, input dir:String , percentage:Double||) | Regards, *Meethu Mathew* *Engineer* *Flytxt* _http://www.linkedin.com/home?trk=hb_tab_home_top_ On Friday 14 November 2014 12:01 AM, Ashutosh wrote: Hi Anant, Please see the changes. https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala I have changed the input format to Vector of String. I think we can also make it generic. Line 59 72 : that counter will not affect in parallelism, Since it only work on one datapoint. It only does the Indexing of the column. Rest all side effects have been removed. Thanks, Ashutosh From: slcclimber [via Apache Spark Developers List] ml-node+s1001551n9287...@n3.nabble.com Sent: Tuesday, November 11, 2014 11:46 PM To: Ashutosh Trivedi (MT2013030) Subject: Re: [MLlib] Contributing Algorithm for Outlier Detection Mayur, Libsvm format sounds good to me. I could work on writing the tests if that helps you? Anant On Nov 11, 2014 11:06 AM, Ashutosh [via Apache Spark Developers List] [hidden email]/user/SendEmail.jtp?type=nodenode=9287i=0 wrote: Hi Mayur, Vector data types are implemented using breeze library, it is presented at .../org/apache/spark/mllib/linalg Anant, One restriction I found that a vector can only be of 'Double', so it actually restrict the user. What are you thoughts on LibSVM format? Thanks for the comments, I was just trying to get away from those increment /decrement functions, they look ugly. Points are noted. I'll try to fix them soon. Tests are also required for the code. Regards, Ashutosh From: Mayur Rustagi [via Apache Spark Developers List] ml-node+[hidden email]http://user/SendEmail.jtp?type=nodenode=9286i=0 Sent: Saturday, November 8, 2014 12:52 PM To: Ashutosh Trivedi (MT2013030) Subject: Re: [MLlib] Contributing Algorithm for Outlier Detection We should take a vector instead giving the user flexibility to decide data source/ type What do you mean by vector datatype exactly? Mayur Rustagi Ph: a href=tel:%2B1%20%28760%29%20203%203257 value=+17602033257 target=_blank+1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Wed, Nov 5, 2014 at 6:45 AM, slcclimber [hidden email]http://user/SendEmail.jtp?type=nodenode=9239i=0 wrote: Ashutosh, I still see a few issues. 1. On line 112 you are counting using a counter. Since this will happen in a RDD the counter will cause issues. Also that is not good functional style to use a filter function with a side effect. You could use randomSplit instead. This does not the same thing without the side effect. 2. Similar shared usage of j in line 102 is going to be an issue as well. also hash seed does not need to be sequential it could be randomly generated or hashed on the values. 3. The compute function and trim scores still runs on a comma separeated RDD. We should take a vector instead giving the user flexibility to decide data source/ type. what if we want data from hive tables or parquet or JSON or avro formats. This is a very restrictive format. With vectors the user has the choice of taking in whatever data format and converting them to vectors insteda of reading json files creating a csv file and then workig on that. 4. Similar use of counters in 54 and 65 is an issue. Basically the shared state counters is a huge issue that does not scale. Since the processing of RDD's is distributed and the value j lives on the master. Anant On Tue, Nov 4, 2014 at 7:22 AM, Ashutosh [via Apache Spark Developers List] [hidden email]http://user/SendEmail.jtp?type=nodenode=9239i=1 wrote: Anant, I got rid of those increment/ decrements functions and now code is much cleaner. Please check. All your comments have been looked after. https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala _Ashu https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala Outlier-Detection-with-AVF-Spark/OutlierWithAVFModel.scala at master · codeAshu/Outlier-Detection-with-AVF-Spark · GitHub Contribute to Outlier-Detection-with-AVF-Spark development by creating an account on GitHub
Re: ISpark class not found
Hi, I was also trying Ispark..But I couldnt even start the notebook..I am getting the following error. ERROR:tornado.access:500 POST /api/sessions (127.0.0.1) 10.15ms referer=http://localhost:/notebooks/Scala/Untitled0.ipynb How did you start the notebook? Thanks Regards, Meethu M On Wednesday, 12 November 2014 6:50 AM, Laird, Benjamin benjamin.la...@capitalone.com wrote: I've been experimenting with the ISpark extension to IScala (https://github.com/tribbloid/ISpark) Objects created in the REPL are not being loaded correctly on worker nodes, leading to a ClassNotFound exception. This does work correctly in spark-shell. I was curious if anyone has used ISpark and has encountered this issue. Thanks! Simple example: In [1]: case class Circle(rad:Float) In [2]: val rdd = sc.parallelize(1 to 1).map(i=Circle(i.toFloat)).take(10)14/11/11 13:03:35 ERROR TaskResultGetter: Exception while getting task resultcom.esotericsoftware.kryo.KryoException: Unable to find class: [L$line5.$read$$iwC$$iwC$Circle; Full trace in my gist: https://gist.github.com/benjaminlaird/3e543a9a89fb499a3a14 The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer.
Is there a step-by-step instruction on how to build Spark App with IntelliJ IDEA?
Hi, This question was asked earlier and I did it in the way specified..I am getting java.lang.ClassNotFoundException.. Can somebody explain all the steps required to build a spark app using IntelliJ (latest version)starting from creating the project to running it..I searched a lot but couldnt find an appropriate documentation.. Re: Is there a step-by-step instruction on how to build Spark App with IntelliJ IDEA? | | | | | | | | | Re: Is there a step-by-step instruction on how to build Spark App with IntelliJ IDEA?Don’t try to use spark-core as an archetype. Instead just create a plain Scala project (noarchetype) and add a Maven dependency on spark-core. That should be all you need. | | | | View on mail-archives.apache.org | Preview by Yahoo | | | | | Thanks Regards, Meethu M
Re: Relation between worker memory and executor memory in standalone mode
Try to set --total-executor-cores to limit how many total cores it can use. Thanks Regards, Meethu M On Thursday, 2 October 2014 2:39 AM, Akshat Aranya aara...@gmail.com wrote: I guess one way to do so would be to run 1 worker per node, like say, instead of running 1 worker and giving it 8 cores, you can run 4 workers with 2 cores each. Then, you get 4 executors with 2 cores each. On Wed, Oct 1, 2014 at 1:06 PM, Boromir Widas vcsub...@gmail.com wrote: I have not found a way to control the cores yet. This effectively limits the cluster to a single application at a time. A subsequent application shows in the 'WAITING' State on the dashboard. On Wed, Oct 1, 2014 at 2:49 PM, Akshat Aranya aara...@gmail.com wrote: On Wed, Oct 1, 2014 at 11:33 AM, Akshat Aranya aara...@gmail.com wrote: On Wed, Oct 1, 2014 at 11:00 AM, Boromir Widas vcsub...@gmail.com wrote: 1. worker memory caps executor. 2. With default config, every job gets one executor per worker. This executor runs with all cores available to the worker. By the job do you mean one SparkContext or one stage execution within a program? Does that also mean that two concurrent jobs will get one executor each at the same time? Experimenting with this some more, I figured out that an executor takes away spark.executor.memory amount of memory from the configured worker memory. It also takes up all the cores, so even if there is still some memory left, there are no cores left for starting another executor. Is my assessment correct? Is there no way to configure the number of cores that an executor can use? On Wed, Oct 1, 2014 at 11:04 AM, Akshat Aranya aara...@gmail.com wrote: Hi, What's the relationship between Spark worker and executor memory settings in standalone mode? Do they work independently or does the worker cap executor memory? Also, is the number of concurrent executors per worker capped by the number of CPU cores configured for the worker?
Same code --works in spark 1.0.2-- but not in spark 1.1.0
Hi all, My code was working fine in spark 1.0.2 ,but after upgrading to 1.1.0, its throwing exceptions and tasks are getting failed. The code contains some map and filter transformations followed by groupByKey (reduceByKey in another code ). What I could find out is that the code works fine until groupByKey or reduceByKey in both versions.But after that the following errors show up in Spark 1.1.0 java.io.FileNotFoundException: /tmp/spark-local-20141006173014-4178/35/shuffle_6_0_5161 (Too many open files) java.io.FileOutputStream.openAppend(Native Method) java.io.FileOutputStream.init(FileOutputStream.java:210) org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:123) org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:192) org.apache.spark.shuffle.hash.HashShuffleWriter$$anonfun$write$1.apply(HashShuffleWriter.scala:67) org.apache.spark.shuffle.hash.HashShuffleWriter$$anonfun$write$1.apply(HashShuffleWriter.scala:65) scala.collection.Iterator$class.foreach(Iterator.scala:727) scala.collection.AbstractIterator.foreach(Iterator.scala:1157) org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:65) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:701) I cleaned my /tmp directory,changed my local directory to another folder ; but nothing helped. Can anyone say what could be the reason .? Thanks Regards, Meethu M
[jira] [Commented] (SPARK-3588) Gaussian Mixture Model clustering
[ https://issues.apache.org/jira/browse/SPARK-3588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14154434#comment-14154434 ] Meethu Mathew commented on SPARK-3588: -- Ok. We will start implementing the Scala version of Gaussian Mixture Model. Gaussian Mixture Model clustering - Key: SPARK-3588 URL: https://issues.apache.org/jira/browse/SPARK-3588 Project: Spark Issue Type: New Feature Components: MLlib, PySpark Reporter: Meethu Mathew Assignee: Meethu Mathew Attachments: GMMSpark.py Gaussian Mixture Models (GMM) is a popular technique for soft clustering. GMM models the entire data set as a finite mixture of Gaussian distributions,each parameterized by a mean vector µ ,a covariance matrix ∑ and a mixture weight π. In this technique, probability of each point to belong to each cluster is computed along with the cluster statistics. We have come up with an initial distributed implementation of GMM in pyspark where the parameters are estimated using the Expectation-Maximization algorithm.Our current implementation considers diagonal covariance matrix for each component. We did an initial benchmark study on a 2 node Spark standalone cluster setup where each node config is 8 Cores,8 GB RAM, the spark version used is 1.0.0. We also evaluated python version of k-means available in spark on the same datasets. Below are the results from this benchmark study. The reported stats are average from 10 runs.Tests were done on multiple datasets with varying number of features and instances. ||nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;Dataset nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;||nbsp;nbsp;nbsp;Gaussian mixture modelnbsp;nbsp;nbsp;nbsp;nbsp;|| nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;Kmeans(Python)nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;|| |Instances|Dimensions |Avg time per iteration|Time for 100 iterations |Avg time per iteration |Time for 100 iterations | |0.7million| nbsp;nbsp;nbsp;13 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;| nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; 7s nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; | nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; 12min nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; | nbsp;nbsp;nbsp;nbsp; nbsp;nbsp; 13s nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; | nbsp;nbsp;nbsp;nbsp;26min nbsp;nbsp;nbsp;| |1.8million| nbsp;nbsp;nbsp;11 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;| nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; 17s nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; | nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; 29min nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; | nbsp;nbsp;nbsp;nbsp; nbsp;nbsp; 33s nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; | nbsp;nbsp;nbsp;nbsp;53min nbsp;nbsp;nbsp; | |10million|nbsp;nbsp;nbsp;16 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;| nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; 1.6min nbsp;nbsp;nbsp;nbsp;nbsp; | nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; 2.7hr nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; | nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; 1.2min nbsp;nbsp;nbsp;nbsp;| nbsp;nbsp;nbsp;nbsp;2hr nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; | -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org