from:"MEETHU MATHEW"

unsubscribe

2020-01-08 Thread MEETHU MATHEW




Thanks & Regards, Meethu M

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

[jira] [Commented] (SPARK-25452) Query with where clause is giving unexpected result in case of float column

2018-09-26 Thread Meethu Mathew (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16628433#comment-16628433
 ] 

Meethu Mathew commented on SPARK-25452:
---

This is not duplicate of -SPARK-24829.-

 

!image-2018-09-26-14-14-47-504.png!

> Query with where clause is giving unexpected result in case of float column
> ---
>
> Key: SPARK-25452
> URL: https://issues.apache.org/jira/browse/SPARK-25452
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
> Environment: *Spark 2.3.1*
> *Hadoop 2.7.2*
>Reporter: Ayush Anubhava
>Priority: Major
> Attachments: image-2018-09-26-14-14-47-504.png
>
>
> *Description* : Query with clause is giving unexpected result in case of 
> float column 
>  
> {color:#d04437}*Query with filter less than equal to is giving inappropriate 
> result{code}*{color}
> {code}
> 0: jdbc:hive2://10.18.18.214:23040/default> create table k2 ( a int, b float);
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.18.18.214:23040/default> insert into table k2 values 
> (0,0.0);
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.18.18.214:23040/default> insert into table k2 values 
> (1,1.1);
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.18.18.214:23040/default> select * from k2 where b >=0.0;
> +++--+
> | a | b |
> +++--+
> | 0 | 0.0 |
> | 1 | 1.10023841858 |
> +++--+
> Query with filter less than equal to is giving in appropriate result
> 0: jdbc:hive2://10.18.18.214:23040/default> select * from k2 where b <=1.1;
> ++--+--+
> | a | b |
> ++--+--+
> | 0 | 0.0 |
> ++--+--+
> 1 row selected (0.299 seconds)
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25452) Query with where clause is giving unexpected result in case of float column

2018-09-26 Thread Meethu Mathew (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Meethu Mathew updated SPARK-25452:
--
Attachment: image-2018-09-26-14-14-47-504.png

> Query with where clause is giving unexpected result in case of float column
> ---
>
> Key: SPARK-25452
> URL: https://issues.apache.org/jira/browse/SPARK-25452
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
> Environment: *Spark 2.3.1*
> *Hadoop 2.7.2*
>Reporter: Ayush Anubhava
>Priority: Major
> Attachments: image-2018-09-26-14-14-47-504.png
>
>
> *Description* : Query with clause is giving unexpected result in case of 
> float column 
>  
> {color:#d04437}*Query with filter less than equal to is giving inappropriate 
> result{code}*{color}
> {code}
> 0: jdbc:hive2://10.18.18.214:23040/default> create table k2 ( a int, b float);
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.18.18.214:23040/default> insert into table k2 values 
> (0,0.0);
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.18.18.214:23040/default> insert into table k2 values 
> (1,1.1);
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.18.18.214:23040/default> select * from k2 where b >=0.0;
> +++--+
> | a | b |
> +++--+
> | 0 | 0.0 |
> | 1 | 1.10023841858 |
> +++--+
> Query with filter less than equal to is giving in appropriate result
> 0: jdbc:hive2://10.18.18.214:23040/default> select * from k2 where b <=1.1;
> ++--+--+
> | a | b |
> ++--+--+
> | 0 | 0.0 |
> ++--+--+
> 1 row selected (0.299 seconds)
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Filtering based on a float value with more than one decimal place not working correctly in Pyspark dataframe

2018-09-25 Thread Meethu Mathew

Hi all,

I tried the following code and the output was not as expected.

schema = StructType([StructField('Id', StringType(), False),
>  StructField('Value', FloatType(), False)])
> df_test = spark.createDataFrame([('a',5.0),('b',1.236),('c',-0.31)],schema)

df_test


Output :  DataFrame[Id: string, Value: float]
[image: image.png]
But when the value is given as a string, it worked.

[image: image.png]
Again tried with a floating point number with one decimal place and it
worked.
[image: image.png]
And when the equals operation is changed to greater than or less than, its
working with more than one decimal place numbers
[image: image.png]
Is this a bug?

Regards,
Meethu Mathew

[jira] [Created] (ZEPPELIN-3126) More than 2 notebooks in R failing with error sparkr intrepreter not responding

2018-01-04 Thread Meethu Mathew (JIRA)

Meethu Mathew created ZEPPELIN-3126:
---

 Summary: More than 2 notebooks in R failing with error sparkr 
intrepreter not responding
 Key: ZEPPELIN-3126
 URL: https://issues.apache.org/jira/browse/ZEPPELIN-3126
 Project: Zeppelin
  Issue Type: Bug
  Components: r-interpreter
Affects Versions: 0.7.2
 Environment: spark version 1.6.2


Reporter: Meethu Mathew
Priority: Critical


Spark interpreter is in per note Scoped mode.
Please find the steps below to reproduce the issue:
1. Create a notebook (Note1) and run any r code in a paragraph. I ran the 
following code.
%r
rdf <- data.frame(c(1,2,3,4))
colnames(rdf) <- c("myCol")
sdf <- createDataFrame(sqlContext, rdf)  
withColumn(sdf, "newCol", sdf$myCol * 2.0)

2.  Create another notebook (Note2) and run any r code in a paragraph. I ran 
the same code as above.

Till now everything works fine.

3. Create third notebook (Note3) and run any r code in a paragraph. I ran the 
same code. This notebook fails with the error 
org.apache.zeppelin.interpreter.InterpreterException: sparkr is not responding 

The problem will be solved on restarting the sparkr interpreter and another 2 
models could be executed successfully. But again, for the third model run using 
the sparkr interpreter, the error is thrown. 
Once a notebook throws the error, all further notebooks will throw the same 
error and each time we run those failed notebooks, a new R shell process will 
be started and these processes are not getting killed even if we we delete the 
failed notebook.i.e it does not reuse original R shell after failure



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Re: More than 2 notebooks in R failing with error sparkr intrepreter not responding

2018-01-02 Thread Meethu Mathew

Hi Jeff,

PFB the interpreter log.

INFO [2018-01-03 12:10:05,960] ({pool-2-thread-9}
Logging.scala[logInfo]:58) - Starting HTTP Server
 INFO [2018-01-03 12:10:05,961] ({pool-2-thread-9}
Server.java[doStart]:272) - jetty-8.y.z-SNAPSHOT
 INFO [2018-01-03 12:10:05,963] ({pool-2-thread-9}
AbstractConnector.java[doStart]:338) - Started SocketConnector@0.0.0.0:58989
 INFO [2018-01-03 12:10:05,963] ({pool-2-thread-9}
Logging.scala[logInfo]:58) - Successfully started service 'HTTP class
server' on port 58989.
 INFO [2018-01-03 12:10:06,094] ({dispatcher-event-loop-1}
Logging.scala[logInfo]:58) - Removed broadcast_1_piece0 on localhost:42453
in memory (size: 854.0 B, free: 511.1 MB)
 INFO [2018-01-03 12:10:07,049] ({pool-2-thread-9}
ZeppelinR.java[createRScript]:353) - File
/tmp/zeppelin_sparkr-5046601627391341672.R created
ERROR [2018-01-03 12:10:17,051] ({pool-2-thread-9} Job.java[run]:188) - Job
failed
*org.apache.zeppelin.interpreter.InterpreterException: sparkr is not
responding *


R version 3.4.1 (2017-06-30) -- "Single Candle"
Copyright (C) 2017 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)


> args <- commandArgs(trailingOnly = TRUE)

> hashCode <- as.integer(args[1])

> port <- as.integer(args[2])

> libPath <- args[3]

> version <- as.integer(args[4])

> rm(args)

>
> print(paste("Port ", toString(port)))
[1]
 "Port  58063"

> print(paste("LibPath ", libPath))
[1]

 "LibPath  /home/meethu/spark-1.6.1-bin-hadoop2.6/R/lib"
>

> .libPaths(c(file.path(libPath), .libPaths()))

> library(SparkR)

Attaching package: ‘SparkR’

The following objects are masked from ‘package:stats’:

cov, filter, lag, na.omit, predict, sd, var

The following objects are masked from ‘package:base’:

colnames, colnames<-, endsWith, intersect, rank, rbind, sample,
startsWith, subset, summary, table, transform

> SparkR:::connectBackend("localhost", port, 6000)
A connection with
description "->localhost:58063"
class
 "sockconn"
mode"wb"
text"binary"
opened  "opened"
can read"yes"
can write   "yes"
>

> # scStartTime is needed by R/pkg/R/sparkR.R

> assign(".scStartTime", as.integer(Sys.time()), envir =
SparkR:::.sparkREnv)

> # getZeppelinR

> *.zeppelinR = SparkR:::callJStatic("org.apache.zeppelin.spark.ZeppelinR",
"getZeppelinR", hashCode)*

at
org.apache.zeppelin.spark.ZeppelinR.waitForRScriptInitialized(ZeppelinR.java:285)
at org.apache.zeppelin.spark.ZeppelinR.request(ZeppelinR.java:227)
at org.apache.zeppelin.spark.ZeppelinR.eval(ZeppelinR.java:176)
at org.apache.zeppelin.spark.ZeppelinR.open(ZeppelinR.java:165)
at
org.apache.zeppelin.spark.SparkRInterpreter.open(SparkRInterpreter.java:90)
at
org.apache.zeppelin.interpreter.LazyOpenInterpreter.open(LazyOpenInterpreter.java:70)
at
org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:491)
at org.apache.zeppelin.scheduler.Job.run(Job.java:175)
at org.apache.zeppelin.scheduler.FIFOScheduler$1.run(FIFOScheduler.java:139)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
 INFO [2018-01-03 12:10:17,070] ({pool-2-thread-9}
SchedulerFactory.java[jobFinished]:137) - Job
remoteInterpretJob_1514961605951 finished by scheduler
org.apache.zeppelin.spark.SparkRInterpreter392022746
 INFO [2018-01-03 12:39:22,664] ({Spark Context Cleaner}
Logging.scala[logInfo]:58) - Cleaned accumulator 2


PFB the output of the command  *ps -ef | grep /usr/lib/R/bin/exec/R*

meethu6647  6470  0 12:09 pts/100:00:00 /usr/lib/R/bin/exec/R
--no-save --no-restore -f /tmp/zeppelin_sparkr-1100854828050763213.R --args
214655664 58063 /home/meethu/spark-1.6.1-bin-hadoop2.6/R/lib 10601
meethu6701  6470  0 12:09 pts/100:00:00 /usr/lib/R/bin/exec/R
--no-save --no-restore -f /tmp/zeppelin_sparkr-4152305170353311178.R --args
1642312173 58063 /home/meethu/spark-1.6.1-bin-hadoop2.6/R/lib 10601
meethu6745  6470  0 12:10 pts/100:00:00 /usr/lib/R/bin/exec/R
--no-save --no-restore -f /tmp/zeppelin_sparkr-5046601627391341672.R --args
1158632477 58063 /home/meethu/spark-1.6.1-bin-hadoop2.6/R/lib 10601


Regards,
Meethu Mathew


On Wed, Jan 3, 2018 at 12:56 PM, Jeff Zhang <zjf...@gmail.com> wrote:

>
> Could

More than 2 notebooks in R failing with error sparkr intrepreter not responding

2018-01-02 Thread Meethu Mathew

Hi,

I have met with a strange issue in running R notebooks in zeppelin(0.7.2).
Spark intrepreter is in per note Scoped mode and spark version is 1.6.2

Please find the steps below to reproduce the issue:
1. Create a notebook (Note1) and run any r code in a paragraph. I ran the
following code.

> %r
>
> rdf <- data.frame(c(1,2,3,4))
>
> colnames(rdf) <- c("myCol")
>
> sdf <- createDataFrame(sqlContext, rdf)
>
> withColumn(sdf, "newCol", sdf$myCol * 2.0)
>
>
2.  Create another notebook (Note2) and run any r code in a paragraph. I
ran the same code as above.

Till now everything works fine.

3. Create third notebook (Note3) and run any r code in a paragraph. I ran
the same code. This notebook fails with the error

> org.apache.zeppelin.interpreter.InterpreterException: sparkr is not
> responding


 What I understood from the analysis is that  the process created for
sparkr interpreter is not getting killed properly and this makes every
third model to throw an error while executing. The process will be killed
on restarting the sparkr interpreter and another 2 models could be executed
successfully. ie, For every third model run using the sparkr interpreter,
the error is thrown. We suspect this as a limitation with zeppelin.

Please help to solve this issue

Regards,
Meethu Mathew

Re: Zeppelin framework is not getting unregistered from Mesos

2017-04-27 Thread Meethu Mathew

Hi Moon,

Yes its fixed in 0.7.1. Thank you

Regards,
Meethu Mathew


On Wed, Apr 26, 2017 at 10:42 PM, moon soo Lee <m...@apache.org> wrote:

> Some bugs related to interpreter process management has been fixed in
> 0.7.1 release [1]. Could you try 0.7.1 or master branch and see if the same
> problem occurs?
>
> Thanks,
> moon
>
> [1] https://issues.apache.org/jira/browse/ZEPPELIN-1832
>
> On Wed, Apr 26, 2017 at 1:13 AM Meethu Mathew <meethu.mat...@flytxt.com>
> wrote:
>
>> Hi,
>>
>> We have connected our zeppelin to mesos. But the issue we are facing is
>> that Zeppelin framework is not getting unregistered from Mesos  even if the
>> notebook is closed.
>>
>> Another problem is if the user logout from zeppelin, the SparkContext is
>> getting stopped. When the same user login again, it creates another
>> SparkContext and then the previous SparkContext will become a dead process
>> and exist.
>>
>> Is it a bug of zeppelin or is there any other proper way to unbind the
>> zeppelin framework?
>>
>> Zeppelin version is 0.7.0
>>
>> Regards,
>>
>>
>> Meethu Mathew
>>
>>

Zeppelin framework is not getting unregistered from Mesos

2017-04-26 Thread Meethu Mathew

Hi,

We have connected our zeppelin to mesos. But the issue we are facing is
that Zeppelin framework is not getting unregistered from Mesos  even if the
notebook is closed.

Another problem is if the user logout from zeppelin, the SparkContext is
getting stopped. When the same user login again, it creates another
SparkContext and then the previous SparkContext will become a dead process
and exist.

Is it a bug of zeppelin or is there any other proper way to unbind the
zeppelin framework?

Zeppelin version is 0.7.0

Regards,
Meethu Mathew

Re: UnicodeDecodeError in zeppelin 0.7.1

2017-04-20 Thread Meethu Mathew


Hi,

Thanks for the repsonse.

@ moon soo Lee: The interpreter setting is same in 0.7.0 and 0.7.1

@ Felix Cheng : The Python version is same.

The code is as follows:

*PYSPARK*

def textPreProcessor(text):
>  for w in text.split():
>
>  
> regex = re.compile('[%s]' % re.escape(string.punctuation))
>
> * *
> *no_punctuation = unicode(regex.sub(' ', w),'utf8')*
>
>  
> tokens = word_tokenize(no_punctuation)
>
>  
> lowercased = [t.lower() for t in tokens]
>
>  
> no_stopwords = [w for w in lowercased if not w in stopwordsX]
>
>  
> stemmed = [stemmerX.stem(w) for w in no_stopwords]
>
>  
> return [w for w in stemmed if w]



>- docs =sc.textFile(hdfs_path+training_data,*use_unicode=False*
>).repartition(96)
>- docs.map(lambda features: sentimentObject.textPreProcessor(features.
>split(delimiter)[text_colum])).count()
>
>
*Error:*

   - UnicodeDecodeError: 'utf8' codec can't decode byte 0x9b in position
   17: invalid start byte


   - Same error  *use_unicode=False* is not used


   - Error change to *'ascii' codec can't decode byte 0x97 in position 3:
   ordinal not in range(128) when **no_punctuation = regex.sub(' ', w)* is
   used instead of *no_punctuation = unicode(regex.sub(' ', w),'utf8'). *

*Note :: In version 0.7.0 the code was running fine without using
use_unicode and unicode(regex.sub(' ', w),'utf8')*

*PYTHON*

def textPreProcessor(text_column):
> processed_text=[]
> for text in text_column:
>for w in text.split():
>   regex = re.compile('[%s]' % re.escape(string.punctuation)) # reg
> exprn for puntuation
>   no_punctuation = unicode(regex.sub(' ', text_),'utf8')
>  tokens = word_tokenize(no_punctuation)
>  lowercased = [t.lower() for t in tokens]
>no_stopwords = [w for w in lowercased if not w in stopwordsX]
>stemmed = [stemmerX.stem(w) for w in no_stopwords]
>processed_text.append([w for w in stemmed if w])
> return processed_text


   - new_training = pd.read_csv(training_data,header=None,
   delimiter=delimiter, error_bad_lines=False, usecols=[label_column,text_
   column],names=['label','msg']).dropna()
   - new_training['processed_msg'] = textPreProcessor(new_training['msg'])

This python code is working and I am getting result. In version 0.7.0, I am
getting output without using the unicode function.

Hope the problem is clear now.

Regards,
Meethu Mathew


On Fri, Apr 21, 2017 at 3:07 AM, Felix Cheung <felixcheun...@hotmail.com>
wrote:

> And are they running with the same Python version? What is the Python
> version?
>
> _
> From: moon soo Lee <m...@apache.org>
> Sent: Thursday, April 20, 2017 11:53 AM
> Subject: Re: UnicodeDecodeError in zeppelin 0.7.1
> To: <users@zeppelin.apache.org>
>
>
>
> Hi,
>
> 0.7.1 didn't changed any encoding type as far as i know.
> One difference is 0.7.1 official artifact has been built with JDK8 while
> 0.7.0 built with JDK7 (we'll use JDK7 to build upcoming 0.7.2 binary). But
> i'm not sure that can make pyspark and spark encoding type changes.
>
> Do you have exactly the same interpreter setting in 0.7.1 and 0.7.0?
>
> Thanks,
> moon
>
> On Wed, Apr 19, 2017 at 5:30 AM Meethu Mathew <meethu.mat...@flytxt.com>
> wrote:
>
>> Hi,
>>
>> I just migrated from zeppelin 0.7.0 to zeppelin 0.7.1 and I am facing
>> this error while creating an RDD(in pyspark).
>>
>> UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0:
>>> invalid start byte
>>
>>
>> I was able to create the RDD without any error after adding
>> use_unicode=False as follows
>>
>>> sc.textFile("file.csv",use_unicode=False)
>>
>>
>> But it fails when I try to stem the text. I am getting similar error
>> when trying to apply stemming to the text using python interpreter.
>>
>> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4:
>>> ordinal not in range(128)
>>
>> All these code is working in 0.7.0 version. There is no change in the
>> dataset and code. Is there any change in the encoding type in the new
>> version of zeppelin?
>>
>> Regards,
>>
>>
>> Meethu Mathew
>>
>>
>
>

UnicodeDecodeError in zeppelin 0.7.1

2017-04-19 Thread Meethu Mathew

Hi,

I just migrated from zeppelin 0.7.0 to zeppelin 0.7.1 and I am facing this
error while creating an RDD(in pyspark).

UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0:
> invalid start byte


I was able to create the RDD without any error after adding
use_unicode=False as follows

> sc.textFile("file.csv",use_unicode=False)


But it fails when I try to stem the text. I am getting similar error when
trying to apply stemming to the text using python interpreter.

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4:
> ordinal not in range(128)

All these code is working in 0.7.0 version. There is no change in the
dataset and code. Is there any change in the encoding type in the new
version of zeppelin?

Regards,
Meethu Mathew

sqlContext not avilable as hiveContext in notebook

2017-04-04 Thread Meethu Mathew

Hi,

I am running zeppelin 0.7.0. the sqlContext already created in the zeppelin
notebook returns a ,
even though my spark is built with HIVE.

"zeppelin.spark.useHiveContext" in the spark properties is set to true.

As mentioned in https://issues.apache.org/jira/browse/ZEPPELIN-1728, I
tried

  hc = HiveContext.getOrCreate(sc)

but still its returning

.

My pyspark shell and jupyter notebook is returning

 without doing anything.

How to get

 in the zeppelin notebook ?

Regards,
Meethu Mathew

Separate interpreter running scope Per user or Per Note documentation

2017-03-28 Thread Meethu Mathew

Hi,

I couldnt find the documentation for the feature Separate interpreter
running scope Per user or Per Note at
https://zeppelin.apache.org/docs/0.7.0/manual/interpreters.html#interpreter-binding-mode
.
 Can somebody  help me in understanding the per note scoped mode and per
user scoped mode?

Regards,
Meethu Mathew

[jira] [Created] (ZEPPELIN-2313) Run-a-paragraph-synchronously response documented incorrectly

2017-03-23 Thread Meethu Mathew (JIRA)

Meethu Mathew created ZEPPELIN-2313:
---

 Summary: Run-a-paragraph-synchronously response documented 
incorrectly
 Key: ZEPPELIN-2313
 URL: https://issues.apache.org/jira/browse/ZEPPELIN-2313
 Project: Zeppelin
  Issue Type: Bug
  Components: documentation
Affects Versions: 0.7.0
Reporter: Meethu Mathew


The documentation at 
https://zeppelin.apache.org/docs/0.7.0/rest-api/rest-notebook.html#run-a-paragraph-synchronously
 says the sample json error as 
{
 "status": "INTERNAL_SERVER_ERROR",
   "body": {
   "code": "ERROR",
   "type": "TEXT",
   "msg": "bash: -c: line 0: unexpected EOF while looking for matching 
``'\nbash: -c: line 1: syntax error: unexpected end of file\nExitValue: 2"
   }
}

But it is  actually coming like 
 {  "status": "OK", 
"body": {   
 "code": "SUCCESS",
  "msg": [  {   
   "type": "TEXT",
"data": "hello world"  }]  
}}





--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (ZEPPELIN-2312) Allow to Undo edits in a paragraph once its executed and undo deleted paragraph

2017-03-23 Thread Meethu Mathew (JIRA)

Meethu Mathew created ZEPPELIN-2312:
---

 Summary: Allow to Undo edits in a paragraph once its executed and 
undo deleted paragraph
 Key: ZEPPELIN-2312
 URL: https://issues.apache.org/jira/browse/ZEPPELIN-2312
 Project: Zeppelin
  Issue Type: Improvement
  Components: Core
Affects Versions: 0.7.0
Reporter: Meethu Mathew
Priority: Minor


Its not possible to undo edits in a paragraph once its executed. But it was 
possible in 0.6.0.

There should an option to undo a delete paragraph.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (ZEPPELIN-2305) overall experience on auto-completion need to improve.

2017-03-23 Thread Meethu Mathew (JIRA)

Meethu Mathew created ZEPPELIN-2305:
---

 Summary: overall experience on auto-completion need to improve.
 Key: ZEPPELIN-2305
 URL: https://issues.apache.org/jira/browse/ZEPPELIN-2305
 Project: Zeppelin
  Issue Type: Improvement
  Components: Core
Affects Versions: 0.7.0
Reporter: Meethu Mathew


There is no Auto-completion or suggestions for the defined variable names which 
is available in other frameworks. Also
Ctrl+. is giving awkward suggestions for related functions also. For example, 
the relevant functions for a spark rdd or dataframe is not available in the 
suggestions list. The overall experience on auto-completion is something that 
Zeppelin need to improve.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Auto completion for defined variable names

2017-03-20 Thread Meethu Mathew

Hi,

Is there any way to get auto-completion or suggestions for the defined
variable names? In Jupyter notebooks, once defined variables will show
under suggestions.
Ctrl+. is giving awkward suggestions for related functions also. For a
spark data frame, it wont show the relevant functions.

Please improve the suggestion functionality.

Regards,
Meethu Mathew

--files in SPARK_SUBMIT_OPTIONS not working - ZEPPELIN-2136

2017-03-17 Thread Meethu Mathew

Hi,

Acc to the zeppelin documentation, to pass a python package to zeppelin
pyspark interpreter, you can export it through --files option in
SPARK_SUBMIT_OPTIONS in conf/zeppelin-env.sh.

When I add a .egg file through the --files option in SPARK_SUBMIT_OPTIONS ,
zeppelin notebook is not throwing error, but I am not able to import the
module inside the zeppelin notebook.

Spark version is 1.6.2 and the zepplein-env.sh(version 0.7.0) file looks
like:
export SPARK_HOME=/home/me/spark-1.6.1-bin-hadoop2.6
export SPARK_SUBMIT_OPTIONS="--jars
/home/me/spark-csv-1.5.0-s_2.10.jar,/home/me/commons-csv-1.4.jar --files
/home/me/models/Churn/package/build/dist/fly_libs-1.1-py2.7.egg"

Any progress in this ticket ZEPPELIN-2136
<https://issues.apache.org/jira/browse/ZEPPELIN-2136> ?

Regards,
Meethu Mathew

python prints "..." in the place of comments in output

2017-03-16 Thread Meethu Mathew

Hi,

The output of following code prints unexpected dots in the result if there
is a comment in the code. Is it a bug with zeppelin?

*Code :*

%python
v = [1,2,3]
#comment 1
#comment
print v

*output*
... ... [1, 2, 3]

Regards,
Meethu Mathew

Re: "spark ui" button in spark interpreter does not show Spark web-ui

2017-03-13 Thread Meethu Mathew

Hi,

I have noticed the same problem

Regards,
Meethu Mathew


On Mon, Mar 13, 2017 at 9:56 AM, Xiaohui Liu <hero...@gmail.com> wrote:

> Hi,
>
> We used 0.7.1-snapshot with our Mesos cluster, almost all our needed
> features (ldap login, notebook acl control, livy/pyspark/rspark/scala,
> etc.) work pretty well.
>
> But one thing does not work for us is the 'spark ui' button does not
> response to user clicks. No errors in browser side.
>
> Anyone has met similar issues? Any suggestions about where I should check?
>
> Regards
> Xiaohui
>

Adding images in the %md interpreter

2017-03-02 Thread Meethu Mathew

Hi all,

I am trying to display images in the %md interpreter of zeppelin(version
0.7.0) notebook using the following code.
   * ![](model-files/sentiment_donut_viz.png)*

But I am facing the following problems:

1. Not able to give a local path
2. I put the file inside the {zeppelin_home}/webapps/webapp and it worked.
But the files or folders added in this folder which is
the ZEPPELIN_WAR_TEMPDIR is deleted after a restart.

How can I add images in the mark down interpreter without using other
webservers?

Regards,
Meethu Mathew

[jira] [Created] (ZEPPELIN-2141) sc.addPyFile("hdfs://path/to file) in zeppelin causing UnKnownHostException

2017-02-20 Thread Meethu Mathew (JIRA)

Meethu Mathew created ZEPPELIN-2141:
---

 Summary: sc.addPyFile("hdfs://path/to file) in zeppelin causing 
UnKnownHostException
 Key: ZEPPELIN-2141
 URL: https://issues.apache.org/jira/browse/ZEPPELIN-2141
 Project: Zeppelin
  Issue Type: Bug
  Components: pySpark
Affects Versions: 0.6.0
Reporter: Meethu Mathew
Priority: Minor


In the documentation of sc.addPyFile(0 its is mentioned that "
Add a .py or .zip dependency for all tasks to be executed on this SparkContext 
in the future. The path passed can be either a local file, a file in HDFS (or 
other Hadoop-supported filesystems), or an HTTP, HTTPS or FTP URI"

But when I added an HDFS path in the method in zeppelin, it results in the 
following exception:
Py4JJavaError: An error occurred while calling 
z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 
3, demo-node4.flytxt.com): java.lang.IllegalArgumentException: 
java.net.UnknownHostException: flycluster

Spark version used is 1.6.2. The same command is working fine with pyspark 
shell and hence I think something is wrong  with Zeppelin



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (ZEPPELIN-2136) --files in SPARK_SUBMIT_OPTIONS not working

2017-02-19 Thread Meethu Mathew (JIRA)

Meethu Mathew created ZEPPELIN-2136:
---

 Summary: --files in SPARK_SUBMIT_OPTIONS not working 
 Key: ZEPPELIN-2136
 URL: https://issues.apache.org/jira/browse/ZEPPELIN-2136
 Project: Zeppelin
  Issue Type: Bug
  Components: pySpark
Affects Versions: 0.6.0
Reporter: Meethu Mathew


Acc to the zeppelin documentation, to pass a python package to zeppelin pyspark 
interpreter, you can export it through --files option in SPARK_SUBMIT_OPTIONS 
in conf/zeppelin-env.sh. 

When I add a .egg file through the --files option in SPARK_SUBMIT_OPTIONS , 
zeppelin notebook is not throwing error, but I am not able to import the module 
inside the zeppelin notebook.

Spark version is 1.6.2 and the zepplein-env.sh file looks like:

export SPARK_HOME=/home/me/spark-1.6.1-bin-hadoop2.6
export SPARK_SUBMIT_OPTIONS="--jars 
/home/me/spark-csv-1.5.0-s_2.10.jar,/home/me/commons-csv-1.4.jar --files 
/home/me/models/Churn/package/build/dist/fly_libs-1.1-py2.7.egg"

My work around for this problem was to add the .rgg file using sc.addPyFile() 
inside the notebook.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Re: Failed to run spark jobs on mesos due to "hadoop" not found.

2016-11-18 Thread Meethu Mathew

Hi,

Add HADOOP_HOME=/path/to/hadoop/folder in /etc/default/mesos-slave in all
mesos agents and restart mesos

Regards,
Meethu Mathew


On Thu, Nov 10, 2016 at 4:57 PM, Yu Wei <yu20...@hotmail.com> wrote:

> Hi Guys,
>
> I failed to launch spark jobs on mesos. Actually I submitted the job to
> cluster successfully.
>
> But the job failed to run.
>
> I1110 18:25:11.095507   301 fetcher.cpp:498] Fetcher Info:
> {"cache_directory":"\/tmp\/mesos\/fetch\/slaves\/1f8e621b-3cbf-4b86-a1c1-
> 9e2cf77265ee-S7\/root","items":[{"action":"BYPASS_CACHE","
> uri":{"extract":true,"value":"hdfs:\/\/192.168.111.74:9090\/
> bigdata\/package\/spark-examples_2.11-2.0.1.jar"}}],"
> sandbox_directory":"\/var\/lib\/mesos\/agent\/slaves\/
> 1f8e621b-3cbf-4b86-a1c1-9e2cf77265ee-S7\/frameworks\/
> 1f8e621b-3cbf-4b86-a1c1-9e2cf77265ee-0002\/executors\/
> driver-20161110182510-0001\/runs\/b561328e-9110-4583-b740-
> 98f9653e7fc2","user":"root"}
> I1110 18:25:11.099799   301 fetcher.cpp:409] Fetching URI 'hdfs://
> 192.168.111.74:9090/bigdata/package/spark-examples_2.11-2.0.1.jar'
> I1110 18:25:11.099820   301 fetcher.cpp:250] Fetching directly into the
> sandbox directory
> I1110 18:25:11.099862   301 fetcher.cpp:187] Fetching URI 'hdfs://
> 192.168.111.74:9090/bigdata/package/spark-examples_2.11-2.0.1.jar'
> E1110 18:25:11.101842   301 shell.hpp:106] Command 'hadoop version 2>&1'
> failed; this is the output:
> sh: hadoop: command not found
> Failed to fetch 'hdfs://192.168.111.74:9090/bigdata/package/spark-
> examples_2.11-2.0.1.jar': Failed to create HDFS client: Failed to execute
> 'hadoop version 2>&1'; the command was either not found or exited with a
> non-zero exit status: 127
> Failed to synchronize with agent (it's probably exited
>
> Actually I installed hadoop on each agent node.
>
>
> Any advice?
>
>
> Thanks,
>
> Jared, (韦煜）
> Software developer
> Interested in open source software, big data, Linux
>

[jira] [Created] (ZEPPELIN-1562) Wrong documentation in 'Run a paragraph synchronously' rest api

2016-10-18 Thread Meethu Mathew (JIRA)

Meethu Mathew created ZEPPELIN-1562:
---

 Summary: Wrong documentation in 'Run a paragraph synchronously' 
rest api
 Key: ZEPPELIN-1562
 URL: https://issues.apache.org/jira/browse/ZEPPELIN-1562
 Project: Zeppelin
  Issue Type: Bug
  Components: documentation
Affects Versions: 0.7.0
Reporter: Meethu Mathew
 Fix For: 0.7.0


The URL for running a paragraph synchronously using REST api is given as 
"http://[zeppelin-server]:[zeppelin-port]/api/notebook/job/[notebookId]/[paragraphId]
 " in the documentation.
https://zeppelin.apache.org/docs/0.7.0-SNAPSHOT/rest-api/rest-notebook.html#run-a-paragraph-synchronously.

When I searched the same in the github code , 
https://zeppelin.apache.org/docs/0.7.0-SNAPSHOT/rest-api/rest-notebook.html#run-a-paragraph-synchronously
 , the URL is given as  "run/notebookId/paragraphId"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (SPARK-12755) Spark may attempt to rebuild application UI before finishing writing the event logs in possible race condition

2016-05-13 Thread Meethu Mathew (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15282452#comment-15282452
 ] 

Meethu Mathew commented on SPARK-12755:
---

Hi, 
I am facing similar issues again in 1.6.1 standalone. 
1. My completed applications are listed under in the incompleted applications 
list. My application was completed using sc.stop() and the log directory 
contains app folders without .inprogress suffix. No permission issues is there 
for the log directory.
2 From the incompleted list, I can view the UI of only those apps ,which has a 
.inprogress suffix in the folder name in log directory. For other apps it's 
showing error "Application app-2015x not found".
 Please help me.

> Spark may attempt to rebuild application UI before finishing writing the 
> event logs in possible race condition
> --
>
> Key: SPARK-12755
> URL: https://issues.apache.org/jira/browse/SPARK-12755
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.2
>Reporter: Michael Allman
>Assignee: Michael Allman
>Priority: Minor
> Fix For: 1.5.3, 1.6.1, 2.0.0
>
>
> As reported in SPARK-6950, it appears that sometimes the standalone master 
> attempts to build an application's historical UI before closing the app's 
> event log. This is still an issue for us in 1.5.2+, and I believe I've found 
> the underlying cause.
> When stopping a {{SparkContext}}, the {{stop}} method stops the DAG scheduler:
> https://github.com/apache/spark/blob/a76cf51ed91d99c88f301ec85f3cda1288bcf346/core/src/main/scala/org/apache/spark/SparkContext.scala#L1722-L1727
> and then stops the event logger:
> https://github.com/apache/spark/blob/a76cf51ed91d99c88f301ec85f3cda1288bcf346/core/src/main/scala/org/apache/spark/SparkContext.scala#L1722-L1727
> Though it is difficult to follow the chain of events, one of the sequelae of 
> stopping the DAG scheduler is that the master's {{rebuildSparkUI}} method is 
> called. This method looks for the application's event logs, and its behavior 
> varies based on the existence of an {{.inprogress}} file suffix. In 
> particular, a warning is logged if this suffix exists:
> https://github.com/apache/spark/blob/a76cf51ed91d99c88f301ec85f3cda1288bcf346/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L935
> After calling the {{stop}} method on the DAG scheduler, the {{SparkContext}} 
> stops the event logger:
> https://github.com/apache/spark/blob/a76cf51ed91d99c88f301ec85f3cda1288bcf346/core/src/main/scala/org/apache/spark/SparkContext.scala#L1734-L1736
> This renames the event log, dropping the {{.inprogress}} file sequence.
> As such, a race condition exists where the master may attempt to process the 
> application log file before finalizing it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11227) Spark1.5+ HDFS HA mode throw java.net.UnknownHostException: nameservice1

2016-05-02 Thread Meethu Mathew (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15266237#comment-15266237
 ] 

Meethu Mathew commented on SPARK-11227:
---

I am also facing the same issue when HA is setup in cloudera HDFS . I am using 
spark 1.6.1 and using ipython notebook. When HA is disabled, everything is fine.

> Spark1.5+ HDFS HA mode throw java.net.UnknownHostException: nameservice1
> 
>
> Key: SPARK-11227
> URL: https://issues.apache.org/jira/browse/SPARK-11227
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.0, 1.5.1
> Environment: OS: CentOS 6.6
> Memory: 28G
> CPU: 8
> Mesos: 0.22.0
> HDFS: Hadoop 2.6.0-CDH5.4.0 (build by Cloudera Manager)
>Reporter: Yuri Saito
>
> When running jar including Spark Job at HDFS HA Cluster, Mesos and 
> Spark1.5.1, the job throw Exception as "java.net.UnknownHostException: 
> nameservice1" and fail.
> I do below in Terminal.
> {code}
> /opt/spark/bin/spark-submit \
>   --class com.example.Job /jobs/job-assembly-1.0.0.jar
> {code}
> So, job throw below message.
> {code}
> 15/10/21 15:22:12 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 
> (TID 0, spark003.example.com): java.lang.IllegalArgumentException: 
> java.net.UnknownHostException: nameservice1
> at 
> org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:374)
> at 
> org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:312)
> at 
> org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:178)
> at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:665)
> at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:601)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:148)
> at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2596)
> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
> at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:169)
> at 
> org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:656)
> at 
> org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:436)
> at 
> org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:409)
> at 
> org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$32.apply(SparkContext.scala:1016)
> at 
> org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$32.apply(SparkContext.scala:1016)
> at 
> org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
> at 
> org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
> at scala.Option.map(Option.scala:145)
> at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:176)
> at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:220)
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216)
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurren

[jira] [Commented] (SPARK-8402) Add DP means clustering to MLlib

2016-01-28 Thread Meethu Mathew (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15121332#comment-15121332
 ] 

Meethu Mathew commented on SPARK-8402:
--

[~mengxr] [~josephkb] This ticket is in idle state for a long time . Could you 
please comment on what we can do next?

> Add DP means clustering to MLlib
> 
>
> Key: SPARK-8402
> URL: https://issues.apache.org/jira/browse/SPARK-8402
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>    Reporter: Meethu Mathew
>    Assignee: Meethu Mathew
>  Labels: features
>
> At present, all the clustering algorithms in MLlib require the number of 
> clusters to be specified in advance. 
> The Dirichlet process (DP) is a popular non-parametric Bayesian mixture model 
> that allows for flexible clustering of data without having to specify apriori 
> the number of clusters. 
> DP means is a non-parametric clustering algorithm that uses a scale parameter 
> 'lambda' to control the creation of new clusters ["Revisiting k-means: New 
> Algorithms via Bayesian Nonparametrics" by Brian Kulis, Michael I. Jordan].
> We have followed the distributed implementation of DP means which has been 
> proposed in the paper titled "MLbase: Distributed Machine Learning Made Easy" 
> by Xinghao Pan, Evan R. Sparks, Andre Wibisono.
> A benchmark comparison between k-means and dp-means based on Normalized 
> Mutual Information between ground truth clusters and algorithm outputs, have 
> been provided in the following table. It can be seen from the table that 
> DP-means reported a higher NMI on 5 of 8 data sets in comparison to 
> k-means[Source: Kulis, B., Jordan, M.I.: Revisiting k-means: New algorithms 
> via Bayesian nonparametrics (2011) Arxiv:.0352. (Table 1)]
> | Dataset   | DP-means | k-means |
> | Wine  | .41  | .43 |
> | Iris  | .75  | .76 |
> | Pima  | .02  | .03 |
> | Soybean   | .72  | .66 |
> | Car   | .07  | .05 |
> | Balance Scale | .17  | .11 |
> | Breast Cancer | .04  | .03 |
> | Vehicle   | .18  | .18 |
> Experiment on our spark cluster setup:
> An initial benchmark study was performed on a 3 node Spark cluster setup on 
> mesos where each node config was 8 Cores, 64 GB RAM and the spark version 
> used was 1.5(git branch).
> Tests were done using a mixture of 10 Gaussians with varying number of 
> features and instances. The results from the benchmark study are provided 
> below. The reported stats are average over 5 runs. 
> | DATASET || DPMEANS |   |
>  | KMEANS (k =10) | |
> | Instances   | Dimensions | No of clusters obtained | Time  | Converged in 
> iterations |  Time  | Converged in iterations |
> |  10 million | 10 |10   | 43.6s |2   
>  |  52.2s |2|
> |  1 million  | 100|10   | 39.8s |2   
>  | 43.39s |2|
> | 0.1 million |1000|10   | 37.3s |2   
>  | 41.64s |2|



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8402) Add DP means clustering to MLlib

2015-12-14 Thread Meethu Mathew (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Meethu Mathew updated SPARK-8402:
-
Summary: Add DP means clustering to MLlib  (was: DP means clustering )

> Add DP means clustering to MLlib
> 
>
> Key: SPARK-8402
> URL: https://issues.apache.org/jira/browse/SPARK-8402
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>    Reporter: Meethu Mathew
>    Assignee: Meethu Mathew
>  Labels: features
>
> At present, all the clustering algorithms in MLlib require the number of 
> clusters to be specified in advance. 
> The Dirichlet process (DP) is a popular non-parametric Bayesian mixture model 
> that allows for flexible clustering of data without having to specify apriori 
> the number of clusters. 
> DP means is a non-parametric clustering algorithm that uses a scale parameter 
> 'lambda' to control the creation of new clusters ["Revisiting k-means: New 
> Algorithms via Bayesian Nonparametrics" by Brian Kulis, Michael I. Jordan].
> We have followed the distributed implementation of DP means which has been 
> proposed in the paper titled "MLbase: Distributed Machine Learning Made Easy" 
> by Xinghao Pan, Evan R. Sparks, Andre Wibisono.
> A benchmark comparison between k-means and dp-means based on Normalized 
> Mutual Information between ground truth clusters and algorithm outputs, have 
> been provided in the following table. It can be seen from the table that 
> DP-means reported a higher NMI on 5 of 8 data sets in comparison to 
> k-means[Source: Kulis, B., Jordan, M.I.: Revisiting k-means: New algorithms 
> via Bayesian nonparametrics (2011) Arxiv:.0352. (Table 1)]
> | Dataset   | DP-means | k-means |
> | Wine  | .41  | .43 |
> | Iris  | .75  | .76 |
> | Pima  | .02  | .03 |
> | Soybean   | .72  | .66 |
> | Car   | .07  | .05 |
> | Balance Scale | .17  | .11 |
> | Breast Cancer | .04  | .03 |
> | Vehicle   | .18  | .18 |
> Experiment on our spark cluster setup:
> An initial benchmark study was performed on a 3 node Spark cluster setup on 
> mesos where each node config was 8 Cores, 64 GB RAM and the spark version 
> used was 1.5(git branch).
> Tests were done using a mixture of 10 Gaussians with varying number of 
> features and instances. The results from the benchmark study are provided 
> below. The reported stats are average over 5 runs. 
> | DATASET || DPMEANS |   |
>  | KMEANS (k =10) | |
> | Instances   | Dimensions | No of clusters obtained | Time  | Converged in 
> iterations |  Time  | Converged in iterations |
> |  10 million | 10 |10   | 43.6s |2   
>  |  52.2s |2|
> |  1 million  | 100|10   | 39.8s |2   
>  | 43.39s |2|
> | 0.1 million |1000|10   | 37.3s |2   
>  | 41.64s |2|



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6612) Python KMeans parity

2015-12-09 Thread Meethu Mathew (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15050263#comment-15050263
 ] 

Meethu Mathew commented on SPARK-6612:
--

[~mengxr] This issue is resolved. But it seems Apache Spark made a wrong 
comment here. Could you please check it out ?

> Python KMeans parity
> 
>
> Key: SPARK-6612
> URL: https://issues.apache.org/jira/browse/SPARK-6612
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, PySpark
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Assignee: Hrishikesh
>Priority: Minor
> Fix For: 1.4.0
>
>
> This is a subtask of [SPARK-6258] for the Python API of KMeans.  These items 
> are missing:
> KMeans
> * setEpsilon
> * setInitializationSteps
> KMeansModel
> * computeCost
> * k



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2572) Can't delete local dir on executor automatically when running spark over Mesos.

2015-11-26 Thread Meethu Mathew (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15028657#comment-15028657
 ] 

Meethu Mathew commented on SPARK-2572:
--

[~srowen] We are facing this issue with  Mesos fine grained mode in Spark 
1.4.1. The /tmp/spark-* and and some blockmgr-* files exist even after calling 
sc.stop(). Is there any any other way to solve this issue?

> Can't delete local dir on executor automatically when running spark over 
> Mesos.
> ---
>
> Key: SPARK-2572
> URL: https://issues.apache.org/jira/browse/SPARK-2572
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.0.0
>Reporter: Yadong Qi
>Priority: Minor
>
> When running spark over Mesos in “fine-grained” modes or “coarse-grained” 
> mode. After the application finished.The local 
> dir(/tmp/spark-local-20140718114058-834c) on executor can't not delete 
> automatically.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

How is the predict() working in LogisticRegressionModel?

2015-11-13 Thread MEETHU MATHEW

Hi all,Can somebody point me to the implementation of predict() in 
LogisticRegressionModel of spark mllib? I could find a predictPoint() in the 
class LogisticRegressionModel, but where is predict()?
 Thanks & Regards,  Meethu M

Re: Please reply if you use Mesos fine grained mode

2015-11-03 Thread MEETHU MATHEW

Hi,
We are using Mesos fine grained mode because we can have multiple instances of 
spark to share machines and each application get resources dynamically 
allocated.  Thanks & Regards,  Meethu M 


 On Wednesday, 4 November 2015 5:24 AM, Reynold Xin  
wrote:
   

 If you are using Spark with Mesos fine grained mode, can you please respond to 
this email explaining why you use it over the coarse grained mode?
Thanks.

Re: Please reply if you use Mesos fine grained mode

2015-11-03 Thread MEETHU MATHEW

Hi,
We are using Mesos fine grained mode because we can have multiple instances of 
spark to share machines and each application get resources dynamically 
allocated.  Thanks & Regards,  Meethu M 


 On Wednesday, 4 November 2015 5:24 AM, Reynold Xin  
wrote:
   

 If you are using Spark with Mesos fine grained mode, can you please respond to 
this email explaining why you use it over the coarse grained mode?
Thanks.

[jira] [Commented] (SPARK-6724) Model import/export for FPGrowth

2015-10-15 Thread Meethu Mathew (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14958374#comment-14958374
 ] 

Meethu Mathew commented on SPARK-6724:
--

I am not able to take this PR forward. Can somebody take this?

> Model import/export for FPGrowth
> 
>
> Key: SPARK-6724
> URL: https://issues.apache.org/jira/browse/SPARK-6724
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Note: experimental model API



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Spark 1.6 Release window is not updated in Spark-wiki

2015-10-01 Thread Meethu Mathew

Hi,
In the https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage the
current release window has not been changed from 1.5. Can anybody give an
idea of the expected dates for 1.6 version?

Regards,

Meethu Mathew
Senior Engineer
Flytxt

Re: Best way to merge final output part files created by Spark job

2015-09-17 Thread MEETHU MATHEW

Try coalesce(1) before writing Thanks & Regards, Meethu M 


 On Tuesday, 15 September 2015 6:49 AM, java8964  
wrote:
   

 #yiv1620377612 #yiv1620377612 --.yiv1620377612hmmessage 
P{margin:0px;padding:0px;}#yiv1620377612 
body.yiv1620377612hmmessage{font-size:12pt;font-family:Calibri;}#yiv1620377612 
For text file, this merge works fine, but for binary format like "ORC", 
"Parquet" or "AVOR", not sure this will work.
These kind of formats in fact are not append-able, as they write the detail 
data information either in the head or at tail part of the file.
You have to use the format specified API to merge the data.
Yong

Date: Mon, 14 Sep 2015 09:10:33 +0200
Subject: Re: Best way to merge final output part files created by Spark job
From: gmu...@stratio.com
To: umesh.ka...@gmail.com
CC: user@spark.apache.org

Hi, check out  FileUtil.copyMerge function in the Hadoop API.  
It's simple,  
   
   - Get the hadoop configuration from Spark Context  FileSystem fs = 
FileSystem.get(sparkContext.hadoopConfiguration());   

   - Create new Path with destination and source directory.
   - Call copyMerge   FileUtil.copyMerge(fs, inputPath, fs, destPath, true, 
sparkContext.hadoopConfiguration(), null);

2015-09-13 23:25 GMT+02:00 unk1102 :

Hi I have a spark job which creates around 500 part files inside each
directory I process. So I have thousands of such directories. So I need to
merge these small small 500 part files. I am using
spark.sql.shuffle.partition as 500 and my final small files are ORC files.
Is there a way to merge orc files in Spark if not please suggest the best
way to merge files created by Spark job in hdfs please guide. Thanks much.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Best-way-to-merge-final-output-part-files-created-by-Spark-job-tp24681.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org





-- 

Gaspar Muñoz 
@gmunozsoria

Vía de las dos Castillas, 33, Ática 4, 3ª Planta
28224 Pozuelo de Alarcón, MadridTel: +34 91 352 59 42 // @stratiobd

[jira] [Commented] (SPARK-6724) Model import/export for FPGrowth

2015-09-10 Thread Meethu Mathew (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14740160#comment-14740160
 ] 

Meethu Mathew commented on SPARK-6724:
--

[~josephkb] I will take a look into it and update the PR accordingly. Thank you.

> Model import/export for FPGrowth
> 
>
> Key: SPARK-6724
> URL: https://issues.apache.org/jira/browse/SPARK-6724
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Note: experimental model API



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8402) DP means clustering

2015-09-02 Thread Meethu Mathew (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Meethu Mathew updated SPARK-8402:
-
Description: 
At present, all the clustering algorithms in MLlib require the number of 
clusters to be specified in advance. 
The Dirichlet process (DP) is a popular non-parametric Bayesian mixture model 
that allows for flexible clustering of data without having to specify apriori 
the number of clusters. 
DP means is a non-parametric clustering algorithm that uses a scale parameter 
'lambda' to control the creation of new clusters ["Revisiting k-means: New 
Algorithms via Bayesian Nonparametrics" by Brian Kulis, Michael I. Jordan].

We have followed the distributed implementation of DP means which has been 
proposed in the paper titled "MLbase: Distributed Machine Learning Made Easy" 
by Xinghao Pan, Evan R. Sparks, Andre Wibisono.

A benchmark comparison between k-means and dp-means based on Normalized Mutual 
Information between ground truth clusters and algorithm outputs, have been 
provided in the following table. It can be seen from the table that DP-means 
reported a higher NMI on 5 of 8 data sets in comparison to k-means[Source: 
Kulis, B., Jordan, M.I.: Revisiting k-means: New algorithms via Bayesian 
nonparametrics (2011) Arxiv:.0352. (Table 1)]

| Dataset   | DP-means | k-means |
| Wine  | .41  | .43 |
| Iris  | .75  | .76 |
| Pima  | .02  | .03 |
| Soybean   | .72  | .66 |
| Car   | .07  | .05 |
| Balance Scale | .17  | .11 |
| Breast Cancer | .04  | .03 |
| Vehicle   | .18  | .18 |

Experiment on our spark cluster setup:

An initial benchmark study was performed on a 3 node Spark cluster setup on 
mesos where each node config was 8 Cores, 64 GB RAM and the spark version used 
was 1.5(git branch).

Tests were done using a mixture of 10 Gaussians with varying number of features 
and instances. The results from the benchmark study are provided below. The 
reported stats are average over 5 runs. 

| DATASET || DPMEANS |   |  
   | KMEANS (k =10) | |
| Instances   | Dimensions | No of clusters obtained | Time  | Converged in 
iterations |  Time  | Converged in iterations |
|  10 million | 10 |10   | 43.6s |2 
   |  52.2s |2|
|  1 million  | 100|10   | 39.8s |2 
   | 43.39s |2|
| 0.1 million |1000|10   | 37.3s |2 
   | 41.64s |2|

  was:
At present, all the clustering algorithms in MLlib require the number of 
clusters to be specified in advance. 
The Dirichlet process (DP) is a popular non-parametric Bayesian mixture model 
that allows for flexible clustering of data without having to specify apriori 
the number of clusters. 
DP means is a non-parametric clustering algorithm that uses a scale parameter 
'lambda' to control the creation of new clusters["Revisiting k-means: New 
Algorithms via Bayesian Nonparametrics" by Brian Kulis, Michael I. Jordan].

We have followed the distributed implementation of DP means which has been 
proposed in the paper titled "MLbase: Distributed Machine Learning Made Easy" 
by Xinghao Pan, Evan R. Sparks, Andre Wibisono.


> DP means clustering 
> 
>
> Key: SPARK-8402
> URL: https://issues.apache.org/jira/browse/SPARK-8402
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>    Reporter: Meethu Mathew
>Assignee: Meethu Mathew
>  Labels: features
>
> At present, all the clustering algorithms in MLlib require the number of 
> clusters to be specified in advance. 
> The Dirichlet process (DP) is a popular non-parametric Bayesian mixture model 
> that allows for flexible clustering of data without having to specify apriori 
> the number of clusters. 
> DP means is a non-parametric clustering algorithm that uses a scale parameter 
> 'lambda' to control the creation of new clusters ["Revisiting k-means: New 
> Algorithms via Bayesian Nonparametrics" by Brian Kulis, Michael I. Jordan].
> We have followed the distributed implementation of DP means which has been 
> proposed in the paper titled "MLbase: Distributed Machine Learning Made Easy" 
> by Xinghao Pan, Evan R. Sparks, Andre Wibisono.
> A benchmark comparison between k-means and dp-means based on Normalized 
> Mutual Information between ground truth clusters and algorithm outputs, have 
> been provided in the following table. It c

[jira] [Commented] (SPARK-6724) Model import/export for FPGrowth

2015-08-30 Thread Meethu Mathew (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14722999#comment-14722999
 ] 

Meethu Mathew commented on SPARK-6724:
--

[~josephkb] Could you plz give your opinion on this ?

> Model import/export for FPGrowth
> 
>
> Key: SPARK-6724
> URL: https://issues.apache.org/jira/browse/SPARK-6724
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Note: experimental model API



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: make-distribution.sh failing at spark/R/lib/sparkr.zip

2015-08-13 Thread MEETHU MATHEW

Hi,
It worked after removing that line. Thank you for the response and fix .
 Thanks  Regards, Meethu M 


 On Thursday, 13 August 2015 4:12 AM, Burak Yavuz brk...@gmail.com wrote:
   

 For the record:https://github.com/apache/spark/pull/8147
https://issues.apache.org/jira/browse/SPARK-9916

On Wed, Aug 12, 2015 at 3:08 PM, Burak Yavuz brk...@gmail.com wrote:

Are you running from master? Could you delete line 222 of 
make-distribution.sh?We updated when we build sparkr.zip. I'll submit a fix for 
it for 1.5 and master.
Burak
On Wed, Aug 12, 2015 at 3:31 AM, MEETHU MATHEW meethu2...@yahoo.co.in wrote:

Hi, I am trying to create a package using the make-distribution.sh script from 
the github master branch. But its not getting successfully completed. The last 
statement printed is 
+ cp /home/meethu/git/FlytxtRnD/spark/R/lib/sparkr.zip 
/home/meethu/git/FlytxtRnD/spark/dist/R/libcp: cannot stat 
`/home/meethu/git/FlytxtRnD/spark/R/lib/sparkr.zip': No such file or directory
My bulid is success and I am trying to execute the following command 
./make-distribution.sh --tgz -Pyarn -Dyarn.version=2.6.0 -Phadoop-2.4 
-Dhadoop.version=2.6.0 -Phive

Please help.
Thanks  Regards, Meethu M

Re: Combining Spark Files with saveAsTextFile

2015-08-06 Thread MEETHU MATHEW

Hi,Try using coalesce(1) before calling saveAsTextFile() Thanks  Regards, 
Meethu M 


 On Wednesday, 5 August 2015 7:53 AM, Brandon White 
bwwintheho...@gmail.com wrote:
   

 What is the best way to make saveAsTextFile save as only a single file?

RE:Building scaladoc using build/sbt unidoc failure

2015-07-10 Thread MEETHU MATHEW

Hi,
I am getting the assertion error while trying to run build/sbt unidoc same as 
you described in Building scaladoc using build/sbt unidoc failure .Could you 
tell me how you get it working ?
|   |
|   |   |   |   |   |
| Building scaladoc using build/sbt unidoc failureHello,I am trying to build 
scala doc from the 1.4 branch.  |
|  |
| View on mail-archives.apache.org | Preview by Yahoo |
|  |
|   |


 Thanks  Regards,
Meethu M

[jira] [Commented] (SPARK-8402) DP means clustering

2015-06-17 Thread Meethu Mathew (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14591197#comment-14591197
 ] 

Meethu Mathew commented on SPARK-8402:
--

Could you please assign the ticket to me?

 DP means clustering 
 

 Key: SPARK-8402
 URL: https://issues.apache.org/jira/browse/SPARK-8402
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Meethu Mathew
  Labels: features

 At present, all the clustering algorithms in MLlib require the number of 
 clusters to be specified in advance. 
 The Dirichlet process (DP) is a popular non-parametric Bayesian mixture model 
 that allows for flexible clustering of data without having to specify apriori 
 the number of clusters. 
 DP means is a non-parametric clustering algorithm that uses a scale parameter 
 'lambda' to control the creation of new clusters[Revisiting k-means: New 
 Algorithms via Bayesian Nonparametrics by Brian Kulis, Michael I. Jordan].
 We have followed the distributed implementation of DP means which has been 
 proposed in the paper titled MLbase: Distributed Machine Learning Made Easy 
 by Xinghao Pan, Evan R. Sparks, Andre Wibisono.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[MLlib] Contributing algorithm for DP means clustering

2015-06-17 Thread Meethu Mathew

Hi all,

At present, all the clustering algorithms in MLlib require the number of
clusters to be specified in advance.

The Dirichlet process (DP) is a popular non-parametric Bayesian mixture
model that allows for flexible clustering of data without having to specify
apriori the number of clusters. DP means is a non-parametric clustering
algorithm that uses a scale parameter 'lambda' to control the creation of
new clusters.

We have followed the distributed implementation of DP means which has been
proposed in the paper titled MLbase: Distributed Machine Learning Made
Easy by Xinghao Pan, Evan R. Sparks, Andre Wibisono.

I have raised a JIRA ticket at
https://issues.apache.org/jira/browse/SPARK-8402

Suggestions and guidance are welcome.

Regards,

Meethu Mathew
Senior Engineer
Flytxt
www.flytxt.com | Visit our blog http://blog.flytxt.com/ | Follow us
http://www.twitter.com/flytxt | Connect on LinkedIn
http://www.linkedin.com/company/22166?goback=%2Efcs_GLHD_flytxt_false_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2trk=ncsrch_hits

[jira] [Created] (SPARK-8402) DP means clustering

2015-06-17 Thread Meethu Mathew (JIRA)

Meethu Mathew created SPARK-8402:


 Summary: DP means clustering 
 Key: SPARK-8402
 URL: https://issues.apache.org/jira/browse/SPARK-8402
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Meethu Mathew


At present, all the clustering algorithms in MLlib require the number of 
clusters to be specified in advance. 
The Dirichlet process (DP) is a popular non-parametric Bayesian mixture model 
that allows for flexible clustering of data without having to specify apriori 
the number of clusters. 
DP means is a non-parametric clustering algorithm that uses a scale parameter 
'lambda' to control the creation of new clusters[Revisiting k-means: New 
Algorithms via Bayesian Nonparametrics by Brian Kulis, Michael I. Jordan].

We have followed the distributed implementation of DP means which has been 
proposed in the paper titled MLbase: Distributed Machine Learning Made Easy 
by Xinghao Pan, Evan R. Sparks, Andre Wibisono.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8402) DP means clustering

2015-06-17 Thread Meethu Mathew (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14589392#comment-14589392
 ] 

Meethu Mathew commented on SPARK-8402:
--

Could anyone please assign this ticket to me ?

 DP means clustering 
 

 Key: SPARK-8402
 URL: https://issues.apache.org/jira/browse/SPARK-8402
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Meethu Mathew
  Labels: features

 At present, all the clustering algorithms in MLlib require the number of 
 clusters to be specified in advance. 
 The Dirichlet process (DP) is a popular non-parametric Bayesian mixture model 
 that allows for flexible clustering of data without having to specify apriori 
 the number of clusters. 
 DP means is a non-parametric clustering algorithm that uses a scale parameter 
 'lambda' to control the creation of new clusters[Revisiting k-means: New 
 Algorithms via Bayesian Nonparametrics by Brian Kulis, Michael I. Jordan].
 We have followed the distributed implementation of DP means which has been 
 proposed in the paper titled MLbase: Distributed Machine Learning Made Easy 
 by Xinghao Pan, Evan R. Sparks, Andre Wibisono.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8018) KMeans should accept initial cluster centers as param

2015-06-09 Thread Meethu Mathew (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14578751#comment-14578751
 ] 

Meethu Mathew commented on SPARK-8018:
--

Should I add a new test for this in the test suite or can I add it along with 
any other test(like model save/load) ?

 KMeans should accept initial cluster centers as param
 -

 Key: SPARK-8018
 URL: https://issues.apache.org/jira/browse/SPARK-8018
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley
Assignee: Meethu Mathew

 KMeans should allow model initialization using an existing set of cluster 
 centers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: Anyone facing problem in incremental building of individual project

2015-06-04 Thread Meethu Mathew

Hi,
I added

 createDependencyReducedPom in my pom.xml and  the problem is solved.
!-- Work around MSHADE-148 --
+

 createDependencyReducedPomfalse/createDependencyReducedPom

Thank you @Steve  and @Ted


Regards,

Meethu Mathew
Senior Engineer
Flytxt
On Thu, Jun 4, 2015 at 9:51 PM, Ted Yu yuzhih...@gmail.com wrote:

 Andrew Or put in this workaround :

 diff --git a/pom.xml b/pom.xml
 index 0b1aaad..d03d33b 100644
 --- a/pom.xml
 +++ b/pom.xml
 @@ -1438,6 +1438,8 @@
  version2.3/version
  configuration
shadedArtifactAttachedfalse/shadedArtifactAttached
 +  !-- Work around MSHADE-148 --
 +  createDependencyReducedPomfalse/createDependencyReducedPom
artifactSet
  includes
!-- At a minimum we must include this to force effective
 pom generation --

 FYI

 On Thu, Jun 4, 2015 at 6:25 AM, Steve Loughran ste...@hortonworks.com
 wrote:


  On 4 Jun 2015, at 11:16, Meethu Mathew meethu.mat...@flytxt.com wrote:

  Hi all,

  I added some new code to MLlib. When I am trying to build only the
 mllib project using  *mvn --projects mllib/ -DskipTests clean install*
 * *after setting
  export S
 PARK_PREPEND_CLASSES=true
 , the build is getting stuck with the following message.



  Excluding org.jpmml:pmml-schema:jar:1.1.15 from the shaded jar.
 [INFO] Excluding com.sun.xml.bind:jaxb-impl:jar:2.2.7 from the shaded
 jar.
 [INFO] Excluding com.sun.xml.bind:jaxb-core:jar:2.2.7 from the shaded
 jar.
 [INFO] Excluding javax.xml.bind:jaxb-api:jar:2.2.7 from the shaded jar.
 [INFO] Including org.spark-project.spark:unused:jar:1.0.0 in the shaded
 jar.
 [INFO] Excluding org.scala-lang:scala-reflect:jar:2.10.4 from the shaded
 jar.
 [INFO] Replacing original artifact with shaded artifact.
 [INFO] Replacing
 /home/meethu/git/FlytxtRnD/spark/mllib/target/spark-mllib_2.10-1.4.0-SNAPSHOT.jar
 with
 /home/meethu/git/FlytxtRnD/spark/mllib/target/spark-mllib_2.10-1.4.0-SNAPSHOT-shaded.jar
 [INFO] Dependency-reduced POM written at:
 /home/meethu/git/FlytxtRnD/spark/mllib/dependency-reduced-pom.xml
 [INFO] Dependency-reduced POM written at:
 /home/meethu/git/FlytxtRnD/spark/mllib/dependency-reduced-pom.xml
 [INFO] Dependency-reduced POM written at:
 /home/meethu/git/FlytxtRnD/spark/mllib/dependency-reduced-pom.xml
 [INFO] Dependency-reduced POM written at:
 /home/meethu/git/FlytxtRnD/spark/mllib/dependency-reduced-pom.xml

.



  I've seen something similar in a different build,

  It looks like MSHADE-148:
 https://issues.apache.org/jira/browse/MSHADE-148
 if you apply Tom White's patch, does your problem go away?

[jira] [Commented] (SPARK-8018) KMeans should accept initial cluster centers as param

2015-06-04 Thread Meethu Mathew (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14572489#comment-14572489
 ] 

Meethu Mathew commented on SPARK-8018:
--

[~josephkb][~mengxr] Thank you for the comments. In the method suggested by 
Xiangrui, do we need to get the value of k as a parameter and then compare it 
with the value of model.k as in GMM?

 KMeans should accept initial cluster centers as param
 -

 Key: SPARK-8018
 URL: https://issues.apache.org/jira/browse/SPARK-8018
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley

 KMeans should allow model initialization using an existing set of cluster 
 centers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Anyone facing problem in incremental building of individual project

2015-06-04 Thread Meethu Mathew

Hi all,

I added some new code to MLlib. When I am trying to build only the mllib
project using  *mvn --projects mllib/ -DskipTests clean install*
* *after setting
export S
PARK_PREPEND_CLASSES=true
, the build is getting stuck with the following message.



  Excluding org.jpmml:pmml-schema:jar:1.1.15 from the shaded jar.
 [INFO] Excluding com.sun.xml.bind:jaxb-impl:jar:2.2.7 from the shaded jar.
 [INFO] Excluding com.sun.xml.bind:jaxb-core:jar:2.2.7 from the shaded jar.
 [INFO] Excluding javax.xml.bind:jaxb-api:jar:2.2.7 from the shaded jar.
 [INFO] Including org.spark-project.spark:unused:jar:1.0.0 in the shaded
 jar.
 [INFO] Excluding org.scala-lang:scala-reflect:jar:2.10.4 from the shaded
 jar.
 [INFO] Replacing original artifact with shaded artifact.
 [INFO] Replacing
 /home/meethu/git/FlytxtRnD/spark/mllib/target/spark-mllib_2.10-1.4.0-SNAPSHOT.jar
 with
 /home/meethu/git/FlytxtRnD/spark/mllib/target/spark-mllib_2.10-1.4.0-SNAPSHOT-shaded.jar
 [INFO] Dependency-reduced POM written at:
 /home/meethu/git/FlytxtRnD/spark/mllib/dependency-reduced-pom.xml
 [INFO] Dependency-reduced POM written at:
 /home/meethu/git/FlytxtRnD/spark/mllib/dependency-reduced-pom.xml
 [INFO] Dependency-reduced POM written at:
 /home/meethu/git/FlytxtRnD/spark/mllib/dependency-reduced-pom.xml
 [INFO] Dependency-reduced POM written at:
 /home/meethu/git/FlytxtRnD/spark/mllib/dependency-reduced-pom.xml

   .

But  a full build completes as usual. Please help if anyone is facing the
same issue.

Regards,

Meethu Mathew
Senior Engineer
Flytxt

[jira] [Comment Edited] (SPARK-8018) KMeans should accept initial cluster centers as param

2015-06-04 Thread Meethu Mathew (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14572489#comment-14572489
 ] 

Meethu Mathew edited comment on SPARK-8018 at 6/4/15 10:11 AM:
---

[~josephkb][~mengxr] Thank you for the comments. In the method suggested by 
Xiangrui, do we need to get the value of k as a parameter and then compare it 
with the value of model.k as in GMM?
I am interested to work on this ticket. Please assign it to me


was (Author: meethumathew):
[~josephkb][~mengxr] Thank you for the comments. In the method suggested by 
Xiangrui, do we need to get the value of k as a parameter and then compare it 
with the value of model.k as in GMM?

 KMeans should accept initial cluster centers as param
 -

 Key: SPARK-8018
 URL: https://issues.apache.org/jira/browse/SPARK-8018
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley

 KMeans should allow model initialization using an existing set of cluster 
 centers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: How to create fewer output files for Spark job ?

2015-06-04 Thread MEETHU MATHEW

Try using coalesce Thanks  Regards,
Meethu M 


 On Wednesday, 3 June 2015 11:26 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com 
wrote:
   

 I am running a series of spark functions with 9000 executors and its resulting 
in 9000+ files that is execeeding the namespace file count qutota.
How can Spark be configured to use CombinedOutputFormat. {code}protected def 
writeOutputRecords(detailRecords: RDD[(AvroKey[DetailOutputRecord], 
NullWritable)], outputDir: String) {    val writeJob = new Job()    val schema 
= SchemaUtil.outputSchema(_detail)    AvroJob.setOutputKeySchema(writeJob, 
schema)    detailRecords.saveAsNewAPIHadoopFile(outputDir,      
classOf[AvroKey[GenericRecord]],      
classOf[org.apache.hadoop.io.NullWritable],      
classOf[AvroKeyOutputFormat[GenericRecord]],      writeJob.getConfiguration)  
}{code}

-- 
Deepak

[jira] [Commented] (SPARK-8018) KMeans should accept initial cluster centers as param

2015-06-03 Thread Meethu Mathew (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14570620#comment-14570620
 ] 

Meethu Mathew commented on SPARK-8018:
--

[~josephkb] For initialization using an existing set of cluster centers , do we 
need to supply centers for only 1 run ? or should we supply initial centers for 
multiple runs ?

 KMeans should accept initial cluster centers as param
 -

 Key: SPARK-8018
 URL: https://issues.apache.org/jira/browse/SPARK-8018
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley

 KMeans should allow model initialization using an existing set of cluster 
 centers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Regarding Connecting spark to Mesos documentation

2015-05-20 Thread Meethu Mathew


Hi List,

In  the documentation of Connecting Spark to Mesos 
http://spark.apache.org/docs/latest/running-on-mesos.html#connecting-spark-to-mesos, 
is it possible to modify and write in detail the step  Create a binary 
package using make-distribution.sh --tgz ? When we use custom compiled 
version of Spark, mostly we specify a hadoop version (which is not the 
default one). In this case, make-distribution.sh should be supplied the 
same maven options we used for building spark. This is not specified  in 
the documentation. Please correct me , if I am wrong.


Regards,
Meethu Mathew

Re: How to run multiple jobs in one sparkcontext from separate threads in pyspark?

2015-05-20 Thread MEETHU MATHEW

Hi Davies,Thank you for pointing to spark streaming. I am confused about how to 
return the result after running a function via  a thread.I tried using Queue to 
add the results to it and print it at the end.But here, I can see the results 
after all threads are finished.How to get the result of the function once a 
thread is finished, rather than waiting for all other threads to finish? Thanks 
 Regards,
Meethu M 


 On Tuesday, 19 May 2015 2:43 AM, Davies Liu dav...@databricks.com wrote:
   

 SparkContext can be used in multiple threads (Spark streaming works
with multiple threads), for example:

import threading
import time

def show(x):
    time.sleep(1)
    print x

def job():
    sc.parallelize(range(100)).foreach(show)

threading.Thread(target=job).start()


On Mon, May 18, 2015 at 12:34 AM, ayan guha guha.a...@gmail.com wrote:
 Hi

 So to be clear, do you want to run one operation in multiple threads within
 a function or you want run multiple jobs using multiple threads? I am
 wondering why python thread module can't be used? Or you have already gave
 it a try?

 On 18 May 2015 16:39, MEETHU MATHEW meethu2...@yahoo.co.in wrote:

 Hi Akhil,

 The python wrapper for Spark Job Server did not help me. I actually need
 the pyspark code sample  which shows how  I can call a function from 2
 threads and execute it simultaneously.

 Thanks  Regards,
 Meethu M



 On Thursday, 14 May 2015 12:38 PM, Akhil Das ak...@sigmoidanalytics.com
 wrote:


 Did you happened to have a look at the spark job server? Someone wrote a
 python wrapper around it, give it a try.

 Thanks
 Best Regards

 On Thu, May 14, 2015 at 11:10 AM, MEETHU MATHEW meethu2...@yahoo.co.in
 wrote:

 Hi all,

  Quote
  Inside a given Spark application (SparkContext instance), multiple
 parallel jobs can run simultaneously if they were submitted from separate
 threads. 

 How to run multiple jobs in one SPARKCONTEXT using separate threads in
 pyspark? I found some examples in scala and java, but couldn't find python
 code. Can anyone help me with a pyspark example?

 Thanks  Regards,
 Meethu M






-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: How to run multiple jobs in one sparkcontext from separate threads in pyspark?

2015-05-18 Thread MEETHU MATHEW

Hi Akhil, The python wrapper for Spark Job Server did not help me. I actually 
need the pyspark code sample  which shows how  I can call a function from 2 
threads and execute it simultaneously. Thanks  Regards,
Meethu M 


 On Thursday, 14 May 2015 12:38 PM, Akhil Das ak...@sigmoidanalytics.com 
wrote:
   

 Did you happened to have a look at the spark job server? Someone wrote a 
python wrapper around it, give it a try.
ThanksBest Regards
On Thu, May 14, 2015 at 11:10 AM, MEETHU MATHEW meethu2...@yahoo.co.in wrote:

Hi all,
 Quote Inside a given Spark application (SparkContext instance), multiple 
parallel jobs can run simultaneously if they were submitted from separate 
threads.  
How to run multiple jobs in one SPARKCONTEXT using separate threads in pyspark? 
I found some examples in scala and java, but couldn't find python code. Can 
anyone help me with a pyspark example? 
Thanks  Regards,
Meethu M

Re: Restricting the number of iterations in Mllib Kmeans

2015-05-18 Thread MEETHU MATHEW

Hi,I think you cant supply an initial set of centroids to kmeans Thanks  
Regards,
Meethu M 


 On Friday, 15 May 2015 12:37 AM, Suman Somasundar 
suman.somasun...@oracle.com wrote:
   

 !--#yiv5602900621 _filtered #yiv5602900621 {font-family:Cambria 
Math;panose-1:2 4 5 3 5 4 6 3 2 4;} _filtered #yiv5602900621 
{font-family:Calibri;panose-1:2 15 5 2 2 2 4 3 2 4;}#yiv5602900621 
#yiv5602900621 p.yiv5602900621MsoNormal, #yiv5602900621 
li.yiv5602900621MsoNormal, #yiv5602900621 div.yiv5602900621MsoNormal 
{margin:0in;margin-bottom:.0001pt;font-size:11.0pt;font-family:Calibri, 
sans-serif;}#yiv5602900621 a:link, #yiv5602900621 
span.yiv5602900621MsoHyperlink 
{color:blue;text-decoration:underline;}#yiv5602900621 a:visited, #yiv5602900621 
span.yiv5602900621MsoHyperlinkFollowed 
{color:purple;text-decoration:underline;}#yiv5602900621 
span.yiv5602900621EmailStyle17 {font-family:Calibri, 
sans-serif;color:windowtext;}#yiv5602900621 .yiv5602900621MsoChpDefault {} 
_filtered #yiv5602900621 {margin:1.0in 1.0in 1.0in 1.0in;}#yiv5602900621 
div.yiv5602900621WordSection1 {}--Hi,,

I want to run a definite number of iterations in Kmeans.  There is a command 
line argument to set maxIterations, but even if I set it to a number, Kmeans 
runs until the centroids converge. Is there a specific way to specify it in 
command line?
Also, I wanted to know if we can supply the initial set of centroids to the 
program instead of it choosing the centroids in random?  Thanks,
Suman.

[jira] [Commented] (SPARK-7651) PySpark GMM predict, predictSoft should fail on bad input

2015-05-14 Thread Meethu Mathew (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14544910#comment-14544910
 ] 

Meethu Mathew commented on SPARK-7651:
--

[~josephkb] Yea, I wil fix it asap.

 PySpark GMM predict, predictSoft should fail on bad input
 -

 Key: SPARK-7651
 URL: https://issues.apache.org/jira/browse/SPARK-7651
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 1.3.0, 1.3.1, 1.4.0
Reporter: Joseph K. Bradley
Priority: Minor

 In PySpark, GaussianMixtureModel predict and predictSoft test if the argument 
 is an RDD and operate correctly if so.  But if the argument is not an RDD, 
 they fail silently, returning nothing.
 [https://github.com/apache/spark/blob/11a1a135d1fe892cd48a9116acc7554846aed84c/python/pyspark/mllib/clustering.py#L176]
 Instead, they should raise errors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7651) PySpark GMM predict, predictSoft should fail on bad input

2015-05-14 Thread Meethu Mathew (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14544924#comment-14544924
 ] 

Meethu Mathew commented on SPARK-7651:
--

Could you please tell me where I should make the changes? In master branch or 
1.3.0?

 PySpark GMM predict, predictSoft should fail on bad input
 -

 Key: SPARK-7651
 URL: https://issues.apache.org/jira/browse/SPARK-7651
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 1.3.0, 1.3.1, 1.4.0
Reporter: Joseph K. Bradley
Priority: Minor

 In PySpark, GaussianMixtureModel predict and predictSoft test if the argument 
 is an RDD and operate correctly if so.  But if the argument is not an RDD, 
 they fail silently, returning nothing.
 [https://github.com/apache/spark/blob/11a1a135d1fe892cd48a9116acc7554846aed84c/python/pyspark/mllib/clustering.py#L176]
 Instead, they should raise errors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7651) PySpark GMM predict, predictSoft should fail on bad input

2015-05-14 Thread Meethu Mathew (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14544929#comment-14544929
 ] 

Meethu Mathew commented on SPARK-7651:
--

Ok thank you

 PySpark GMM predict, predictSoft should fail on bad input
 -

 Key: SPARK-7651
 URL: https://issues.apache.org/jira/browse/SPARK-7651
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 1.3.0, 1.3.1, 1.4.0
Reporter: Joseph K. Bradley
Priority: Minor

 In PySpark, GaussianMixtureModel predict and predictSoft test if the argument 
 is an RDD and operate correctly if so.  But if the argument is not an RDD, 
 they fail silently, returning nothing.
 [https://github.com/apache/spark/blob/11a1a135d1fe892cd48a9116acc7554846aed84c/python/pyspark/mllib/clustering.py#L176]
 Instead, they should raise errors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7651) PySpark GMM predict, predictSoft should fail on bad input

2015-05-14 Thread Meethu Mathew (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14544930#comment-14544930
 ] 

Meethu Mathew commented on SPARK-7651:
--

Ok thank you

 PySpark GMM predict, predictSoft should fail on bad input
 -

 Key: SPARK-7651
 URL: https://issues.apache.org/jira/browse/SPARK-7651
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 1.3.0, 1.3.1, 1.4.0
Reporter: Joseph K. Bradley
Priority: Minor

 In PySpark, GaussianMixtureModel predict and predictSoft test if the argument 
 is an RDD and operate correctly if so.  But if the argument is not an RDD, 
 they fail silently, returning nothing.
 [https://github.com/apache/spark/blob/11a1a135d1fe892cd48a9116acc7554846aed84c/python/pyspark/mllib/clustering.py#L176]
 Instead, they should raise errors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

How to run multiple jobs in one sparkcontext from separate threads in pyspark?

2015-05-13 Thread MEETHU MATHEW

Hi all,
 Quote Inside a given Spark application (SparkContext instance), multiple 
parallel jobs can run simultaneously if they were submitted from separate 
threads.  
How to run multiple jobs in one SPARKCONTEXT using separate threads in pyspark? 
I found some examples in scala and java, but couldn't find python code. Can 
anyone help me with a pyspark example? 
Thanks  Regards,
Meethu M

Re: Speeding up Spark build during development

2015-05-04 Thread Meethu Mathew


*
*
** ** ** ** **  **  Hi,

 Is it really necessary to run **mvn --projects assembly/ -DskipTests 
install ? Could you please explain why this is needed?
I got the changes after running mvn --projects streaming/ -DskipTests 
package.


Regards,
Meethu

On Monday 04 May 2015 02:20 PM, Emre Sevinc wrote:

Just to give you an example:

When I was trying to make a small change only to the Streaming component of
Spark, first I built and installed the whole Spark project (this took about
15 minutes on my 4-core, 4 GB RAM laptop). Then, after having changed files
only in Streaming, I ran something like (in the top-level directory):

mvn --projects streaming/ -DskipTests package

and then

mvn --projects assembly/ -DskipTests install


This was much faster than trying to build the whole Spark from scratch,
because Maven was only building one component, in my case the Streaming
component, of Spark. I think you can use a very similar approach.

--
Emre Sevinç



On Mon, May 4, 2015 at 10:44 AM, Pramod Biligiri pramodbilig...@gmail.com
wrote:


No, I just need to build one project at a time. Right now SparkSql.

Pramod

On Mon, May 4, 2015 at 12:09 AM, Emre Sevinc emre.sev...@gmail.com
wrote:


Hello Pramod,

Do you need to build the whole project every time? Generally you don't,
e.g., when I was changing some files that belong only to Spark Streaming, I
was building only the streaming (of course after having build and installed
the whole project, but that was done only once), and then the assembly.
This was much faster than trying to build the whole Spark every time.

--
Emre Sevinç

On Mon, May 4, 2015 at 9:01 AM, Pramod Biligiri pramodbilig...@gmail.com

wrote:
Using the inbuilt maven and zinc it takes around 10 minutes for each
build.
Is that reasonable?
My maven opts looks like this:
$ echo $MAVEN_OPTS
-Xmx12000m -XX:MaxPermSize=2048m

I'm running it as build/mvn -DskipTests package

Should I be tweaking my Zinc/Nailgun config?

Pramod

On Sun, May 3, 2015 at 3:40 PM, Mark Hamstra m...@clearstorydata.com
wrote:




https://spark.apache.org/docs/latest/building-spark.html#building-with-buildmvn

On Sun, May 3, 2015 at 2:54 PM, Pramod Biligiri 

pramodbilig...@gmail.com

wrote:


This is great. I didn't know about the mvn script in the build

directory.

Pramod

On Fri, May 1, 2015 at 9:51 AM, York, Brennon 
brennon.y...@capitalone.com
wrote:


Following what Ted said, if you leverage the `mvn` from within the
`build/` directory of Spark you¹ll get zinc for free which should

help

speed up build times.

On 5/1/15, 9:45 AM, Ted Yu yuzhih...@gmail.com wrote:


Pramod:
Please remember to run Zinc so that the build is faster.

Cheers

On Fri, May 1, 2015 at 9:36 AM, Ulanov, Alexander
alexander.ula...@hp.com
wrote:


Hi Pramod,

For cluster-like tests you might want to use the same code as in

mllib's

LocalClusterSparkContext. You can rebuild only the package that

you

change
and then run this main class.

Best regards, Alexander

-Original Message-
From: Pramod Biligiri [mailto:pramodbilig...@gmail.com]
Sent: Friday, May 01, 2015 1:46 AM
To: dev@spark.apache.org
Subject: Speeding up Spark build during development

Hi,
I'm making some small changes to the Spark codebase and trying

it out

on a
cluster. I was wondering if there's a faster way to build than

running

the
package target each time.
Currently I'm using: mvn -DskipTests  package

All the nodes have the same filesystem mounted at the same mount

point.

Pramod




The information contained in this e-mail is confidential and/or
proprietary to Capital One and/or its affiliates. The information
transmitted herewith is intended only for use by the individual or

entity

to which it is addressed.  If the reader of this message is not the
intended recipient, you are hereby notified that any review,
retransmission, dissemination, distribution, copying or other use

of, or

taking of any action in reliance upon this information is strictly
prohibited. If you have received this communication in error, please
contact the sender and delete the material from your computer.







--
Emre Sevinc

Spark-1.3.0 UI shows 0 cores in completed applications tab

2015-03-26 Thread MEETHU MATHEW

Hi all,
I started spark-shell in spark-1.3.0 and did some actions. The UI was showing 8 
cores under the running applications tab. But when I exited the spark-shell 
using exit, the application is moved to completed applications tab and the 
number of cores is 0. Again when I exited the spark-shell using sc.stop() ,it 
is showing correctly  8  cores under completed applications tab. Why it is 
showing 0 cores when I didnt use sc.stop()?Does anyone face this issue? Thanks 
 Regards,
Meethu M

[jira] [Commented] (SPARK-6485) Add CoordinateMatrix/RowMatrix/IndexedRowMatrix in PySpark

2015-03-24 Thread Meethu Mathew (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14379333#comment-14379333
 ] 

Meethu Mathew commented on SPARK-6485:
--

As you had mentioned here https://issues.apache.org/jira/browse/SPARK-6100, 
MatrixUDT has been merged. But MatrixUDT for PySpark seems to be under 
progress. Does https://issues.apache.org/jira/browse/SPARK-6390 block this task?

 Add CoordinateMatrix/RowMatrix/IndexedRowMatrix in PySpark
 --

 Key: SPARK-6485
 URL: https://issues.apache.org/jira/browse/SPARK-6485
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Reporter: Xiangrui Meng

 We should add APIs for CoordinateMatrix/RowMatrix/IndexedRowMatrix in 
 PySpark. Internally, we can use DataFrames for serialization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6227) PCA and SVD for PySpark

2015-03-12 Thread Meethu Mathew (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14358529#comment-14358529
 ] 

Meethu Mathew commented on SPARK-6227:
--

[~mengxr]  Please give your inputs on the same.

 PCA and SVD for PySpark
 ---

 Key: SPARK-6227
 URL: https://issues.apache.org/jira/browse/SPARK-6227
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Affects Versions: 1.2.1
Reporter: Julien Amelot

 The Dimensionality Reduction techniques are not available via Python (Scala + 
 Java only).
 * Principal component analysis (PCA)
 * Singular value decomposition (SVD)
 Doc:
 http://spark.apache.org/docs/1.2.1/mllib-dimensionality-reduction.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6227) PCA and SVD for PySpark

2015-03-11 Thread Meethu Mathew (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14356428#comment-14356428
 ] 

Meethu Mathew commented on SPARK-6227:
--

Interested to work on this ticket.Could anyone assign to it to me?

 PCA and SVD for PySpark
 ---

 Key: SPARK-6227
 URL: https://issues.apache.org/jira/browse/SPARK-6227
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark
Affects Versions: 1.2.1
Reporter: Julien Amelot
Priority: Minor

 The Dimensionality Reduction techniques are not available via Python (Scala + 
 Java only).
 * Principal component analysis (PCA)
 * Singular value decomposition (SVD)
 Doc:
 http://spark.apache.org/docs/1.2.1/mllib-dimensionality-reduction.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

How to build Spark and run examples using Intellij ?

2015-03-09 Thread MEETHU MATHEW

Hi,
I am trying to  run examples of spark(master branch from git)  from 
Intellij(14.0.2) but facing errors. These are the steps I followed:
1. git clone the master branch of apache spark.2. Build it using mvn 
-DskipTests clean install3. In Intellij  select Import Projects and choose the 
POM.xml of spark root folder(Auto Import enabled)4. Then I tried to run SparkPi 
program but getting the following errors
Information:9/3/15 3:46 PM - Compilation completed with 44 errors and 0 
warnings in 5 sec
usr/local/spark-1.3.0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scalaError:(314,
 109) polymorphic expression cannot be instantiated to expected type; found   : 
[T(in method apply)]org.apache.spark.sql.catalyst.dsl.ScalaUdfBuilder[T(in 
method apply)] required: 
org.apache.spark.sql.catalyst.dsl.package.ScalaUdfBuilder[T(in method 
functionToUdfBuilder)]  implicit def functionToUdfBuilder[T: TypeTag](func: 
Function1[_, T]): ScalaUdfBuilder[T] = ScalaUdfBuilder(func)
I am able to run examples of this built version of spark from terminal using 
./bin/run-example script.
Could someone please help me in this issue?
Thanks  Regards,
Meethu M

How to read from hdfs using spark-shell in Intel hadoop?

2015-02-26 Thread MEETHU MATHEW

Hi,
I am not able to read from HDFS(Intel distribution hadoop,Hadoop version is 
1.0.3) from spark-shell(spark version is 1.2.1). I built spark using the 
commandmvn -Dhadoop.version=1.0.3 clean package and started  spark-shell and 
read a HDFS file using sc.textFile() and the exception is  
 WARN hdfs.DFSClient: Failed to connect to /10.88.6.133:50010, add to deadNodes 
and continuejava.net.SocketTimeoutException: 12 millis timeout while 
waiting for channel to be ready for read. ch : 
java.nio.channels.SocketChannel[connected local=/10.88.6.131:44264 
remote=/10.88.6.133:50010]

The same problem is asked in the this mail.
 RE: Spark is unable to read from HDFS
|   |
|   |   |   |   |   |
| RE: Spark is unable to read from HDFSHi,Thanks for the reply. I've tried the 
below.  |
|  |
| View on mail-archives.us.apache.org | Preview by Yahoo |
|  |
|   |

  As suggested in the above mail,In addition to specifying 
HADOOP_VERSION=1.0.3 in the ./project/SparkBuild.scala file, you will need to 
specify the libraryDependencies and name spark-core  resolvers. Otherwise, 
sbt will fetch version 1.0.3 of hadoop-core from apache instead of Intel. You 
can set up your own local or remote repository that you specify 
Now HADOOP_VERSION is deprecated and -Dhadoop.version should be used. Can 
anybody please elaborate on how to specify tat SBT should fetch hadoop-core 
from Intel which is in our internal repository?
Thanks  Regards,
Meethu M

Mail to u...@spark.apache.org failing

2015-02-09 Thread Meethu Mathew


Hi,

The mail id given in 
https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark seems 
to be failing. Can anyone tell me how to get added to Powered By Spark list?


--

Regards,

*Meethu*

[jira] [Commented] (SPARK-5609) PythonMLlibAPI trainGaussianMixture seed should use Java type

2015-02-04 Thread Meethu Mathew (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14306593#comment-14306593
 ] 

Meethu Mathew commented on SPARK-5609:
--

Please assign the ticket to me.


 PythonMLlibAPI trainGaussianMixture seed should use Java type
 -

 Key: SPARK-5609
 URL: https://issues.apache.org/jira/browse/SPARK-5609
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Priority: Trivial

 trainGaussianMixture takes parameter seed of type scala.Long but should take 
 java.lang.Long.
 Otherwise, the test for whether seed is null (None in Python) will be 
 ineffective.  See compilation warning:
 {code}
 [warn] 
 /Users/josephkb/spark/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala:304:
  comparing values of types Long and Null using `!=' will always yield true
 [warn] if (seed != null) gmmAlg.setSeed(seed)
 [warn]  ^
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-5609) PythonMLlibAPI trainGaussianMixture seed should use Java type

2015-02-04 Thread Meethu Mathew (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14306593#comment-14306593
 ] 

Meethu Mathew edited comment on SPARK-5609 at 2/5/15 4:03 AM:
--

Please assign the ticket to me. [~josephkb]



was (Author: meethumathew):
Please assign the ticket to me.


 PythonMLlibAPI trainGaussianMixture seed should use Java type
 -

 Key: SPARK-5609
 URL: https://issues.apache.org/jira/browse/SPARK-5609
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Priority: Trivial

 trainGaussianMixture takes parameter seed of type scala.Long but should take 
 java.lang.Long.
 Otherwise, the test for whether seed is null (None in Python) will be 
 ineffective.  See compilation warning:
 {code}
 [warn] 
 /Users/josephkb/spark/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala:304:
  comparing values of types Long and Null using `!=' will always yield true
 [warn] if (seed != null) gmmAlg.setSeed(seed)
 [warn]  ^
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Test suites in the python wrapper of kmeans failing

2015-01-21 Thread Meethu Mathew


Hi,

The test suites in the Kmeans class in clustering.py is not updated to 
take the seed value and hence it is failing.
Shall I make the changes and submit it along with my PR( Python API for 
Gaussian Mixture Model) or create a JIRA ?


Regards,
Meethu

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Test suites in the python wrapper of kmeans failing

2015-01-21 Thread Meethu Mathew


Hi,

Sorry it was my mistake. My code was not properly built.

Regards,
Meethu


_http://www.linkedin.com/home?trk=hb_tab_home_top_

On Thursday 22 January 2015 10:39 AM, Meethu Mathew wrote:

Hi,

The test suites in the Kmeans class in clustering.py is not updated to
take the seed value and hence it is failing.
Shall I make the changes and submit it along with my PR( Python API for
Gaussian Mixture Model) or create a JIRA ?

Regards,
Meethu

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

[jira] [Commented] (SPARK-5012) Python API for Gaussian Mixture Model

2015-01-21 Thread Meethu Mathew (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286942#comment-14286942
 ] 

Meethu Mathew commented on SPARK-5012:
--

[~tgaloppo] Thank you..Will update this PR asap..

 Python API for Gaussian Mixture Model
 -

 Key: SPARK-5012
 URL: https://issues.apache.org/jira/browse/SPARK-5012
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, PySpark
Reporter: Xiangrui Meng
Assignee: Meethu Mathew

 Add Python API for the Scala implementation of GMM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5012) Python API for Gaussian Mixture Model

2015-01-15 Thread Meethu Mathew (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279811#comment-14279811
 ] 

Meethu Mathew commented on SPARK-5012:
--

Once SPARK-5019 is resolved, we will make the changes accordingly.Thanks 
[~josephkb] [~tgaloppo] for the comments

 Python API for Gaussian Mixture Model
 -

 Key: SPARK-5012
 URL: https://issues.apache.org/jira/browse/SPARK-5012
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, PySpark
Reporter: Xiangrui Meng
Assignee: Meethu Mathew

 Add Python API for the Scala implementation of GMM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Use of MapConverter, ListConverter in python to java object conversion

2015-01-13 Thread Meethu Mathew


Hi all,

In the python object to java conversion done in the method _py2java in 
spark/python/pyspark/mllib/common.py, why  we are doing individual 
conversion  using MpaConverter,ListConverter? The same can be acheived 
using


bytearray(PickleSerializer().dumps(obj))
obj = sc._jvm.SerDe.loads(bytes)

Is there any performance gain or something in using individual 
converters rather than PickleSerializer?


--

Regards,

*Meethu*

[jira] [Commented] (SPARK-5012) Python API for Gaussian Mixture Model

2015-01-12 Thread Meethu Mathew (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14273561#comment-14273561
 ] 

Meethu Mathew commented on SPARK-5012:
--

I added a new class GaussianMixtureModel in clustering.py and the method
predict in it and trying to pass a List  of more than one dimension to the 
function
_py2java , but I am getting the exception 

'list' object has no attribute '_get_object_id'

and when I give a tuple input (Vectors.dense([0.8786,
-0.7855]),Vectors.dense([-0.1863, 0.7799])) exception is like

'numpy.ndarray' object has no attribute '_get_object_id'.   Can you help me to 
solve this?

My aim is to call the predictsoft() in GaussianMixtureModel.scala from 
clustering.py by passing the values of weight,mean and sigma 

 Python API for Gaussian Mixture Model
 -

 Key: SPARK-5012
 URL: https://issues.apache.org/jira/browse/SPARK-5012
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, PySpark
Reporter: Xiangrui Meng
Assignee: Meethu Mathew

 Add Python API for the Scala implementation of GMM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: Python to Java object conversion of numpy array

2015-01-12 Thread Meethu Mathew


Hi,

This is the function defined in PythonMLLibAPI.scala
def findPredict(
  data: JavaRDD[Vector],
  wt: Object,
  mu: Array[Object],
  si: Array[Object]):  RDD[Array[Double]]  = {
}

So the parameter mu should be converted to Array[object].

mu = (Vectors.dense([0.8786, -0.7855]),Vectors.dense([-0.1863, 0.7799]))

def _py2java(sc, obj):

if isinstance(obj, RDD):
...
elif isinstance(obj, SparkContext):
  ...
elif isinstance(obj, dict):
   ...
elif isinstance(obj, (list, tuple)):
obj = ListConverter().convert(obj, sc._gateway._gateway_client)
elif isinstance(obj, JavaObject):
pass
elif isinstance(obj, (int, long, float, bool, basestring)):
pass
else:
bytes = bytearray(PickleSerializer().dumps(obj))
obj = sc._jvm.SerDe.loads(bytes)
return obj

Since its a tuple of Densevectors, in _py2java() its entering the 
isinstance(obj, (list, tuple)) condition and throwing exception(happens 
because the dimension of tuple 1). However the conversion occurs 
correctly if the Pickle conversion is done (last else part).


Hope its clear now.

Regards,
Meethu

On Monday 12 January 2015 11:35 PM, Davies Liu wrote:

On Sun, Jan 11, 2015 at 10:21 PM, Meethu Mathew
meethu.mat...@flytxt.com wrote:

Hi,

This is the code I am running.

mu = (Vectors.dense([0.8786, -0.7855]),Vectors.dense([-0.1863, 0.7799]))

membershipMatrix = callMLlibFunc(findPredict, rdd.map(_convert_to_vector),
mu)

What's the Java API looks like? all the arguments of findPredict
should be converted
into java objects, so what should `mu` be converted to?


Regards,
Meethu
On Monday 12 January 2015 11:46 AM, Davies Liu wrote:

Could you post a piece of code here?

On Sun, Jan 11, 2015 at 9:28 PM, Meethu Mathew meethu.mat...@flytxt.com
wrote:

Hi,
Thanks Davies .

I added a new class GaussianMixtureModel in clustering.py and the method
predict in it and trying to pass numpy array from this method.I converted it
to DenseVector and its solved now.

Similarly I tried passing a List  of more than one dimension to the function
_py2java , but now the exception is

'list' object has no attribute '_get_object_id'

and when I give a tuple input (Vectors.dense([0.8786,
-0.7855]),Vectors.dense([-0.1863, 0.7799])) exception is like

'numpy.ndarray' object has no attribute '_get_object_id'

Regards,



Meethu Mathew

Engineer

Flytxt

www.flytxt.com | Visit our blog  |  Follow us | Connect on Linkedin



On Friday 09 January 2015 11:37 PM, Davies Liu wrote:

Hey Meethu,

The Java API accepts only Vector, so you should convert the numpy array into
pyspark.mllib.linalg.DenseVector.

BTW, which class are you using? the KMeansModel.predict() accept
numpy.array,
it will do the conversion for you.

Davies

On Fri, Jan 9, 2015 at 4:45 AM, Meethu Mathew meethu.mat...@flytxt.com
wrote:

Hi,
I am trying to send a numpy array as an argument to a function predict() in
a class in spark/python/pyspark/mllib/clustering.py which is passed to the
function callMLlibFunc(name, *args)  in
spark/python/pyspark/mllib/common.py.

Now the value is passed to the function  _py2java(sc, obj) .Here I am
getting an exception

Py4JJavaError: An error occurred while calling
z:org.apache.spark.mllib.api.python.SerDe.loads.
: net.razorvine.pickle.PickleException: expected zero arguments for
construction of ClassDict (for numpy.core.multiarray._reconstruct)
 at
net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23)
 at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:617)
 at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:170)
 at net.razorvine.pickle.Unpickler.load(Unpickler.java:84)
 at net.razorvine.pickle.Unpickler.loads(Unpickler.java:97)


Why common._py2java(sc, obj) is not handling numpy array type?

Please help..


--

Regards,

*Meethu Mathew*

*Engineer*

*Flytxt*

www.flytxt.com | Visit our blog http://blog.flytxt.com/ | Follow us
http://www.twitter.com/flytxt | _Connect on Linkedin
http://www.linkedin.com/home?trk=hb_tab_home_top_

Re: Python to Java object conversion of numpy array

2015-01-11 Thread Meethu Mathew


Hi,

This is the code I am running.

mu = (Vectors.dense([0.8786, -0.7855]),Vectors.dense([-0.1863, 0.7799]))

membershipMatrix = callMLlibFunc(findPredict, 
rdd.map(_convert_to_vector), mu)


Regards,
Meethu
On Monday 12 January 2015 11:46 AM, Davies Liu wrote:

Could you post a piece of code here?

On Sun, Jan 11, 2015 at 9:28 PM, Meethu Mathew meethu.mat...@flytxt.com wrote:

Hi,
Thanks Davies .

I added a new class GaussianMixtureModel in clustering.py and the method
predict in it and trying to pass numpy array from this method.I converted it
to DenseVector and its solved now.

Similarly I tried passing a List  of more than one dimension to the function
_py2java , but now the exception is

'list' object has no attribute '_get_object_id'

and when I give a tuple input (Vectors.dense([0.8786,
-0.7855]),Vectors.dense([-0.1863, 0.7799])) exception is like

'numpy.ndarray' object has no attribute '_get_object_id'

Regards,



Meethu Mathew

Engineer

Flytxt

www.flytxt.com | Visit our blog  |  Follow us | Connect on Linkedin



On Friday 09 January 2015 11:37 PM, Davies Liu wrote:

Hey Meethu,

The Java API accepts only Vector, so you should convert the numpy array into
pyspark.mllib.linalg.DenseVector.

BTW, which class are you using? the KMeansModel.predict() accept
numpy.array,
it will do the conversion for you.

Davies

On Fri, Jan 9, 2015 at 4:45 AM, Meethu Mathew meethu.mat...@flytxt.com
wrote:

Hi,
I am trying to send a numpy array as an argument to a function predict() in
a class in spark/python/pyspark/mllib/clustering.py which is passed to the
function callMLlibFunc(name, *args)  in
spark/python/pyspark/mllib/common.py.

Now the value is passed to the function  _py2java(sc, obj) .Here I am
getting an exception

Py4JJavaError: An error occurred while calling
z:org.apache.spark.mllib.api.python.SerDe.loads.
: net.razorvine.pickle.PickleException: expected zero arguments for
construction of ClassDict (for numpy.core.multiarray._reconstruct)
 at
net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23)
 at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:617)
 at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:170)
 at net.razorvine.pickle.Unpickler.load(Unpickler.java:84)
 at net.razorvine.pickle.Unpickler.loads(Unpickler.java:97)


Why common._py2java(sc, obj) is not handling numpy array type?

Please help..


--

Regards,

*Meethu Mathew*

*Engineer*

*Flytxt*

www.flytxt.com | Visit our blog http://blog.flytxt.com/ | Follow us
http://www.twitter.com/flytxt | _Connect on Linkedin
http://www.linkedin.com/home?trk=hb_tab_home_top_

Re: Python to Java object conversion of numpy array

2015-01-11 Thread Meethu Mathew


Hi,
Thanks Davies .

I added a new class GaussianMixtureModel in clustering.py and the method 
predict in it and trying to pass numpy array from this method.I 
converted it to DenseVector and its solved now.


Similarly I tried passing a List  of more than one dimension to the 
function _py2java , but now the exception is


'list' object has no attribute '_get_object_id'

and when I give a tuple input (Vectors.dense([0.8786, 
-0.7855]),Vectors.dense([-0.1863, 0.7799])) exception is like


'numpy.ndarray' object has no attribute '_get_object_id'

Regards,

*Meethu Mathew*

*Engineer*

*Flytxt*

www.flytxt.com | Visit our blog http://blog.flytxt.com/ | Follow us 
http://www.twitter.com/flytxt | _Connect on Linkedin 
http://www.linkedin.com/home?trk=hb_tab_home_top_


On Friday 09 January 2015 11:37 PM, Davies Liu wrote:

Hey Meethu,

The Java API accepts only Vector, so you should convert the numpy array into
pyspark.mllib.linalg.DenseVector.

BTW, which class are you using? the KMeansModel.predict() accept numpy.array,
it will do the conversion for you.

Davies

On Fri, Jan 9, 2015 at 4:45 AM, Meethu Mathew meethu.mat...@flytxt.com wrote:

Hi,
I am trying to send a numpy array as an argument to a function predict() in
a class in spark/python/pyspark/mllib/clustering.py which is passed to the
function callMLlibFunc(name, *args)  in
spark/python/pyspark/mllib/common.py.

Now the value is passed to the function  _py2java(sc, obj) .Here I am
getting an exception

Py4JJavaError: An error occurred while calling
z:org.apache.spark.mllib.api.python.SerDe.loads.
: net.razorvine.pickle.PickleException: expected zero arguments for
construction of ClassDict (for numpy.core.multiarray._reconstruct)
 at
net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23)
 at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:617)
 at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:170)
 at net.razorvine.pickle.Unpickler.load(Unpickler.java:84)
 at net.razorvine.pickle.Unpickler.loads(Unpickler.java:97)


Why common._py2java(sc, obj) is not handling numpy array type?

Please help..


--

Regards,

*Meethu Mathew*

*Engineer*

*Flytxt*

www.flytxt.com | Visit our blog http://blog.flytxt.com/ | Follow us
http://www.twitter.com/flytxt | _Connect on Linkedin
http://www.linkedin.com/home?trk=hb_tab_home_top_

[jira] [Commented] (SPARK-5012) Python API for Gaussian Mixture Model

2014-12-30 Thread Meethu Mathew (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14261923#comment-14261923
 ] 

Meethu Mathew commented on SPARK-5012:
--

The python implementation of the algorithm has already been added to 
spark-packages http://spark-packages.org/package/11  and it would be great if 
we are given a chance to write the Python wrappers for the algorithm.

 Python API for Gaussian Mixture Model
 -

 Key: SPARK-5012
 URL: https://issues.apache.org/jira/browse/SPARK-5012
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, PySpark
Reporter: Xiangrui Meng
Assignee: Travis Galoppo

 Add Python API for the Scala implementation of GMM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5015) GaussianMixtureEM should take random seed parameter

2014-12-30 Thread Meethu Mathew (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14261936#comment-14261936
 ] 

Meethu Mathew commented on SPARK-5015:
--

Instead of using random seed , using the  cluster centers returned by kmeans++ 
for initializing the means in GMM would be good strategy as implemented in 
scikit-learn  http://scikit-learn.org/stable/modules/mixture.html#mixture. What 
is your opinion ?

 GaussianMixtureEM should take random seed parameter
 ---

 Key: SPARK-5015
 URL: https://issues.apache.org/jira/browse/SPARK-5015
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley
Priority: Minor

 GaussianMixtureEM uses randomness but does not take a random seed.  It should 
 take one as a parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5015) GaussianMixtureEM should take random seed parameter

2014-12-30 Thread Meethu Mathew (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14261946#comment-14261946
 ] 

Meethu Mathew commented on SPARK-5015:
--

We would try to experiment with both the initialization methods and come up 
with a comparison on cluster quality and running time.

 GaussianMixtureEM should take random seed parameter
 ---

 Key: SPARK-5015
 URL: https://issues.apache.org/jira/browse/SPARK-5015
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley
Priority: Minor

 GaussianMixtureEM uses randomness but does not take a random seed.  It should 
 take one as a parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: Problems concerning implementing machine learning algorithm from scratch based on Spark

2014-12-30 Thread MEETHU MATHEW

Hi,
The GMMSpark.py you mentioned is the old one.The new code is now added to 
spark-packages and is available at http://spark-packages.org/package/11 . Have 
a look at the new code.
We have used numpy functions in our code and didnt notice any slowdown because 
of this. Thanks  Regards,
Meethu M 

 On Tuesday, 30 December 2014 11:50 AM, danqing0703 
danqing0...@berkeley.edu wrote:
   

 Hi all,

I am trying to use some machine learning algorithms that are not included
in the Mllib. Like Mixture Model and LDA(Latent Dirichlet Allocation), and
I am using pyspark and Spark SQL.

My problem is: I have some scripts that implement these algorithms, but I
am not sure which part I shall change to make it fit into Big Data.

  - Like some very simple calculation may take much time if data is too
  big,but also constructing RDD or SQLContext table takes too much time. I am
  really not sure if I shall use map(), reduce() every time I need to make
  calculation.
  - Also, there are some matrix/array level calculation that can not be
  implemented easily merely using map(),reduce(), thus functions of the Numpy
  package shall be used. I am not sure when data is too big, and we simply
  use the numpy functions. Will it take too much time?

I have found some scripts that are not from Mllib and was created by other
developers(credits to Meethu Mathew from Flytxt, thanks for giving me
insights!:))

Many thanks and look forward to getting feedbacks!

Best, Danqing


GMMSpark.py (7K) 
http://apache-spark-developers-list.1001551.n3.nabble.com/attachment/9964/0/GMMSpark.py




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Problems-concerning-implementing-machine-learning-algorithm-from-scratch-based-on-Spark-tp9964.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

[jira] [Commented] (SPARK-4156) Add expectation maximization for Gaussian mixture models to MLLib clustering

2014-12-11 Thread Meethu Mathew (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14242486#comment-14242486
 ] 

Meethu Mathew commented on SPARK-4156:
--

[~tgaloppo] The current version of the code has no predict function to return 
the cluster labels, i.e, the index of the cluster to which the point has 
maximum membership.We have written a predict function to return the cluster 
labels and  the membership values.We would be happy to contribute this to your 
code.
cc [~mengxr] 

 Add expectation maximization for Gaussian mixture models to MLLib clustering
 

 Key: SPARK-4156
 URL: https://issues.apache.org/jira/browse/SPARK-4156
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Travis Galoppo
Assignee: Travis Galoppo

 As an additional clustering algorithm, implement expectation maximization for 
 Gaussian mixture models



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: Mllib Error

2014-12-11 Thread MEETHU MATHEW

Hi,Try this.Change spark-mllib to spark-mllib_2.10
libraryDependencies ++=Seq( org.apache.spark % spark-core_2.10 % 1.1.1
 org.apache.spark % spark-mllib_2.10 % 1.1.1 ) 
Thanks  Regards,
Meethu M 

 On Friday, 12 December 2014 12:22 PM, amin mohebbi 
aminn_...@yahoo.com.INVALID wrote:
   

  I'm trying to build a very simple scala standalone app using the Mllib, but I 
get the following error when trying to bulid the program:Object Mllib is not a 
member of package org.apache.sparkThen, I realized that I have to add Mllib as 
dependency as follow :libraryDependencies ++= Seq(
org.apache.spark  %% spark-core  % 1.1.0,
org.apache.spark  %% spark-mllib % 1.1.0
)But, here I got an error that says :unresolved dependency 
spark-core_2.10.4;1.1.1 : not foundso I had to modify it toorg.apache.spark % 
spark-core_2.10 % 1.1.1,But there is still an error that says :unresolved 
dependency spark-mllib;1.1.1 : not foundAnyone knows how to add dependency of 
Mllib in .sbt file?
Best Regards

...

Amin Mohebbi

PhD candidate in Software Engineering 
 at university of Malaysia  

Tel : +60 18 2040 017



E-Mail : tp025...@ex.apiit.edu.my

  amin_...@me.com

Re: How to incrementally compile spark examples using mvn

2014-12-04 Thread MEETHU MATHEW

Hi all,
I made some code changes  in mllib project and as mentioned in the previous 
mails I did 
mvn install -pl mllib 
Now  I run a program in examples using run-example, the new code is not 
executing.Instead the previous code itself is running.
But if I do an  mvn install in the entire spark project , I can see the new 
code running.But installing the entire spark takes a lot of time and so its 
difficult to do this each time  I make some changes.
Can someone tell me how to compile mllib alone and get the changes working? 
Thanks  Regards,
Meethu M 

 On Friday, 28 November 2014 2:39 PM, MEETHU MATHEW 
meethu2...@yahoo.co.in wrote:
   

 Hi,I have a similar problem.I modified the code in mllib and examples.I did 
mvn install -pl mllib mvn install -pl examples
But when I run the program in examples using run-example,the older version of  
mllib (before the changes were made) is getting executed.How to get the changes 
made in mllib while  calling it from examples project? Thanks  Regards,
Meethu M 

 On Monday, 24 November 2014 3:33 PM, Yiming (John) Zhang 
sdi...@gmail.com wrote:
   

 Thank you, Marcelo and Sean, mvn install is a good answer for my demands. 

-邮件原件-
发件人: Marcelo Vanzin [mailto:van...@cloudera.com] 
发送时间: 2014年11月21日 1:47
收件人: yiming zhang
抄送: Sean Owen; user@spark.apache.org
主题: Re: How to incrementally compile spark examples using mvn

Hi Yiming,

On Wed, Nov 19, 2014 at 5:35 PM, Yiming (John) Zhang sdi...@gmail.com wrote:
 Thank you for your reply. I was wondering whether there is a method of 
 reusing locally-built components without installing them? That is, if I have 
 successfully built the spark project as a whole, how should I configure it so 
 that I can incrementally build (only) the spark-examples sub project 
 without the need of downloading or installation?

As Sean suggest, you shouldn't need to install anything. After mvn install, 
your local repo is a working Spark installation, and you can use spark-submit 
and other tool directly within it.

You just need to remember to rebuild the assembly/ project when modifying Spark 
code (or the examples/ project when modifying examples).


--
Marcelo


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

[jira] [Commented] (SPARK-4156) Add expectation maximization for Gaussian mixture models to MLLib clustering

2014-12-02 Thread Meethu Mathew (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14231226#comment-14231226
 ] 

Meethu Mathew commented on SPARK-4156:
--

We had run the GMM code on two public datasets :
 http://cs.joensuu.fi/sipu/datasets/s1.txt 
 http://cs.joensuu.fi/sipu/datasets/birch2.txt 

It was observed in both the cases that the execution converged at the 3rd 
iteration and the w , mu and sigma were identical for all the components.The 
code was run using the following commands:
./bin/run-example org.apache.spark.examples.mllib.DenseGmmEM s1.csv 15 .0001
./bin/run-example org.apache.spark.examples.mllib.DenseGmmEM birch2.csv 100 
.0001

Are we missing something here?

 Add expectation maximization for Gaussian mixture models to MLLib clustering
 

 Key: SPARK-4156
 URL: https://issues.apache.org/jira/browse/SPARK-4156
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Travis Galoppo
Assignee: Travis Galoppo

 As an additional clustering algorithm, implement expectation maximization for 
 Gaussian mixture models



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4156) Add expectation maximization for Gaussian mixture models to MLLib clustering

2014-12-02 Thread Meethu Mathew (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14232639#comment-14232639
 ] 

Meethu Mathew commented on SPARK-4156:
--

we considered only diagonal covariance matrix and it was initialized using the 
variance of each feature.

 Add expectation maximization for Gaussian mixture models to MLLib clustering
 

 Key: SPARK-4156
 URL: https://issues.apache.org/jira/browse/SPARK-4156
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Travis Galoppo
Assignee: Travis Galoppo

 As an additional clustering algorithm, implement expectation maximization for 
 Gaussian mixture models



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: How to incrementally compile spark examples using mvn

2014-11-28 Thread MEETHU MATHEW

Hi,I have a similar problem.I modified the code in mllib and examples.I did mvn 
install -pl mllib mvn install -pl examples
But when I run the program in examples using run-example,the older version of  
mllib (before the changes were made) is getting executed.How to get the changes 
made in mllib while  calling it from examples project? Thanks  Regards,
Meethu M 

 On Monday, 24 November 2014 3:33 PM, Yiming (John) Zhang 
sdi...@gmail.com wrote:
   

 Thank you, Marcelo and Sean, mvn install is a good answer for my demands. 

-邮件原件-
发件人: Marcelo Vanzin [mailto:van...@cloudera.com] 
发送时间: 2014年11月21日 1:47
收件人: yiming zhang
抄送: Sean Owen; user@spark.apache.org
主题: Re: How to incrementally compile spark examples using mvn

Hi Yiming,

On Wed, Nov 19, 2014 at 5:35 PM, Yiming (John) Zhang sdi...@gmail.com wrote:
 Thank you for your reply. I was wondering whether there is a method of 
 reusing locally-built components without installing them? That is, if I have 
 successfully built the spark project as a whole, how should I configure it so 
 that I can incrementally build (only) the spark-examples sub project 
 without the need of downloading or installation?

As Sean suggest, you shouldn't need to install anything. After mvn install, 
your local repo is a working Spark installation, and you can use spark-submit 
and other tool directly within it.

You just need to remember to rebuild the assembly/ project when modifying Spark 
code (or the examples/ project when modifying examples).


--
Marcelo


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

[jira] [Commented] (SPARK-3588) Gaussian Mixture Model clustering

2014-11-24 Thread Meethu Mathew (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14224091#comment-14224091
 ] 

Meethu Mathew commented on SPARK-3588:
--

[~mengxr] We have completed the pyspark implementation which is available  at 
https://github.com/FlytxtRnD/GMM. We are in the process of porting the code to 
Scala and were planning to create a PR once the coding and test cases are 
completed.
By merging do you mean to merge the tickets or the implementations? Kindly 
explain how the merge would be done.
Will our work be a duplicate effort if we continue with our scala 
implementation? 
Could you please suggest the next course of action?

 Gaussian Mixture Model clustering
 -

 Key: SPARK-3588
 URL: https://issues.apache.org/jira/browse/SPARK-3588
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, PySpark
Reporter: Meethu Mathew
Assignee: Meethu Mathew
 Attachments: GMMSpark.py


 Gaussian Mixture Models (GMM) is a popular technique for soft clustering. GMM 
 models the entire data set as a finite mixture of Gaussian distributions,each 
 parameterized by a mean vector µ ,a covariance matrix ∑ and  a mixture weight 
 π. In this technique, probability of  each point to belong to each cluster is 
 computed along with the cluster statistics.
 We have come up with an initial distributed implementation of GMM in pyspark 
 where the parameters are estimated using the  Expectation-Maximization 
 algorithm.Our current implementation considers diagonal covariance matrix for 
 each component.
 We did an initial benchmark study on a  2 node Spark standalone cluster setup 
 where each node config is 8 Cores,8 GB RAM, the spark version used is 1.0.0. 
 We also evaluated python version of k-means available in spark on the same 
 datasets.
 Below are the results from this benchmark study. The reported stats are 
 average from 10 runs.Tests were done on multiple datasets with varying number 
 of features and instances.
 ||nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;Dataset  
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;||nbsp;nbsp;nbsp;Gaussian
  mixture modelnbsp;nbsp;nbsp;nbsp;nbsp;|| 
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;Kmeans(Python)nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;||
  
 |Instances|Dimensions |Avg time per iteration|Time for  100 iterations |Avg 
 time per iteration |Time for 100 iterations | 
 |0.7million| nbsp;nbsp;nbsp;13 
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;|  
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;   7s 
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; | 
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; 12min 
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;   |  
 nbsp;nbsp;nbsp;nbsp; nbsp;nbsp; 13s  
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;  |  nbsp;nbsp;nbsp;nbsp;26min 
 nbsp;nbsp;nbsp;|
 |1.8million| nbsp;nbsp;nbsp;11 
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;|   
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;  17s 
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; | 
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; 29min 
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;  |  
 nbsp;nbsp;nbsp;nbsp; nbsp;nbsp; 33s  
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;   |  nbsp;nbsp;nbsp;nbsp;53min 
 nbsp;nbsp;nbsp;  |
 |10million|nbsp;nbsp;nbsp;16 
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;|  
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;  1.6min nbsp;nbsp;nbsp;nbsp;nbsp;   
  | nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; 2.7hr 
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;   |  
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; 1.2min nbsp;nbsp;nbsp;nbsp;| 
  nbsp;nbsp;nbsp;nbsp;2hr nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;   
  |



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: [MLlib] Contributing Algorithm for Outlier Detection

2014-11-13 Thread Meethu Mathew


Hi Ashutosh,

Please edit the README file.I think the following function call is 
changed now.


|model = OutlierWithAVFModel.outliers(master:String, input dir:String , 
percentage:Double||)
|

Regards,

*Meethu Mathew*

*Engineer*

*Flytxt*

_http://www.linkedin.com/home?trk=hb_tab_home_top_

On Friday 14 November 2014 12:01 AM, Ashutosh wrote:

Hi Anant,

Please see the changes.

https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala


I have changed the input format to Vector of String. I think we can also make 
it generic.


Line 59  72 : that counter will not affect in parallelism, Since it only work 
on one datapoint. It  only does the Indexing of the column.


Rest all side effects have been removed.



Thanks,

Ashutosh





From: slcclimber [via Apache Spark Developers List] 
ml-node+s1001551n9287...@n3.nabble.com
Sent: Tuesday, November 11, 2014 11:46 PM
To: Ashutosh Trivedi (MT2013030)
Subject: Re: [MLlib] Contributing Algorithm for Outlier Detection


Mayur,
Libsvm format sounds good to me. I could work on writing the tests if that 
helps you?
Anant

On Nov 11, 2014 11:06 AM, Ashutosh [via Apache Spark Developers List] [hidden 
email]/user/SendEmail.jtp?type=nodenode=9287i=0 wrote:

Hi Mayur,

Vector data types are implemented using breeze library, it is presented at

.../org/apache/spark/mllib/linalg


Anant,

One restriction I found that a vector can only be of 'Double', so it actually 
restrict the user.

What are you thoughts on LibSVM format?

Thanks for the comments, I was just trying to get away from those increment 
/decrement functions, they look ugly. Points are noted. I'll try to fix them 
soon. Tests are also required for the code.


Regards,

Ashutosh



From: Mayur Rustagi [via Apache Spark Developers List] ml-node+[hidden 
email]http://user/SendEmail.jtp?type=nodenode=9286i=0
Sent: Saturday, November 8, 2014 12:52 PM
To: Ashutosh Trivedi (MT2013030)
Subject: Re: [MLlib] Contributing Algorithm for Outlier Detection


We should take a vector instead giving the user flexibility to decide
data source/ type

What do you mean by vector datatype exactly?

Mayur Rustagi
Ph: a href=tel:%2B1%20%28760%29%20203%203257 value=+17602033257 
target=_blank+1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi


On Wed, Nov 5, 2014 at 6:45 AM, slcclimber [hidden 
email]http://user/SendEmail.jtp?type=nodenode=9239i=0 wrote:


Ashutosh,
I still see a few issues.
1. On line 112 you are counting using a counter. Since this will happen in
a RDD the counter will cause issues. Also that is not good functional style
to use a filter function with a side effect.
You could use randomSplit instead. This does not the same thing without the
side effect.
2. Similar shared usage of j in line 102 is going to be an issue as well.
also hash seed does not need to be sequential it could be randomly
generated or hashed on the values.
3. The compute function and trim scores still runs on a comma separeated
RDD. We should take a vector instead giving the user flexibility to decide
data source/ type. what if we want data from hive tables or parquet or JSON
or avro formats. This is a very restrictive format. With vectors the user
has the choice of taking in whatever data format and converting them to
vectors insteda of reading json files creating a csv file and then workig
on that.
4. Similar use of counters in 54 and 65 is an issue.
Basically the shared state counters is a huge issue that does not scale.
Since the processing of RDD's is distributed and the value j lives on the
master.

Anant



On Tue, Nov 4, 2014 at 7:22 AM, Ashutosh [via Apache Spark Developers List]
[hidden email]http://user/SendEmail.jtp?type=nodenode=9239i=1 wrote:


  Anant,

I got rid of those increment/ decrements functions and now code is much
cleaner. Please check. All your comments have been looked after.




https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala


  _Ashu



https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala

   Outlier-Detection-with-AVF-Spark/OutlierWithAVFModel.scala at master ·
codeAshu/Outlier-Detection-with-AVF-Spark · GitHub
  Contribute to Outlier-Detection-with-AVF-Spark development by creating

an

account on GitHub.
  Read more...


https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala


  --
*From:* slcclimber [via Apache Spark Developers List] ml-node+[hidden
email] http://user/SendEmail.jtp?type=nodenode=9083i=0
*Sent:* Friday, October 31, 2014 10:09 AM
*To:* Ashutosh Trivedi (MT2013030)
*Subject:* Re: [MLlib] Contributing Algorithm for Outlier Detection


You should create a jira ticket to go with it as well.
Thanks
On Oct 30, 2014 10:38 PM, Ashutosh [via Apache Spark

Re: [MLlib] Contributing Algorithm for Outlier Detection

2014-11-13 Thread Meethu Mathew



Hi,

I have a doubt regarding the input to your algorithm.
_http://www.linkedin.com/home?trk=hb_tab_home_top_

val model = OutlierWithAVFModel.outliers(data :RDD[Vector[String]], 
percent : Double, sc :SparkContext)



Here our input  data is an RDD[Vector[String]]. How we can create this 
RDD from a file? sc.textFile will simply give us an RDD, how to make it 
a Vector[String]?



Could you plz share any code snippet of this conversion if you have..


Regards,
Meethu Mathew

On Friday 14 November 2014 10:02 AM, Meethu Mathew wrote:

Hi Ashutosh,

Please edit the README file.I think the following function call is
changed now.

|model = OutlierWithAVFModel.outliers(master:String, input dir:String , 
percentage:Double||)
|

Regards,

*Meethu Mathew*

*Engineer*

*Flytxt*

_http://www.linkedin.com/home?trk=hb_tab_home_top_

On Friday 14 November 2014 12:01 AM, Ashutosh wrote:

Hi Anant,

Please see the changes.

https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala


I have changed the input format to Vector of String. I think we can also make 
it generic.


Line 59  72 : that counter will not affect in parallelism, Since it only work 
on one datapoint. It  only does the Indexing of the column.


Rest all side effects have been removed.



Thanks,

Ashutosh





From: slcclimber [via Apache Spark Developers List] 
ml-node+s1001551n9287...@n3.nabble.com
Sent: Tuesday, November 11, 2014 11:46 PM
To: Ashutosh Trivedi (MT2013030)
Subject: Re: [MLlib] Contributing Algorithm for Outlier Detection


Mayur,
Libsvm format sounds good to me. I could work on writing the tests if that 
helps you?
Anant

On Nov 11, 2014 11:06 AM, Ashutosh [via Apache Spark Developers List] [hidden 
email]/user/SendEmail.jtp?type=nodenode=9287i=0 wrote:

Hi Mayur,

Vector data types are implemented using breeze library, it is presented at

.../org/apache/spark/mllib/linalg


Anant,

One restriction I found that a vector can only be of 'Double', so it actually 
restrict the user.

What are you thoughts on LibSVM format?

Thanks for the comments, I was just trying to get away from those increment 
/decrement functions, they look ugly. Points are noted. I'll try to fix them 
soon. Tests are also required for the code.


Regards,

Ashutosh



From: Mayur Rustagi [via Apache Spark Developers List] ml-node+[hidden 
email]http://user/SendEmail.jtp?type=nodenode=9286i=0
Sent: Saturday, November 8, 2014 12:52 PM
To: Ashutosh Trivedi (MT2013030)
Subject: Re: [MLlib] Contributing Algorithm for Outlier Detection


We should take a vector instead giving the user flexibility to decide
data source/ type

What do you mean by vector datatype exactly?

Mayur Rustagi
Ph: a href=tel:%2B1%20%28760%29%20203%203257 value=+17602033257 
target=_blank+1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi


On Wed, Nov 5, 2014 at 6:45 AM, slcclimber [hidden 
email]http://user/SendEmail.jtp?type=nodenode=9239i=0 wrote:


Ashutosh,
I still see a few issues.
1. On line 112 you are counting using a counter. Since this will happen in
a RDD the counter will cause issues. Also that is not good functional style
to use a filter function with a side effect.
You could use randomSplit instead. This does not the same thing without the
side effect.
2. Similar shared usage of j in line 102 is going to be an issue as well.
also hash seed does not need to be sequential it could be randomly
generated or hashed on the values.
3. The compute function and trim scores still runs on a comma separeated
RDD. We should take a vector instead giving the user flexibility to decide
data source/ type. what if we want data from hive tables or parquet or JSON
or avro formats. This is a very restrictive format. With vectors the user
has the choice of taking in whatever data format and converting them to
vectors insteda of reading json files creating a csv file and then workig
on that.
4. Similar use of counters in 54 and 65 is an issue.
Basically the shared state counters is a huge issue that does not scale.
Since the processing of RDD's is distributed and the value j lives on the
master.

Anant



On Tue, Nov 4, 2014 at 7:22 AM, Ashutosh [via Apache Spark Developers List]
[hidden email]http://user/SendEmail.jtp?type=nodenode=9239i=1 wrote:


   Anant,

I got rid of those increment/ decrements functions and now code is much
cleaner. Please check. All your comments have been looked after.




https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala

   _Ashu



https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala

Outlier-Detection-with-AVF-Spark/OutlierWithAVFModel.scala at master ·
codeAshu/Outlier-Detection-with-AVF-Spark · GitHub
   Contribute to Outlier-Detection-with-AVF-Spark development by creating

an

account on GitHub

Re: ISpark class not found

2014-11-11 Thread MEETHU MATHEW

Hi,
I was also trying Ispark..But I couldnt even start the notebook..I am getting 
the following error.
ERROR:tornado.access:500 POST /api/sessions (127.0.0.1) 10.15ms 
referer=http://localhost:/notebooks/Scala/Untitled0.ipynb
How did you start the notebook?
 Thanks  Regards,
Meethu M 

 On Wednesday, 12 November 2014 6:50 AM, Laird, Benjamin 
benjamin.la...@capitalone.com wrote:
   

 I've been experimenting with the ISpark extension to IScala 
(https://github.com/tribbloid/ISpark)
Objects created in the REPL are not being loaded correctly on worker nodes, 
leading to a ClassNotFound exception. This does work correctly in spark-shell.
I was curious if anyone has used ISpark and has encountered this issue. Thanks!

Simple example:
In [1]: case class Circle(rad:Float)
In [2]: val rdd = sc.parallelize(1 to 
1).map(i=Circle(i.toFloat)).take(10)14/11/11 13:03:35 ERROR 
TaskResultGetter: Exception while getting task 
resultcom.esotericsoftware.kryo.KryoException: Unable to find class: 
[L$line5.$read$$iwC$$iwC$Circle;

Full trace in my gist: 
https://gist.github.com/benjaminlaird/3e543a9a89fb499a3a14

 The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed.  If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.

Is there a step-by-step instruction on how to build Spark App with IntelliJ IDEA?

2014-11-10 Thread MEETHU MATHEW

Hi,
This question was asked  earlier  and I did it in the way specified..I am 
getting java.lang.ClassNotFoundException..
Can somebody explain all the steps required to build a spark app using IntelliJ 
(latest version)starting from creating the project to running it..I searched a 
lot but couldnt find an appropriate documentation..
Re: Is there a step-by-step instruction on how to build Spark App with IntelliJ 
IDEA?

|   |
|   |   |   |   |   |
| Re: Is there a step-by-step instruction on how to build Spark App with 
IntelliJ IDEA?Don’t try to use spark-core as an archetype. Instead just create 
a plain Scala project (noarchetype) and add a Maven dependency on spark-core. 
That should be all you need.  |
|  |
| View on mail-archives.apache.org | Preview by Yahoo |
|  |
|   |

   Thanks  Regards,
Meethu M

Re: Relation between worker memory and executor memory in standalone mode

2014-10-07 Thread MEETHU MATHEW

Try  to set --total-executor-cores to limit how many total cores it can use.

Thanks  Regards, 
Meethu M


On Thursday, 2 October 2014 2:39 AM, Akshat Aranya aara...@gmail.com wrote:
 


I guess one way to do so would be to run 1 worker per node, like say, instead 
of running 1 worker and giving it 8 cores, you can run 4 workers with 2 cores 
each.  Then, you get 4 executors with 2 cores each.



On Wed, Oct 1, 2014 at 1:06 PM, Boromir Widas vcsub...@gmail.com wrote:

I have not found a way to control the cores yet. This effectively limits the 
cluster to a single application at a time. A subsequent application shows in 
the 'WAITING' State on the dashboard. 



On Wed, Oct 1, 2014 at 2:49 PM, Akshat Aranya aara...@gmail.com wrote:





On Wed, Oct 1, 2014 at 11:33 AM, Akshat Aranya aara...@gmail.com wrote:





On Wed, Oct 1, 2014 at 11:00 AM, Boromir Widas vcsub...@gmail.com wrote:

1. worker memory caps executor. 
2. With default config, every job gets one executor per worker. This 
executor runs with all cores available to the worker.


By the job do you mean one SparkContext or one stage execution within a 
program?  Does that also mean that two concurrent jobs will get one executor 
each at the same time?



Experimenting with this some more, I figured out that an executor takes away 
spark.executor.memory amount of memory from the configured worker memory.  
It also takes up all the cores, so even if there is still some memory left, 
there are no cores left for starting another executor.  Is my assessment 
correct? Is there no way to configure the number of cores that an executor 
can use?


 



On Wed, Oct 1, 2014 at 11:04 AM, Akshat Aranya aara...@gmail.com wrote:

Hi,

What's the relationship between Spark worker and executor memory settings 
in standalone mode?  Do they work independently or does the worker cap 
executor memory?

Also, is the number of concurrent executors per worker capped by the 
number of CPU cores configured for the worker?

Same code --works in spark 1.0.2-- but not in spark 1.1.0

2014-10-07 Thread MEETHU MATHEW

Hi all,

My code was working fine in spark 1.0.2 ,but after upgrading to 1.1.0, its 
throwing exceptions and tasks are getting failed.

The code contains some map and filter transformations followed by groupByKey 
(reduceByKey in another code ). What I could find out is that the code works 
fine until  groupByKey  or reduceByKey  in both versions.But after that the 
following errors show up in Spark 1.1.0
 
java.io.FileNotFoundException: 
/tmp/spark-local-20141006173014-4178/35/shuffle_6_0_5161 (Too many open files)
java.io.FileOutputStream.openAppend(Native Method)
java.io.FileOutputStream.init(FileOutputStream.java:210)

org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:123)

org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:192)

org.apache.spark.shuffle.hash.HashShuffleWriter$$anonfun$write$1.apply(HashShuffleWriter.scala:67)

org.apache.spark.shuffle.hash.HashShuffleWriter$$anonfun$write$1.apply(HashShuffleWriter.scala:65)
scala.collection.Iterator$class.foreach(Iterator.scala:727)
scala.collection.AbstractIterator.foreach(Iterator.scala:1157)

org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:65)

org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)

org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:701)

I cleaned my /tmp directory,changed my local directory to another folder ; but 
nothing helped.
 
Can anyone say what could  be the reason .?

Thanks  Regards, 
Meethu M

[jira] [Commented] (SPARK-3588) Gaussian Mixture Model clustering

2014-10-01 Thread Meethu Mathew (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14154434#comment-14154434
 ] 

Meethu Mathew commented on SPARK-3588:
--

Ok. We will start implementing the Scala version of Gaussian Mixture Model.

 Gaussian Mixture Model clustering
 -

 Key: SPARK-3588
 URL: https://issues.apache.org/jira/browse/SPARK-3588
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, PySpark
Reporter: Meethu Mathew
Assignee: Meethu Mathew
 Attachments: GMMSpark.py


 Gaussian Mixture Models (GMM) is a popular technique for soft clustering. GMM 
 models the entire data set as a finite mixture of Gaussian distributions,each 
 parameterized by a mean vector µ ,a covariance matrix ∑ and  a mixture weight 
 π. In this technique, probability of  each point to belong to each cluster is 
 computed along with the cluster statistics.
 We have come up with an initial distributed implementation of GMM in pyspark 
 where the parameters are estimated using the  Expectation-Maximization 
 algorithm.Our current implementation considers diagonal covariance matrix for 
 each component.
 We did an initial benchmark study on a  2 node Spark standalone cluster setup 
 where each node config is 8 Cores,8 GB RAM, the spark version used is 1.0.0. 
 We also evaluated python version of k-means available in spark on the same 
 datasets.
 Below are the results from this benchmark study. The reported stats are 
 average from 10 runs.Tests were done on multiple datasets with varying number 
 of features and instances.
 ||nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;Dataset  
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;||nbsp;nbsp;nbsp;Gaussian
  mixture modelnbsp;nbsp;nbsp;nbsp;nbsp;|| 
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;Kmeans(Python)nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;||
  
 |Instances|Dimensions |Avg time per iteration|Time for  100 iterations |Avg 
 time per iteration |Time for 100 iterations | 
 |0.7million| nbsp;nbsp;nbsp;13 
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;|  
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;   7s 
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; | 
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; 12min 
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;   |  
 nbsp;nbsp;nbsp;nbsp; nbsp;nbsp; 13s  
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;  |  nbsp;nbsp;nbsp;nbsp;26min 
 nbsp;nbsp;nbsp;|
 |1.8million| nbsp;nbsp;nbsp;11 
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;|   
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;  17s 
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; | 
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; 29min 
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;  |  
 nbsp;nbsp;nbsp;nbsp; nbsp;nbsp; 33s  
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;   |  nbsp;nbsp;nbsp;nbsp;53min 
 nbsp;nbsp;nbsp;  |
 |10million|nbsp;nbsp;nbsp;16 
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;|  
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;  1.6min nbsp;nbsp;nbsp;nbsp;nbsp;   
  | nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; 2.7hr 
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;   |  
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; 1.2min nbsp;nbsp;nbsp;nbsp;| 
  nbsp;nbsp;nbsp;nbsp;2hr nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;   
  |



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 133 matches

Mail list logo