unsubscribe
Thanks & Regards, Meethu M - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Filtering based on a float value with more than one decimal place not working correctly in Pyspark dataframe
Hi all, I tried the following code and the output was not as expected. schema = StructType([StructField('Id', StringType(), False), > StructField('Value', FloatType(), False)]) > df_test = spark.createDataFrame([('a',5.0),('b',1.236),('c',-0.31)],schema) df_test Output : DataFrame[Id: string, Value: float] [image: image.png] But when the value is given as a string, it worked. [image: image.png] Again tried with a floating point number with one decimal place and it worked. [image: image.png] And when the equals operation is changed to greater than or less than, its working with more than one decimal place numbers [image: image.png] Is this a bug? Regards, Meethu Mathew
Re: Failed to run spark jobs on mesos due to "hadoop" not found.
Hi, Add HADOOP_HOME=/path/to/hadoop/folder in /etc/default/mesos-slave in all mesos agents and restart mesos Regards, Meethu Mathew On Thu, Nov 10, 2016 at 4:57 PM, Yu Wei wrote: > Hi Guys, > > I failed to launch spark jobs on mesos. Actually I submitted the job to > cluster successfully. > > But the job failed to run. > > I1110 18:25:11.095507 301 fetcher.cpp:498] Fetcher Info: > {"cache_directory":"\/tmp\/mesos\/fetch\/slaves\/1f8e621b-3cbf-4b86-a1c1- > 9e2cf77265ee-S7\/root","items":[{"action":"BYPASS_CACHE"," > uri":{"extract":true,"value":"hdfs:\/\/192.168.111.74:9090\/ > bigdata\/package\/spark-examples_2.11-2.0.1.jar"}}]," > sandbox_directory":"\/var\/lib\/mesos\/agent\/slaves\/ > 1f8e621b-3cbf-4b86-a1c1-9e2cf77265ee-S7\/frameworks\/ > 1f8e621b-3cbf-4b86-a1c1-9e2cf77265ee-0002\/executors\/ > driver-20161110182510-0001\/runs\/b561328e-9110-4583-b740- > 98f9653e7fc2","user":"root"} > I1110 18:25:11.099799 301 fetcher.cpp:409] Fetching URI 'hdfs:// > 192.168.111.74:9090/bigdata/package/spark-examples_2.11-2.0.1.jar' > I1110 18:25:11.099820 301 fetcher.cpp:250] Fetching directly into the > sandbox directory > I1110 18:25:11.099862 301 fetcher.cpp:187] Fetching URI 'hdfs:// > 192.168.111.74:9090/bigdata/package/spark-examples_2.11-2.0.1.jar' > E1110 18:25:11.101842 301 shell.hpp:106] Command 'hadoop version 2>&1' > failed; this is the output: > sh: hadoop: command not found > Failed to fetch 'hdfs://192.168.111.74:9090/bigdata/package/spark- > examples_2.11-2.0.1.jar': Failed to create HDFS client: Failed to execute > 'hadoop version 2>&1'; the command was either not found or exited with a > non-zero exit status: 127 > Failed to synchronize with agent (it's probably exited > > Actually I installed hadoop on each agent node. > > > Any advice? > > > Thanks, > > Jared, (韦煜) > Software developer > Interested in open source software, big data, Linux >
Re: [discuss] dropping Python 2.6 support
+1 We use Python 2.7 Regards, Meethu Mathew On Tue, Jan 5, 2016 at 12:47 PM, Reynold Xin wrote: > Does anybody here care about us dropping support for Python 2.6 in Spark > 2.0? > > Python 2.6 is ancient, and is pretty slow in many aspects (e.g. json > parsing) when compared with Python 2.7. Some libraries that Spark depend on > stopped supporting 2.6. We can still convince the library maintainers to > support 2.6, but it will be extra work. I'm curious if anybody still uses > Python 2.6 to run Spark. > > Thanks. > > >
Re: Please reply if you use Mesos fine grained mode
Hi, We are using Mesos fine grained mode because we can have multiple instances of spark to share machines and each application get resources dynamically allocated. Thanks & Regards, Meethu M On Wednesday, 4 November 2015 5:24 AM, Reynold Xin wrote: If you are using Spark with Mesos fine grained mode, can you please respond to this email explaining why you use it over the coarse grained mode? Thanks.
Spark 1.6 Release window is not updated in Spark-wiki
Hi, In the https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage the current release window has not been changed from 1.5. Can anybody give an idea of the expected dates for 1.6 version? Regards, Meethu Mathew Senior Engineer Flytxt
[MLlib] Contributing algorithm for DP means clustering
Hi all, At present, all the clustering algorithms in MLlib require the number of clusters to be specified in advance. The Dirichlet process (DP) is a popular non-parametric Bayesian mixture model that allows for flexible clustering of data without having to specify apriori the number of clusters. DP means is a non-parametric clustering algorithm that uses a scale parameter 'lambda' to control the creation of new clusters. We have followed the distributed implementation of DP means which has been proposed in the paper titled "MLbase: Distributed Machine Learning Made Easy" by Xinghao Pan, Evan R. Sparks, Andre Wibisono. I have raised a JIRA ticket at https://issues.apache.org/jira/browse/SPARK-8402 Suggestions and guidance are welcome. Regards, Meethu Mathew Senior Engineer Flytxt www.flytxt.com | Visit our blog <http://blog.flytxt.com/> | Follow us <http://www.twitter.com/flytxt> | Connect on LinkedIn <http://www.linkedin.com/company/22166?goback=%2Efcs_GLHD_flytxt_false_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2&trk=ncsrch_hits>
Re: Anyone facing problem in incremental building of individual project
Hi, I added in my pom.xml and the problem is solved. + false Thank you @Steve and @Ted Regards, Meethu Mathew Senior Engineer Flytxt On Thu, Jun 4, 2015 at 9:51 PM, Ted Yu wrote: > Andrew Or put in this workaround : > > diff --git a/pom.xml b/pom.xml > index 0b1aaad..d03d33b 100644 > --- a/pom.xml > +++ b/pom.xml > @@ -1438,6 +1438,8 @@ > 2.3 > >false > + > + false > > > > > FYI > > On Thu, Jun 4, 2015 at 6:25 AM, Steve Loughran > wrote: > >> >> On 4 Jun 2015, at 11:16, Meethu Mathew wrote: >> >> Hi all, >> >> I added some new code to MLlib. When I am trying to build only the >> mllib project using *mvn --projects mllib/ -DskipTests clean install* >> * *after setting >> export S >> PARK_PREPEND_CLASSES=true >> , the build is getting stuck with the following message. >> >> >> >>> Excluding org.jpmml:pmml-schema:jar:1.1.15 from the shaded jar. >>> [INFO] Excluding com.sun.xml.bind:jaxb-impl:jar:2.2.7 from the shaded >>> jar. >>> [INFO] Excluding com.sun.xml.bind:jaxb-core:jar:2.2.7 from the shaded >>> jar. >>> [INFO] Excluding javax.xml.bind:jaxb-api:jar:2.2.7 from the shaded jar. >>> [INFO] Including org.spark-project.spark:unused:jar:1.0.0 in the shaded >>> jar. >>> [INFO] Excluding org.scala-lang:scala-reflect:jar:2.10.4 from the shaded >>> jar. >>> [INFO] Replacing original artifact with shaded artifact. >>> [INFO] Replacing >>> /home/meethu/git/FlytxtRnD/spark/mllib/target/spark-mllib_2.10-1.4.0-SNAPSHOT.jar >>> with >>> /home/meethu/git/FlytxtRnD/spark/mllib/target/spark-mllib_2.10-1.4.0-SNAPSHOT-shaded.jar >>> [INFO] Dependency-reduced POM written at: >>> /home/meethu/git/FlytxtRnD/spark/mllib/dependency-reduced-pom.xml >>> [INFO] Dependency-reduced POM written at: >>> /home/meethu/git/FlytxtRnD/spark/mllib/dependency-reduced-pom.xml >>> [INFO] Dependency-reduced POM written at: >>> /home/meethu/git/FlytxtRnD/spark/mllib/dependency-reduced-pom.xml >>> [INFO] Dependency-reduced POM written at: >>> /home/meethu/git/FlytxtRnD/spark/mllib/dependency-reduced-pom.xml >> >>. >> >> >> >> I've seen something similar in a different build, >> >> It looks like MSHADE-148: >> https://issues.apache.org/jira/browse/MSHADE-148 >> if you apply Tom White's patch, does your problem go away? >> > >
Anyone facing problem in incremental building of individual project
Hi all, I added some new code to MLlib. When I am trying to build only the mllib project using *mvn --projects mllib/ -DskipTests clean install* * *after setting export S PARK_PREPEND_CLASSES=true , the build is getting stuck with the following message. > Excluding org.jpmml:pmml-schema:jar:1.1.15 from the shaded jar. > [INFO] Excluding com.sun.xml.bind:jaxb-impl:jar:2.2.7 from the shaded jar. > [INFO] Excluding com.sun.xml.bind:jaxb-core:jar:2.2.7 from the shaded jar. > [INFO] Excluding javax.xml.bind:jaxb-api:jar:2.2.7 from the shaded jar. > [INFO] Including org.spark-project.spark:unused:jar:1.0.0 in the shaded > jar. > [INFO] Excluding org.scala-lang:scala-reflect:jar:2.10.4 from the shaded > jar. > [INFO] Replacing original artifact with shaded artifact. > [INFO] Replacing > /home/meethu/git/FlytxtRnD/spark/mllib/target/spark-mllib_2.10-1.4.0-SNAPSHOT.jar > with > /home/meethu/git/FlytxtRnD/spark/mllib/target/spark-mllib_2.10-1.4.0-SNAPSHOT-shaded.jar > [INFO] Dependency-reduced POM written at: > /home/meethu/git/FlytxtRnD/spark/mllib/dependency-reduced-pom.xml > [INFO] Dependency-reduced POM written at: > /home/meethu/git/FlytxtRnD/spark/mllib/dependency-reduced-pom.xml > [INFO] Dependency-reduced POM written at: > /home/meethu/git/FlytxtRnD/spark/mllib/dependency-reduced-pom.xml > [INFO] Dependency-reduced POM written at: > /home/meethu/git/FlytxtRnD/spark/mllib/dependency-reduced-pom.xml . But a full build completes as usual. Please help if anyone is facing the same issue. Regards, Meethu Mathew Senior Engineer Flytxt
Regarding "Connecting spark to Mesos" documentation
Hi List, In the documentation of Connecting Spark to Mesos <http://spark.apache.org/docs/latest/running-on-mesos.html#connecting-spark-to-mesos>, is it possible to modify and write in detail the step "Create a binary package using make-distribution.sh --tgz" ? When we use custom compiled version of Spark, mostly we specify a hadoop version (which is not the default one). In this case, make-distribution.sh should be supplied the same maven options we used for building spark. This is not specified in the documentation. Please correct me , if I am wrong. Regards, Meethu Mathew
Re: Speeding up Spark build during development
* * ** ** ** ** ** ** Hi, Is it really necessary to run **mvn --projects assembly/ -DskipTests install ? Could you please explain why this is needed? I got the changes after running "mvn --projects streaming/ -DskipTests package". Regards, Meethu On Monday 04 May 2015 02:20 PM, Emre Sevinc wrote: Just to give you an example: When I was trying to make a small change only to the Streaming component of Spark, first I built and installed the whole Spark project (this took about 15 minutes on my 4-core, 4 GB RAM laptop). Then, after having changed files only in Streaming, I ran something like (in the top-level directory): mvn --projects streaming/ -DskipTests package and then mvn --projects assembly/ -DskipTests install This was much faster than trying to build the whole Spark from scratch, because Maven was only building one component, in my case the Streaming component, of Spark. I think you can use a very similar approach. -- Emre Sevinç On Mon, May 4, 2015 at 10:44 AM, Pramod Biligiri wrote: No, I just need to build one project at a time. Right now SparkSql. Pramod On Mon, May 4, 2015 at 12:09 AM, Emre Sevinc wrote: Hello Pramod, Do you need to build the whole project every time? Generally you don't, e.g., when I was changing some files that belong only to Spark Streaming, I was building only the streaming (of course after having build and installed the whole project, but that was done only once), and then the assembly. This was much faster than trying to build the whole Spark every time. -- Emre Sevinç On Mon, May 4, 2015 at 9:01 AM, Pramod Biligiri wrote: Using the inbuilt maven and zinc it takes around 10 minutes for each build. Is that reasonable? My maven opts looks like this: $ echo $MAVEN_OPTS -Xmx12000m -XX:MaxPermSize=2048m I'm running it as build/mvn -DskipTests package Should I be tweaking my Zinc/Nailgun config? Pramod On Sun, May 3, 2015 at 3:40 PM, Mark Hamstra wrote: https://spark.apache.org/docs/latest/building-spark.html#building-with-buildmvn On Sun, May 3, 2015 at 2:54 PM, Pramod Biligiri < pramodbilig...@gmail.com> wrote: This is great. I didn't know about the mvn script in the build directory. Pramod On Fri, May 1, 2015 at 9:51 AM, York, Brennon < brennon.y...@capitalone.com> wrote: Following what Ted said, if you leverage the `mvn` from within the `build/` directory of Spark you¹ll get zinc for free which should help speed up build times. On 5/1/15, 9:45 AM, "Ted Yu" wrote: Pramod: Please remember to run Zinc so that the build is faster. Cheers On Fri, May 1, 2015 at 9:36 AM, Ulanov, Alexander wrote: Hi Pramod, For cluster-like tests you might want to use the same code as in mllib's LocalClusterSparkContext. You can rebuild only the package that you change and then run this main class. Best regards, Alexander -Original Message- From: Pramod Biligiri [mailto:pramodbilig...@gmail.com] Sent: Friday, May 01, 2015 1:46 AM To: dev@spark.apache.org Subject: Speeding up Spark build during development Hi, I'm making some small changes to the Spark codebase and trying it out on a cluster. I was wondering if there's a faster way to build than running the package target each time. Currently I'm using: mvn -DskipTests package All the nodes have the same filesystem mounted at the same mount point. Pramod The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer. -- Emre Sevinc
Mail to u...@spark.apache.org failing
Hi, The mail id given in https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark seems to be failing. Can anyone tell me how to get added to Powered By Spark list? -- Regards, *Meethu*
Re: Test suites in the python wrapper of kmeans failing
Hi, Sorry it was my mistake. My code was not properly built. Regards, Meethu _<http://www.linkedin.com/home?trk=hb_tab_home_top>_ On Thursday 22 January 2015 10:39 AM, Meethu Mathew wrote: Hi, The test suites in the Kmeans class in clustering.py is not updated to take the seed value and hence it is failing. Shall I make the changes and submit it along with my PR( Python API for Gaussian Mixture Model) or create a JIRA ? Regards, Meethu - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Test suites in the python wrapper of kmeans failing
Hi, The test suites in the Kmeans class in clustering.py is not updated to take the seed value and hence it is failing. Shall I make the changes and submit it along with my PR( Python API for Gaussian Mixture Model) or create a JIRA ? Regards, Meethu - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Use of MapConverter, ListConverter in python to java object conversion
Hi all, In the python object to java conversion done in the method _py2java in spark/python/pyspark/mllib/common.py, why we are doing individual conversion using MpaConverter,ListConverter? The same can be acheived using bytearray(PickleSerializer().dumps(obj)) obj = sc._jvm.SerDe.loads(bytes) Is there any performance gain or something in using individual converters rather than PickleSerializer? -- Regards, *Meethu*
Re: Python to Java object conversion of numpy array
Hi, This is the function defined in PythonMLLibAPI.scala def findPredict( data: JavaRDD[Vector], wt: Object, mu: Array[Object], si: Array[Object]): RDD[Array[Double]] = { } So the parameter mu should be converted to Array[object]. mu = (Vectors.dense([0.8786, -0.7855]),Vectors.dense([-0.1863, 0.7799])) def _py2java(sc, obj): if isinstance(obj, RDD): ... elif isinstance(obj, SparkContext): ... elif isinstance(obj, dict): ... elif isinstance(obj, (list, tuple)): obj = ListConverter().convert(obj, sc._gateway._gateway_client) elif isinstance(obj, JavaObject): pass elif isinstance(obj, (int, long, float, bool, basestring)): pass else: bytes = bytearray(PickleSerializer().dumps(obj)) obj = sc._jvm.SerDe.loads(bytes) return obj Since its a tuple of Densevectors, in _py2java() its entering the isinstance(obj, (list, tuple)) condition and throwing exception(happens because the dimension of tuple >1). However the conversion occurs correctly if the Pickle conversion is done (last else part). Hope its clear now. Regards, Meethu On Monday 12 January 2015 11:35 PM, Davies Liu wrote: On Sun, Jan 11, 2015 at 10:21 PM, Meethu Mathew wrote: Hi, This is the code I am running. mu = (Vectors.dense([0.8786, -0.7855]),Vectors.dense([-0.1863, 0.7799])) membershipMatrix = callMLlibFunc("findPredict", rdd.map(_convert_to_vector), mu) What's the Java API looks like? all the arguments of findPredict should be converted into java objects, so what should `mu` be converted to? Regards, Meethu On Monday 12 January 2015 11:46 AM, Davies Liu wrote: Could you post a piece of code here? On Sun, Jan 11, 2015 at 9:28 PM, Meethu Mathew wrote: Hi, Thanks Davies . I added a new class GaussianMixtureModel in clustering.py and the method predict in it and trying to pass numpy array from this method.I converted it to DenseVector and its solved now. Similarly I tried passing a List of more than one dimension to the function _py2java , but now the exception is 'list' object has no attribute '_get_object_id' and when I give a tuple input (Vectors.dense([0.8786, -0.7855]),Vectors.dense([-0.1863, 0.7799])) exception is like 'numpy.ndarray' object has no attribute '_get_object_id' Regards, Meethu Mathew Engineer Flytxt www.flytxt.com | Visit our blog | Follow us | Connect on Linkedin On Friday 09 January 2015 11:37 PM, Davies Liu wrote: Hey Meethu, The Java API accepts only Vector, so you should convert the numpy array into pyspark.mllib.linalg.DenseVector. BTW, which class are you using? the KMeansModel.predict() accept numpy.array, it will do the conversion for you. Davies On Fri, Jan 9, 2015 at 4:45 AM, Meethu Mathew wrote: Hi, I am trying to send a numpy array as an argument to a function predict() in a class in spark/python/pyspark/mllib/clustering.py which is passed to the function callMLlibFunc(name, *args) in spark/python/pyspark/mllib/common.py. Now the value is passed to the function _py2java(sc, obj) .Here I am getting an exception Py4JJavaError: An error occurred while calling z:org.apache.spark.mllib.api.python.SerDe.loads. : net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for numpy.core.multiarray._reconstruct) at net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23) at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:617) at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:170) at net.razorvine.pickle.Unpickler.load(Unpickler.java:84) at net.razorvine.pickle.Unpickler.loads(Unpickler.java:97) Why common._py2java(sc, obj) is not handling numpy array type? Please help.. -- Regards, *Meethu Mathew* *Engineer* *Flytxt* www.flytxt.com | Visit our blog <http://blog.flytxt.com/> | Follow us <http://www.twitter.com/flytxt> | _Connect on Linkedin <http://www.linkedin.com/home?trk=hb_tab_home_top>_
Re: Python to Java object conversion of numpy array
Hi, This is the code I am running. mu = (Vectors.dense([0.8786, -0.7855]),Vectors.dense([-0.1863, 0.7799])) membershipMatrix = callMLlibFunc("findPredict", rdd.map(_convert_to_vector), mu) Regards, Meethu On Monday 12 January 2015 11:46 AM, Davies Liu wrote: Could you post a piece of code here? On Sun, Jan 11, 2015 at 9:28 PM, Meethu Mathew wrote: Hi, Thanks Davies . I added a new class GaussianMixtureModel in clustering.py and the method predict in it and trying to pass numpy array from this method.I converted it to DenseVector and its solved now. Similarly I tried passing a List of more than one dimension to the function _py2java , but now the exception is 'list' object has no attribute '_get_object_id' and when I give a tuple input (Vectors.dense([0.8786, -0.7855]),Vectors.dense([-0.1863, 0.7799])) exception is like 'numpy.ndarray' object has no attribute '_get_object_id' Regards, Meethu Mathew Engineer Flytxt www.flytxt.com | Visit our blog | Follow us | Connect on Linkedin On Friday 09 January 2015 11:37 PM, Davies Liu wrote: Hey Meethu, The Java API accepts only Vector, so you should convert the numpy array into pyspark.mllib.linalg.DenseVector. BTW, which class are you using? the KMeansModel.predict() accept numpy.array, it will do the conversion for you. Davies On Fri, Jan 9, 2015 at 4:45 AM, Meethu Mathew wrote: Hi, I am trying to send a numpy array as an argument to a function predict() in a class in spark/python/pyspark/mllib/clustering.py which is passed to the function callMLlibFunc(name, *args) in spark/python/pyspark/mllib/common.py. Now the value is passed to the function _py2java(sc, obj) .Here I am getting an exception Py4JJavaError: An error occurred while calling z:org.apache.spark.mllib.api.python.SerDe.loads. : net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for numpy.core.multiarray._reconstruct) at net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23) at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:617) at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:170) at net.razorvine.pickle.Unpickler.load(Unpickler.java:84) at net.razorvine.pickle.Unpickler.loads(Unpickler.java:97) Why common._py2java(sc, obj) is not handling numpy array type? Please help.. -- Regards, *Meethu Mathew* *Engineer* *Flytxt* www.flytxt.com | Visit our blog <http://blog.flytxt.com/> | Follow us <http://www.twitter.com/flytxt> | _Connect on Linkedin <http://www.linkedin.com/home?trk=hb_tab_home_top>_
Re: Python to Java object conversion of numpy array
Hi, Thanks Davies . I added a new class GaussianMixtureModel in clustering.py and the method predict in it and trying to pass numpy array from this method.I converted it to DenseVector and its solved now. Similarly I tried passing a List of more than one dimension to the function _py2java , but now the exception is 'list' object has no attribute '_get_object_id' and when I give a tuple input (Vectors.dense([0.8786, -0.7855]),Vectors.dense([-0.1863, 0.7799])) exception is like 'numpy.ndarray' object has no attribute '_get_object_id' Regards, *Meethu Mathew* *Engineer* *Flytxt* www.flytxt.com | Visit our blog <http://blog.flytxt.com/> | Follow us <http://www.twitter.com/flytxt> | _Connect on Linkedin <http://www.linkedin.com/home?trk=hb_tab_home_top>_ On Friday 09 January 2015 11:37 PM, Davies Liu wrote: Hey Meethu, The Java API accepts only Vector, so you should convert the numpy array into pyspark.mllib.linalg.DenseVector. BTW, which class are you using? the KMeansModel.predict() accept numpy.array, it will do the conversion for you. Davies On Fri, Jan 9, 2015 at 4:45 AM, Meethu Mathew wrote: Hi, I am trying to send a numpy array as an argument to a function predict() in a class in spark/python/pyspark/mllib/clustering.py which is passed to the function callMLlibFunc(name, *args) in spark/python/pyspark/mllib/common.py. Now the value is passed to the function _py2java(sc, obj) .Here I am getting an exception Py4JJavaError: An error occurred while calling z:org.apache.spark.mllib.api.python.SerDe.loads. : net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for numpy.core.multiarray._reconstruct) at net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23) at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:617) at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:170) at net.razorvine.pickle.Unpickler.load(Unpickler.java:84) at net.razorvine.pickle.Unpickler.loads(Unpickler.java:97) Why common._py2java(sc, obj) is not handling numpy array type? Please help.. -- Regards, *Meethu Mathew* *Engineer* *Flytxt* www.flytxt.com | Visit our blog <http://blog.flytxt.com/> | Follow us <http://www.twitter.com/flytxt> | _Connect on Linkedin <http://www.linkedin.com/home?trk=hb_tab_home_top>_
Python to Java object conversion of numpy array
Hi, I am trying to send a numpy array as an argument to a function predict() in a class in spark/python/pyspark/mllib/clustering.py which is passed to the function callMLlibFunc(name, *args) in spark/python/pyspark/mllib/common.py. Now the value is passed to the function _py2java(sc, obj) .Here I am getting an exception Py4JJavaError: An error occurred while calling z:org.apache.spark.mllib.api.python.SerDe.loads. : net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for numpy.core.multiarray._reconstruct) at net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23) at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:617) at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:170) at net.razorvine.pickle.Unpickler.load(Unpickler.java:84) at net.razorvine.pickle.Unpickler.loads(Unpickler.java:97) Why common._py2java(sc, obj) is not handling numpy array type? Please help.. -- Regards, *Meethu Mathew* *Engineer* *Flytxt* www.flytxt.com | Visit our blog <http://blog.flytxt.com/> | Follow us <http://www.twitter.com/flytxt> | _Connect on Linkedin <http://www.linkedin.com/home?trk=hb_tab_home_top>_
Re: Problems concerning implementing machine learning algorithm from scratch based on Spark
Hi, The GMMSpark.py you mentioned is the old one.The new code is now added to spark-packages and is available at http://spark-packages.org/package/11 . Have a look at the new code. We have used numpy functions in our code and didnt notice any slowdown because of this. Thanks & Regards, Meethu M On Tuesday, 30 December 2014 11:50 AM, danqing0703 wrote: Hi all, I am trying to use some machine learning algorithms that are not included in the Mllib. Like Mixture Model and LDA(Latent Dirichlet Allocation), and I am using pyspark and Spark SQL. My problem is: I have some scripts that implement these algorithms, but I am not sure which part I shall change to make it fit into Big Data. - Like some very simple calculation may take much time if data is too big,but also constructing RDD or SQLContext table takes too much time. I am really not sure if I shall use map(), reduce() every time I need to make calculation. - Also, there are some matrix/array level calculation that can not be implemented easily merely using map(),reduce(), thus functions of the Numpy package shall be used. I am not sure when data is too big, and we simply use the numpy functions. Will it take too much time? I have found some scripts that are not from Mllib and was created by other developers(credits to Meethu Mathew from Flytxt, thanks for giving me insights!:)) Many thanks and look forward to getting feedbacks! Best, Danqing GMMSpark.py (7K) <http://apache-spark-developers-list.1001551.n3.nabble.com/attachment/9964/0/GMMSpark.py> -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Problems-concerning-implementing-machine-learning-algorithm-from-scratch-based-on-Spark-tp9964.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
Re: [MLlib] Contributing Algorithm for Outlier Detection
Hi, I have a doubt regarding the input to your algorithm. _<http://www.linkedin.com/home?trk=hb_tab_home_top>_ val model = OutlierWithAVFModel.outliers(data :RDD[Vector[String]], percent : Double, sc :SparkContext) Here our input data is an RDD[Vector[String]]. How we can create this RDD from a file? sc.textFile will simply give us an RDD, how to make it a Vector[String]? Could you plz share any code snippet of this conversion if you have.. Regards, Meethu Mathew On Friday 14 November 2014 10:02 AM, Meethu Mathew wrote: Hi Ashutosh, Please edit the README file.I think the following function call is changed now. |model = OutlierWithAVFModel.outliers(master:String, input dir:String , percentage:Double||) | Regards, *Meethu Mathew* *Engineer* *Flytxt* _<http://www.linkedin.com/home?trk=hb_tab_home_top>_ On Friday 14 November 2014 12:01 AM, Ashutosh wrote: Hi Anant, Please see the changes. https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala I have changed the input format to Vector of String. I think we can also make it generic. Line 59 & 72 : that counter will not affect in parallelism, Since it only work on one datapoint. It only does the Indexing of the column. Rest all side effects have been removed. Thanks, Ashutosh From: slcclimber [via Apache Spark Developers List] Sent: Tuesday, November 11, 2014 11:46 PM To: Ashutosh Trivedi (MT2013030) Subject: Re: [MLlib] Contributing Algorithm for Outlier Detection Mayur, Libsvm format sounds good to me. I could work on writing the tests if that helps you? Anant On Nov 11, 2014 11:06 AM, "Ashutosh [via Apache Spark Developers List]" <[hidden email]> wrote: Hi Mayur, Vector data types are implemented using breeze library, it is presented at .../org/apache/spark/mllib/linalg Anant, One restriction I found that a vector can only be of 'Double', so it actually restrict the user. What are you thoughts on LibSVM format? Thanks for the comments, I was just trying to get away from those increment /decrement functions, they look ugly. Points are noted. I'll try to fix them soon. Tests are also required for the code. Regards, Ashutosh From: Mayur Rustagi [via Apache Spark Developers List] http://user/SendEmail.jtp?type=node&node=9286&i=0>> Sent: Saturday, November 8, 2014 12:52 PM To: Ashutosh Trivedi (MT2013030) Subject: Re: [MLlib] Contributing Algorithm for Outlier Detection We should take a vector instead giving the user flexibility to decide data source/ type What do you mean by vector datatype exactly? Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi <https://twitter.com/mayur_rustagi> On Wed, Nov 5, 2014 at 6:45 AM, slcclimber <[hidden email]<http://user/SendEmail.jtp?type=node&node=9239&i=0>> wrote: Ashutosh, I still see a few issues. 1. On line 112 you are counting using a counter. Since this will happen in a RDD the counter will cause issues. Also that is not good functional style to use a filter function with a side effect. You could use randomSplit instead. This does not the same thing without the side effect. 2. Similar shared usage of j in line 102 is going to be an issue as well. also hash seed does not need to be sequential it could be randomly generated or hashed on the values. 3. The compute function and trim scores still runs on a comma separeated RDD. We should take a vector instead giving the user flexibility to decide data source/ type. what if we want data from hive tables or parquet or JSON or avro formats. This is a very restrictive format. With vectors the user has the choice of taking in whatever data format and converting them to vectors insteda of reading json files creating a csv file and then workig on that. 4. Similar use of counters in 54 and 65 is an issue. Basically the shared state counters is a huge issue that does not scale. Since the processing of RDD's is distributed and the value j lives on the master. Anant On Tue, Nov 4, 2014 at 7:22 AM, Ashutosh [via Apache Spark Developers List] <[hidden email]<http://user/SendEmail.jtp?type=node&node=9239&i=1>> wrote: Anant, I got rid of those increment/ decrements functions and now code is much cleaner. Please check. All your comments have been looked after. https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala _Ashu < https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala Outlier-Detection-with-AVF-Spark/OutlierWithAVFModel.scala at master · codeAshu/Outlier-Detection-with-AVF-Spark · GitHub Contribute to Outlier-Detection-with-AVF-Spark development by creating an account on GitHub.
Re: [MLlib] Contributing Algorithm for Outlier Detection
Hi Ashutosh, Please edit the README file.I think the following function call is changed now. |model = OutlierWithAVFModel.outliers(master:String, input dir:String , percentage:Double||) | Regards, *Meethu Mathew* *Engineer* *Flytxt* _<http://www.linkedin.com/home?trk=hb_tab_home_top>_ On Friday 14 November 2014 12:01 AM, Ashutosh wrote: Hi Anant, Please see the changes. https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala I have changed the input format to Vector of String. I think we can also make it generic. Line 59 & 72 : that counter will not affect in parallelism, Since it only work on one datapoint. It only does the Indexing of the column. Rest all side effects have been removed. Thanks, Ashutosh From: slcclimber [via Apache Spark Developers List] Sent: Tuesday, November 11, 2014 11:46 PM To: Ashutosh Trivedi (MT2013030) Subject: Re: [MLlib] Contributing Algorithm for Outlier Detection Mayur, Libsvm format sounds good to me. I could work on writing the tests if that helps you? Anant On Nov 11, 2014 11:06 AM, "Ashutosh [via Apache Spark Developers List]" <[hidden email]> wrote: Hi Mayur, Vector data types are implemented using breeze library, it is presented at .../org/apache/spark/mllib/linalg Anant, One restriction I found that a vector can only be of 'Double', so it actually restrict the user. What are you thoughts on LibSVM format? Thanks for the comments, I was just trying to get away from those increment /decrement functions, they look ugly. Points are noted. I'll try to fix them soon. Tests are also required for the code. Regards, Ashutosh From: Mayur Rustagi [via Apache Spark Developers List] http://user/SendEmail.jtp?type=node&node=9286&i=0>> Sent: Saturday, November 8, 2014 12:52 PM To: Ashutosh Trivedi (MT2013030) Subject: Re: [MLlib] Contributing Algorithm for Outlier Detection We should take a vector instead giving the user flexibility to decide data source/ type What do you mean by vector datatype exactly? Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi <https://twitter.com/mayur_rustagi> On Wed, Nov 5, 2014 at 6:45 AM, slcclimber <[hidden email]<http://user/SendEmail.jtp?type=node&node=9239&i=0>> wrote: Ashutosh, I still see a few issues. 1. On line 112 you are counting using a counter. Since this will happen in a RDD the counter will cause issues. Also that is not good functional style to use a filter function with a side effect. You could use randomSplit instead. This does not the same thing without the side effect. 2. Similar shared usage of j in line 102 is going to be an issue as well. also hash seed does not need to be sequential it could be randomly generated or hashed on the values. 3. The compute function and trim scores still runs on a comma separeated RDD. We should take a vector instead giving the user flexibility to decide data source/ type. what if we want data from hive tables or parquet or JSON or avro formats. This is a very restrictive format. With vectors the user has the choice of taking in whatever data format and converting them to vectors insteda of reading json files creating a csv file and then workig on that. 4. Similar use of counters in 54 and 65 is an issue. Basically the shared state counters is a huge issue that does not scale. Since the processing of RDD's is distributed and the value j lives on the master. Anant On Tue, Nov 4, 2014 at 7:22 AM, Ashutosh [via Apache Spark Developers List] <[hidden email]<http://user/SendEmail.jtp?type=node&node=9239&i=1>> wrote: Anant, I got rid of those increment/ decrements functions and now code is much cleaner. Please check. All your comments have been looked after. https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala _Ashu < https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala Outlier-Detection-with-AVF-Spark/OutlierWithAVFModel.scala at master · codeAshu/Outlier-Detection-with-AVF-Spark · GitHub Contribute to Outlier-Detection-with-AVF-Spark development by creating an account on GitHub. Read more... < https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala -- *From:* slcclimber [via Apache Spark Developers List] http://user/SendEmail.jtp?type=node&node=9083&i=0>> *Sent:* Friday, October 31, 2014 10:09 AM *To:* Ashutosh Trivedi (MT2013030) *Subject:* Re: [MLlib] Contributing Algorithm for Outlier Detection You should create a jira ticket to go with it as well. Thanks On Oct 30, 2014 10:38 PM, "Ashutosh [via Apache Spark Developers List]" <[hidden email
Re: Gaussian Mixture Model clustering
Hi Evan, Sorry that I forgot to mention about it. I set the value of K as 10 for the benchmark study. On Friday 19 September 2014 11:24 PM, Evan R. Sparks wrote: Hey Meethu - what are you setting "K" to in the benchmarks you show? This can greatly affect the runtime. On Thu, Sep 18, 2014 at 10:38 PM, Meethu Mathew mailto:meethu.mat...@flytxt.com>> wrote: Hi all, Please find attached the image of benchmark results. The table in the previous mail got messed up. Thanks. On Friday 19 September 2014 10:55 AM, Meethu Mathew wrote: Hi all, We have come up with an initial distributed implementation of Gaussian Mixture Model in pyspark where the parameters are estimated using the Expectation-Maximization algorithm.Our current implementation considers diagonal covariance matrix for each component. We did an initial benchmark study on a 2 node Spark standalone cluster setup where each node config is 8 Cores,8 GB RAM, the spark version used is 1.0.0. We also evaluated python version of k-means available in spark on the same datasets.Below are the results from this benchmark study. The reported stats are average from 10 runs.Tests were done on multiple datasets with varying number of features and instances. DatasetGaussian mixture model Kmeans(Python) Instances Dimensions Avg time per iteration Time for 100 iterations Avg time per iteration Time for 100 iterations 0.7million 13 7s 12min 13s 26min 1.8million 11 17s 29min 33s 53min 10 million 16 1.6min 2.7hr 1.2min2 hr We are interested in contributing this implementation as a patch to SPARK. Does MLLib accept python implementations? If not, can we contribute to the pyspark component I have created a JIRA for the same https://issues.apache.org/jira/browse/SPARK-3588 .How do I get the ticket assigned to myself? Please review and suggest how to take this forward. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org <mailto:dev-unsubscr...@spark.apache.org> For additional commands, e-mail: dev-h...@spark.apache.org <mailto:dev-h...@spark.apache.org> -- Regards, *Meethu Mathew* *Engineer* *Flytxt* www.flytxt.com | Visit our blog <http://blog.flytxt.com/> | Follow us <http://www.twitter.com/flytxt> | _Connect on Linkedin <http://www.linkedin.com/home?trk=hb_tab_home_top>_
Re: Gaussian Mixture Model clustering
Hi all, Please find attached the image of benchmark results. The table in the previous mail got messed up. Thanks. On Friday 19 September 2014 10:55 AM, Meethu Mathew wrote: Hi all, We have come up with an initial distributed implementation of Gaussian Mixture Model in pyspark where the parameters are estimated using the Expectation-Maximization algorithm.Our current implementation considers diagonal covariance matrix for each component. We did an initial benchmark study on a 2 node Spark standalone cluster setup where each node config is 8 Cores,8 GB RAM, the spark version used is 1.0.0. We also evaluated python version of k-means available in spark on the same datasets.Below are the results from this benchmark study. The reported stats are average from 10 runs.Tests were done on multiple datasets with varying number of features and instances. DatasetGaussian mixture model Kmeans(Python) Instances Dimensions Avg time per iteration Time for 100 iterations Avg time per iteration Time for 100 iterations 0.7million 13 7s 12min 13s 26min 1.8million 11 17s 29min 33s 53min 10 million 16 1.6min 2.7hr 1.2min2 hr We are interested in contributing this implementation as a patch to SPARK. Does MLLib accept python implementations? If not, can we contribute to the pyspark component I have created a JIRA for the same https://issues.apache.org/jira/browse/SPARK-3588 .How do I get the ticket assigned to myself? Please review and suggest how to take this forward. -- Regards, *Meethu Mathew* *Engineer* *Flytxt* Skype: meethu.mathew7 F: +91 471.2700202 www.flytxt.com | Visit our blog <http://blog.flytxt.com/> | Follow us <http://www.twitter.com/flytxt> | _Connect on Linkedin <http://www.linkedin.com/home?trk=hb_tab_home_top>_ - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Gaussian Mixture Model clustering
Hi all, We have come up with an initial distributed implementation of Gaussian Mixture Model in pyspark where the parameters are estimated using the Expectation-Maximization algorithm.Our current implementation considers diagonal covariance matrix for each component. We did an initial benchmark study on a 2 node Spark standalone cluster setup where each node config is 8 Cores,8 GB RAM, the spark version used is 1.0.0. We also evaluated python version of k-means available in spark on the same datasets.Below are the results from this benchmark study. The reported stats are average from 10 runs.Tests were done on multiple datasets with varying number of features and instances. Dataset Gaussian mixture model Kmeans(Python) Instances Dimensions Avg time per iteration Time for 100 iterations Avg time per iteration Time for 100 iterations 0.7million 13 7s 12min 13s 26min 1.8million 11 17s 29min 33s 53min 10 million 16 1.6min 2.7hr 1.2min2 hr We are interested in contributing this implementation as a patch to SPARK. Does MLLib accept python implementations? If not, can we contribute to the pyspark component I have created a JIRA for the same https://issues.apache.org/jira/browse/SPARK-3588 .How do I get the ticket assigned to myself? Please review and suggest how to take this forward. -- Regards, *Meethu Mathew* *Engineer* *Flytxt* F: +91 471.2700202 www.flytxt.com | Visit our blog <http://blog.flytxt.com/> | Follow us <http://www.twitter.com/flytxt> | _Connect on Linkedin <http://www.linkedin.com/home?trk=hb_tab_home_top>_
Contribution to MLlib
Hi, I am interested in contributing a clustering algorithm towards MLlib of Spark.I am focusing on Gaussian Mixture Model. But I saw a JIRA @ https://spark-project.atlassian.net/browse/SPARK-952 regrading the same.I would like to know whether Gaussian Mixture Model is already implemented or not. Thanks & Regards, Meethu M
Contributions to MLlib
Hi, I would like to do some contributions towards the MLlib .I've a few concerns regarding the same. 1. Is there any reason for implementing the algorithms supported by MLlib in Scala 2. Will you accept if the contributions are done in Python or Java Thanks, Meethu M