Mllib / kalman

2018-12-17 Thread robin . east
Pretty sure there is nothing in MLLib. This seems to be the most comprehensive coverage of implementing in Spark  https://dzone.com/articles/kalman-filter-with-apache-spark-streaming-and-kafk. I’ve skimmed it but not read it in detail but looks useful. Sent from Polymail (

Re: Positive log-likelihood with Gaussian mixture

2018-05-30 Thread robin . east
Positive log likelihoods for continuous distributions are not unusual. You are evaluating a pdf not a probability. For example a univariate Gaussian pdf returns greater than 1 at the mean when the variance goes below 0.39, at which point the log pdf is positive. Sent from Polymail (

Re: [Spark-Core] sc.textFile() explicit minPartitions did not work

2017-07-25 Thread Robin East
. --- Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co. http://www.manning.com/books/spark-graphx-in-action <http://www.manning.com/books/spark-graphx-in-action> > On 25 Jul 2017, at 13:21, Gokula Krishnan D <email2.

Re: Question on Spark's graph libraries

2017-03-10 Thread Robin East
I would love to know the answer to that too. --- Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co. http://www.manning.com/books/spark-graphx-in-action <http://www.manning.com/bo

Re: Which streaming platform is best? Kafka or Spark Streaming?

2017-03-10 Thread Robin East
on. --- Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co. http://www.manning.com/books/spark-graphx-in-action <http://www.manning.com/books/spark-graphx-in-action> > On 9

Re: Pretrained Word2Vec models

2016-12-05 Thread Robin East
. --- Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co. http://www.manning.com/books/spark-graphx-in-action <http://www.manning.com/books/spark-graphx-in-action> > On 5 Dec 2016, at 21:34, L

Re: LinearRegressionWithSGD and Rank Features By Importance

2016-11-07 Thread Robin East
--- Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co. http://www.manning.com/books/spark-graphx-in-action <http://www.manning.com/books/spark-graphx-in-action> > On 7 Nov 2016, at 15:47, Carlo.Allocca &l

Re: LinearRegressionWithSGD and Rank Features By Importance

2016-11-04 Thread Robin East
--- Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co. http://www.manning.com/books/spark-graphx-in-action <http://www.manning.com/books/spark-graphx-in-action> > On 4 Nov 2016,

Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-03 Thread Robin East
I agree with Koert. Relying on something because it appears to work when you test it can be dangerous if there is nothing in the api guarantee. Going back quite a few years it used to be the case that Oracle would always return a group by with the rows in the order of the grouping key. This

Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-03 Thread Robin East
I don’t think the semantics of groupBy necessarily preserve ordering - whatever the implementation details or the observed behaviour. I would use a Window operation and order within the group. > On 3 Nov 2016, at 11:53, Rabin Banerjee wrote: > > Hi All , > >

Re: Spark ML - Is IDF model reusable

2016-11-01 Thread Robin East
. --- Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co. http://www.manning.com/books/spark-graphx-in-action <http://www.manning.com/books/spark-graphx-in-action> > On 1 Nov 2016, at 11:18, Nirav Patel <npa...@xa

Re: Spark ML - Is IDF model reusable

2016-11-01 Thread Robin East
overfit your model. --- Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co. http://www.manning.com/books/spark-graphx-in-action <http://www.manning.com/books/spark-graphx-in-act

Re: Need help with SVM

2016-10-26 Thread Robin East
It looks like the training is over-regularised - dropping the regParam to 0.1 or 0.01 should resolve the problem. --- Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co. http

Re: Need help with SVM

2016-10-26 Thread Robin East
As per Assem’s point what do you get from data_rdd.toDF.groupBy("label").count.show > On 25 Oct 2016, at 15:41, Aseem Bansal wrote: > > Is there any labeled point with label 0 in your dataset? > > On Tue, Oct 25, 2016 at 2:13 AM, aditya1702

Re: K-Mean retrieving Cluster Members

2016-10-19 Thread Robin East
or alternatively this should work (assuming parsedData is an RDD[Vector]): clusters.predict(parsedData) > On 18 Oct 2016, at 00:35, Reth RM wrote: > > I think I got it > > parsedData.foreach( > new VoidFunction() { > @Override

Re: Graphhopper/routing in Spark

2016-09-09 Thread Robin East
. --- Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co. http://www.manning.com/books/spark-graphx-in-action <http://www.manning.com/books/spark-graphx-in-action> > On 8 Sep 2016, at 22:45,

Re: MLib : Non Linear Optimization

2016-09-08 Thread Robin East
Do you have any particular algorithms in mind? If you state the most common algorithms you use then it might stimulate the appropriate comments. > On 8 Sep 2016, at 05:04, nsareen wrote: > > Any answer to this question group ? > > > > -- > View this message in context:

Re: Forecasting algorithms in spark ML

2016-09-08 Thread Robin East
Sparks algorithms are summarised on this page (https://spark.apache.org/mllib/) and details are available from the MLLib user guide which is linked from the above URL Sent from my iPhone > On 8 Sep 2016, at 05:30, Madabhattula Rajesh Kumar > wrote: > > Hi, > > Please

Re: Machine learning question (suing spark)- removing redundant factors while doing clustering

2016-08-08 Thread Robin East
Another approach is to use L1 regularisation eg http://spark.apache.org/docs/latest/mllib-linear-methods.html#linear-least-squares-lasso-and-ridge-regression. This adds a penalty term to the regression equation to reduce model complexity. When you use L1 (as opposed to say L2) this tends to

Re: Questions about ml.random forest (only one decision tree?)

2016-08-04 Thread Robin East
--- Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co. http://www.manning.com/books/spark-graphx-in-action <http://www.manning.com/books/spark-graphx-in-action> > On 4 Aug 2016, at 09:48, 陈哲 <czhenj.

Re: Is RowMatrix missing in org.apache.spark.ml package?

2016-07-27 Thread Robin East
Can you use the version from mllib? --- Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co. http://www.manning.com/books/spark-graphx-in-action <http://www.manning.com/books/sp

Re: ML PipelineModel to be scored locally

2016-07-21 Thread Robin East
MLeap is another option (Apache licensed) https://github.com/TrueCar/mleap --- Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co. http://www.manning.com/books/spark-graphx-in-action

Re: Saving data using tempTable versus save() method

2016-06-21 Thread Robin East
if you are able to trace the underlying oracle session you can see whether a commit has been called or not. > On 21 Jun 2016, at 09:57, Robin East <robin.e...@xense.co.uk> wrote: > > I’m not sure - I don’t know what those APIs do under the hood. It simply rang > a bell wit

Re: Saving data using tempTable versus save() method

2016-06-21 Thread Robin East
; LinkedIn > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> > > http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> >

Re: Saving data using tempTable versus save() method

2016-06-21 Thread Robin East
random thought - do you need an explicit commit with the 2nd method? > On 20 Jun 2016, at 21:35, Mich Talebzadeh wrote: > > Hi, > > I have a DF based on a table and sorted and shown below > > This is fine and when I register as tempTable I can populate the

Re: What is the interpretation of Cores in Spark doc

2016-06-17 Thread Robin East
t; > > > > Dr Mich Talebzadeh > > LinkedIn > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> > > http://talebzadehmich.wordpress

Re: What is the interpretation of Cores in Spark doc

2016-06-16 Thread Robin East
Mich >> A core may have one or more threads It would be more accurate to say that a core could run one or more threads scheduled for execution. Threads are a software/OS concept that represent executable code that is scheduled to run by the OS; A CPU, core or virtual core/virtual processor

Re: SPARK-13900 - Join with simple OR conditions take too long

2016-04-01 Thread Robin East
Yes and even today CBO (e.g. in Oracle) will still require hints in some cases so I think it is more like: RBO -> RBO + Hints -> CBO + Hints. Most relational databases meet significant numbers of corner cases where CBO plans simply don’t do what you would want. I don’t know enough about Spark

Re: Recommendation for a good book on Spark, beginner to moderate knowledge

2016-03-01 Thread Robin East
. --- Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co. http://www.manning.com/books/spark-graphx-in-action <http://www.manning.com/books/spark-graphx-in-action> > On 1 Mar 2016, at 16:13, Mohammed Gull

Re: Get all vertexes with outDegree equals to 0 with GraphX

2016-02-26 Thread Robin East
possibilities, the key point is that everything is just a graph transformation until you call an action on the resulting graph --- Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co. http

Re: Reindexing in graphx

2016-02-24 Thread Robin East
It looks like you adding vertices one-by-one, you definitely don’t want to do that. What happens when you batch together 400 vertices into an RDD and then add 400 in one go? --- Robin East Spark GraphX in Action Michael

Re: Constantly increasing Spark streaming heap memory

2016-02-22 Thread Robin East
Hi What you describe looks like normal behaviour for almost any Java/Scala application - objects are created on the heap until a limit point is reached and then GC clears away memory allocated to objects that are no longer referenced. Is there an issue you are experiencing? > On 21 Feb

Re: Feature importance for RandomForestRegressor in Spark 1.5

2016-01-15 Thread Robin East
re 1. The pull requests reference the JIRA ticket in this case https://issues.apache.org/jira/browse/SPARK-5133 <https://issues.apache.org/jira/browse/SPARK-5133>. The JIRA says it was released in 1.5. --- Robi

Re: Shared memory between C++ process and Spark

2015-12-07 Thread Robin East
architectural sense. --- Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co. http://www.manning.com/books/spark-graphx-in-action <http://www.manning.com/books/spark-graphx-in-act

Re: Shared memory between C++ process and Spark

2015-12-07 Thread Robin East
more answers in the user mailing list) >>>> >>>> First up let me say that I don’t really know how this could be done - I’m >>>> sure it would be possible with enough tinkering but it’s not clear what >>>> you are trying to achieve. Spark is a di

Re: Shared memory between C++ process and Spark

2015-12-07 Thread Robin East
memory-mapped file reading feature. --- Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co. http://www.manning.com/books/spark-graphx-in-action <http://www.manning.com/books/spark-gra

Re: Shared memory between C++ process and Spark

2015-12-07 Thread Robin East
t;> list. >>> Yes, we have a distributed C++ application, that will store data on each >>> node in the cluster, and we hope to leverage Spark to do more fancy >>> analytics on those data. But we need high performance, that’s why we want >>> shared memory. >

Re: LDA topic modeling and Spark

2015-12-03 Thread Robin East
put to the actual words in each topic? A typical way is to look at the top 5, 10 or 20 words in each topic and use those to infer something about what the topic represents. --- Robin East Spark GraphX in Action Michael Malak and Ro

Re: spark performance - executor computing time

2015-09-16 Thread Robin East
) that means the processing takes longer. — Robin East Spark GraphX in Action Michael S Malak and Robin East http://www.manning.com/books/spark-graphx-in-action <http://www.manning.com/books/spark-graphx-in-action> > On 15 Sep 2015, at 12:35, patcharee <pat

Re: Applying transformations on a JavaRDD using reflection

2015-09-09 Thread Robin East
Have you got some code already that demonstrates the problem? > On 9 Sep 2015, at 04:45, Nirmal Fernando wrote: > > Any thoughts? > > On Tue, Sep 8, 2015 at 3:37 PM, Nirmal Fernando > wrote: > Hi All, > > I'd like to apply a chain of

Re: Build k-NN graph for large dataset

2015-08-26 Thread Robin East
You could try dimensionality reduction (PCA or SVD) first. I would imagine that even if you could successfully compute similarities in the high-dimensional space you would probably run into the curse of dimensionality. On 26 Aug 2015, at 12:35, Jaonary Rabarisoa jaon...@gmail.com wrote: Dear

Re: Spark ec2 lunch problem

2015-08-24 Thread Robin East
then maybe someone on the list can help diagnose the specific problem. --- Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co. http://www.manning.com/malak/ http://www.manning.com/malak

Re: Spark return key value pair

2015-08-19 Thread Robin East
Dawid is right, if you did words.count it would be twice the number of input lines. You can use map like this: words = lines.map(mapper2) for i in words.take(10): msg = i[0] + :” + i[1] + \n” --- Robin East

Re: [MLLIB] Anyone tried correlation with RDD[Vector] ?

2015-07-23 Thread Robin East
The OP’s problem is he gets this: console:47: error: type mismatch; found : org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.DenseVector] required: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] Note: org.apache.spark.mllib.linalg.DenseVector :

Re: Performance issue with Spak's foreachpartition method

2015-07-22 Thread Robin East
The first question I would ask is have you determined whether you have a performance issue writing to Oracle? In particular how many commits are you making? If you are issuing a lot of commits that would be a performance problem. Robin On 22 Jul 2015, at 19:11, diplomatic Guru

Re: Is spark suitable for real time query

2015-07-22 Thread Robin East
Real-time is, of course, relative but you’ve mentioned microsecond level. Spark is designed to process large amounts of data in a distributed fashion. No distributed system I know of could give any kind of guarantees at the microsecond level. Robin On 22 Jul 2015, at 11:14, Louis Hust

Re: Research ideas using spark

2015-07-15 Thread Robin East
Well said Will. I would add that you might want to investigate GraphChi which claims to be able to run a number of large-scale graph processing tasks on a workstation much quicker than a very large Hadoop cluster. It would be interesting to know how widely applicable the approach GraphChi takes

Re: Spark GraphX memory requirements + java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-06-26 Thread Robin East
You’ll get this issue if you just take the first 2000 lines of that file. The problem is triangleCount() expects srdId dstId which is not the case in the file (e.g. vertex 28). You can get round this by calling graph.convertToCanonical Edges() which removes bi-directional edges and ensures

Re: Extracting k-means cluster values along with centers?

2015-06-13 Thread Robin East
trying again On 13 Jun 2015, at 10:15, Robin East robin.e...@xense.co.uk wrote: Here’s typical way to do it: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 import org.apache.spark.mllib.clustering.{KMeans, KMeansModel} import org.apache.spark.mllib.linalg.Vectors // Load and parse

Re: Linear Regression with SGD

2015-06-09 Thread Robin East
Hi Stephen How many is a very large number of iterations? SGD is notorious for requiring 100s or 1000s of iterations, also you may need to spend some time tweaking the step-size. In 1.4 there is an implementation of ElasticNet Linear Regression which is supposed to compare favourably with an

Re: Is LIMIT n in Spark SQL useful?

2015-05-05 Thread Robin East
4, 2015 at 8:06 AM, Robin East robin.e...@xense.co.uk wrote: and a further question - have you tried running this query in pqsl? what’s the performance like there? On 4 May 2015, at 16:04, Robin East robin.e...@xense.co.uk wrote: What query are you running. It may be the case that your

Re: Is LIMIT n in Spark SQL useful?

2015-05-04 Thread Robin East
What query are you running. It may be the case that your query requires PosgreSQL to do a large amount of work before identifying the first n rows On 4 May 2015, at 15:52, Yi Zhang zhangy...@yahoo.com.INVALID wrote: I am trying to query PostgreSQL using LIMIT(n) to reduce memory size and

Re: Is LIMIT n in Spark SQL useful?

2015-05-04 Thread Robin East
and a further question - have you tried running this query in pqsl? what’s the performance like there? On 4 May 2015, at 16:04, Robin East robin.e...@xense.co.uk wrote: What query are you running. It may be the case that your query requires PosgreSQL to do a large amount of work before

Re: GraphX path traversal

2015-03-04 Thread Robin East
because this is the root node. I'm planning to use pergel API but I'm not able to define messages and vertex program in that API. Could you please help me on this. Please let me know if you need more information. Regards, Rajesh On Tue, Mar 3, 2015 at 8:15 PM, Robin East robin.e

Re: PRNG in Scala

2015-03-03 Thread Robin East
And this SO post goes into details on the PRNG in Java http://stackoverflow.com/questions/9907303/does-java-util-random-implementation-differ-between-jres-or-platforms On 3 Mar 2015, at 16:15, Robin East robin.e...@xense.co.uk wrote: This is more of a java/scala question than spark - it uses

Re: GraphX path traversal

2015-03-03 Thread Robin East
Have you tried EdgeDirection.In? On 3 Mar 2015, at 16:32, Robin East robin.e...@xense.co.uk wrote: What about the following which can be run in spark shell: import org.apache.spark._ import org.apache.spark.graphx._ import org.apache.spark.rdd.RDD val vertexlist = Array((1L,One), (2L

Re: PRNG in Scala

2015-03-03 Thread Robin East
This is more of a java/scala question than spark - it uses java.util.Random : https://github.com/scala/scala/blob/2.11.x/src/library/scala/util/Random.scala On 3 Mar 2015, at 15:08, Vijayasarathy Kannan kvi...@vt.edu wrote: Hi, What pseudo-random-number generator does scala.util.Random

Re: GraphX path traversal

2015-03-03 Thread Robin East
to use pergel API but I'm not able to define messages and vertex program in that API. Could you please help me on this. Please let me know if you need more information. Regards, Rajesh On Tue, Mar 3, 2015 at 8:15 PM, Robin East robin.e...@xense.co.uk mailto:robin.e...@xense.co.uk wrote

Re: GraphX path traversal

2015-03-03 Thread Robin East
Rajesh I'm not sure if I can help you, however I don't even understand the question. Could you restate what you are trying to do. Sent from my iPhone On 2 Mar 2015, at 11:17, Madabhattula Rajesh Kumar mrajaf...@gmail.com wrote: Hi, I have a below edge list. How to find the parents

Re: Spark SQL odbc on Windows

2015-02-23 Thread Robin East
Have you looked at Kylin? http://www.ebaytechblog.com/2014/10/20/announcing-kylin-extreme-olap-engine-for-big-data/#.VOtXUUsqnUk Pretty new but has the backing of eBay. On 23 Feb 2015, at 15:38, Denny Lee denny.g@gmail.com wrote: Makes complete sense - I became a fan of Spark for pretty

Re: obtain cluster assignment in K-means

2015-02-12 Thread Robin East
KMeans.train actually returns a KMeansModel so you can use predict() method of the model e.g. clusters.predict(pointToPredict) or clusters.predict(pointsToPredict) first is a single Vector, 2nd is RDD[Vector] Robin On 12 Feb 2015, at 06:37, Shi Yu shiyu@gmail.com wrote: Hi there, I

Re: is there a master for spark cluster in ec2

2015-02-02 Thread Robin East
There is a file $SPARK_HOME/conf/spark-env.sh which comes readily configured with the MASTER variable. So if you start pyspark or spark-shell from the ec2 login machine you will connect to the Spark master. On 29 Jan 2015, at 01:11, Mohit Singh mohit1...@gmail.com wrote: Hi, Probably a

Re: Is Apache Spark less accurate than Scikit Learn?

2015-01-22 Thread Robin East
ask because I encountered this situation on other, larger datasets, so this is not an isolated case (though being the simplest example I could think of I would imagine that it's somewhat indicative of general behaviour) On Thu, Jan 22, 2015 at 1:57 AM, Robin East robin.e...@xense.co.uk wrote

Re: Is Apache Spark less accurate than Scikit Learn?

2015-01-21 Thread Robin East
I don’t get those results. I get: spark 0.14 scikit-learn0.85 The scikit-learn mse is due to the very low eta0 setting. Tweak that to 0.1 and push iterations to 400 and you get a mse ~= 0. Of course the coefficients are both ~1 and the intercept ~0. Similarly if you change the

Re: Current Build Gives HTTP ERROR

2015-01-13 Thread Robin East
I’ve just pulled down the latest commits from github, and done the following: 1) mvn clean package -DskipTests builds fine 2) ./bin/spark-shell works 3) run SparkPi example with no problems: ./bin/run-example SparkPi 10 4) Started a master ./sbin/start-master.sh grabbed the MasterWebUI