RE: Join implementation in SparkSQL

2015-01-15 Thread Cheng, Hao
Not so sure about your question, but the SparkStrategies.scala and Optimizer.scala is a good start if you want to get details of the join implementation or optimization. -Original Message- From: Andrew Ash [mailto:and...@andrewash.com] Sent: Friday, January 16, 2015 4:52 AM To: Reynold

Re: LinearRegressionWithSGD accuracy

2015-01-15 Thread Devl Devel
It was a bug in the code, however adding the step parameter got the results to work. Mean Squared Error = 2.610379825794694E-5 I've also opened a jira to put the step parameter in the examples so that people new to mllib have a way to improve the MSE. https://issues.apache.org/jira/browse/SPARK-

Re: Spark SQL API changes and stabilization

2015-01-15 Thread Reynold Xin
We can look into some sort of util class in sql.types for general type inference. In general many methods in JsonRDD might be useful enough to extract. Those will probably be marked as DeveloperAPI with less stability guarantees. On Thu, Jan 15, 2015 at 12:16 PM, Corey Nolet wrote: > Reynold, >

Re: Join implementation in SparkSQL

2015-01-15 Thread Andrew Ash
What Reynold is describing is a performance optimization in implementation, but the semantics of the join (cartesian product plus relational algebra filter) should be the same and produce the same results. On Thu, Jan 15, 2015 at 1:36 PM, Reynold Xin wrote: > It's a bunch of strategies defined h

Re: LinearRegressionWithSGD accuracy

2015-01-15 Thread Joseph Bradley
It looks like you're training on the non-scaled data but testing on the scaled data. Have you tried this training & testing on only the scaled data? On Thu, Jan 15, 2015 at 10:42 AM, Devl Devel wrote: > Thanks, that helps a bit at least with the NaN but the MSE is still very > high even with th

Re: Spark SQL API changes and stabilization

2015-01-15 Thread Corey Nolet
Reynold, One thing I'd like worked into the public portion of the API is the json inferencing logic that creates a Set[(String, StructType)] out of Map[String,Any]. SPARK-5260 addresses this so that I can use Accumulators to infer my schema instead of forcing a map/reduce phase to occur on an RDD

Re: LinearRegressionWithSGD accuracy

2015-01-15 Thread Devl Devel
Thanks, that helps a bit at least with the NaN but the MSE is still very high even with that step size and 10k iterations: training Mean Squared Error = 3.3322561285919316E7 Does this method need say 100k iterations? On Thu, Jan 15, 2015 at 5:42 PM, Robin East wrote: > -dev, +user > > You

Re: Graphx TripletFields written in Java?

2015-01-15 Thread Reynold Xin
The static fields - Scala can't express JVM static fields unfortunately. Those will be important once we provide the Java API. On Thu, Jan 15, 2015 at 8:58 AM, Jay Hutfles wrote: > Hi all, > Does anyone know the reasoning behind implementing > org.apache.spark.graphx.TripletFields in Java in

Graphx TripletFields written in Java?

2015-01-15 Thread Jay Hutfles
Hi all, Does anyone know the reasoning behind implementing org.apache.spark.graphx.TripletFields in Java instead of Scala? It doesn't look like there's anything in there that couldn't be done in Scala. Nothing serious, just curious. Thanks! -Jay

Re: Join implementation in SparkSQL

2015-01-15 Thread Reynold Xin
It's a bunch of strategies defined here: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala In most common use cases (e.g. inner equi join), filters are pushed below the join or into the join. Doing a cartesian product followed

Re: Spark SQL API changes and stabilization

2015-01-15 Thread Reynold Xin
Alex, I didn't communicate properly. By "private", I simply meant the expectation that it is not a public API. The plan is to still omit it from the scaladoc/javadoc generation, but no language visibility modifier will be applied on them. After 1.3, you will likely no longer need to use things in

Re: LinearRegressionWithSGD accuracy

2015-01-15 Thread Robin East
-dev, +user You’ll need to set the gradient descent step size to something small - a bit of trial and error shows that 0.0001 works. You’ll need to create a LinearRegressionWithSGD instance and set the step size explicitly: val lr = new LinearRegressionWithSGD() lr.optimizer.setStepSize(0.

LinearRegressionWithSGD accuracy

2015-01-15 Thread devl.development
>From what I gather, you use LinearRegressionWithSGD to predict y or the response variable given a feature vector x. In a simple example I used a perfectly linear dataset such that x=y y,x 1,1 2,2 ... 1,1 Using the out-of-box example from the website (with and without scaling): val dat

Re: Implementing TinkerPop on top of GraphX

2015-01-15 Thread David Robinson
I am new to Spark and GraphX, however, I use Tinkerpop backed graphs and think the idea of using Tinkerpop as the API for GraphX is a great idea and hope you are still headed in that direction. I noticed that Tinkerpop 3 is moving into the Apache family: http://wiki.apache.org/incubator/TinkerPopP

Re: Spark SQL API changes and stabilization

2015-01-15 Thread Alessandro Baretta
Reynold, Thanks for the heads up. In general, I strongly oppose the use of "private" to restrict access to certain parts of the API, the reason being that I might find the need to use some of the internals of a library from my own project. I find that a @DeveloperAPI annotation serves the same pur

Join implementation in SparkSQL

2015-01-15 Thread Alessandro Baretta
Hello, Where can I find docs about how joins are implemented in SparkSQL? In particular, I'd like to know whether they are implemented according to their relational algebra definition as filters on top of a cartesian product. Thanks, Alex

Spark 1.2.0: MissingRequirementError

2015-01-15 Thread PierreB
Hi guys, A few people seem to have the same problem with Spark 1.2.0 so I figured I would push it here. see: http://apache-spark-user-list.1001560.n3.nabble.com/MissingRequirementError-with-spark-td21149.html In a nutshell, for sbt test to work, we now need to fork a JVM and also give more memor

Spark client reconnect to driver in yarn-cluster deployment mode

2015-01-15 Thread preeze
>From the official spark documentation (http://spark.apache.org/docs/1.2.0/running-on-yarn.html): "In yarn-cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application." Is there any d

Re: SciSpark: NASA AIST14 proposal

2015-01-15 Thread andy petrella
Hey Chris, This sounds amazing! You might have to check also with the Geotrellis team (Rob and Eugene for instance) who have already covered quite interesting ground dealing with tiles as RDD element. Some algebra operations are there, but also thingies l