Graphx TripletFields written in Java?
Hi all, Does anyone know the reasoning behind implementing org.apache.spark.graphx.TripletFields in Java instead of Scala? It doesn't look like there's anything in there that couldn't be done in Scala. Nothing serious, just curious. Thanks! -Jay
Re: Graphx TripletFields written in Java?
The static fields - Scala can't express JVM static fields unfortunately. Those will be important once we provide the Java API. On Thu, Jan 15, 2015 at 8:58 AM, Jay Hutfles jayhutf...@gmail.com wrote: Hi all, Does anyone know the reasoning behind implementing org.apache.spark.graphx.TripletFields in Java instead of Scala? It doesn't look like there's anything in there that couldn't be done in Scala. Nothing serious, just curious. Thanks! -Jay
Re: Spark SQL API changes and stabilization
Alex, I didn't communicate properly. By private, I simply meant the expectation that it is not a public API. The plan is to still omit it from the scaladoc/javadoc generation, but no language visibility modifier will be applied on them. After 1.3, you will likely no longer need to use things in sql.catalyst package directly. Programmatically construct SchemaRDDs is going to be a first class public API. Data types have already been moved out of the sql.catalyst package and now lives in sql.types. They are becoming stable public APIs. When the data frame patch is submitted, you will see a public expression library also. There will be few reason for end users or library developers to hook into things in sql.catalyst. For the bravest and the most advanced, they can still use them, with the expectation that it is subject to change. On Thu, Jan 15, 2015 at 7:53 AM, Alessandro Baretta alexbare...@gmail.com wrote: Reynold, Thanks for the heads up. In general, I strongly oppose the use of private to restrict access to certain parts of the API, the reason being that I might find the need to use some of the internals of a library from my own project. I find that a @DeveloperAPI annotation serves the same purpose as private without imposing unnecessary restrictions: it discourages people from using the annotated API and reserves the right for the core developers to change it suddenly in backwards incompatible ways. In particular, I would like to express the desire that the APIs to programmatically construct SchemaRDDs from an RDD[Row] and a StructType remain public. All the SparkSQL data type objects should be exposed by the API, and the jekyll build should not hide the docs as it does now. Thanks. Alex On Wed, Jan 14, 2015 at 9:45 PM, Reynold Xin r...@databricks.com wrote: Hi Spark devs, Given the growing number of developers that are building on Spark SQL, we would like to stabilize the API in 1.3 so users and developers can be confident to build on it. This also gives us a chance to improve the API. In particular, we are proposing the following major changes. This should have no impact for most users (i.e. those running SQL through the JDBC client or SQLContext.sql method). 1. Everything in sql.catalyst package is private to the project. 2. Redesign SchemaRDD DSL (SPARK-5097): We initially added the DSL for SchemaRDD and logical plans in order to construct test cases. We have received feedback from a lot of users that the DSL can be incredibly powerful. In 1.3, we’d like to refactor the DSL to make it suitable for not only constructing test cases, but also in everyday data pipelines. The new SchemaRDD API is inspired by the data frame concept in Pandas and R. 3. Reconcile Java and Scala APIs (SPARK-5193): We would like to expose one set of APIs that will work for both Java and Scala. The current Java API (sql.api.java) does not share any common ancestor with the Scala API. This led to high maintenance burden for us as Spark developers and for library developers. We propose to eliminate the Java specific API, and simply work on the existing Scala API to make it also usable for Java. This will make Java a first class citizen as Scala. This effectively means that all public classes should be usable for both Scala and Java, including SQLContext, HiveContext, SchemaRDD, data types, and the aforementioned DSL. Again, this should have no impact on most users since the existing DSL is rarely used by end users. However, library developers might need to change the import statements because we are moving certain classes around. We will keep you posted as patches are merged.
Re: Join implementation in SparkSQL
It's a bunch of strategies defined here: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala In most common use cases (e.g. inner equi join), filters are pushed below the join or into the join. Doing a cartesian product followed by a filter is too expensive. On Thu, Jan 15, 2015 at 7:39 AM, Alessandro Baretta alexbare...@gmail.com wrote: Hello, Where can I find docs about how joins are implemented in SparkSQL? In particular, I'd like to know whether they are implemented according to their relational algebra definition as filters on top of a cartesian product. Thanks, Alex
Re: LinearRegressionWithSGD accuracy
Thanks, that helps a bit at least with the NaN but the MSE is still very high even with that step size and 10k iterations: training Mean Squared Error = 3.3322561285919316E7 Does this method need say 100k iterations? On Thu, Jan 15, 2015 at 5:42 PM, Robin East robin.e...@xense.co.uk wrote: -dev, +user You’ll need to set the gradient descent step size to something small - a bit of trial and error shows that 0.0001 works. You’ll need to create a LinearRegressionWithSGD instance and set the step size explicitly: val lr = new LinearRegressionWithSGD() lr.optimizer.setStepSize(0.0001) lr.optimizer.setNumIterations(100) val model = lr.run(parsedData) On 15 Jan 2015, at 16:46, devl.development devl.developm...@gmail.com wrote: From what I gather, you use LinearRegressionWithSGD to predict y or the response variable given a feature vector x. In a simple example I used a perfectly linear dataset such that x=y y,x 1,1 2,2 ... 1,1 Using the out-of-box example from the website (with and without scaling): val data = sc.textFile(file) val parsedData = data.map { line = val parts = line.split(',') LabeledPoint(parts(1).toDouble, Vectors.dense(parts(0).toDouble)) //y and x } val scaler = new StandardScaler(withMean = true, withStd = true) .fit(parsedData.map(x = x.features)) val scaledData = parsedData .map(x = LabeledPoint(x.label, scaler.transform(Vectors.dense(x.features.toArray // Building the model val numIterations = 100 val model = LinearRegressionWithSGD.train(parsedData, numIterations) // Evaluate model on training examples and compute training error * tried using both scaledData and parsedData val valuesAndPreds = scaledData.map { point = val prediction = model.predict(point.features) (point.label, prediction) } val MSE = valuesAndPreds.map{case(v, p) = math.pow((v - p), 2)}.mean() println(training Mean Squared Error = + MSE) Both scaled and unscaled attempts give: training Mean Squared Error = NaN I've even tried x, y+(sample noise from normal with mean 0 and stddev 1) still comes up with the same thing. Is this not supposed to work for x and y or 2 dimensional plots? Is there something I'm missing or wrong in the code above? Or is there a limitation in the method? Thanks for any advice. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/LinearRegressionWithSGD-accuracy-tp10127.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: LinearRegressionWithSGD accuracy
-dev, +user You’ll need to set the gradient descent step size to something small - a bit of trial and error shows that 0.0001 works. You’ll need to create a LinearRegressionWithSGD instance and set the step size explicitly: val lr = new LinearRegressionWithSGD() lr.optimizer.setStepSize(0.0001) lr.optimizer.setNumIterations(100) val model = lr.run(parsedData) On 15 Jan 2015, at 16:46, devl.development devl.developm...@gmail.com wrote: From what I gather, you use LinearRegressionWithSGD to predict y or the response variable given a feature vector x. In a simple example I used a perfectly linear dataset such that x=y y,x 1,1 2,2 ... 1,1 Using the out-of-box example from the website (with and without scaling): val data = sc.textFile(file) val parsedData = data.map { line = val parts = line.split(',') LabeledPoint(parts(1).toDouble, Vectors.dense(parts(0).toDouble)) //y and x } val scaler = new StandardScaler(withMean = true, withStd = true) .fit(parsedData.map(x = x.features)) val scaledData = parsedData .map(x = LabeledPoint(x.label, scaler.transform(Vectors.dense(x.features.toArray // Building the model val numIterations = 100 val model = LinearRegressionWithSGD.train(parsedData, numIterations) // Evaluate model on training examples and compute training error * tried using both scaledData and parsedData val valuesAndPreds = scaledData.map { point = val prediction = model.predict(point.features) (point.label, prediction) } val MSE = valuesAndPreds.map{case(v, p) = math.pow((v - p), 2)}.mean() println(training Mean Squared Error = + MSE) Both scaled and unscaled attempts give: training Mean Squared Error = NaN I've even tried x, y+(sample noise from normal with mean 0 and stddev 1) still comes up with the same thing. Is this not supposed to work for x and y or 2 dimensional plots? Is there something I'm missing or wrong in the code above? Or is there a limitation in the method? Thanks for any advice. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/LinearRegressionWithSGD-accuracy-tp10127.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Implementing TinkerPop on top of GraphX
I am new to Spark and GraphX, however, I use Tinkerpop backed graphs and think the idea of using Tinkerpop as the API for GraphX is a great idea and hope you are still headed in that direction. I noticed that Tinkerpop 3 is moving into the Apache family: http://wiki.apache.org/incubator/TinkerPopProposal which might alleviate concerns about having an API definition outside of Spark. Thanks, -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Implementing-TinkerPop-on-top-of-GraphX-tp9169p10126.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Spark SQL API changes and stabilization
Reynold, Thanks for the heads up. In general, I strongly oppose the use of private to restrict access to certain parts of the API, the reason being that I might find the need to use some of the internals of a library from my own project. I find that a @DeveloperAPI annotation serves the same purpose as private without imposing unnecessary restrictions: it discourages people from using the annotated API and reserves the right for the core developers to change it suddenly in backwards incompatible ways. In particular, I would like to express the desire that the APIs to programmatically construct SchemaRDDs from an RDD[Row] and a StructType remain public. All the SparkSQL data type objects should be exposed by the API, and the jekyll build should not hide the docs as it does now. Thanks. Alex On Wed, Jan 14, 2015 at 9:45 PM, Reynold Xin r...@databricks.com wrote: Hi Spark devs, Given the growing number of developers that are building on Spark SQL, we would like to stabilize the API in 1.3 so users and developers can be confident to build on it. This also gives us a chance to improve the API. In particular, we are proposing the following major changes. This should have no impact for most users (i.e. those running SQL through the JDBC client or SQLContext.sql method). 1. Everything in sql.catalyst package is private to the project. 2. Redesign SchemaRDD DSL (SPARK-5097): We initially added the DSL for SchemaRDD and logical plans in order to construct test cases. We have received feedback from a lot of users that the DSL can be incredibly powerful. In 1.3, we’d like to refactor the DSL to make it suitable for not only constructing test cases, but also in everyday data pipelines. The new SchemaRDD API is inspired by the data frame concept in Pandas and R. 3. Reconcile Java and Scala APIs (SPARK-5193): We would like to expose one set of APIs that will work for both Java and Scala. The current Java API (sql.api.java) does not share any common ancestor with the Scala API. This led to high maintenance burden for us as Spark developers and for library developers. We propose to eliminate the Java specific API, and simply work on the existing Scala API to make it also usable for Java. This will make Java a first class citizen as Scala. This effectively means that all public classes should be usable for both Scala and Java, including SQLContext, HiveContext, SchemaRDD, data types, and the aforementioned DSL. Again, this should have no impact on most users since the existing DSL is rarely used by end users. However, library developers might need to change the import statements because we are moving certain classes around. We will keep you posted as patches are merged.
Re: LinearRegressionWithSGD accuracy
It looks like you're training on the non-scaled data but testing on the scaled data. Have you tried this training testing on only the scaled data? On Thu, Jan 15, 2015 at 10:42 AM, Devl Devel devl.developm...@gmail.com wrote: Thanks, that helps a bit at least with the NaN but the MSE is still very high even with that step size and 10k iterations: training Mean Squared Error = 3.3322561285919316E7 Does this method need say 100k iterations? On Thu, Jan 15, 2015 at 5:42 PM, Robin East robin.e...@xense.co.uk wrote: -dev, +user You’ll need to set the gradient descent step size to something small - a bit of trial and error shows that 0.0001 works. You’ll need to create a LinearRegressionWithSGD instance and set the step size explicitly: val lr = new LinearRegressionWithSGD() lr.optimizer.setStepSize(0.0001) lr.optimizer.setNumIterations(100) val model = lr.run(parsedData) On 15 Jan 2015, at 16:46, devl.development devl.developm...@gmail.com wrote: From what I gather, you use LinearRegressionWithSGD to predict y or the response variable given a feature vector x. In a simple example I used a perfectly linear dataset such that x=y y,x 1,1 2,2 ... 1,1 Using the out-of-box example from the website (with and without scaling): val data = sc.textFile(file) val parsedData = data.map { line = val parts = line.split(',') LabeledPoint(parts(1).toDouble, Vectors.dense(parts(0).toDouble)) //y and x } val scaler = new StandardScaler(withMean = true, withStd = true) .fit(parsedData.map(x = x.features)) val scaledData = parsedData .map(x = LabeledPoint(x.label, scaler.transform(Vectors.dense(x.features.toArray // Building the model val numIterations = 100 val model = LinearRegressionWithSGD.train(parsedData, numIterations) // Evaluate model on training examples and compute training error * tried using both scaledData and parsedData val valuesAndPreds = scaledData.map { point = val prediction = model.predict(point.features) (point.label, prediction) } val MSE = valuesAndPreds.map{case(v, p) = math.pow((v - p), 2)}.mean() println(training Mean Squared Error = + MSE) Both scaled and unscaled attempts give: training Mean Squared Error = NaN I've even tried x, y+(sample noise from normal with mean 0 and stddev 1) still comes up with the same thing. Is this not supposed to work for x and y or 2 dimensional plots? Is there something I'm missing or wrong in the code above? Or is there a limitation in the method? Thanks for any advice. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/LinearRegressionWithSGD-accuracy-tp10127.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Join implementation in SparkSQL
What Reynold is describing is a performance optimization in implementation, but the semantics of the join (cartesian product plus relational algebra filter) should be the same and produce the same results. On Thu, Jan 15, 2015 at 1:36 PM, Reynold Xin r...@databricks.com wrote: It's a bunch of strategies defined here: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala In most common use cases (e.g. inner equi join), filters are pushed below the join or into the join. Doing a cartesian product followed by a filter is too expensive. On Thu, Jan 15, 2015 at 7:39 AM, Alessandro Baretta alexbare...@gmail.com wrote: Hello, Where can I find docs about how joins are implemented in SparkSQL? In particular, I'd like to know whether they are implemented according to their relational algebra definition as filters on top of a cartesian product. Thanks, Alex
Re: Spark SQL API changes and stabilization
Reynold, One thing I'd like worked into the public portion of the API is the json inferencing logic that creates a Set[(String, StructType)] out of Map[String,Any]. SPARK-5260 addresses this so that I can use Accumulators to infer my schema instead of forcing a map/reduce phase to occur on an RDD in order to get the final schema. Do you (or anyone else) see a path forward in exposing this to users? A utility class perhaps? On Thu, Jan 15, 2015 at 1:33 PM, Reynold Xin r...@databricks.com wrote: Alex, I didn't communicate properly. By private, I simply meant the expectation that it is not a public API. The plan is to still omit it from the scaladoc/javadoc generation, but no language visibility modifier will be applied on them. After 1.3, you will likely no longer need to use things in sql.catalyst package directly. Programmatically construct SchemaRDDs is going to be a first class public API. Data types have already been moved out of the sql.catalyst package and now lives in sql.types. They are becoming stable public APIs. When the data frame patch is submitted, you will see a public expression library also. There will be few reason for end users or library developers to hook into things in sql.catalyst. For the bravest and the most advanced, they can still use them, with the expectation that it is subject to change. On Thu, Jan 15, 2015 at 7:53 AM, Alessandro Baretta alexbare...@gmail.com wrote: Reynold, Thanks for the heads up. In general, I strongly oppose the use of private to restrict access to certain parts of the API, the reason being that I might find the need to use some of the internals of a library from my own project. I find that a @DeveloperAPI annotation serves the same purpose as private without imposing unnecessary restrictions: it discourages people from using the annotated API and reserves the right for the core developers to change it suddenly in backwards incompatible ways. In particular, I would like to express the desire that the APIs to programmatically construct SchemaRDDs from an RDD[Row] and a StructType remain public. All the SparkSQL data type objects should be exposed by the API, and the jekyll build should not hide the docs as it does now. Thanks. Alex On Wed, Jan 14, 2015 at 9:45 PM, Reynold Xin r...@databricks.com wrote: Hi Spark devs, Given the growing number of developers that are building on Spark SQL, we would like to stabilize the API in 1.3 so users and developers can be confident to build on it. This also gives us a chance to improve the API. In particular, we are proposing the following major changes. This should have no impact for most users (i.e. those running SQL through the JDBC client or SQLContext.sql method). 1. Everything in sql.catalyst package is private to the project. 2. Redesign SchemaRDD DSL (SPARK-5097): We initially added the DSL for SchemaRDD and logical plans in order to construct test cases. We have received feedback from a lot of users that the DSL can be incredibly powerful. In 1.3, we’d like to refactor the DSL to make it suitable for not only constructing test cases, but also in everyday data pipelines. The new SchemaRDD API is inspired by the data frame concept in Pandas and R. 3. Reconcile Java and Scala APIs (SPARK-5193): We would like to expose one set of APIs that will work for both Java and Scala. The current Java API (sql.api.java) does not share any common ancestor with the Scala API. This led to high maintenance burden for us as Spark developers and for library developers. We propose to eliminate the Java specific API, and simply work on the existing Scala API to make it also usable for Java. This will make Java a first class citizen as Scala. This effectively means that all public classes should be usable for both Scala and Java, including SQLContext, HiveContext, SchemaRDD, data types, and the aforementioned DSL. Again, this should have no impact on most users since the existing DSL is rarely used by end users. However, library developers might need to change the import statements because we are moving certain classes around. We will keep you posted as patches are merged.
RE: Join implementation in SparkSQL
Not so sure about your question, but the SparkStrategies.scala and Optimizer.scala is a good start if you want to get details of the join implementation or optimization. -Original Message- From: Andrew Ash [mailto:and...@andrewash.com] Sent: Friday, January 16, 2015 4:52 AM To: Reynold Xin Cc: Alessandro Baretta; dev@spark.apache.org Subject: Re: Join implementation in SparkSQL What Reynold is describing is a performance optimization in implementation, but the semantics of the join (cartesian product plus relational algebra filter) should be the same and produce the same results. On Thu, Jan 15, 2015 at 1:36 PM, Reynold Xin r...@databricks.com wrote: It's a bunch of strategies defined here: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/or g/apache/spark/sql/execution/SparkStrategies.scala In most common use cases (e.g. inner equi join), filters are pushed below the join or into the join. Doing a cartesian product followed by a filter is too expensive. On Thu, Jan 15, 2015 at 7:39 AM, Alessandro Baretta alexbare...@gmail.com wrote: Hello, Where can I find docs about how joins are implemented in SparkSQL? In particular, I'd like to know whether they are implemented according to their relational algebra definition as filters on top of a cartesian product. Thanks, Alex - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Spark 1.2.0: MissingRequirementError
Hi guys, A few people seem to have the same problem with Spark 1.2.0 so I figured I would push it here. see: http://apache-spark-user-list.1001560.n3.nabble.com/MissingRequirementError-with-spark-td21149.html In a nutshell, for sbt test to work, we now need to fork a JVM and also give more memory to be able to run tests. See also:https://github.com/deanwampler/spark-workshop/blob/master/project/Build.scala This all used to work fine until 1.2.0. Could u have a look please? Thanks P. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-1-2-0-MissingRequirementError-tp10123.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org