[GitHub] spark pull request: [SPARK-2783][SQL] Basic support for analyze in...

2014-08-02 Thread concretevitamin
Github user concretevitamin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1741#discussion_r15733260
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala ---
@@ -280,7 +281,7 @@ private[hive] case class MetastoreRelation
   // of RPCs are involved.  Besides `totalSize`, there are also 
`numFiles`, `numRows`,
   // `rawDataSize` keys that we can look at in the future.
   BigInt(
-Option(hiveQlTable.getParameters.get("totalSize"))
+Option(hiveQlTable.getParameters.get(StatsSetupConst.TOTAL_SIZE))
--- End diff --

Oh wow, this is a hard-to-find class!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2783][SQL] Basic support for analyze in...

2014-08-02 Thread concretevitamin
Github user concretevitamin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1741#discussion_r15733265
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala ---
@@ -280,7 +281,7 @@ private[hive] case class MetastoreRelation
   // of RPCs are involved.  Besides `totalSize`, there are also 
`numFiles`, `numRows`,
--- End diff --

Perhaps update the comments here to say other fields in `StatsSetupConst` 
might be useful.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2783][SQL] Basic support for analyze in...

2014-08-02 Thread concretevitamin
Github user concretevitamin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1741#discussion_r15733255
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala ---
@@ -21,12 +21,15 @@ import java.io.{BufferedReader, File, 
InputStreamReader, PrintStream}
 import java.sql.Timestamp
 import java.util.{ArrayList => JArrayList}
 
+import org.apache.hadoop.hive.ql.stats.StatsSetupConst
--- End diff --

Alphabetize imports


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2272 [MLlib] Feature scaling which stand...

2014-08-02 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1207#discussion_r15733253
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/StandardScaler.scala ---
@@ -0,0 +1,94 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.feature
+
+import breeze.linalg.{DenseVector => BDV}
+
+import org.apache.spark.annotation.DeveloperApi
+import org.apache.spark.mllib.linalg.distributed.RowMatrix
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.rdd.RDD
+
+/**
+ * :: DeveloperApi ::
+ * Standardizes features by removing the mean and scaling to unit variance 
using column summary
+ * statistics on the samples in the training set.
+ *
+ * @param withMean True by default. Centers the data with mean before 
scaling. It will build a dense
+ * output, so this does not work on sparse input and will 
raise an exception.
+ * @param withStd True by default. Scales the data to unit standard 
deviation.
--- End diff --

It is okay to keep it. I'm thinking about the use cases. We might need 
centering without standardizing the columns. But it is a little weird to use 
this transformer, because it is called `StandardScaler` while centering is 
neither `standarding` nor `scaling`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2272 [MLlib] Feature scaling which stand...

2014-08-02 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/1207#discussion_r15733248
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/VectorTransformer.scala ---
@@ -0,0 +1,47 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.feature
+
+import org.apache.spark.annotation.DeveloperApi
+import org.apache.spark.mllib.linalg.Vector
+import org.apache.spark.rdd.RDD
+
+/**
+ * :: DeveloperApi ::
+ * Trait for transformation of a vector
+ */
+@DeveloperApi
+trait VectorTransformer {
+
+  /**
+   * Applies transformation on a vector.
+   *
+   * @param vector vector to be transformed.
+   * @return transformed vector.
+   */
+  def transform(vector: Vector): Vector
+
+  /**
+   * Applies transformation on a RDD[Vector].
+   *
+   * @param data RDD[Vector] to be transformed.
+   * @return transformed RDD[Vector].
+   */
+  def transform(data: RDD[Vector]): RDD[Vector] = data.map(x => 
this.transform(x))
--- End diff --

Can you elaborate this?  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2272 [MLlib] Feature scaling which stand...

2014-08-02 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/1207#discussion_r15733244
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/StandardScaler.scala ---
@@ -0,0 +1,94 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.feature
+
+import breeze.linalg.{DenseVector => BDV}
+
+import org.apache.spark.annotation.DeveloperApi
+import org.apache.spark.mllib.linalg.distributed.RowMatrix
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.rdd.RDD
+
+/**
+ * :: DeveloperApi ::
+ * Standardizes features by removing the mean and scaling to unit variance 
using column summary
+ * statistics on the samples in the training set.
+ *
+ * @param withMean True by default. Centers the data with mean before 
scaling. It will build a dense
+ * output, so this does not work on sparse input and will 
raise an exception.
+ * @param withStd True by default. Scales the data to unit standard 
deviation.
--- End diff --

sklearn.preprocessing.StandardScaler has this API. If we want to minimize 
the set of parameters now, we can remove it for this release.


http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...

2014-08-02 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/1379#issuecomment-50983421
  
@pwendell I didn't see `Closes #1379` in the merged commit. Is something 
wrong with asfgit?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2792. Fix reading too much or too little...

2014-08-02 Thread mateiz
Github user mateiz commented on the pull request:

https://github.com/apache/spark/pull/1722#issuecomment-50983403
  
@aarondav / @mridulm any other comments on this, or is it okay to merge?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...

2014-08-02 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/1379#issuecomment-50983381
  
... I have no idea. Let me check.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1477]: Add the lifecycle interface

2014-08-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/991#issuecomment-50983352
  
QA tests have started for PR 991. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17806/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1477]: Add the lifecycle interface

2014-08-02 Thread witgo
Github user witgo commented on the pull request:

https://github.com/apache/spark/pull/991#issuecomment-50983325
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2272 [MLlib] Feature scaling which stand...

2014-08-02 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1207#discussion_r15733224
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Normalizer.scala ---
@@ -0,0 +1,58 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.feature
+
+import org.apache.spark.annotation.DeveloperApi
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+
+/**
+ * :: DeveloperApi ::
+ * Normalizes samples individually to unit L^n norm
+ *
+ * @param n  L^2 norm by default. Normalization in L^n space.
+ */
+@DeveloperApi
+class Normalizer(n: Int) extends VectorTransformer with Serializable {
+
+  def this() = this(2)
+
+  require(n > 0)
--- End diff --

Ah, we should use Double for norm and also accept 
`Double.PositiveInfinity`. 1, 2, and `inf` are the popular norms.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2272 [MLlib] Feature scaling which stand...

2014-08-02 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/1207#discussion_r15733221
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Normalizer.scala ---
@@ -0,0 +1,58 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.feature
+
+import org.apache.spark.annotation.DeveloperApi
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+
+/**
+ * :: DeveloperApi ::
+ * Normalizes samples individually to unit L^n norm
+ *
+ * @param n  L^2 norm by default. Normalization in L^n space.
+ */
+@DeveloperApi
+class Normalizer(n: Int) extends VectorTransformer with Serializable {
+
+  def this() = this(2)
+
+  require(n > 0)
--- End diff --

I made it more explicit for not saving one cpu cycle. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2272 [MLlib] Feature scaling which stand...

2014-08-02 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/1207#discussion_r15733217
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Normalizer.scala ---
@@ -0,0 +1,58 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.feature
+
+import org.apache.spark.annotation.DeveloperApi
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+
+/**
+ * :: DeveloperApi ::
+ * Normalizes samples individually to unit L^n norm
+ *
+ * @param n  L^2 norm by default. Normalization in L^n space.
+ */
+@DeveloperApi
+class Normalizer(n: Int) extends VectorTransformer with Serializable {
+
+  def this() = this(2)
+
+  require(n > 0)
--- End diff --

This is Int. As long as we require p > 0; it implies p >= 0


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2272 [MLlib] Feature scaling which stand...

2014-08-02 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1207#discussion_r15733213
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/feature/NormalizerSuite.scala ---
@@ -0,0 +1,134 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.feature
+
+import breeze.linalg.{DenseVector => BDV, SparseVector => BSV}
+import org.scalatest.FunSuite
+
+import org.apache.spark.mllib.linalg.Vectors
+import org.apache.spark.mllib.util.LocalSparkContext
+import org.apache.spark.mllib.util.TestingUtils._
+
+class NormalizerSuite extends FunSuite with LocalSparkContext {
+
+  private def norm(v: Array[Double], n: Int): Double = {
+v.foldLeft[Double](0.0)((acc, value) => acc + 
Math.pow(Math.abs(value), n))
+  }
+
+  val data = Array(
+Vectors.sparse(3, Seq((0, -2.0), (1, 2.3))),
+Vectors.dense(0.0, 0.0, 0.0),
+Vectors.dense(0.6, -1.1, -3.0),
+Vectors.sparse(3, Seq((1, 0.91), (2, 3.2))),
+Vectors.dense(5.7, 0.72, 2.7),
+Vectors.sparse(3, Seq())
+  )
+
+  lazy val dataRDD = sc.parallelize(data, 3)
+
+  test("Normalization using L1 distance") {
+val l1Normalizer = new Normalizer(1)
+
+val data1 = data.map(l1Normalizer.transform(_))
+val data1RDD = l1Normalizer.transform(dataRDD)
+
+assert((data.map(_.toBreeze), data1.map(_.toBreeze), 
data1RDD.collect().map(_.toBreeze))
+  .zipped.forall(
+(v1, v2, v3) => (v1, v2, v3) match {
+  case (v1: BDV[Double], v2: BDV[Double], v3: BDV[Double]) => true
+  case (v1: BSV[Double], v2: BSV[Double], v3: BSV[Double]) => true
+  case _ => false
+}
+  ), "The vector type should be preserved after normalization.")
+
+assert((data1, data1RDD.collect()).zipped.forall((v1, v2) => v1 ~== v2 
absTol 1E-5))
+
+assert(norm(data1(0).toArray, 1) ~== 1.0 absTol 1E-5)
+assert(norm(data1(2).toArray, 1) ~== 1.0 absTol 1E-5)
+assert(norm(data1(3).toArray, 1) ~== 1.0 absTol 1E-5)
+assert(norm(data1(4).toArray, 1) ~== 1.0 absTol 1E-5)
+
+assert(data1(0) ~== Vectors.sparse(3, Seq((0, -0.465116279), (1, 
0.53488372))) absTol 1E-5)
+assert(data1(1) ~== Vectors.dense(0.0, 0.0, 0.0) absTol 1E-5)
+assert(data1(2) ~== Vectors.dense(0.12765957, -0.23404255, 
-0.63829787) absTol 1E-5)
+assert(data1(3) ~== Vectors.sparse(3, Seq((1, 0.22141119), (2, 
0.7785888))) absTol 1E-5)
+assert(data1(4) ~== Vectors.dense(0.625, 0.07894737, 0.29605263) 
absTol 1E-5)
+assert(data1(5) ~== Vectors.sparse(3, Seq()) absTol 1E-5)
+  }
+
+  test("Normalization using L2 distance") {
+val l2Normalizer = new Normalizer()
+
+val data2 = data.map(l2Normalizer.transform(_))
+val data2RDD = l2Normalizer.transform(dataRDD)
+
+assert((data.map(_.toBreeze), data2.map(_.toBreeze), 
data2RDD.collect().map(_.toBreeze))
+  .zipped.forall(
+(v1, v2, v3) => (v1, v2, v3) match {
+  case (v1: BDV[Double], v2: BDV[Double], v3: BDV[Double]) => true
+  case (v1: BSV[Double], v2: BSV[Double], v3: BSV[Double]) => true
+  case _ => false
+}
+  ), "The vector type should be preserved after normalization.")
+
+assert((data2, data2RDD.collect()).zipped.forall((v1, v2) => v1 ~== v2 
absTol 1E-5))
+
+assert(norm(data2(0).toArray, 2) ~== 1.0 absTol 1E-5)
+assert(norm(data2(2).toArray, 2) ~== 1.0 absTol 1E-5)
+assert(norm(data2(3).toArray, 2) ~== 1.0 absTol 1E-5)
+assert(norm(data2(4).toArray, 2) ~== 1.0 absTol 1E-5)
+
+assert(data2(0) ~== Vectors.sparse(3, Seq((0, -0.65617871), (1, 
0.75460552))) absTol 1E-5)
+assert(data2(1) ~== Vectors.dense(0.0, 0.0, 0.0) absTol 1E-5)
+assert(data2(2) ~== Vectors.dense(0.184549876, -0.3383414, 
-0.922749378) absTol 1E-5)
+assert(data2(3) ~== Vectors.sparse(3, Seq((1, 0

[GitHub] spark pull request: SPARK-2272 [MLlib] Feature scaling which stand...

2014-08-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1207#issuecomment-50983292
  
QA results for PR 1207:- This patch PASSES unit tests.- This patch 
merges cleanly- This patch adds the following public classes 
(experimental):class Normalizer(n: Int) extends VectorTransformer with 
Serializable {class StandardScaler(withMean: Boolean, withStd: 
Boolean)trait VectorTransformer {For more information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17804/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2272 [MLlib] Feature scaling which stand...

2014-08-02 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1207#discussion_r15733202
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/StandardScaler.scala ---
@@ -0,0 +1,94 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.feature
+
+import breeze.linalg.{DenseVector => BDV}
+
+import org.apache.spark.annotation.DeveloperApi
+import org.apache.spark.mllib.linalg.distributed.RowMatrix
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.rdd.RDD
+
+/**
+ * :: DeveloperApi ::
+ * Standardizes features by removing the mean and scaling to unit variance 
using column summary
+ * statistics on the samples in the training set.
+ *
+ * @param withMean True by default. Centers the data with mean before 
scaling. It will build a dense
--- End diff --

I would set `withMean` default to `false` because almost all datasets are 
sparse.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2272 [MLlib] Feature scaling which stand...

2014-08-02 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1207#discussion_r15733204
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/StandardScaler.scala ---
@@ -0,0 +1,94 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.feature
+
+import breeze.linalg.{DenseVector => BDV}
+
+import org.apache.spark.annotation.DeveloperApi
+import org.apache.spark.mllib.linalg.distributed.RowMatrix
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.rdd.RDD
+
+/**
+ * :: DeveloperApi ::
+ * Standardizes features by removing the mean and scaling to unit variance 
using column summary
+ * statistics on the samples in the training set.
+ *
+ * @param withMean True by default. Centers the data with mean before 
scaling. It will build a dense
+ * output, so this does not work on sparse input and will 
raise an exception.
+ * @param withStd True by default. Scales the data to unit standard 
deviation.
+ */
+@DeveloperApi
+class StandardScaler(withMean: Boolean, withStd: Boolean)
+  extends VectorTransformer with Serializable {
+
+  def this() = this(true, true)
+
+  var mean: Vector = _
+  var variance: Vector = _
+
+  /**
+   * Computes the mean and variance and stores as a model to be used for 
later scaling.
+   *
+   * @param data The data used to compute the mean and variance to build 
the transformation model.
+   * @return This StandardScalar object.
+   */
+  def fit(data: RDD[Vector]): this.type = {
+val summary = new RowMatrix(data).computeColumnSummaryStatistics
--- End diff --

Using `OnlineSummarizer` here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2272 [MLlib] Feature scaling which stand...

2014-08-02 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1207#discussion_r15733206
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/VectorTransformer.scala ---
@@ -0,0 +1,47 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.feature
+
+import org.apache.spark.annotation.DeveloperApi
+import org.apache.spark.mllib.linalg.Vector
+import org.apache.spark.rdd.RDD
+
+/**
+ * :: DeveloperApi ::
+ * Trait for transformation of a vector
+ */
+@DeveloperApi
+trait VectorTransformer {
--- End diff --

add `Serializable`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2272 [MLlib] Feature scaling which stand...

2014-08-02 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1207#discussion_r15733207
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/VectorTransformer.scala ---
@@ -0,0 +1,47 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.feature
+
+import org.apache.spark.annotation.DeveloperApi
+import org.apache.spark.mllib.linalg.Vector
+import org.apache.spark.rdd.RDD
+
+/**
+ * :: DeveloperApi ::
+ * Trait for transformation of a vector
+ */
+@DeveloperApi
+trait VectorTransformer {
+
+  /**
+   * Applies transformation on a vector.
+   *
+   * @param vector vector to be transformed.
+   * @return transformed vector.
+   */
+  def transform(vector: Vector): Vector
+
+  /**
+   * Applies transformation on a RDD[Vector].
+   *
+   * @param data RDD[Vector] to be transformed.
+   * @return transformed RDD[Vector].
+   */
+  def transform(data: RDD[Vector]): RDD[Vector] = data.map(x => 
this.transform(x))
--- End diff --

Note: to transform an RDD, we should broadcast the data we need instead of 
serialize it into the task closure. (This may become unnecessary because we 
broadcast RDD objects in Spark v1.1.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2272 [MLlib] Feature scaling which stand...

2014-08-02 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1207#discussion_r15733203
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/StandardScaler.scala ---
@@ -0,0 +1,94 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.feature
+
+import breeze.linalg.{DenseVector => BDV}
+
+import org.apache.spark.annotation.DeveloperApi
+import org.apache.spark.mllib.linalg.distributed.RowMatrix
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.rdd.RDD
+
+/**
+ * :: DeveloperApi ::
+ * Standardizes features by removing the mean and scaling to unit variance 
using column summary
+ * statistics on the samples in the training set.
+ *
+ * @param withMean True by default. Centers the data with mean before 
scaling. It will build a dense
+ * output, so this does not work on sparse input and will 
raise an exception.
+ * @param withStd True by default. Scales the data to unit standard 
deviation.
--- End diff --

Are there use cases for `withStd == false`? (I'm trying to make a minimal 
set of parameters.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2272 [MLlib] Feature scaling which stand...

2014-08-02 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1207#discussion_r15733205
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/StandardScaler.scala ---
@@ -0,0 +1,94 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.feature
+
+import breeze.linalg.{DenseVector => BDV}
+
+import org.apache.spark.annotation.DeveloperApi
+import org.apache.spark.mllib.linalg.distributed.RowMatrix
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.rdd.RDD
+
+/**
+ * :: DeveloperApi ::
+ * Standardizes features by removing the mean and scaling to unit variance 
using column summary
+ * statistics on the samples in the training set.
+ *
+ * @param withMean True by default. Centers the data with mean before 
scaling. It will build a dense
+ * output, so this does not work on sparse input and will 
raise an exception.
+ * @param withStd True by default. Scales the data to unit standard 
deviation.
+ */
+@DeveloperApi
+class StandardScaler(withMean: Boolean, withStd: Boolean)
+  extends VectorTransformer with Serializable {
+
+  def this() = this(true, true)
+
+  var mean: Vector = _
+  var variance: Vector = _
+
+  /**
+   * Computes the mean and variance and stores as a model to be used for 
later scaling.
+   *
+   * @param data The data used to compute the mean and variance to build 
the transformation model.
+   * @return This StandardScalar object.
+   */
+  def fit(data: RDD[Vector]): this.type = {
+val summary = new RowMatrix(data).computeColumnSummaryStatistics
+this.mean = summary.mean
+this.variance = summary.variance
+require(mean.toBreeze.length == variance.toBreeze.length)
+this
+  }
+
+  /**
+   * Applies standardization transformation on a vector.
+   *
+   * @param vector Vector to be standardized.
+   * @return Standardized vector. If the variance of a column is zero, it 
will return default `0.0`
+   * for the column with zero variance.
+   */
+  override def transform(vector: Vector): Vector = {
+require(mean != null || variance != null, s"Please `fit` the model 
with training set first.")
+require(vector.toBreeze.length == mean.toBreeze.length)
+
+if (withMean) {
+  vector.toBreeze match {
+case dv: BDV[Double] => // pass
+case v: Any =>
+  throw new IllegalArgumentException("Do not support vector type " 
+ v.getClass)
+  }
+}
+
+val output = vector.toBreeze.copy
+output.activeIterator.foreach {
+  case (i, value) => {
+val shift = if (withMean) mean(i) else 0.0
+if (variance(i) != 0.0 && withStd) {
+  output(i) = (value - shift) / Math.sqrt(variance(i))
--- End diff --

ditto: same issue with random access


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2272 [MLlib] Feature scaling which stand...

2014-08-02 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1207#discussion_r15733196
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Normalizer.scala ---
@@ -0,0 +1,58 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.feature
+
+import org.apache.spark.annotation.DeveloperApi
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+
+/**
+ * :: DeveloperApi ::
+ * Normalizes samples individually to unit L^n norm
+ *
+ * @param n  L^2 norm by default. Normalization in L^n space.
+ */
+@DeveloperApi
+class Normalizer(n: Int) extends VectorTransformer with Serializable {
--- End diff --

`n` -> `p`, which is commonly used for norms.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2272 [MLlib] Feature scaling which stand...

2014-08-02 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1207#discussion_r15733200
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Normalizer.scala ---
@@ -0,0 +1,58 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.feature
+
+import org.apache.spark.annotation.DeveloperApi
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+
+/**
+ * :: DeveloperApi ::
+ * Normalizes samples individually to unit L^n norm
+ *
+ * @param n  L^2 norm by default. Normalization in L^n space.
+ */
+@DeveloperApi
+class Normalizer(n: Int) extends VectorTransformer with Serializable {
+
+  def this() = this(2)
+
+  require(n > 0)
+
+  /**
+   * Applies unit length normalization on a vector.
+   *
+   * @param vector vector to be normalized.
+   * @return normalized vector. If all the elements in vector are zeros, 
it will return as it.
+   */
+  override def transform(vector: Vector): Vector = {
+var sum = 0.0
+vector.toBreeze.activeIterator.foreach {
+  case (i, value) => sum += Math.pow(Math.abs(value), n)
+}
+
+val output = vector.toBreeze.copy
+if (sum != 0.0) {
+  sum = Math.pow(sum, 1.0 / n)
+  output.activeIterator.foreach {
+case (i, value) => output(i) = value / sum
--- End diff --

For sparse vectors, `apply(Int)` is implemented using binary search. So we 
should operate on the values array directly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2272 [MLlib] Feature scaling which stand...

2014-08-02 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1207#discussion_r15733198
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Normalizer.scala ---
@@ -0,0 +1,58 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.feature
+
+import org.apache.spark.annotation.DeveloperApi
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+
+/**
+ * :: DeveloperApi ::
+ * Normalizes samples individually to unit L^n norm
+ *
+ * @param n  L^2 norm by default. Normalization in L^n space.
+ */
+@DeveloperApi
+class Normalizer(n: Int) extends VectorTransformer with Serializable {
+
+  def this() = this(2)
+
+  require(n > 0)
+
+  /**
+   * Applies unit length normalization on a vector.
+   *
+   * @param vector vector to be normalized.
+   * @return normalized vector. If all the elements in vector are zeros, 
it will return as it.
+   */
+  override def transform(vector: Vector): Vector = {
+var sum = 0.0
+vector.toBreeze.activeIterator.foreach {
--- End diff --

We can use breeze's norm directly, e.g., `norm(v, 2.0)`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2272 [MLlib] Feature scaling which stand...

2014-08-02 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1207#discussion_r15733197
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Normalizer.scala ---
@@ -0,0 +1,58 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.feature
+
+import org.apache.spark.annotation.DeveloperApi
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+
+/**
+ * :: DeveloperApi ::
+ * Normalizes samples individually to unit L^n norm
+ *
+ * @param n  L^2 norm by default. Normalization in L^n space.
+ */
+@DeveloperApi
+class Normalizer(n: Int) extends VectorTransformer with Serializable {
+
+  def this() = this(2)
+
+  require(n > 0)
--- End diff --

`p >= 1`? Any use case for `p \in (0, 1)`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2272 [MLlib] Feature scaling which stand...

2014-08-02 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1207#discussion_r15733199
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Normalizer.scala ---
@@ -0,0 +1,58 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.feature
+
+import org.apache.spark.annotation.DeveloperApi
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+
+/**
+ * :: DeveloperApi ::
+ * Normalizes samples individually to unit L^n norm
+ *
+ * @param n  L^2 norm by default. Normalization in L^n space.
+ */
+@DeveloperApi
+class Normalizer(n: Int) extends VectorTransformer with Serializable {
+
+  def this() = this(2)
+
+  require(n > 0)
+
+  /**
+   * Applies unit length normalization on a vector.
+   *
+   * @param vector vector to be normalized.
+   * @return normalized vector. If all the elements in vector are zeros, 
it will return as it.
+   */
+  override def transform(vector: Vector): Vector = {
+var sum = 0.0
+vector.toBreeze.activeIterator.foreach {
+  case (i, value) => sum += Math.pow(Math.abs(value), n)
+}
+
+val output = vector.toBreeze.copy
--- End diff --

Should be faster if we branch on the vector type here. If the vector is 
sparse, we only need to copy its value array. Also, the `activeIterator` is not 
very efficient.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1477]: Add the lifecycle interface

2014-08-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/991#issuecomment-50983223
  
QA results for PR 991:- This patch FAILED unit tests.- This patch 
merges cleanly- This patch adds the following public classes 
(experimental):trait Lifecycle extends Service {trait Service extends 
java.io.Closeable {class SparkContext(config: SparkConf) extends Logging 
with Lifecycle {class JavaStreamingContext(val ssc: StreamingContext) 
extends Lifecycle {class JobGenerator(jobScheduler: JobScheduler) extends 
Logging with Lifecycle {class JobScheduler(val ssc: StreamingContext) 
extends Logging with Lifecycle {class ReceiverTracker(ssc: 
StreamingContext) extends Logging with Lifecycle {class ReceiverLauncher 
extends Lifecycle {For more information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17805/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2197] [mllib] Java DecisionTree bug fix...

2014-08-02 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1740#discussion_r15733162
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/configuration/Strategy.scala 
---
@@ -60,4 +62,31 @@ class Strategy (
   val isMulticlassWithCategoricalFeatures
 = isMulticlassClassification && (categoricalFeaturesInfo.size > 0)
 
+  /**
+   * Java-friendly constructor.
+   *
+   * @param algo classification or regression
+   * @param impurity criterion used for information gain calculation
+   * @param maxDepth Maximum depth of the tree.
+   * E.g., depth 0 means 1 leaf node; depth 1 means 1 
internal node + 2 leaf nodes.
+   * @param numClassesForClassification number of classes for 
classification. Default value is 2
+   *leads to binary classification
+   * @param maxBins maximum number of bins used for splitting features
+   * @param categoricalFeaturesInfo A map storing information about the 
categorical variables and
+   *the number of discrete values they 
take. For example, an entry
+   *(n -> k) implies the feature n is 
categorical with k categories
+   *0, 1, 2, ... , k-1. It's important to 
note that features are
+   *zero-indexed.
+   */
+  def this(
+  algo: Algo,
+  impurity: Impurity,
+  maxDepth: Int,
+  numClassesForClassification: Int,
+  maxBins: Int,
+  categoricalFeaturesInfo: java.util.Map[java.lang.Integer, 
java.lang.Integer]) {
+this(algo, impurity, maxDepth, numClassesForClassification, maxBins, 
Sort,
+  categoricalFeaturesInfo.map{ case (a, b) => (a.toInt, b.toInt) 
}.toMap)
--- End diff --

This seems to work:

~~~
import scala.collection.JavaConverters._

categoricalFeaturesInfo.asInstanceOf[java.util.Map[Int, Int]].asScala.toMap
~~~

`JavaConverters` is preferred because the conversion is explicit via 
`asScala`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1477]: Add the lifecycle interface

2014-08-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/991#issuecomment-50982869
  
QA tests have started for PR 991. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17805/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2813: [SQL] Implement SQRT() directly in...

2014-08-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1750#issuecomment-50982709
  
QA results for PR 1750:- This patch PASSES unit tests.- This patch 
merges cleanly- This patch adds the following public classes 
(experimental):case class Sqrt(child: Expression) extends UnaryExpression 
{For more information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17801/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...

2014-08-02 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/1379#issuecomment-50982699
  
@mengxr  Is there any problem with asfgit? This is not finished yet, why 
asfgit said it's merged into apache:master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2272 [MLlib] Feature scaling which stand...

2014-08-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1207#issuecomment-50982624
  
QA tests have started for PR 1207. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17804/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2272 [MLlib] Feature scaling which stand...

2014-08-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1207#issuecomment-50982574
  
QA results for PR 1207:- This patch FAILED unit tests.- This patch 
merges cleanly- This patch adds the following public classes 
(experimental):class Normalizer(n: Int) extends VectorTransformer with 
Serializable {class StandardScaler(withMean: Boolean, withStd: 
Boolean)trait VectorTransformer {For more information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17803/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2272 [MLlib] Feature scaling which stand...

2014-08-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1207#issuecomment-50982572
  
QA tests have started for PR 1207. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17803/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2272 [MLlib] Feature scaling which stand...

2014-08-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1207#issuecomment-50982521
  
QA results for PR 1207:- This patch FAILED unit tests.- This patch 
merges cleanly- This patch adds the following public classes 
(experimental):class Normalizer(n: Int) extends VectorTransformer with 
Serializable {class StandardScaler(withMean: Boolean, withStd: 
Boolean)trait VectorTransformer {For more information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17802/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2627] have the build enforce PEP 8 auto...

2014-08-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1744#issuecomment-50982524
  
QA results for PR 1744:- This patch PASSES unit tests.For more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17800/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2272 [MLlib] Feature scaling which stand...

2014-08-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1207#issuecomment-50982519
  
QA tests have started for PR 1207. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17802/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2602 [BUILD] Tests steal focus under Jav...

2014-08-02 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1747


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2414 [BUILD] Add LICENSE entry for jquer...

2014-08-02 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1748


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Minor] Fixes on top of #1679

2014-08-02 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1736


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...

2014-08-02 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1379


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2678][Core] Added "--" to prevent spark...

2014-08-02 Thread liancheng
Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/1715#issuecomment-50982344
  
@andrewor14 Interesting, actually this is almost exactly the same solution 
I came across at the very beginning :)

The only difference is that we chose `--` rather than some more intuitive 
option name like `--spark-application-args`. And `--` was chosen because it's 
an idiomatic way among UNIX-like systems to pass this kind of "user application 
options".

The reason that we (@pwendell and me) gave it up after discussion is that 
this solution is actually not fully downward compatible, it breaks existing 
user applications which already recognize `--` as a valid option. Turning `--` 
into something more specific like `--spark-application-args` does reduce the 
probability of name collision. Especially, after this change, we won't have 
similar compatibility issue whenever we add any new options to `spark-submit` 
in the future. @pwendell Maybe this is acceptable?

And I agree with your arguments about the drawbacks of putting application 
jar into `--jars`. Similar arguments applies to Python application. That is 
also an important reason that I introduced `--primary` at the first place.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2783][SQL] Basic support for analyze in...

2014-08-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1741#issuecomment-50982297
  
QA results for PR 1741:- This patch FAILED unit tests.- This patch 
merges cleanly- This patch adds no public classesFor more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17799/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2414 [BUILD] Add LICENSE entry for jquer...

2014-08-02 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1748#issuecomment-50982167
  
Thanks Sean - I'll merge this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2627] have the build enforce PEP 8 auto...

2014-08-02 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1744#issuecomment-50982162
  
Hey nick - thanks for taking a crack at this. It's great to see us adding 
more automated code quality checks. Couple things:

1. Could you add `[PySpark]` to the title of this PR? We are using tags 
like that to do sorting amongst the committership and it will get noticed that 
way.
2. In terms of the dependency on pep8, we've tried really hard to avoid 
having exogenous dependencies in Spark. It makes porting things like our QA 
environment very difficult. So one idea - could this have a script that just 
lazily fetches the pep8 library directly? For instance, this is what we do with 
our sbt tool - we just wget the sbt jar... it seems like you could do something 
similar for pep8. Not sure if that totally works, but just an idea.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2813: [SQL] Implement SQRT() directly in...

2014-08-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1750#issuecomment-50981820
  
QA tests have started for PR 1750. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17801/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2813: [SQL] Implement SQRT() directly in...

2014-08-02 Thread willb
GitHub user willb opened a pull request:

https://github.com/apache/spark/pull/1750

SPARK-2813:  [SQL] Implement SQRT() directly in Catalyst

This PR adds a native implementation for SQL SQRT() and thus avoids 
delegating this function to Hive.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/willb/spark spark-2813

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1750.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1750


commit 18d63f93316e56b9f0e137e272981b5a2eb84074
Author: William Benton 
Date:   2014-08-02T15:30:26Z

Added native SQRT implementation

commit bb8022612c468ae99531fbcc9ddff8a5f45bcf36
Author: William Benton 
Date:   2014-08-02T16:22:40Z

added SQRT test to SqlQuerySuite




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2627] have the build enforce PEP 8 auto...

2014-08-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1744#issuecomment-50981650
  
QA tests have started for PR 1744. This patch DID NOT merge cleanly! 
View progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17800/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-695] In DAGScheduler's getPreferredLocs...

2014-08-02 Thread staple
Github user staple commented on the pull request:

https://github.com/apache/spark/pull/1362#issuecomment-50981523
  
Great, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2783][SQL] Basic support for analyze in...

2014-08-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1741#issuecomment-50981507
  
QA tests have started for PR 1741. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17799/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2314][SQL] Override collect and take in...

2014-08-02 Thread staple
Github user staple commented on the pull request:

https://github.com/apache/spark/pull/1592#issuecomment-50981508
  
Sorry, I'm away from home and had limited time / access to try and do the 
merge last night - which I didn't finish, and as you mentioned messed up the 
included commits. I'll post an explicit comment here when the merge is ready.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2783][SQL] Basic support for analyze in...

2014-08-02 Thread yhuai
Github user yhuai commented on the pull request:

https://github.com/apache/spark/pull/1741#issuecomment-50981485
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2481: The environment variables SPARK_HI...

2014-08-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1341#issuecomment-50981393
  
QA results for PR 1341:- This patch PASSES unit tests.- This patch 
merges cleanly- This patch adds no public classesFor more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17798/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2197] [mllib] Java DecisionTree bug fix...

2014-08-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1740#issuecomment-50981238
  
QA results for PR 1740:- This patch PASSES unit tests.- This patch 
merges cleanly- This patch adds no public classesFor more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17797/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Spark 2017

2014-08-02 Thread carlosfuertes
Github user carlosfuertes commented on the pull request:

https://github.com/apache/spark/pull/1682#issuecomment-50980868
  
I added a configuration property "spark.ui.jsRenderingEnabled" that 
controls whether the rendering of the tables happens using Javascript or not. 
It is enable by default. This ensures that people that cannot or do not want to 
run javascript to do the rendering, they can use the web ui as before. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2784][SQL] Deprecate hql() method in fa...

2014-08-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1746#issuecomment-50980687
  
QA results for PR 1746:- This patch PASSES unit tests.- This patch 
merges cleanly- This patch adds the following public classes 
(experimental):* into Spark SQL's query functions (i.e. sql()). Otherwise, 
users of this trait canFor more information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17795/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2481: The environment variables SPARK_HI...

2014-08-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1341#issuecomment-50980619
  
QA tests have started for PR 1341. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17798/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2739][SQL] Rename registerAsTable to re...

2014-08-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1743#issuecomment-50980616
  
QA results for PR 1743:- This patch PASSES unit tests.- This patch 
merges cleanly- This patch adds no public classesFor more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17794/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2481: The environment variables SPARK_HI...

2014-08-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1341#issuecomment-50980426
  
QA results for PR 1341:- This patch FAILED unit tests.- This patch 
merges cleanly- This patch adds no public classesFor more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17796/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2197] [mllib] Java DecisionTree bug fix...

2014-08-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1740#issuecomment-50980415
  
QA tests have started for PR 1740. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17797/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2197] [mllib] Java DecisionTree bug fix...

2014-08-02 Thread jkbradley
Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/1740#issuecomment-50980396
  
Thanks for the comments!  I pushed the changes.  The only remaining item is 
JavaConverters for Strategy; I'm not sure how to get it to work there.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2197] [mllib] Java DecisionTree bug fix...

2014-08-02 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/1740#discussion_r15732689
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/configuration/Strategy.scala 
---
@@ -60,4 +62,31 @@ class Strategy (
   val isMulticlassWithCategoricalFeatures
 = isMulticlassClassification && (categoricalFeaturesInfo.size > 0)
 
+  /**
+   * Java-friendly constructor.
+   *
+   * @param algo classification or regression
+   * @param impurity criterion used for information gain calculation
+   * @param maxDepth Maximum depth of the tree.
+   * E.g., depth 0 means 1 leaf node; depth 1 means 1 
internal node + 2 leaf nodes.
+   * @param numClassesForClassification number of classes for 
classification. Default value is 2
+   *leads to binary classification
+   * @param maxBins maximum number of bins used for splitting features
+   * @param categoricalFeaturesInfo A map storing information about the 
categorical variables and
+   *the number of discrete values they 
take. For example, an entry
+   *(n -> k) implies the feature n is 
categorical with k categories
+   *0, 1, 2, ... , k-1. It's important to 
note that features are
+   *zero-indexed.
+   */
+  def this(
+  algo: Algo,
+  impurity: Impurity,
+  maxDepth: Int,
+  numClassesForClassification: Int,
+  maxBins: Int,
+  categoricalFeaturesInfo: java.util.Map[java.lang.Integer, 
java.lang.Integer]) {
+this(algo, impurity, maxDepth, numClassesForClassification, maxBins, 
Sort,
+  categoricalFeaturesInfo.map{ case (a, b) => (a.toInt, b.toInt) 
}.toMap)
--- End diff --

I tried using that, but could not figure out how to make it work.  The 
issue is that the integer type used by the map is not converted properly.  Is 
there a good way to do that?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2739][SQL] Rename registerAsTable to re...

2014-08-02 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1743


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2481: The environment variables SPARK_HI...

2014-08-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1341#issuecomment-50979713
  
QA tests have started for PR 1341. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17796/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2784][SQL] Deprecate hql() method in fa...

2014-08-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1746#issuecomment-50979645
  
QA tests have started for PR 1746. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17795/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2784][SQL] Deprecate hql() method in fa...

2014-08-02 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/1746#issuecomment-50979639
  
test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2481: The environment variables SPARK_HI...

2014-08-02 Thread witgo
Github user witgo commented on the pull request:

https://github.com/apache/spark/pull/1341#issuecomment-50979611
  
@pwendell  I think  the PR can be merged into 1.1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2784][SQL] Deprecate hql() method in fa...

2014-08-02 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/1746#issuecomment-50979618
  
Github is pretty confused about this one now since apache is lagging...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2739][SQL] Rename registerAsTable to re...

2014-08-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1743#issuecomment-50979512
  
QA tests have started for PR 1743. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17794/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2739][SQL] Rename registerAsTable to re...

2014-08-02 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/1743#issuecomment-50979490
  
Thanks for looking this over!  I've merged to master and 1.1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2804: Remove scalalogging-slf4j dependen...

2014-08-02 Thread witgo
Github user witgo commented on the pull request:

https://github.com/apache/spark/pull/1208#issuecomment-50979440
  
Cool thanks


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1470][SPARK-1842] Use the scala-logging...

2014-08-02 Thread witgo
Github user witgo closed the pull request at:

https://github.com/apache/spark/pull/332


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2739][SQL] Rename registerAsTable to re...

2014-08-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1743#issuecomment-50979387
  
QA results for PR 1743:- This patch PASSES unit tests.- This patch 
merges cleanly- This patch adds the following public classes 
(experimental):trait OverrideFunctionRegistry extends FunctionRegistry 
{class SimpleFunctionRegistry extends FunctionRegistry {protected[sql] 
trait UDFRegistration {class JavaSQLContext(val sqlContext: SQLContext) 
extends UDFRegistration {case class EvaluatePython(udf: PythonUDF, child: 
LogicalPlan) extends logical.UnaryNode {case class 
BatchPythonEvaluation(udf: PythonUDF, output: Seq[Attribute], child: 
SparkPlan)For more information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17791/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1997] mllib - upgrade to breeze 0.8.1

2014-08-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1749#issuecomment-50979270
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1997] mllib - upgrade to breeze 0.8.1

2014-08-02 Thread avati
GitHub user avati opened a pull request:

https://github.com/apache/spark/pull/1749

[SPARK-1997] mllib - upgrade to breeze 0.8.1

Signed-off-by: Anand Avati 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/avati/spark SPARK-1997-breeze-0.8.1

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1749.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1749


commit 5a9e6ba7694fb67e50a9cd469ce65ed5f7b91b0d
Author: Anand Avati 
Date:   2014-07-26T04:06:48Z

[SPARK-1997] mllib - upgrade to breeze 0.8.1

Signed-off-by: Anand Avati 




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2797] [SQL] SchemaRDDs don't support un...

2014-08-02 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1745


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2729][SQL] Added test case for SPARK-27...

2014-08-02 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1738


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Minor] Fixes on top of #1679

2014-08-02 Thread mridulm
Github user mridulm commented on a diff in the pull request:

https://github.com/apache/spark/pull/1736#discussion_r15732470
  
--- Diff: 
core/src/main/scala/org/apache/spark/storage/BlockManagerSource.scala ---
@@ -46,9 +46,8 @@ private[spark] class BlockManagerSource(val blockManager: 
BlockManager, sc: Spar
   metricRegistry.register(MetricRegistry.name("memory", "memUsed_MB"), new 
Gauge[Long] {
 override def getValue: Long = {
   val storageStatusList = blockManager.master.getStorageStatus
-  val maxMem = storageStatusList.map(_.maxMem).sum
-  val remainingMem = storageStatusList.map(_.memRemaining).sum
-  (maxMem - remainingMem) / 1024 / 1024
+  val memUsed = storageStatusList.map(_.memUsed).sum
+  memUsed / 1024 / 1024
--- End diff --

Btw, it is just a nit - so please dont let this block a commit !


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2784][SQL] Deprecate hql() method in fa...

2014-08-02 Thread yhuai
Github user yhuai commented on the pull request:

https://github.com/apache/spark/pull/1746#issuecomment-50979067
  
LGTM. Two places in the programming guide need to be updated.
```
.//docs/sql-programming-guide.md:the `sql` method a `JavaHiveContext` also 
provides an `hql` methods, which allows queries to be
.//docs/sql-programming-guide.md:the `sql` method a `HiveContext` also 
provides an `hql` methods, which allows queries to be
```
But, since we will work on doc next week, we can update these in our PR for 
doc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2797] [SQL] SchemaRDDs don't support un...

2014-08-02 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/1745#issuecomment-50979052
  
Thanks!  I've merged this to master and 1.1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2602 [BUILD] Tests steal focus under Jav...

2014-08-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1747#issuecomment-50979051
  
QA results for PR 1747:- This patch PASSES unit tests.- This patch 
merges cleanly- This patch adds no public classesFor more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17793/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2797] [SQL] SchemaRDDs don't support un...

2014-08-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1745#issuecomment-50979025
  
QA results for PR 1745:- This patch PASSES unit tests.- This patch 
merges cleanly- This patch adds no public classesFor more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17790/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2414 [BUILD] Add LICENSE entry for jquer...

2014-08-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1748#issuecomment-50979032
  
QA results for PR 1748:- This patch PASSES unit tests.- This patch 
merges cleanly- This patch adds no public classesFor more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17792/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Minor] Fixes on top of #1679

2014-08-02 Thread mridulm
Github user mridulm commented on a diff in the pull request:

https://github.com/apache/spark/pull/1736#discussion_r15732438
  
--- Diff: 
core/src/main/scala/org/apache/spark/storage/BlockManagerSource.scala ---
@@ -46,9 +46,8 @@ private[spark] class BlockManagerSource(val blockManager: 
BlockManager, sc: Spar
   metricRegistry.register(MetricRegistry.name("memory", "memUsed_MB"), new 
Gauge[Long] {
 override def getValue: Long = {
   val storageStatusList = blockManager.master.getStorageStatus
-  val maxMem = storageStatusList.map(_.maxMem).sum
-  val remainingMem = storageStatusList.map(_.memRemaining).sum
-  (maxMem - remainingMem) / 1024 / 1024
+  val memUsed = storageStatusList.map(_.memUsed).sum
+  memUsed / 1024 / 1024
--- End diff --

bad code is bad code :-)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2784][SQL] Deprecate hql() method in fa...

2014-08-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1746#issuecomment-50978890
  
QA results for PR 1746:- This patch FAILED unit tests.- This patch 
merges cleanly- This patch adds the following public classes 
(experimental):* into Spark SQL's query functions (i.e. sql()). Otherwise, 
users of this trait canFor more information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17789/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2566. Update ShuffleWriteMetrics increme...

2014-08-02 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1481#issuecomment-50978548
  
The best way might be to do something like this.

```
/**
 * Alias for WriterMetrics for compatibility reasons
 */
@DeveloperApi
class ShuffleWriteMetrics extends WriterMetrics

/**
 * :: DeveloperApi ::
 * Metrics pertaining to data written through a BlockObjectWriter.
 */
@DeveloperApi
private[spark] class WriterMetrics extends Serializable {
  /**
   * Number of bytes written for the this task
   */
  var shuffleBytesWritten: Long = _

  /**
   * Time the task spent blocking on writes to disk or buffer cache, in 
nanoseconds
   */
  var shuffleWriteTime: Long = _
}

```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2294: fix locality inversion bug in Task...

2014-08-02 Thread CodingCat
Github user CodingCat commented on the pull request:

https://github.com/apache/spark/pull/1313#issuecomment-50978501
  
I might give more explanation on the trace printed above

Set()
ANY,NODE_LOCAL
task 1, ArrayBuffer()
task 0, ArrayBuffer(TaskLocation(localhost, None))
miss task
==

the first line is speculative tasks, 
the second line, maxLocality, allowedLocality
the third to the second last line are the tasks in the allPendingTasks and 
their locality preference
the last line is whether the tasksetManager finds a task

From the trace above, we can see that, the nonPref tasks indeed experience 
unnecessary delay, causing the test case being interrupted


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2729][SQL] Added test case for SPARK-27...

2014-08-02 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/1738#issuecomment-50978394
  
Thanks!  I've merged this into master and 1.1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2314][SQL] Override collect and take in...

2014-08-02 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/1592#issuecomment-50978374
  
This seems to have captured a bunch of unrelated changes during the rebase.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2566. Update ShuffleWriteMetrics increme...

2014-08-02 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1481#issuecomment-50978354
  
@sryza Can't the `ExeternalSorter` and `ExternalAppendOnlyMap` just pass 
their own `ShuffleWriteMetrics` when they create a disk writer and then read 
back the bytes written?

We could also change the name of `ShuffleWriteMetrics` to just be 
`WriteMetrics` - or we could leave it for now and just put a TODO.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2478] [mllib] DecisionTree Python API

2014-08-02 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1727#discussion_r15732257
  
--- Diff: examples/src/main/python/mllib/tree.py ---
@@ -0,0 +1,129 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""
+Decision tree classification and regression using MLlib.
+"""
+
+import numpy, os, sys
+
+from operator import add
+
+from pyspark import SparkContext
+from pyspark.mllib.regression import LabeledPoint
+from pyspark.mllib.tree import DecisionTree
+from pyspark.mllib.util import MLUtils
+
+
+def getAccuracy(dtModel, data):
+"""
+Return accuracy of DecisionTreeModel on the given RDD[LabeledPoint].
+"""
+seqOp = (lambda acc, x: acc + (x[0] == x[1]))
+predictions = dtModel.predict(data.map(lambda x: x.features))
+truth = data.map(lambda p: p.label)
+trainCorrect = predictions.zip(truth).aggregate(0, seqOp, add)
+return trainCorrect / (0.0 + data.count())
+
+
+def getMSE(dtModel, data):
+"""
+Return mean squared error (MSE) of DecisionTreeModel on the given
+RDD[LabeledPoint].
+"""
+seqOp = (lambda acc, x: acc + numpy.square(x[0] - x[1]))
+predictions = dtModel.predict(data.map(lambda x: x.features))
+truth = data.map(lambda p: p.label)
+trainMSE = predictions.zip(truth).aggregate(0, seqOp, add)
+return trainMSE / (0.0 + data.count())
+
+
+def reindexClassLabels(data):
+"""
+Re-index class labels in a dataset to the range {0,...,numClasses-1}.
+If all labels in that range already appear at least once,
+ then the returned RDD is the same one (without a mapping).
+Note: If a label simply does not appear in the data,
+  the index will not include it.
+  Be aware of this when reindexing subsampled data.
+:param data: RDD of LabeledPoint where labels are integer values
+ denoting labels for a classification problem.
+:return: Pair (reindexedData, origToNewLabels) where
+ reindexedData is an RDD of LabeledPoint with labels in
+  the range {0,...,numClasses-1}, and
+ origToNewLabels is a dictionary mapping original labels
+  to new labels.
+"""
+# classCounts: class --> # examples in class
+classCounts = data.map(lambda x: x.label).countByValue()
+numExamples = sum(classCounts.values())
+sortedClasses = sorted(classCounts.keys())
+numClasses = len(classCounts)
+# origToNewLabels: class --> index in 0,...,numClasses-1
+if (numClasses < 2):
+print >> sys.stderr, \
+"Dataset for classification should have at least 2 classes." + 
\
+" The given dataset had only %d classes." % numClasses
+exit(-1)
+origToNewLabels = dict([(sortedClasses[i], i) for i in 
range(0,numClasses)])
+
+print "numClasses = %d" % numClasses
+print "Per-class example fractions, counts:"
+print "Class\tFrac\tCount"
+for c in sortedClasses:
+frac = classCounts[c] / (numExamples + 0.0)
+print "%g\t%g\t%d" % (c, frac, classCounts[c])
+
+if (sortedClasses[0] == 0 and sortedClasses[-1] == numClasses - 1):
--- End diff --

Only the first and the last were checked. The values in the middle could be 
something like `0.5`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2097][SQL] UDF Support

2014-08-02 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1063


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2785][SQL] Remove assertions that throw...

2014-08-02 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1742


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2602 [BUILD] Tests steal focus under Jav...

2014-08-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1747#issuecomment-50978057
  
QA tests have started for PR 1747. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17793/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2414 [BUILD] Add LICENSE entry for jquer...

2014-08-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1748#issuecomment-50978055
  
QA tests have started for PR 1748. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17792/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2414 [BUILD] Add LICENSE entry for jquer...

2014-08-02 Thread srowen
GitHub user srowen opened a pull request:

https://github.com/apache/spark/pull/1748

SPARK-2414 [BUILD] Add LICENSE entry for jquery

The JIRA concerned removing jquery, and this does not remove jquery. While 
it is distributed by Spark it should have an accompanying line in LICENSE, very 
technically, as per http://www.apache.org/dev/licensing-howto.html

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/srowen/spark SPARK-2414

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1748.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1748


commit 2fdb03c99d6d802c85c4d2033d670eafd4bcb118
Author: Sean Owen 
Date:   2014-08-02T23:51:15Z

Add LICENSE entry for jquery




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2739][SQL] Rename registerAsTable to re...

2014-08-02 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/1743#issuecomment-50978019
  
Good catch @yhuai.  I've updated the java files as well.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2783][SQL] Basic support for analyze in...

2014-08-02 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/1741#issuecomment-50977990
  
Hmmm, linux vs mac file size problems?

```
[info] StatisticsSuite:
[info] - analyze MetastoreRelations *** FAILED ***
[info]   11768 did not equal 11624 (StatisticsSuite.scala:42)
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2602 [BUILD] Tests steal focus under Jav...

2014-08-02 Thread srowen
GitHub user srowen opened a pull request:

https://github.com/apache/spark/pull/1747

SPARK-2602 [BUILD] Tests steal focus under Java 6

As per https://issues.apache.org/jira/browse/SPARK-2602 , this may be 
resolved for Java 6 with the java.awt.headless system property, which never 
hurt anyone running a command line app. I tested it and seemed to get rid of 
focus stealing.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/srowen/spark SPARK-2602

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1747.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1747


commit b141018061365cb42f2991506ee4ecd4bd4f377b
Author: Sean Owen 
Date:   2014-08-02T23:47:24Z

Set java.awt.headless during tests




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   >