[GitHub] spark issue #14702: [SPARK-15694] Implement ScriptTransformation in sql/core...

2016-10-10 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/14702
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14788: [SPARK-17174][SQL] Add the support for TimestampType for...

2016-10-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14788
  
**[Test build #66713 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66713/consoleFull)**
 for PR 14788 at commit 
[`ef67829`](https://github.com/apache/spark/commit/ef678292d104f2d7a4b637cedc0a388aeb900323).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15072: [SPARK-17123][SQL] Use type-widened encoder for D...

2016-10-10 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/15072#discussion_r82728292
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -53,7 +53,15 @@ import org.apache.spark.util.Utils
 
 private[sql] object Dataset {
   def apply[T: Encoder](sparkSession: SparkSession, logicalPlan: 
LogicalPlan): Dataset[T] = {
-new Dataset(sparkSession, logicalPlan, implicitly[Encoder[T]])
+val encoder = implicitly[Encoder[T]]
+if (encoder.clsTag.runtimeClass == classOf[Row]) {
+  // We should use the encoder generated from the executed plan rather 
than the existing
+  // encoder for DataFrame because the types of columns can be varied 
due to widening types.
+  // See SPARK-17123. This is a bit hacky. Maybe we should find a 
better way to do this.
+  ofRows(sparkSession, logicalPlan).asInstanceOf[Dataset[T]]
+} else {
+  new Dataset(sparkSession, logicalPlan, encoder)
+}
--- End diff --

Hm, I manually tested. It seems `except` is failed too. It seems fine with 
`intersect`. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13675: [SPARK-15957] [ML] RFormula supports forcing to i...

2016-10-10 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/13675


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13675: [SPARK-15957] [ML] RFormula supports forcing to index la...

2016-10-10 Thread yanboliang
Github user yanboliang commented on the issue:

https://github.com/apache/spark/pull/13675
  
I'll merge this into master, thanks for review! @jkbradley @felixcheung 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13675: [SPARK-15957] [ML] RFormula supports forcing to index la...

2016-10-10 Thread yanboliang
Github user yanboliang commented on the issue:

https://github.com/apache/spark/pull/13675
  
@felixcheung This PR does not affect R code, I will send another PR to fix 
issues like [SPARK-15153](https://issues.apache.org/jira/browse/SPARK-15153) 
which need to add some R tests.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15425: [SPARK-17816] [Core] [Branch-2.0] Fix ConcurrentModifica...

2016-10-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15425
  
**[Test build #3321 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3321/consoleFull)**
 for PR 15425 at commit 
[`678ee6b`](https://github.com/apache/spark/commit/678ee6b1d6308a81a5c2d83a196144f29c80434b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15425: [SPARK-17816] [Core] [Branch-2.0] Fix ConcurrentModifica...

2016-10-10 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/15425
  
Jenkins, test this please.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15295: [SPARK-17720][SQL] introduce static SQL conf

2016-10-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15295
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15295: [SPARK-17720][SQL] introduce static SQL conf

2016-10-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15295
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66706/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15295: [SPARK-17720][SQL] introduce static SQL conf

2016-10-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15295
  
**[Test build #66706 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66706/consoleFull)**
 for PR 15295 at commit 
[`8d93c4a`](https://github.com/apache/spark/commit/8d93c4aed4b32ef145f054571a6c8097d01ee5e8).
 * This patch **fails SparkR unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15424: [SPARK-17338][SQL][follow-up] add global temp view

2016-10-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15424
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15424: [SPARK-17338][SQL][follow-up] add global temp view

2016-10-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15424
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66707/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15424: [SPARK-17338][SQL][follow-up] add global temp view

2016-10-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15424
  
**[Test build #66707 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66707/consoleFull)**
 for PR 15424 at commit 
[`15efca6`](https://github.com/apache/spark/commit/15efca65f3249675f7b137ffb42eb08a875c6269).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15292: [SPARK-17719][SPARK-17776][SQL] Unify and tie up options...

2016-10-10 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/15292
  
@gatorsmile @cloud-fan  Thank you for reviewing this both!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15388: [SPARK-17821][SQL] Support And and Or in Expression Cano...

2016-10-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15388
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66708/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15388: [SPARK-17821][SQL] Support And and Or in Expression Cano...

2016-10-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15388
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15412: [SPARK-17844] Simplify DataFrame API for defining...

2016-10-10 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/15412


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15388: [SPARK-17821][SQL] Support And and Or in Expression Cano...

2016-10-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15388
  
**[Test build #66708 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66708/consoleFull)**
 for PR 15388 at commit 
[`21958d7`](https://github.com/apache/spark/commit/21958d7e7b2cb0de6a5b6353afc933359e490df2).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15412: [SPARK-17844] Simplify DataFrame API for defining frame ...

2016-10-10 Thread hvanhovell
Github user hvanhovell commented on the issue:

https://github.com/apache/spark/pull/15412
  
LGTM - merging to master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15416: [SPARK-17849] [SQL] Fix NPE problem when using gr...

2016-10-10 Thread yangw1234
Github user yangw1234 commented on a diff in the pull request:

https://github.com/apache/spark/pull/15416#discussion_r82726593
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -298,10 +298,14 @@ class Analyzer(
   case other => Alias(other, other.toString)()
 }
 
-val nonNullBitmask = x.bitmasks.reduce(_ & _)
+// The rightmost bit in the bitmasks corresponds to the last 
expression in groupByAliases with 0
+// indicating this expression is in the grouping set. The 
following line of code calculates the
+// bitmask representing the expressions that exist in all the 
grouping sets (also indicated by 0).
+val nonNullBitmask = x.bitmasks.reduce(_ | _)
--- End diff --

done @davies 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-10 Thread Yunni
Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/15148#discussion_r82726587
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala ---
@@ -0,0 +1,343 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import scala.util.Random
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.linalg.{Vector, VectorUDT}
+import org.apache.spark.ml.param.{IntParam, ParamMap, ParamValidators}
+import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
+import org.apache.spark.ml.util.SchemaUtils
+import org.apache.spark.sql._
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+
+/**
+ * :: Experimental ::
+ * Params for [[LSH]].
+ */
+@Since("2.1.0")
+private[ml] trait LSHParams extends HasInputCol with HasOutputCol {
+  /**
+   * Param for the dimension of LSH OR-amplification.
+   *
+   * In this implementation, we use LSH OR-amplification to reduce the 
false negative rate. The
+   * higher the dimension is, the lower the false negative rate.
+   * @group param
+   */
+  @Since("2.1.0")
+  final val outputDim: IntParam = new IntParam(this, "outputDim", "output 
dimension, where" +
+"increasing dimensionality lowers the false negative rate, and 
decreasing dimensionality" +
+" improves the running performance", ParamValidators.gt(0))
+
+  /** @group getParam */
+  @Since("2.1.0")
+  final def getOutputDim: Int = $(outputDim)
+
+  /**
+   * Transform the Schema for LSH
+   * @param schema The schema of the input dataset without [[outputCol]]
+   * @return A derived schema with [[outputCol]] added
+   */
+  @Since("2.1.0")
+  protected[this] final def validateAndTransformSchema(schema: 
StructType): StructType = {
+SchemaUtils.appendColumn(schema, $(outputCol), new VectorUDT)
+  }
+}
+
+/**
+ * :: Experimental ::
+ * Model produced by [[LSH]].
+ */
+@Experimental
+@Since("2.1.0")
+private[ml] abstract class LSHModel[T <: LSHModel[T]] extends Model[T] 
with LSHParams {
+  self: T =>
+
+  @Since("2.1.0")
+  override def copy(extra: ParamMap): T = defaultCopy(extra)
+
+  /**
+   * The hash function of LSH, mapping a predefined KeyType to a Vector
+   * @return The mapping of LSH function.
+   */
+  @Since("2.1.0")
+  protected[this] val hashFunction: Vector => Vector
+
+  /**
+   * Calculate the distance between two different keys using the distance 
metric corresponding
+   * to the hashFunction
+   * @param x One input vector in the metric space
+   * @param y One input vector in the metric space
+   * @return The distance between x and y
+   */
+  @Since("2.1.0")
+  protected[ml] def keyDistance(x: Vector, y: Vector): Double
+
+  /**
+   * Calculate the distance between two different hash Vectors.
+   *
+   * @param x One of the hash vector
+   * @param y Another hash vector
+   * @return The distance between hash vectors x and y
+   */
+  @Since("2.1.0")
+  protected[ml] def hashDistance(x: Vector, y: Vector): Double
+
+  @Since("2.1.0")
+  override def transform(dataset: Dataset[_]): DataFrame = {
+transformSchema(dataset.schema, logging = true)
+val transformUDF = udf(hashFunction, new VectorUDT)
+dataset.withColumn($(outputCol), transformUDF(dataset($(inputCol
+  }
+
+  @Since("2.1.0")
+  override def transformSchema(schema: StructType): StructType = {
+validateAndTransformSchema(schema)
+  }
+
+  /**
+   * Given a large dataset and an item, approximately find at most k items 
which have the closest
+   * distance to the item. If the [[outputCol]] is missing, the method 
will transform the data; if
+   * the [[outputCol]] exists, it 

[GitHub] spark issue #15292: [SPARK-17719][SPARK-17776][SQL] Unify and tie up options...

2016-10-10 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/15292
  
Thanks! Merging to master!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15408: [SPARK-17839][CORE] Use Nio's directbuffer instea...

2016-10-10 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/15408#discussion_r82726120
  
--- Diff: 
core/src/main/java/org/apache/spark/io/NioBasedBufferedFileInputStream.java ---
@@ -0,0 +1,127 @@
+/*
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.io;
+
+import org.apache.spark.storage.StorageUtils;
+
+import java.io.File;
+import java.io.IOException;
+import java.io.InputStream;
+import java.nio.ByteBuffer;
+import java.nio.channels.FileChannel;
+import java.nio.file.StandardOpenOption;
+
+/**
+ * {@link InputStream} implementation which uses direct buffer
+ * to read a file to avoid extra copy of data between Java and
+ * native memory which happens when using {@link 
java.io.BufferedInputStream}.
+ * Unfortunately, this is not something already available in JDK,
+ * {@link sun.nio.ch.ChannelInputStream} supports reading a file using nio,
+ * but does not support buffering.
+ *
+ */
+public final class NioBasedBufferedFileInputStream extends InputStream {
+
+  private static int DEFAULT_BUFFER_SIZE_BYTES = 8192;
+
+  private final ByteBuffer byteBuffer;
+
+  private final FileChannel fileChannel;
+
+  public NioBasedBufferedFileInputStream(File file, int bufferSizeInBytes) 
throws IOException {
+byteBuffer = ByteBuffer.allocateDirect(bufferSizeInBytes);
+fileChannel = FileChannel.open(file.toPath(), StandardOpenOption.READ);
+byteBuffer.flip();
+  }
+
+  public NioBasedBufferedFileInputStream(File file) throws IOException {
+this(file, DEFAULT_BUFFER_SIZE_BYTES);
+  }
+
+  /**
+   * Checks weather data is left to be read from the input stream.
+   * @return true if data is left, false otherwise
+   * @throws IOException
+   */
+  private boolean refill() throws IOException {
+if (!byteBuffer.hasRemaining()) {
+  byteBuffer.clear();
+  int nRead = fileChannel.read(byteBuffer);
+  if (nRead == -1) {
--- End diff --

Hm, 
https://docs.oracle.com/javase/7/docs/api/java/nio/channels/FileChannel.html#read(java.nio.ByteBuffer)
 suggests that 0 doesn't mean EOF, just 0 bytes read, but, I'm also not sure 
what to do if the channel won't actually give any bytes at this point. I think 
that can only happen if the buffer is full but that won't happen here. `<= 0` 
seems reasonable AFAIK.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15072: [SPARK-17123][SQL] Use type-widened encoder for D...

2016-10-10 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/15072#discussion_r82726273
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -53,7 +53,15 @@ import org.apache.spark.util.Utils
 
 private[sql] object Dataset {
   def apply[T: Encoder](sparkSession: SparkSession, logicalPlan: 
LogicalPlan): Dataset[T] = {
-new Dataset(sparkSession, logicalPlan, implicitly[Encoder[T]])
+val encoder = implicitly[Encoder[T]]
+if (encoder.clsTag.runtimeClass == classOf[Row]) {
+  // We should use the encoder generated from the executed plan rather 
than the existing
+  // encoder for DataFrame because the types of columns can be varied 
due to widening types.
+  // See SPARK-17123. This is a bit hacky. Maybe we should find a 
better way to do this.
+  ofRows(sparkSession, logicalPlan).asInstanceOf[Dataset[T]]
+} else {
+  new Dataset(sparkSession, logicalPlan, encoder)
+}
--- End diff --

We only need this for Union right? In all other cases we only return tuples 
from the first dataset.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15292: [SPARK-17719][SPARK-17776][SQL] Unify and tie up ...

2016-10-10 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/15292


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15412: [SPARK-17844] Simplify DataFrame API for defining frame ...

2016-10-10 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/15412
  
cc @hvanhovell ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-10 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/15148#discussion_r82725489
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala ---
@@ -0,0 +1,343 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import scala.util.Random
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.linalg.{Vector, VectorUDT}
+import org.apache.spark.ml.param.{IntParam, ParamMap, ParamValidators}
+import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
+import org.apache.spark.ml.util.SchemaUtils
+import org.apache.spark.sql._
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+
+/**
+ * :: Experimental ::
+ * Params for [[LSH]].
+ */
+@Since("2.1.0")
+private[ml] trait LSHParams extends HasInputCol with HasOutputCol {
+  /**
+   * Param for the dimension of LSH OR-amplification.
+   *
+   * In this implementation, we use LSH OR-amplification to reduce the 
false negative rate. The
+   * higher the dimension is, the lower the false negative rate.
+   * @group param
+   */
+  @Since("2.1.0")
+  final val outputDim: IntParam = new IntParam(this, "outputDim", "output 
dimension, where" +
+"increasing dimensionality lowers the false negative rate, and 
decreasing dimensionality" +
+" improves the running performance", ParamValidators.gt(0))
+
+  /** @group getParam */
+  @Since("2.1.0")
+  final def getOutputDim: Int = $(outputDim)
+
+  /**
+   * Transform the Schema for LSH
+   * @param schema The schema of the input dataset without [[outputCol]]
+   * @return A derived schema with [[outputCol]] added
+   */
+  @Since("2.1.0")
+  protected[this] final def validateAndTransformSchema(schema: 
StructType): StructType = {
+SchemaUtils.appendColumn(schema, $(outputCol), new VectorUDT)
+  }
+}
+
+/**
+ * :: Experimental ::
+ * Model produced by [[LSH]].
+ */
+@Experimental
+@Since("2.1.0")
+private[ml] abstract class LSHModel[T <: LSHModel[T]] extends Model[T] 
with LSHParams {
+  self: T =>
+
+  @Since("2.1.0")
+  override def copy(extra: ParamMap): T = defaultCopy(extra)
+
+  /**
+   * The hash function of LSH, mapping a predefined KeyType to a Vector
+   * @return The mapping of LSH function.
+   */
+  @Since("2.1.0")
+  protected[this] val hashFunction: Vector => Vector
+
+  /**
+   * Calculate the distance between two different keys using the distance 
metric corresponding
+   * to the hashFunction
+   * @param x One input vector in the metric space
+   * @param y One input vector in the metric space
+   * @return The distance between x and y
+   */
+  @Since("2.1.0")
+  protected[ml] def keyDistance(x: Vector, y: Vector): Double
+
+  /**
+   * Calculate the distance between two different hash Vectors.
+   *
+   * @param x One of the hash vector
+   * @param y Another hash vector
+   * @return The distance between hash vectors x and y
+   */
+  @Since("2.1.0")
+  protected[ml] def hashDistance(x: Vector, y: Vector): Double
+
+  @Since("2.1.0")
+  override def transform(dataset: Dataset[_]): DataFrame = {
+transformSchema(dataset.schema, logging = true)
+val transformUDF = udf(hashFunction, new VectorUDT)
+dataset.withColumn($(outputCol), transformUDF(dataset($(inputCol
+  }
+
+  @Since("2.1.0")
+  override def transformSchema(schema: StructType): StructType = {
+validateAndTransformSchema(schema)
+  }
+
+  /**
+   * Given a large dataset and an item, approximately find at most k items 
which have the closest
+   * distance to the item. If the [[outputCol]] is missing, the method 
will transform the data; if
+   * the [[outputCol]] exists, it 

[GitHub] spark issue #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15148
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66717/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15148
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15148
  
**[Test build #66717 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66717/consoleFull)**
 for PR 15148 at commit 
[`2c95e5c`](https://github.com/apache/spark/commit/2c95e5c1d89e2db0350b5d8667e2ae8d293df7a9).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class MinHash(override val uid: String) extends LSH[MinHashModel] with 
HasSeed `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13675: [SPARK-15957] [ML] RFormula supports forcing to index la...

2016-10-10 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/13675
  
Does this affect R code - could we add some R tests for this?




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15072: [SPARK-17123][SQL] Use type-widened encoder for D...

2016-10-10 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/15072#discussion_r82725229
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -53,7 +53,15 @@ import org.apache.spark.util.Utils
 
 private[sql] object Dataset {
   def apply[T: Encoder](sparkSession: SparkSession, logicalPlan: 
LogicalPlan): Dataset[T] = {
-new Dataset(sparkSession, logicalPlan, implicitly[Encoder[T]])
+val encoder = implicitly[Encoder[T]]
+if (encoder.clsTag.runtimeClass == classOf[Row]) {
+  // We should use the encoder generated from the executed plan rather 
than the existing
+  // encoder for DataFrame because the types of columns can be varied 
due to widening types.
+  // See SPARK-17123. This is a bit hacky. Maybe we should find a 
better way to do this.
+  ofRows(sparkSession, logicalPlan).asInstanceOf[Dataset[T]]
+} else {
+  new Dataset(sparkSession, logicalPlan, encoder)
+}
--- End diff --

In transformation methods of Dataset, normally we will call `withTypedPlan` 
to generate a new Dataset. However, for set operator methods, we should call a 
different method and put this special logic in it, so that the scope of this 
hack is narrowed down to only set operator methods.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15424: [SPARK-17338][SQL][follow-up] add global temp view

2016-10-10 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/15424
  
LGTM pending Jenkins.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15408: [SPARK-17839][CORE] Use Nio's directbuffer instead of Bu...

2016-10-10 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/15408
  
Yea pooling can make sense, but we don't do it anywhere right now so it'd 
make more sense to defer until we have a plan to do it more broadly.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15426: [SPARK-17864][SQL] Mark data type APIs as stable (not De...

2016-10-10 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/15426
  
cc @marmbrus 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15426: [SPARK-17864][SQL] Mark data type APIs as stable (not De...

2016-10-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15426
  
**[Test build #66721 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66721/consoleFull)**
 for PR 15426 at commit 
[`0cf7e72`](https://github.com/apache/spark/commit/0cf7e7211f4b8112c776f1ac6bc06d6d204e6fd8).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15426: [SPARK-17864][SQL] Mark data type APIs as stable ...

2016-10-10 Thread rxin
GitHub user rxin opened a pull request:

https://github.com/apache/spark/pull/15426

[SPARK-17864][SQL] Mark data type APIs as stable (not DeveloperApi)

## What changes were proposed in this pull request?
The data type API has not been changed since Spark 1.3.0, and is ready for 
graduation. This patch marks them as stable APIs using the new 
InterfaceStability annotation.

This patch also looks at the various files in the catalyst module (not the 
"package") and marks the remaining few classes appropriately as well.

## How was this patch tested?
This is an annotation change. No functional changes.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rxin/spark SPARK-17864

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15426.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15426


commit 0cf7e7211f4b8112c776f1ac6bc06d6d204e6fd8
Author: Reynold Xin 
Date:   2016-10-11T04:53:35Z

[SPARK-17864][SQL] Mark data type APIs as stable (not DeveloperApi)




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15424: [SPARK-17338][SQL][follow-up] add global temp view

2016-10-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15424
  
**[Test build #66720 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66720/consoleFull)**
 for PR 15424 at commit 
[`0ff26d0`](https://github.com/apache/spark/commit/0ff26d0050b12917f0c801ba61d43d0ae4970f81).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15416: [SPARK-17849] [SQL] Fix NPE problem when using gr...

2016-10-10 Thread davies
Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/15416#discussion_r82723882
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -298,10 +298,14 @@ class Analyzer(
   case other => Alias(other, other.toString)()
 }
 
-val nonNullBitmask = x.bitmasks.reduce(_ & _)
+// The rightmost bit in the bitmasks corresponds to the last 
expression in groupByAliases with 0
+// indicating this expression is in the grouping set. The 
following line of code calculates the
+// bitmask representing the expressions that exist in all the 
grouping sets (also indicated by 0).
+val nonNullBitmask = x.bitmasks.reduce(_ | _)
--- End diff --

Should we call this `nullBitmask` now? (1 means it's nullable)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15408: [SPARK-17839][CORE] Use Nio's directbuffer instead of Bu...

2016-10-10 Thread mridulm
Github user mridulm commented on the issue:

https://github.com/apache/spark/pull/15408
  
Barring query to @rxin (regarding buffer pooling), I am fine with the 
change - pretty neat, thanks @sitalkedia !
Would be good if more eyeballs look at it though given how fundamental it 
is.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15285: [SPARK-17711] Compress rolled executor log

2016-10-10 Thread loneknightpy
Github user loneknightpy commented on the issue:

https://github.com/apache/spark/pull/15285
  
@tdas Addressed your comments


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15292: [SPARK-17719][SPARK-17776][SQL] Unify and tie up options...

2016-10-10 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/15292
  
Ah, right. I just updated.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15377: [SPARK-17802] Improved caller context logging.

2016-10-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15377
  
**[Test build #66719 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66719/consoleFull)**
 for PR 15377 at commit 
[`df28bdd`](https://github.com/apache/spark/commit/df28bdddce5e4789a02cf7ef5dedab8b7c408630).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15292: [SPARK-17719][SPARK-17776][SQL] Unify and tie up options...

2016-10-10 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/15292
  
Sorry, I did not explain it in details. In this PR, we had a bug fix. We 
need a separate bullet in the PR description. 

Previously, when attempting to make a database connection, we pass all the 
Spark-specific JDBC options as connection properties. After this fix, we 
exclude them from the connection properties. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...

2016-10-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15421
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66702/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...

2016-10-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15421
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...

2016-10-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15421
  
**[Test build #66702 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66702/consoleFull)**
 for PR 15421 at commit 
[`9e621eb`](https://github.com/apache/spark/commit/9e621ebb1b4d9ac20fa294937ebe87e88730f3c9).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15414: [SPARK-17848][ML] Move LabelCol datatype cast into Predi...

2016-10-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15414
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66710/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15414: [SPARK-17848][ML] Move LabelCol datatype cast into Predi...

2016-10-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15414
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15414: [SPARK-17848][ML] Move LabelCol datatype cast into Predi...

2016-10-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15414
  
**[Test build #66710 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66710/consoleFull)**
 for PR 15414 at commit 
[`6c61e73`](https://github.com/apache/spark/commit/6c61e73c9b8d401f7ec9d48e9f74df7e134cec5f).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15285: [SPARK-17711] Compress rolled executor log

2016-10-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15285
  
**[Test build #66718 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66718/consoleFull)**
 for PR 15285 at commit 
[`e5676a6`](https://github.com/apache/spark/commit/e5676a6d4e60e7b7446bf525fb7003cb26efc448).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15398: [SPARK-17647][SQL] Fix backslash escaping in 'LIKE' patt...

2016-10-10 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/15398
  
Also cc @yhuai and @JoshRosen @mengxr Please check whether the changes here 
can satisfy what you want. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15425: [SPARK-17816] [Core] [Branch-2.0] Fix ConcurrentModifica...

2016-10-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15425
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15272: [SPARK-17698] [SQL] Join predicates should not contain f...

2016-10-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15272
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-10 Thread Yunni
Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/15148#discussion_r82722577
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala ---
@@ -0,0 +1,343 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import scala.util.Random
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.linalg.{Vector, VectorUDT}
+import org.apache.spark.ml.param.{IntParam, ParamMap, ParamValidators}
+import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
+import org.apache.spark.ml.util.SchemaUtils
+import org.apache.spark.sql._
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+
+/**
+ * :: Experimental ::
+ * Params for [[LSH]].
+ */
+@Since("2.1.0")
+private[ml] trait LSHParams extends HasInputCol with HasOutputCol {
+  /**
+   * Param for the dimension of LSH OR-amplification.
+   *
+   * In this implementation, we use LSH OR-amplification to reduce the 
false negative rate. The
+   * higher the dimension is, the lower the false negative rate.
+   * @group param
+   */
+  @Since("2.1.0")
+  final val outputDim: IntParam = new IntParam(this, "outputDim", "output 
dimension, where" +
+"increasing dimensionality lowers the false negative rate, and 
decreasing dimensionality" +
+" improves the running performance", ParamValidators.gt(0))
+
+  /** @group getParam */
+  @Since("2.1.0")
+  final def getOutputDim: Int = $(outputDim)
+
+  /**
+   * Transform the Schema for LSH
+   * @param schema The schema of the input dataset without [[outputCol]]
+   * @return A derived schema with [[outputCol]] added
+   */
+  @Since("2.1.0")
+  protected[this] final def validateAndTransformSchema(schema: 
StructType): StructType = {
+SchemaUtils.appendColumn(schema, $(outputCol), new VectorUDT)
+  }
+}
+
+/**
+ * :: Experimental ::
+ * Model produced by [[LSH]].
+ */
+@Experimental
+@Since("2.1.0")
+private[ml] abstract class LSHModel[T <: LSHModel[T]] extends Model[T] 
with LSHParams {
+  self: T =>
+
+  @Since("2.1.0")
+  override def copy(extra: ParamMap): T = defaultCopy(extra)
+
+  /**
+   * The hash function of LSH, mapping a predefined KeyType to a Vector
+   * @return The mapping of LSH function.
+   */
+  @Since("2.1.0")
+  protected[this] val hashFunction: Vector => Vector
+
+  /**
+   * Calculate the distance between two different keys using the distance 
metric corresponding
+   * to the hashFunction
+   * @param x One input vector in the metric space
+   * @param y One input vector in the metric space
+   * @return The distance between x and y
+   */
+  @Since("2.1.0")
+  protected[ml] def keyDistance(x: Vector, y: Vector): Double
+
+  /**
+   * Calculate the distance between two different hash Vectors.
+   *
+   * @param x One of the hash vector
+   * @param y Another hash vector
+   * @return The distance between hash vectors x and y
+   */
+  @Since("2.1.0")
+  protected[ml] def hashDistance(x: Vector, y: Vector): Double
+
+  @Since("2.1.0")
+  override def transform(dataset: Dataset[_]): DataFrame = {
+transformSchema(dataset.schema, logging = true)
+val transformUDF = udf(hashFunction, new VectorUDT)
+dataset.withColumn($(outputCol), transformUDF(dataset($(inputCol
+  }
+
+  @Since("2.1.0")
+  override def transformSchema(schema: StructType): StructType = {
+validateAndTransformSchema(schema)
+  }
+
+  /**
+   * Given a large dataset and an item, approximately find at most k items 
which have the closest
+   * distance to the item. If the [[outputCol]] is missing, the method 
will transform the data; if
+   * the [[outputCol]] exists, it 

[GitHub] spark issue #15272: [SPARK-17698] [SQL] Join predicates should not contain f...

2016-10-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15272
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66705/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15272: [SPARK-17698] [SQL] Join predicates should not contain f...

2016-10-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15272
  
**[Test build #66705 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66705/consoleFull)**
 for PR 15272 at commit 
[`e9f9378`](https://github.com/apache/spark/commit/e9f93784175dd0906a648ca23e86cf6d026c4ece).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15398: [SPARK-17647][SQL] Fix backslash escaping in 'LIK...

2016-10-10 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/15398#discussion_r82722525
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/StringUtils.scala
 ---
@@ -25,26 +25,25 @@ object StringUtils {
 
   // replace the _ with .{1} exactly match 1 time of any character
   // replace the % with .*, match 0 or more times with any character
-  def escapeLikeRegex(v: String): String = {
-if (!v.isEmpty) {
-  "(?s)" + (' ' +: v.init).zip(v).flatMap {
-case (prev, '\\') => ""
-case ('\\', c) =>
-  c match {
-case '_' => "_"
-case '%' => "%"
-case _ => Pattern.quote("\\" + c)
-  }
-case (prev, c) =>
-  c match {
-case '_' => "."
-case '%' => ".*"
-case _ => Pattern.quote(Character.toString(c))
-  }
-  }.mkString
-} else {
-  v
+  def escapeLikeRegex(str: String): String = {
+val builder = new StringBuilder()
+var escaping = false
+for (next <- str) {
+  if (escaping) {
+builder ++= Pattern.quote(Character.toString(next))
--- End diff --

How about `"\\a"`? Previously it is `\Q\a\E`, now it seems becoming `\Qa\E`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15371: [SPARK-17816] [Core] Fix ConcurrentModificationException...

2016-10-10 Thread seyfe
Github user seyfe commented on the issue:

https://github.com/apache/spark/pull/15371
  
Thanks @zsxwing.

Here is the PR for branch-2.0
https://github.com/apache/spark/pull/15425


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15425: [SPARK-17816] [Core] [Branch-2.0] Fix ConcurrentM...

2016-10-10 Thread seyfe
GitHub user seyfe opened a pull request:

https://github.com/apache/spark/pull/15425

[SPARK-17816] [Core] [Branch-2.0] Fix ConcurrentModificationException issue 
in BlockStatusesAccumulator

## What changes were proposed in this pull request?
Replaced `BlockStatusesAccumulator` with `CollectionAccumulator` which is 
thread safe and few more cleanups.

## How was this patch tested?
Tested in master branch and cherry-picked.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/seyfe/spark race_cond_jsonprotocal_branch-2.0

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15425.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15425


commit 678ee6b1d6308a81a5c2d83a196144f29c80434b
Author: Ergin Seyfe 
Date:   2016-10-11T03:41:31Z

[SPARK-17816][CORE] Fix ConcurrentModificationException issue in 
BlockStatusesAccumulator

Change the BlockStatusesAccumulator to return immutable object when value 
method is called.

Existing tests plus I verified this change by running a pipeline which 
consistently repro this issue.

This is the stack trace for this exception:
`
java.util.ConcurrentModificationException
at 
java.util.ArrayList$Itr.checkForComodification(ArrayList.java:901)
at java.util.ArrayList$Itr.next(ArrayList.java:851)
at 
scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at 
scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at 
scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:183)
at 
scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:45)
at 
scala.collection.TraversableLike$class.to(TraversableLike.scala:590)
at scala.collection.AbstractTraversable.to(Traversable.scala:104)
at 
scala.collection.TraversableOnce$class.toList(TraversableOnce.scala:294)
at 
scala.collection.AbstractTraversable.toList(Traversable.scala:104)
at 
org.apache.spark.util.JsonProtocol$.accumValueToJson(JsonProtocol.scala:314)
at 
org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$5.apply(JsonProtocol.scala:291)
at 
org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$5.apply(JsonProtocol.scala:291)
at scala.Option.map(Option.scala:146)
at 
org.apache.spark.util.JsonProtocol$.accumulableInfoToJson(JsonProtocol.scala:291)
at 
org.apache.spark.util.JsonProtocol$$anonfun$taskInfoToJson$12.apply(JsonProtocol.scala:283)
at 
org.apache.spark.util.JsonProtocol$$anonfun$taskInfoToJson$12.apply(JsonProtocol.scala:283)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:381)
at 
scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
at scala.collection.mutable.ListBuffer.foreach(ListBuffer.scala:45)
at 
scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at 
org.apache.spark.util.JsonProtocol$.taskInfoToJson(JsonProtocol.scala:283)
at 
org.apache.spark.util.JsonProtocol$.taskEndToJson(JsonProtocol.scala:145)
at 
org.apache.spark.util.JsonProtocol$.sparkEventToJson(JsonProtocol.scala:76)
`

Author: Ergin Seyfe 

Closes #15371 from seyfe/race_cond_jsonprotocal.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15424: [SPARK-17338][SQL][follow-up] add global temp vie...

2016-10-10 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/15424#discussion_r82722351
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/catalog/Catalog.scala ---
@@ -270,9 +270,10 @@ abstract class Catalog {
* tied to any databases, i.e. we can't use `db1.view1` to reference a 
local temporary view.
*
--- End diff --

can you add a line saying the return type was unit in Spark 2.0, but 
changed to boolean in Spark 2.1?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15424: [SPARK-17338][SQL][follow-up] add global temp view

2016-10-10 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/15424
  
LGTM other than the two minor comments.

We also need a Python API for this, don't we?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15416: [SPARK-17849] [SQL] Fix NPE problem when using grouping ...

2016-10-10 Thread yangw1234
Github user yangw1234 commented on the issue:

https://github.com/apache/spark/pull/15416
  
@davies Other places all seem to be correct.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-10 Thread Yunni
Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/15148#discussion_r82722244
  
--- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/LSHTest.scala ---
@@ -0,0 +1,135 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.spark.ml.linalg.Vector
+import org.apache.spark.sql.Dataset
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types.DataTypes
+
+private[ml] object LSHTest {
+  /**
+   * For any locality sensitive function h in a metric space, we meed to 
verify whether
+   * the following property is satisfied.
+   *
+   * There exist dist1, dist2, p1, p2, so that for any two elements e1 and 
e2,
+   * If dist(e1, e2) <= dist1, then Pr{h(x) == h(y)} >= p1
+   * If dist(e1, e2) >= dist2, then Pr{h(x) == h(y)} <= p2
+   *
+   * This is called locality sensitive property. This method checks the 
property on an
+   * existing dataset and calculate the probabilities.
+   * (https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Definition)
+   *
+   * This method hashes each elements to hash buckets using LSH, and 
calculate the false positive
+   * and false negative:
+   * False positive: Of all (e1, e2) sharing any bucket, the probability 
of dist(e1, e2) > distFP
+   * False positive: Of all (e1, e2) not sharing buckets, the probability 
of dist(e1, e2) < distFN
--- End diff --

Fixed. Yes, these calculation methods are for unit tests only, and will not 
be open to users.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-10 Thread Yunni
Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/15148#discussion_r82722195
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/RandomProjection.scala ---
@@ -0,0 +1,159 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import scala.util.Random
+
+import breeze.linalg.normalize
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.linalg.{BLAS, Vector, Vectors, VectorUDT}
+import org.apache.spark.ml.param.{BooleanParam, DoubleParam, Params, 
ParamValidators}
+import org.apache.spark.ml.util.{Identifiable, SchemaUtils}
+import org.apache.spark.sql.types.StructType
+
+/**
+ * :: Experimental ::
--- End diff --

Removed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-10 Thread Yunni
Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/15148#discussion_r82722184
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHash.scala ---
@@ -0,0 +1,143 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import scala.util.Random
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.linalg.{Vector, Vectors, VectorUDT}
+import org.apache.spark.ml.param.{BooleanParam, Params}
+import org.apache.spark.ml.util.{Identifiable, SchemaUtils}
+import org.apache.spark.sql.types.StructType
+
+/**
+ * :: Experimental ::
--- End diff --

Removed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-10 Thread Yunni
Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/15148#discussion_r82722187
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHash.scala ---
@@ -0,0 +1,143 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import scala.util.Random
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.linalg.{Vector, Vectors, VectorUDT}
+import org.apache.spark.ml.param.{BooleanParam, Params}
+import org.apache.spark.ml.util.{Identifiable, SchemaUtils}
+import org.apache.spark.sql.types.StructType
+
+/**
+ * :: Experimental ::
+ * Params for [[MinHash]].
+ */
+@Since("2.1.0")
+private[ml] trait MinHashParams extends Params {
+
+  /**
+   * If true, set the random seed to 0. Otherwise, use default setting in 
scala.util.Random
+   * @group param
+   */
+  @Since("2.1.0")
+  val hasSeed: BooleanParam = new BooleanParam(this, "hasSeed",
+"If true, set the random seed to 0.")
+
+  /** @group getParam */
+  @Since("2.1.0")
+  final def getHasSeed: Boolean = $(hasSeed)
+}
+
+/**
+ * :: Experimental ::
+ * Model produced by [[MinHash]]
+ * @param hashFunctions A seq of hash functions, mapping elements to their 
hash values.
+ */
+@Experimental
+@Since("2.1.0")
+class MinHashModel private[ml] (override val uid: String, hashFunctions: 
Seq[Int => Long])
+  extends LSHModel[MinHashModel] {
+
+  @Since("2.1.0")
+  override protected[this] val hashFunction: Vector => Vector = {
+elems: Vector =>
+  require(elems.numNonzeros > 0, "Must have at least 1 non zero 
entry.")
+  val elemsList = elems.toSparse.indices.toList
+  Vectors.dense(hashFunctions.map(
+func => elemsList.map(func).min.toDouble
+  ).toArray)
+  }
+
+  @Since("2.1.0")
+  override protected[ml] def keyDistance(x: Vector, y: Vector): Double = {
+val xSet = x.toSparse.indices.toSet
+val ySet = y.toSparse.indices.toSet
+val intersectionSize = xSet.intersect(ySet).size.toDouble
+val unionSize = xSet.size + ySet.size - intersectionSize
+assert(unionSize > 0, "The union of two input sets must have at least 
1 elements")
+1 - intersectionSize / unionSize
+  }
+
+  @Since("2.1.0")
+  override protected[ml] def hashDistance(x: Vector, y: Vector): Double = {
+// Since it's generated by hashing, it will be a pair of dense vectors.
+x.toDense.values.zip(y.toDense.values).map(x => math.abs(x._1 - 
x._2)).min
+  }
+}
+
+/**
+ * :: Experimental ::
+ * LSH class for Jaccard distance.
+ *
+ * The input can be dense or sparse vectors, but it is more efficient if 
it is sparse. For example,
+ *`Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 1.0)])`
+ * means there are 10 elements in the space. This set contains elem 2, 
elem 3 and elem 5.
+ * Also, any input vector must have at least 1 non-zero indices, and all 
non-zero values are treated
+ * as binary "1" values.
+ */
+@Experimental
+@Since("2.1.0")
+class MinHash(override val uid: String) extends LSH[MinHashModel] with 
MinHashParams {
+
+  // A large prime smaller than sqrt(2^63 − 1)
+  private[this] val prime = 2038074743
+
+  @Since("2.1.0")
+  override def setInputCol(value: String): this.type = 
super.setInputCol(value)
+
+  @Since("2.1.0")
+  override def setOutputCol(value: String): this.type = 
super.setOutputCol(value)
+
+  @Since("2.1.0")
+  override def setOutputDim(value: Int): this.type = 
super.setOutputDim(value)
+
+  @Since("2.1.0")
+  def this() = {
+this(Identifiable.randomUID("min hash"))
+  }
+
+  setDefault(outputDim -> 1, outputCol -> "lshFeatures", hasSeed -> false)
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-10 Thread Yunni
Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/15148#discussion_r82722189
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHash.scala ---
@@ -0,0 +1,143 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import scala.util.Random
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.linalg.{Vector, Vectors, VectorUDT}
+import org.apache.spark.ml.param.{BooleanParam, Params}
+import org.apache.spark.ml.util.{Identifiable, SchemaUtils}
+import org.apache.spark.sql.types.StructType
+
+/**
+ * :: Experimental ::
+ * Params for [[MinHash]].
+ */
+@Since("2.1.0")
+private[ml] trait MinHashParams extends Params {
+
+  /**
+   * If true, set the random seed to 0. Otherwise, use default setting in 
scala.util.Random
+   * @group param
+   */
+  @Since("2.1.0")
+  val hasSeed: BooleanParam = new BooleanParam(this, "hasSeed",
+"If true, set the random seed to 0.")
+
+  /** @group getParam */
+  @Since("2.1.0")
+  final def getHasSeed: Boolean = $(hasSeed)
+}
+
+/**
+ * :: Experimental ::
+ * Model produced by [[MinHash]]
+ * @param hashFunctions A seq of hash functions, mapping elements to their 
hash values.
+ */
+@Experimental
+@Since("2.1.0")
+class MinHashModel private[ml] (override val uid: String, hashFunctions: 
Seq[Int => Long])
+  extends LSHModel[MinHashModel] {
+
+  @Since("2.1.0")
+  override protected[this] val hashFunction: Vector => Vector = {
+elems: Vector =>
+  require(elems.numNonzeros > 0, "Must have at least 1 non zero 
entry.")
+  val elemsList = elems.toSparse.indices.toList
+  Vectors.dense(hashFunctions.map(
+func => elemsList.map(func).min.toDouble
+  ).toArray)
+  }
+
+  @Since("2.1.0")
+  override protected[ml] def keyDistance(x: Vector, y: Vector): Double = {
+val xSet = x.toSparse.indices.toSet
+val ySet = y.toSparse.indices.toSet
+val intersectionSize = xSet.intersect(ySet).size.toDouble
+val unionSize = xSet.size + ySet.size - intersectionSize
+assert(unionSize > 0, "The union of two input sets must have at least 
1 elements")
+1 - intersectionSize / unionSize
+  }
+
+  @Since("2.1.0")
+  override protected[ml] def hashDistance(x: Vector, y: Vector): Double = {
+// Since it's generated by hashing, it will be a pair of dense vectors.
+x.toDense.values.zip(y.toDense.values).map(x => math.abs(x._1 - 
x._2)).min
+  }
+}
+
+/**
+ * :: Experimental ::
+ * LSH class for Jaccard distance.
+ *
+ * The input can be dense or sparse vectors, but it is more efficient if 
it is sparse. For example,
+ *`Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 1.0)])`
+ * means there are 10 elements in the space. This set contains elem 2, 
elem 3 and elem 5.
+ * Also, any input vector must have at least 1 non-zero indices, and all 
non-zero values are treated
+ * as binary "1" values.
+ */
+@Experimental
+@Since("2.1.0")
+class MinHash(override val uid: String) extends LSH[MinHashModel] with 
MinHashParams {
+
+  // A large prime smaller than sqrt(2^63 − 1)
+  private[this] val prime = 2038074743
+
+  @Since("2.1.0")
+  override def setInputCol(value: String): this.type = 
super.setInputCol(value)
+
+  @Since("2.1.0")
+  override def setOutputCol(value: String): this.type = 
super.setOutputCol(value)
+
+  @Since("2.1.0")
+  override def setOutputDim(value: Int): this.type = 
super.setOutputDim(value)
+
+  @Since("2.1.0")
+  def this() = {
+this(Identifiable.randomUID("min hash"))
+  }
+
+  setDefault(outputDim -> 1, outputCol -> "lshFeatures", hasSeed -> false)
+
+  @Since("2.1.0")
+  def setHasSeed(value: Boolean): this.type = set(hasSeed, value)
+
+  

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-10 Thread Yunni
Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/15148#discussion_r82722181
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala ---
@@ -0,0 +1,343 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import scala.util.Random
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.linalg.{Vector, VectorUDT}
+import org.apache.spark.ml.param.{IntParam, ParamMap, ParamValidators}
+import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
+import org.apache.spark.ml.util.SchemaUtils
+import org.apache.spark.sql._
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+
+/**
+ * :: Experimental ::
+ * Params for [[LSH]].
+ */
+@Since("2.1.0")
+private[ml] trait LSHParams extends HasInputCol with HasOutputCol {
+  /**
+   * Param for the dimension of LSH OR-amplification.
+   *
+   * In this implementation, we use LSH OR-amplification to reduce the 
false negative rate. The
+   * higher the dimension is, the lower the false negative rate.
+   * @group param
+   */
+  @Since("2.1.0")
+  final val outputDim: IntParam = new IntParam(this, "outputDim", "output 
dimension, where" +
+"increasing dimensionality lowers the false negative rate, and 
decreasing dimensionality" +
+" improves the running performance", ParamValidators.gt(0))
+
+  /** @group getParam */
+  @Since("2.1.0")
+  final def getOutputDim: Int = $(outputDim)
+
+  /**
+   * Transform the Schema for LSH
+   * @param schema The schema of the input dataset without [[outputCol]]
+   * @return A derived schema with [[outputCol]] added
+   */
+  @Since("2.1.0")
+  protected[this] final def validateAndTransformSchema(schema: 
StructType): StructType = {
+SchemaUtils.appendColumn(schema, $(outputCol), new VectorUDT)
+  }
+}
+
+/**
+ * :: Experimental ::
+ * Model produced by [[LSH]].
+ */
+@Experimental
+@Since("2.1.0")
+private[ml] abstract class LSHModel[T <: LSHModel[T]] extends Model[T] 
with LSHParams {
+  self: T =>
+
+  @Since("2.1.0")
+  override def copy(extra: ParamMap): T = defaultCopy(extra)
+
+  /**
+   * The hash function of LSH, mapping a predefined KeyType to a Vector
+   * @return The mapping of LSH function.
+   */
+  @Since("2.1.0")
+  protected[this] val hashFunction: Vector => Vector
+
+  /**
+   * Calculate the distance between two different keys using the distance 
metric corresponding
+   * to the hashFunction
+   * @param x One input vector in the metric space
+   * @param y One input vector in the metric space
+   * @return The distance between x and y
+   */
+  @Since("2.1.0")
+  protected[ml] def keyDistance(x: Vector, y: Vector): Double
+
+  /**
+   * Calculate the distance between two different hash Vectors.
+   *
+   * @param x One of the hash vector
+   * @param y Another hash vector
+   * @return The distance between hash vectors x and y
+   */
+  @Since("2.1.0")
+  protected[ml] def hashDistance(x: Vector, y: Vector): Double
+
+  @Since("2.1.0")
+  override def transform(dataset: Dataset[_]): DataFrame = {
+transformSchema(dataset.schema, logging = true)
+val transformUDF = udf(hashFunction, new VectorUDT)
+dataset.withColumn($(outputCol), transformUDF(dataset($(inputCol
+  }
+
+  @Since("2.1.0")
+  override def transformSchema(schema: StructType): StructType = {
+validateAndTransformSchema(schema)
+  }
+
+  /**
+   * Given a large dataset and an item, approximately find at most k items 
which have the closest
+   * distance to the item. If the [[outputCol]] is missing, the method 
will transform the data; if
+   * the [[outputCol]] exists, it 

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-10 Thread Yunni
Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/15148#discussion_r82722185
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHash.scala ---
@@ -0,0 +1,143 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import scala.util.Random
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.linalg.{Vector, Vectors, VectorUDT}
+import org.apache.spark.ml.param.{BooleanParam, Params}
+import org.apache.spark.ml.util.{Identifiable, SchemaUtils}
+import org.apache.spark.sql.types.StructType
+
+/**
+ * :: Experimental ::
+ * Params for [[MinHash]].
+ */
+@Since("2.1.0")
+private[ml] trait MinHashParams extends Params {
+
+  /**
+   * If true, set the random seed to 0. Otherwise, use default setting in 
scala.util.Random
+   * @group param
+   */
+  @Since("2.1.0")
+  val hasSeed: BooleanParam = new BooleanParam(this, "hasSeed",
--- End diff --

Done for both MinHash and RP


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15408: [SPARK-17839][CORE] Use Nio's directbuffer instead of Bu...

2016-10-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15408
  
**[Test build #66714 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66714/consoleFull)**
 for PR 15408 at commit 
[`681ff62`](https://github.com/apache/spark/commit/681ff62409e1f6520057bdeafd991e2c12a0b232).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-10 Thread Yunni
Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/15148#discussion_r82722177
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala ---
@@ -0,0 +1,343 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import scala.util.Random
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.linalg.{Vector, VectorUDT}
+import org.apache.spark.ml.param.{IntParam, ParamMap, ParamValidators}
+import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
+import org.apache.spark.ml.util.SchemaUtils
+import org.apache.spark.sql._
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+
+/**
+ * :: Experimental ::
--- End diff --

Removed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15377: [SPARK-17802] Improved caller context logging.

2016-10-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15377
  
**[Test build #66715 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66715/consoleFull)**
 for PR 15377 at commit 
[`7485ffa`](https://github.com/apache/spark/commit/7485ffaa3df508f35df4b878ed715eb1ece0f4db).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15148
  
**[Test build #66717 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66717/consoleFull)**
 for PR 15148 at commit 
[`2c95e5c`](https://github.com/apache/spark/commit/2c95e5c1d89e2db0350b5d8667e2ae8d293df7a9).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15285: [SPARK-17711] Compress rolled executor log

2016-10-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15285
  
**[Test build #66716 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66716/consoleFull)**
 for PR 15285 at commit 
[`ef4f2b9`](https://github.com/apache/spark/commit/ef4f2b9dc1be33d56d7d4c93bddcfcc2a69a44e9).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15408: [SPARK-17839][CORE] Use Nio's directbuffer instead of Bu...

2016-10-10 Thread sitalkedia
Github user sitalkedia commented on the issue:

https://github.com/apache/spark/pull/15408
  
jenkins retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15416: [SPARK-17849] [SQL] Fix NPE problem when using gr...

2016-10-10 Thread yangw1234
Github user yangw1234 commented on a diff in the pull request:

https://github.com/apache/spark/pull/15416#discussion_r82721927
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -298,10 +298,14 @@ class Analyzer(
   case other => Alias(other, other.toString)()
 }
 
-val nonNullBitmask = x.bitmasks.reduce(_ & _)
+// The left most bit in the bitmasks corresponds to the last 
expression in groupByAliases
+// with 0 indicating this expression is in the grouping set. The 
following line of code
+// calculates the bit mask representing the expressions that exist 
in all the grouping sets.
+val nonNullBitmask = ~ x.bitmasks.reduce(_ | _)
--- End diff --

Do you mean `((nonNullBitmask >> (attrLength - idx - 1)) & 1) == 1`? We can 
only test on `0` if we left shift `1`, right? @davies 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15416: [SPARK-17849] [SQL] Fix NPE problem when using grouping ...

2016-10-10 Thread davies
Github user davies commented on the issue:

https://github.com/apache/spark/pull/15416
  
@yangw1234 Thanks for working on this, could you also double check that all 
the places that use bitmasks are correct?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14788: [SPARK-17174][SQL] Add the support for TimestampType for...

2016-10-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14788
  
**[Test build #66713 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66713/consoleFull)**
 for PR 14788 at commit 
[`ef67829`](https://github.com/apache/spark/commit/ef678292d104f2d7a4b637cedc0a388aeb900323).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15423: [SPARK-17860][SQL] SHOW COLUMN's database conflic...

2016-10-10 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/15423#discussion_r82721435
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala 
---
@@ -1713,4 +1713,19 @@ class DDLSuite extends QueryTest with 
SharedSQLContext with BeforeAndAfterEach {
   assert(sql("show user functions").count() === 1L)
 }
   }
+
+  test("show columns - negative test") {
+// When case sensitivity is true, the user supplied database name in 
table identifier
+// should match the supplied database name in case sensitive way.
+withSQLConf(SQLConf.CASE_SENSITIVE.key -> "true") {
+  val tabName = "showcolumn"
+  withTable(tabName) {
+sql(s"CREATE TABLE $tabName(col1 int, col2 string) USING parquet ")
--- End diff --

@viirya OK.. I agree. I will make the change


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14788: [SPARK-17174][SQL] Add the support for TimestampType for...

2016-10-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14788
  
**[Test build #66712 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66712/consoleFull)**
 for PR 14788 at commit 
[`537fe88`](https://github.com/apache/spark/commit/537fe8858fd78e11c47cb89e847bd355c2494529).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `case class DateSub(instant: Expression, days: Expression) extends 
AddDaysBase(instant, days) `
  * `case class TruncInstant(instant: Expression, format: Expression)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14788: [SPARK-17174][SQL] Add the support for TimestampType for...

2016-10-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14788
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66712/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14788: [SPARK-17174][SQL] Add the support for TimestampType for...

2016-10-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14788
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-10 Thread Yunni
Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/15148#discussion_r82721024
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala ---
@@ -0,0 +1,339 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import scala.util.Random
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.linalg.{Vector, VectorUDT}
+import org.apache.spark.ml.param.{IntParam, ParamMap, ParamValidators}
+import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
+import org.apache.spark.ml.util.SchemaUtils
+import org.apache.spark.sql._
+import org.apache.spark.sql.expressions.UserDefinedFunction
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+
+/**
+ * Params for [[LSH]].
+ */
+@Experimental
+@Since("2.1.0")
+private[ml] trait LSHParams extends HasInputCol with HasOutputCol {
+  /**
+   * Param for the dimension of LSH OR-amplification.
+   *
+   * In this implementation, we use LSH OR-amplification to reduce the 
false negative rate. The
+   * higher the dimension is, the lower the false negative rate.
+   * @group param
+   */
+  @Since("2.1.0")
+  final val outputDim: IntParam = new IntParam(this, "outputDim", "output 
dimension, where" +
+"increasing dimensionality lowers the false negative rate, and 
decreasing dimensionality" +
+" improves the running performance", ParamValidators.gt(0))
+
+  /** @group getParam */
+  @Since("2.1.0")
+  final def getOutputDim: Int = $(outputDim)
+
+  // TODO: Decide about this default. It should probably depend on the 
particular LSH algorithm.
+  setDefault(outputDim -> 1, outputCol -> "lshFeatures")
+
+  /**
+   * Transform the Schema for LSH
+   * @param schema The schema of the input dataset without [[outputCol]]
+   * @return A derived schema with [[outputCol]] added
+   */
+  @Since("2.1.0")
+  protected[this] final def validateAndTransformSchema(schema: 
StructType): StructType = {
+SchemaUtils.appendColumn(schema, $(outputCol), new VectorUDT)
+  }
+}
+
+/**
+ * Model produced by [[LSH]].
+ */
+@Experimental
+@Since("2.1.0")
+private[ml] abstract class LSHModel[T <: LSHModel[T]] extends Model[T] 
with LSHParams {
+  self: T =>
+
+  @Since("2.1.0")
+  override def copy(extra: ParamMap): T = defaultCopy(extra)
+
+  /**
+   * The hash function of LSH, mapping a predefined KeyType to a Vector
+   * @return The mapping of LSH function.
+   */
+  @Since("2.1.0")
+  protected[this] val hashFunction: Vector => Vector
+
+  /**
+   * Calculate the distance between two different keys using the distance 
metric corresponding
+   * to the hashFunction
+   * @param x One of the point in the metric space
+   * @param y Another the point in the metric space
+   * @return The distance between x and y
+   */
+  @Since("2.1.0")
+  protected[ml] def keyDistance(x: Vector, y: Vector): Double
+
+  /**
+   * Calculate the distance between two different hash Vectors.
+   *
+   * @param x One of the hash vector
+   * @param y Another hash vector
+   * @return The distance between hash vectors x and y
+   */
+  @Since("2.1.0")
+  protected[ml] def hashDistance(x: Vector, y: Vector): Double
+
+  @Since("2.1.0")
+  override def transform(dataset: Dataset[_]): DataFrame = {
+transformSchema(dataset.schema, logging = true)
+val transformUDF = udf(hashFunction, new VectorUDT)
+dataset.withColumn($(outputCol), transformUDF(dataset($(inputCol
+  }
+
+  @Since("2.1.0")
+  override def transformSchema(schema: StructType): StructType = {
+validateAndTransformSchema(schema)
+  }
+
+  /**
+   * Given a large dataset and an item, 

[GitHub] spark issue #14788: [SPARK-17174][SQL] Add the support for TimestampType for...

2016-10-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14788
  
**[Test build #66712 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66712/consoleFull)**
 for PR 14788 at commit 
[`537fe88`](https://github.com/apache/spark/commit/537fe8858fd78e11c47cb89e847bd355c2494529).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15423: [SPARK-17860][SQL] SHOW COLUMN's database conflic...

2016-10-10 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/15423#discussion_r82720911
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala ---
@@ -207,6 +208,7 @@ class SQLQueryTestSuite extends QueryTest with 
SharedSQLContext {
 // Returns true if the plan is supposed to be sorted.
 def isSorted(plan: LogicalPlan): Boolean = plan match {
   case _: Join | _: Aggregate | _: Generate | _: Sample | _: Distinct 
=> false
+  case _: ShowColumnsCommand => true
--- End diff --

Personally I don't think it is odd because we just want to compare the 
results. Adding `ShowColumnsCommand` to sorted op looks more odd to me. cc 
@cloud-fan 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15416: [SPARK-17849] [SQL] Fix NPE problem when using gr...

2016-10-10 Thread davies
Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/15416#discussion_r82720505
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -298,10 +298,14 @@ class Analyzer(
   case other => Alias(other, other.toString)()
 }
 
-val nonNullBitmask = x.bitmasks.reduce(_ & _)
+// The left most bit in the bitmasks corresponds to the last 
expression in groupByAliases
+// with 0 indicating this expression is in the grouping set. The 
following line of code
+// calculates the bit mask representing the expressions that exist 
in all the grouping sets.
+val nonNullBitmask = ~ x.bitmasks.reduce(_ | _)
--- End diff --

Could you remove the '~' here, and use `(nonNullBitmask & (1 << (attrLength 
- idx - 1))) == 1`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15423: [SPARK-17860][SQL] SHOW COLUMN's database conflic...

2016-10-10 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/15423#discussion_r82720521
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala ---
@@ -207,6 +208,7 @@ class SQLQueryTestSuite extends QueryTest with 
SharedSQLContext {
 // Returns true if the plan is supposed to be sorted.
 def isSorted(plan: LogicalPlan): Boolean = plan match {
   case _: Join | _: Aggregate | _: Generate | _: Sample | _: Distinct 
=> false
+  case _: ShowColumnsCommand => true
--- End diff --

@viirya So it seemed odd to have the generated output files to have column 
names sorted which didn't reflect the columns. In the test case i had the table 
create like following.
```SQL
CREATE TABLE showcolumn2 (price int, qty int) partitioned by (year int, 
month int)
```
It seemed odd to me to have the generated output file report the columns as 
month, price, qty and year as opposed to price, qty, year and month.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15371: [SPARK-17816] [Core] Fix ConcurrentModificationEx...

2016-10-10 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/15371


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15371: [SPARK-17816] [Core] Fix ConcurrentModificationException...

2016-10-10 Thread zsxwing
Github user zsxwing commented on the issue:

https://github.com/apache/spark/pull/15371
  
There are some conflicts with 2.0. @seyfe could you submit a PR for 
branch-2.0, please? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15371: [SPARK-17816] [Core] Fix ConcurrentModificationException...

2016-10-10 Thread zsxwing
Github user zsxwing commented on the issue:

https://github.com/apache/spark/pull/15371
  
LGTM. Thanks! Merging to master and 2.0.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15375: [SPARK-17790][SPARKR] Support for parallelizing R data.f...

2016-10-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15375
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66699/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15375: [SPARK-17790][SPARKR] Support for parallelizing R data.f...

2016-10-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15375
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15375: [SPARK-17790][SPARKR] Support for parallelizing R data.f...

2016-10-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15375
  
**[Test build #66699 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66699/consoleFull)**
 for PR 15375 at commit 
[`62ab47b`](https://github.com/apache/spark/commit/62ab47b016aeb42c0721b52c4c37d502db18c535).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15285: [SPARK-17711] Compress rolled executor log

2016-10-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15285
  
**[Test build #66711 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66711/consoleFull)**
 for PR 15285 at commit 
[`ae08495`](https://github.com/apache/spark/commit/ae08495549fe8a2b6750c2b2e4dba8e37779a740).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15285: [SPARK-17711] Compress rolled executor log

2016-10-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15285
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66711/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15285: [SPARK-17711] Compress rolled executor log

2016-10-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15285
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15285: [SPARK-17711] Compress rolled executor log

2016-10-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15285
  
**[Test build #66711 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66711/consoleFull)**
 for PR 15285 at commit 
[`ae08495`](https://github.com/apache/spark/commit/ae08495549fe8a2b6750c2b2e4dba8e37779a740).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15414: [SPARK-17848][ML] Move LabelCol datatype cast into Predi...

2016-10-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15414
  
**[Test build #66710 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66710/consoleFull)**
 for PR 15414 at commit 
[`6c61e73`](https://github.com/apache/spark/commit/6c61e73c9b8d401f7ec9d48e9f74df7e134cec5f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   6   7   8   >