[GitHub] spark issue #14702: [SPARK-15694] Implement ScriptTransformation in sql/core...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/14702 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14788: [SPARK-17174][SQL] Add the support for TimestampType for...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14788 **[Test build #66713 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66713/consoleFull)** for PR 14788 at commit [`ef67829`](https://github.com/apache/spark/commit/ef678292d104f2d7a4b637cedc0a388aeb900323). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15072: [SPARK-17123][SQL] Use type-widened encoder for D...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/15072#discussion_r82728292 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -53,7 +53,15 @@ import org.apache.spark.util.Utils private[sql] object Dataset { def apply[T: Encoder](sparkSession: SparkSession, logicalPlan: LogicalPlan): Dataset[T] = { -new Dataset(sparkSession, logicalPlan, implicitly[Encoder[T]]) +val encoder = implicitly[Encoder[T]] +if (encoder.clsTag.runtimeClass == classOf[Row]) { + // We should use the encoder generated from the executed plan rather than the existing + // encoder for DataFrame because the types of columns can be varied due to widening types. + // See SPARK-17123. This is a bit hacky. Maybe we should find a better way to do this. + ofRows(sparkSession, logicalPlan).asInstanceOf[Dataset[T]] +} else { + new Dataset(sparkSession, logicalPlan, encoder) +} --- End diff -- Hm, I manually tested. It seems `except` is failed too. It seems fine with `intersect`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13675: [SPARK-15957] [ML] RFormula supports forcing to i...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/13675 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13675: [SPARK-15957] [ML] RFormula supports forcing to index la...
Github user yanboliang commented on the issue: https://github.com/apache/spark/pull/13675 I'll merge this into master, thanks for review! @jkbradley @felixcheung --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13675: [SPARK-15957] [ML] RFormula supports forcing to index la...
Github user yanboliang commented on the issue: https://github.com/apache/spark/pull/13675 @felixcheung This PR does not affect R code, I will send another PR to fix issues like [SPARK-15153](https://issues.apache.org/jira/browse/SPARK-15153) which need to add some R tests. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15425: [SPARK-17816] [Core] [Branch-2.0] Fix ConcurrentModifica...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15425 **[Test build #3321 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3321/consoleFull)** for PR 15425 at commit [`678ee6b`](https://github.com/apache/spark/commit/678ee6b1d6308a81a5c2d83a196144f29c80434b). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15425: [SPARK-17816] [Core] [Branch-2.0] Fix ConcurrentModifica...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/15425 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15295: [SPARK-17720][SQL] introduce static SQL conf
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15295 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15295: [SPARK-17720][SQL] introduce static SQL conf
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15295 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66706/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15295: [SPARK-17720][SQL] introduce static SQL conf
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15295 **[Test build #66706 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66706/consoleFull)** for PR 15295 at commit [`8d93c4a`](https://github.com/apache/spark/commit/8d93c4aed4b32ef145f054571a6c8097d01ee5e8). * This patch **fails SparkR unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15424: [SPARK-17338][SQL][follow-up] add global temp view
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15424 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15424: [SPARK-17338][SQL][follow-up] add global temp view
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15424 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66707/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15424: [SPARK-17338][SQL][follow-up] add global temp view
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15424 **[Test build #66707 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66707/consoleFull)** for PR 15424 at commit [`15efca6`](https://github.com/apache/spark/commit/15efca65f3249675f7b137ffb42eb08a875c6269). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15292: [SPARK-17719][SPARK-17776][SQL] Unify and tie up options...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/15292 @gatorsmile @cloud-fan Thank you for reviewing this both! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15388: [SPARK-17821][SQL] Support And and Or in Expression Cano...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15388 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66708/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15388: [SPARK-17821][SQL] Support And and Or in Expression Cano...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15388 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15412: [SPARK-17844] Simplify DataFrame API for defining...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/15412 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15388: [SPARK-17821][SQL] Support And and Or in Expression Cano...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15388 **[Test build #66708 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66708/consoleFull)** for PR 15388 at commit [`21958d7`](https://github.com/apache/spark/commit/21958d7e7b2cb0de6a5b6353afc933359e490df2). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15412: [SPARK-17844] Simplify DataFrame API for defining frame ...
Github user hvanhovell commented on the issue: https://github.com/apache/spark/pull/15412 LGTM - merging to master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15416: [SPARK-17849] [SQL] Fix NPE problem when using gr...
Github user yangw1234 commented on a diff in the pull request: https://github.com/apache/spark/pull/15416#discussion_r82726593 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -298,10 +298,14 @@ class Analyzer( case other => Alias(other, other.toString)() } -val nonNullBitmask = x.bitmasks.reduce(_ & _) +// The rightmost bit in the bitmasks corresponds to the last expression in groupByAliases with 0 +// indicating this expression is in the grouping set. The following line of code calculates the +// bitmask representing the expressions that exist in all the grouping sets (also indicated by 0). +val nonNullBitmask = x.bitmasks.reduce(_ | _) --- End diff -- done @davies --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r82726587 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala --- @@ -0,0 +1,343 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import scala.util.Random + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.{Estimator, Model} +import org.apache.spark.ml.linalg.{Vector, VectorUDT} +import org.apache.spark.ml.param.{IntParam, ParamMap, ParamValidators} +import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol} +import org.apache.spark.ml.util.SchemaUtils +import org.apache.spark.sql._ +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ + +/** + * :: Experimental :: + * Params for [[LSH]]. + */ +@Since("2.1.0") +private[ml] trait LSHParams extends HasInputCol with HasOutputCol { + /** + * Param for the dimension of LSH OR-amplification. + * + * In this implementation, we use LSH OR-amplification to reduce the false negative rate. The + * higher the dimension is, the lower the false negative rate. + * @group param + */ + @Since("2.1.0") + final val outputDim: IntParam = new IntParam(this, "outputDim", "output dimension, where" + +"increasing dimensionality lowers the false negative rate, and decreasing dimensionality" + +" improves the running performance", ParamValidators.gt(0)) + + /** @group getParam */ + @Since("2.1.0") + final def getOutputDim: Int = $(outputDim) + + /** + * Transform the Schema for LSH + * @param schema The schema of the input dataset without [[outputCol]] + * @return A derived schema with [[outputCol]] added + */ + @Since("2.1.0") + protected[this] final def validateAndTransformSchema(schema: StructType): StructType = { +SchemaUtils.appendColumn(schema, $(outputCol), new VectorUDT) + } +} + +/** + * :: Experimental :: + * Model produced by [[LSH]]. + */ +@Experimental +@Since("2.1.0") +private[ml] abstract class LSHModel[T <: LSHModel[T]] extends Model[T] with LSHParams { + self: T => + + @Since("2.1.0") + override def copy(extra: ParamMap): T = defaultCopy(extra) + + /** + * The hash function of LSH, mapping a predefined KeyType to a Vector + * @return The mapping of LSH function. + */ + @Since("2.1.0") + protected[this] val hashFunction: Vector => Vector + + /** + * Calculate the distance between two different keys using the distance metric corresponding + * to the hashFunction + * @param x One input vector in the metric space + * @param y One input vector in the metric space + * @return The distance between x and y + */ + @Since("2.1.0") + protected[ml] def keyDistance(x: Vector, y: Vector): Double + + /** + * Calculate the distance between two different hash Vectors. + * + * @param x One of the hash vector + * @param y Another hash vector + * @return The distance between hash vectors x and y + */ + @Since("2.1.0") + protected[ml] def hashDistance(x: Vector, y: Vector): Double + + @Since("2.1.0") + override def transform(dataset: Dataset[_]): DataFrame = { +transformSchema(dataset.schema, logging = true) +val transformUDF = udf(hashFunction, new VectorUDT) +dataset.withColumn($(outputCol), transformUDF(dataset($(inputCol + } + + @Since("2.1.0") + override def transformSchema(schema: StructType): StructType = { +validateAndTransformSchema(schema) + } + + /** + * Given a large dataset and an item, approximately find at most k items which have the closest + * distance to the item. If the [[outputCol]] is missing, the method will transform the data; if + * the [[outputCol]] exists, it
[GitHub] spark issue #15292: [SPARK-17719][SPARK-17776][SQL] Unify and tie up options...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/15292 Thanks! Merging to master! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15408: [SPARK-17839][CORE] Use Nio's directbuffer instea...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/15408#discussion_r82726120 --- Diff: core/src/main/java/org/apache/spark/io/NioBasedBufferedFileInputStream.java --- @@ -0,0 +1,127 @@ +/* + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.io; + +import org.apache.spark.storage.StorageUtils; + +import java.io.File; +import java.io.IOException; +import java.io.InputStream; +import java.nio.ByteBuffer; +import java.nio.channels.FileChannel; +import java.nio.file.StandardOpenOption; + +/** + * {@link InputStream} implementation which uses direct buffer + * to read a file to avoid extra copy of data between Java and + * native memory which happens when using {@link java.io.BufferedInputStream}. + * Unfortunately, this is not something already available in JDK, + * {@link sun.nio.ch.ChannelInputStream} supports reading a file using nio, + * but does not support buffering. + * + */ +public final class NioBasedBufferedFileInputStream extends InputStream { + + private static int DEFAULT_BUFFER_SIZE_BYTES = 8192; + + private final ByteBuffer byteBuffer; + + private final FileChannel fileChannel; + + public NioBasedBufferedFileInputStream(File file, int bufferSizeInBytes) throws IOException { +byteBuffer = ByteBuffer.allocateDirect(bufferSizeInBytes); +fileChannel = FileChannel.open(file.toPath(), StandardOpenOption.READ); +byteBuffer.flip(); + } + + public NioBasedBufferedFileInputStream(File file) throws IOException { +this(file, DEFAULT_BUFFER_SIZE_BYTES); + } + + /** + * Checks weather data is left to be read from the input stream. + * @return true if data is left, false otherwise + * @throws IOException + */ + private boolean refill() throws IOException { +if (!byteBuffer.hasRemaining()) { + byteBuffer.clear(); + int nRead = fileChannel.read(byteBuffer); + if (nRead == -1) { --- End diff -- Hm, https://docs.oracle.com/javase/7/docs/api/java/nio/channels/FileChannel.html#read(java.nio.ByteBuffer) suggests that 0 doesn't mean EOF, just 0 bytes read, but, I'm also not sure what to do if the channel won't actually give any bytes at this point. I think that can only happen if the buffer is full but that won't happen here. `<= 0` seems reasonable AFAIK. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15072: [SPARK-17123][SQL] Use type-widened encoder for D...
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/15072#discussion_r82726273 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -53,7 +53,15 @@ import org.apache.spark.util.Utils private[sql] object Dataset { def apply[T: Encoder](sparkSession: SparkSession, logicalPlan: LogicalPlan): Dataset[T] = { -new Dataset(sparkSession, logicalPlan, implicitly[Encoder[T]]) +val encoder = implicitly[Encoder[T]] +if (encoder.clsTag.runtimeClass == classOf[Row]) { + // We should use the encoder generated from the executed plan rather than the existing + // encoder for DataFrame because the types of columns can be varied due to widening types. + // See SPARK-17123. This is a bit hacky. Maybe we should find a better way to do this. + ofRows(sparkSession, logicalPlan).asInstanceOf[Dataset[T]] +} else { + new Dataset(sparkSession, logicalPlan, encoder) +} --- End diff -- We only need this for Union right? In all other cases we only return tuples from the first dataset. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15292: [SPARK-17719][SPARK-17776][SQL] Unify and tie up ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/15292 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15412: [SPARK-17844] Simplify DataFrame API for defining frame ...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/15412 cc @hvanhovell ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r82725489 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala --- @@ -0,0 +1,343 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import scala.util.Random + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.{Estimator, Model} +import org.apache.spark.ml.linalg.{Vector, VectorUDT} +import org.apache.spark.ml.param.{IntParam, ParamMap, ParamValidators} +import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol} +import org.apache.spark.ml.util.SchemaUtils +import org.apache.spark.sql._ +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ + +/** + * :: Experimental :: + * Params for [[LSH]]. + */ +@Since("2.1.0") +private[ml] trait LSHParams extends HasInputCol with HasOutputCol { + /** + * Param for the dimension of LSH OR-amplification. + * + * In this implementation, we use LSH OR-amplification to reduce the false negative rate. The + * higher the dimension is, the lower the false negative rate. + * @group param + */ + @Since("2.1.0") + final val outputDim: IntParam = new IntParam(this, "outputDim", "output dimension, where" + +"increasing dimensionality lowers the false negative rate, and decreasing dimensionality" + +" improves the running performance", ParamValidators.gt(0)) + + /** @group getParam */ + @Since("2.1.0") + final def getOutputDim: Int = $(outputDim) + + /** + * Transform the Schema for LSH + * @param schema The schema of the input dataset without [[outputCol]] + * @return A derived schema with [[outputCol]] added + */ + @Since("2.1.0") + protected[this] final def validateAndTransformSchema(schema: StructType): StructType = { +SchemaUtils.appendColumn(schema, $(outputCol), new VectorUDT) + } +} + +/** + * :: Experimental :: + * Model produced by [[LSH]]. + */ +@Experimental +@Since("2.1.0") +private[ml] abstract class LSHModel[T <: LSHModel[T]] extends Model[T] with LSHParams { + self: T => + + @Since("2.1.0") + override def copy(extra: ParamMap): T = defaultCopy(extra) + + /** + * The hash function of LSH, mapping a predefined KeyType to a Vector + * @return The mapping of LSH function. + */ + @Since("2.1.0") + protected[this] val hashFunction: Vector => Vector + + /** + * Calculate the distance between two different keys using the distance metric corresponding + * to the hashFunction + * @param x One input vector in the metric space + * @param y One input vector in the metric space + * @return The distance between x and y + */ + @Since("2.1.0") + protected[ml] def keyDistance(x: Vector, y: Vector): Double + + /** + * Calculate the distance between two different hash Vectors. + * + * @param x One of the hash vector + * @param y Another hash vector + * @return The distance between hash vectors x and y + */ + @Since("2.1.0") + protected[ml] def hashDistance(x: Vector, y: Vector): Double + + @Since("2.1.0") + override def transform(dataset: Dataset[_]): DataFrame = { +transformSchema(dataset.schema, logging = true) +val transformUDF = udf(hashFunction, new VectorUDT) +dataset.withColumn($(outputCol), transformUDF(dataset($(inputCol + } + + @Since("2.1.0") + override def transformSchema(schema: StructType): StructType = { +validateAndTransformSchema(schema) + } + + /** + * Given a large dataset and an item, approximately find at most k items which have the closest + * distance to the item. If the [[outputCol]] is missing, the method will transform the data; if + * the [[outputCol]] exists, it
[GitHub] spark issue #15148: [SPARK-5992][ML] Locality Sensitive Hashing
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15148 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66717/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15148: [SPARK-5992][ML] Locality Sensitive Hashing
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15148 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15148: [SPARK-5992][ML] Locality Sensitive Hashing
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15148 **[Test build #66717 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66717/consoleFull)** for PR 15148 at commit [`2c95e5c`](https://github.com/apache/spark/commit/2c95e5c1d89e2db0350b5d8667e2ae8d293df7a9). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class MinHash(override val uid: String) extends LSH[MinHashModel] with HasSeed ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13675: [SPARK-15957] [ML] RFormula supports forcing to index la...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/13675 Does this affect R code - could we add some R tests for this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15072: [SPARK-17123][SQL] Use type-widened encoder for D...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/15072#discussion_r82725229 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -53,7 +53,15 @@ import org.apache.spark.util.Utils private[sql] object Dataset { def apply[T: Encoder](sparkSession: SparkSession, logicalPlan: LogicalPlan): Dataset[T] = { -new Dataset(sparkSession, logicalPlan, implicitly[Encoder[T]]) +val encoder = implicitly[Encoder[T]] +if (encoder.clsTag.runtimeClass == classOf[Row]) { + // We should use the encoder generated from the executed plan rather than the existing + // encoder for DataFrame because the types of columns can be varied due to widening types. + // See SPARK-17123. This is a bit hacky. Maybe we should find a better way to do this. + ofRows(sparkSession, logicalPlan).asInstanceOf[Dataset[T]] +} else { + new Dataset(sparkSession, logicalPlan, encoder) +} --- End diff -- In transformation methods of Dataset, normally we will call `withTypedPlan` to generate a new Dataset. However, for set operator methods, we should call a different method and put this special logic in it, so that the scope of this hack is narrowed down to only set operator methods. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15424: [SPARK-17338][SQL][follow-up] add global temp view
Github user rxin commented on the issue: https://github.com/apache/spark/pull/15424 LGTM pending Jenkins. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15408: [SPARK-17839][CORE] Use Nio's directbuffer instead of Bu...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/15408 Yea pooling can make sense, but we don't do it anywhere right now so it'd make more sense to defer until we have a plan to do it more broadly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15426: [SPARK-17864][SQL] Mark data type APIs as stable (not De...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/15426 cc @marmbrus --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15426: [SPARK-17864][SQL] Mark data type APIs as stable (not De...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15426 **[Test build #66721 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66721/consoleFull)** for PR 15426 at commit [`0cf7e72`](https://github.com/apache/spark/commit/0cf7e7211f4b8112c776f1ac6bc06d6d204e6fd8). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15426: [SPARK-17864][SQL] Mark data type APIs as stable ...
GitHub user rxin opened a pull request: https://github.com/apache/spark/pull/15426 [SPARK-17864][SQL] Mark data type APIs as stable (not DeveloperApi) ## What changes were proposed in this pull request? The data type API has not been changed since Spark 1.3.0, and is ready for graduation. This patch marks them as stable APIs using the new InterfaceStability annotation. This patch also looks at the various files in the catalyst module (not the "package") and marks the remaining few classes appropriately as well. ## How was this patch tested? This is an annotation change. No functional changes. You can merge this pull request into a Git repository by running: $ git pull https://github.com/rxin/spark SPARK-17864 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15426.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15426 commit 0cf7e7211f4b8112c776f1ac6bc06d6d204e6fd8 Author: Reynold XinDate: 2016-10-11T04:53:35Z [SPARK-17864][SQL] Mark data type APIs as stable (not DeveloperApi) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15424: [SPARK-17338][SQL][follow-up] add global temp view
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15424 **[Test build #66720 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66720/consoleFull)** for PR 15424 at commit [`0ff26d0`](https://github.com/apache/spark/commit/0ff26d0050b12917f0c801ba61d43d0ae4970f81). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15416: [SPARK-17849] [SQL] Fix NPE problem when using gr...
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/15416#discussion_r82723882 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -298,10 +298,14 @@ class Analyzer( case other => Alias(other, other.toString)() } -val nonNullBitmask = x.bitmasks.reduce(_ & _) +// The rightmost bit in the bitmasks corresponds to the last expression in groupByAliases with 0 +// indicating this expression is in the grouping set. The following line of code calculates the +// bitmask representing the expressions that exist in all the grouping sets (also indicated by 0). +val nonNullBitmask = x.bitmasks.reduce(_ | _) --- End diff -- Should we call this `nullBitmask` now? (1 means it's nullable) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15408: [SPARK-17839][CORE] Use Nio's directbuffer instead of Bu...
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/15408 Barring query to @rxin (regarding buffer pooling), I am fine with the change - pretty neat, thanks @sitalkedia ! Would be good if more eyeballs look at it though given how fundamental it is. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15285: [SPARK-17711] Compress rolled executor log
Github user loneknightpy commented on the issue: https://github.com/apache/spark/pull/15285 @tdas Addressed your comments --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15292: [SPARK-17719][SPARK-17776][SQL] Unify and tie up options...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/15292 Ah, right. I just updated. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15377: [SPARK-17802] Improved caller context logging.
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15377 **[Test build #66719 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66719/consoleFull)** for PR 15377 at commit [`df28bdd`](https://github.com/apache/spark/commit/df28bdddce5e4789a02cf7ef5dedab8b7c408630). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15292: [SPARK-17719][SPARK-17776][SQL] Unify and tie up options...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/15292 Sorry, I did not explain it in details. In this PR, we had a bug fix. We need a separate bullet in the PR description. Previously, when attempting to make a database connection, we pass all the Spark-specific JDBC options as connection properties. After this fix, we exclude them from the connection properties. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15421 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66702/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15421 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15421 **[Test build #66702 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66702/consoleFull)** for PR 15421 at commit [`9e621eb`](https://github.com/apache/spark/commit/9e621ebb1b4d9ac20fa294937ebe87e88730f3c9). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15414: [SPARK-17848][ML] Move LabelCol datatype cast into Predi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15414 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66710/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15414: [SPARK-17848][ML] Move LabelCol datatype cast into Predi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15414 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15414: [SPARK-17848][ML] Move LabelCol datatype cast into Predi...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15414 **[Test build #66710 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66710/consoleFull)** for PR 15414 at commit [`6c61e73`](https://github.com/apache/spark/commit/6c61e73c9b8d401f7ec9d48e9f74df7e134cec5f). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15285: [SPARK-17711] Compress rolled executor log
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15285 **[Test build #66718 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66718/consoleFull)** for PR 15285 at commit [`e5676a6`](https://github.com/apache/spark/commit/e5676a6d4e60e7b7446bf525fb7003cb26efc448). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15398: [SPARK-17647][SQL] Fix backslash escaping in 'LIKE' patt...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/15398 Also cc @yhuai and @JoshRosen @mengxr Please check whether the changes here can satisfy what you want. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15425: [SPARK-17816] [Core] [Branch-2.0] Fix ConcurrentModifica...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15425 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15272: [SPARK-17698] [SQL] Join predicates should not contain f...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15272 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r82722577 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala --- @@ -0,0 +1,343 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import scala.util.Random + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.{Estimator, Model} +import org.apache.spark.ml.linalg.{Vector, VectorUDT} +import org.apache.spark.ml.param.{IntParam, ParamMap, ParamValidators} +import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol} +import org.apache.spark.ml.util.SchemaUtils +import org.apache.spark.sql._ +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ + +/** + * :: Experimental :: + * Params for [[LSH]]. + */ +@Since("2.1.0") +private[ml] trait LSHParams extends HasInputCol with HasOutputCol { + /** + * Param for the dimension of LSH OR-amplification. + * + * In this implementation, we use LSH OR-amplification to reduce the false negative rate. The + * higher the dimension is, the lower the false negative rate. + * @group param + */ + @Since("2.1.0") + final val outputDim: IntParam = new IntParam(this, "outputDim", "output dimension, where" + +"increasing dimensionality lowers the false negative rate, and decreasing dimensionality" + +" improves the running performance", ParamValidators.gt(0)) + + /** @group getParam */ + @Since("2.1.0") + final def getOutputDim: Int = $(outputDim) + + /** + * Transform the Schema for LSH + * @param schema The schema of the input dataset without [[outputCol]] + * @return A derived schema with [[outputCol]] added + */ + @Since("2.1.0") + protected[this] final def validateAndTransformSchema(schema: StructType): StructType = { +SchemaUtils.appendColumn(schema, $(outputCol), new VectorUDT) + } +} + +/** + * :: Experimental :: + * Model produced by [[LSH]]. + */ +@Experimental +@Since("2.1.0") +private[ml] abstract class LSHModel[T <: LSHModel[T]] extends Model[T] with LSHParams { + self: T => + + @Since("2.1.0") + override def copy(extra: ParamMap): T = defaultCopy(extra) + + /** + * The hash function of LSH, mapping a predefined KeyType to a Vector + * @return The mapping of LSH function. + */ + @Since("2.1.0") + protected[this] val hashFunction: Vector => Vector + + /** + * Calculate the distance between two different keys using the distance metric corresponding + * to the hashFunction + * @param x One input vector in the metric space + * @param y One input vector in the metric space + * @return The distance between x and y + */ + @Since("2.1.0") + protected[ml] def keyDistance(x: Vector, y: Vector): Double + + /** + * Calculate the distance between two different hash Vectors. + * + * @param x One of the hash vector + * @param y Another hash vector + * @return The distance between hash vectors x and y + */ + @Since("2.1.0") + protected[ml] def hashDistance(x: Vector, y: Vector): Double + + @Since("2.1.0") + override def transform(dataset: Dataset[_]): DataFrame = { +transformSchema(dataset.schema, logging = true) +val transformUDF = udf(hashFunction, new VectorUDT) +dataset.withColumn($(outputCol), transformUDF(dataset($(inputCol + } + + @Since("2.1.0") + override def transformSchema(schema: StructType): StructType = { +validateAndTransformSchema(schema) + } + + /** + * Given a large dataset and an item, approximately find at most k items which have the closest + * distance to the item. If the [[outputCol]] is missing, the method will transform the data; if + * the [[outputCol]] exists, it
[GitHub] spark issue #15272: [SPARK-17698] [SQL] Join predicates should not contain f...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15272 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66705/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15272: [SPARK-17698] [SQL] Join predicates should not contain f...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15272 **[Test build #66705 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66705/consoleFull)** for PR 15272 at commit [`e9f9378`](https://github.com/apache/spark/commit/e9f93784175dd0906a648ca23e86cf6d026c4ece). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15398: [SPARK-17647][SQL] Fix backslash escaping in 'LIK...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/15398#discussion_r82722525 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/StringUtils.scala --- @@ -25,26 +25,25 @@ object StringUtils { // replace the _ with .{1} exactly match 1 time of any character // replace the % with .*, match 0 or more times with any character - def escapeLikeRegex(v: String): String = { -if (!v.isEmpty) { - "(?s)" + (' ' +: v.init).zip(v).flatMap { -case (prev, '\\') => "" -case ('\\', c) => - c match { -case '_' => "_" -case '%' => "%" -case _ => Pattern.quote("\\" + c) - } -case (prev, c) => - c match { -case '_' => "." -case '%' => ".*" -case _ => Pattern.quote(Character.toString(c)) - } - }.mkString -} else { - v + def escapeLikeRegex(str: String): String = { +val builder = new StringBuilder() +var escaping = false +for (next <- str) { + if (escaping) { +builder ++= Pattern.quote(Character.toString(next)) --- End diff -- How about `"\\a"`? Previously it is `\Q\a\E`, now it seems becoming `\Qa\E`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15371: [SPARK-17816] [Core] Fix ConcurrentModificationException...
Github user seyfe commented on the issue: https://github.com/apache/spark/pull/15371 Thanks @zsxwing. Here is the PR for branch-2.0 https://github.com/apache/spark/pull/15425 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15425: [SPARK-17816] [Core] [Branch-2.0] Fix ConcurrentM...
GitHub user seyfe opened a pull request: https://github.com/apache/spark/pull/15425 [SPARK-17816] [Core] [Branch-2.0] Fix ConcurrentModificationException issue in BlockStatusesAccumulator ## What changes were proposed in this pull request? Replaced `BlockStatusesAccumulator` with `CollectionAccumulator` which is thread safe and few more cleanups. ## How was this patch tested? Tested in master branch and cherry-picked. You can merge this pull request into a Git repository by running: $ git pull https://github.com/seyfe/spark race_cond_jsonprotocal_branch-2.0 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15425.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15425 commit 678ee6b1d6308a81a5c2d83a196144f29c80434b Author: Ergin SeyfeDate: 2016-10-11T03:41:31Z [SPARK-17816][CORE] Fix ConcurrentModificationException issue in BlockStatusesAccumulator Change the BlockStatusesAccumulator to return immutable object when value method is called. Existing tests plus I verified this change by running a pipeline which consistently repro this issue. This is the stack trace for this exception: ` java.util.ConcurrentModificationException at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:901) at java.util.ArrayList$Itr.next(ArrayList.java:851) at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at scala.collection.AbstractIterable.foreach(Iterable.scala:54) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:183) at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:45) at scala.collection.TraversableLike$class.to(TraversableLike.scala:590) at scala.collection.AbstractTraversable.to(Traversable.scala:104) at scala.collection.TraversableOnce$class.toList(TraversableOnce.scala:294) at scala.collection.AbstractTraversable.toList(Traversable.scala:104) at org.apache.spark.util.JsonProtocol$.accumValueToJson(JsonProtocol.scala:314) at org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$5.apply(JsonProtocol.scala:291) at org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$5.apply(JsonProtocol.scala:291) at scala.Option.map(Option.scala:146) at org.apache.spark.util.JsonProtocol$.accumulableInfoToJson(JsonProtocol.scala:291) at org.apache.spark.util.JsonProtocol$$anonfun$taskInfoToJson$12.apply(JsonProtocol.scala:283) at org.apache.spark.util.JsonProtocol$$anonfun$taskInfoToJson$12.apply(JsonProtocol.scala:283) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.immutable.List.foreach(List.scala:381) at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35) at scala.collection.mutable.ListBuffer.foreach(ListBuffer.scala:45) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at org.apache.spark.util.JsonProtocol$.taskInfoToJson(JsonProtocol.scala:283) at org.apache.spark.util.JsonProtocol$.taskEndToJson(JsonProtocol.scala:145) at org.apache.spark.util.JsonProtocol$.sparkEventToJson(JsonProtocol.scala:76) ` Author: Ergin Seyfe Closes #15371 from seyfe/race_cond_jsonprotocal. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15424: [SPARK-17338][SQL][follow-up] add global temp vie...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/15424#discussion_r82722351 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/catalog/Catalog.scala --- @@ -270,9 +270,10 @@ abstract class Catalog { * tied to any databases, i.e. we can't use `db1.view1` to reference a local temporary view. * --- End diff -- can you add a line saying the return type was unit in Spark 2.0, but changed to boolean in Spark 2.1? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15424: [SPARK-17338][SQL][follow-up] add global temp view
Github user rxin commented on the issue: https://github.com/apache/spark/pull/15424 LGTM other than the two minor comments. We also need a Python API for this, don't we? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15416: [SPARK-17849] [SQL] Fix NPE problem when using grouping ...
Github user yangw1234 commented on the issue: https://github.com/apache/spark/pull/15416 @davies Other places all seem to be correct. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r82722244 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/LSHTest.scala --- @@ -0,0 +1,135 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.spark.ml.linalg.Vector +import org.apache.spark.sql.Dataset +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types.DataTypes + +private[ml] object LSHTest { + /** + * For any locality sensitive function h in a metric space, we meed to verify whether + * the following property is satisfied. + * + * There exist dist1, dist2, p1, p2, so that for any two elements e1 and e2, + * If dist(e1, e2) <= dist1, then Pr{h(x) == h(y)} >= p1 + * If dist(e1, e2) >= dist2, then Pr{h(x) == h(y)} <= p2 + * + * This is called locality sensitive property. This method checks the property on an + * existing dataset and calculate the probabilities. + * (https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Definition) + * + * This method hashes each elements to hash buckets using LSH, and calculate the false positive + * and false negative: + * False positive: Of all (e1, e2) sharing any bucket, the probability of dist(e1, e2) > distFP + * False positive: Of all (e1, e2) not sharing buckets, the probability of dist(e1, e2) < distFN --- End diff -- Fixed. Yes, these calculation methods are for unit tests only, and will not be open to users. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r82722195 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RandomProjection.scala --- @@ -0,0 +1,159 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import scala.util.Random + +import breeze.linalg.normalize + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.linalg.{BLAS, Vector, Vectors, VectorUDT} +import org.apache.spark.ml.param.{BooleanParam, DoubleParam, Params, ParamValidators} +import org.apache.spark.ml.util.{Identifiable, SchemaUtils} +import org.apache.spark.sql.types.StructType + +/** + * :: Experimental :: --- End diff -- Removed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r82722184 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHash.scala --- @@ -0,0 +1,143 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import scala.util.Random + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.linalg.{Vector, Vectors, VectorUDT} +import org.apache.spark.ml.param.{BooleanParam, Params} +import org.apache.spark.ml.util.{Identifiable, SchemaUtils} +import org.apache.spark.sql.types.StructType + +/** + * :: Experimental :: --- End diff -- Removed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r82722187 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHash.scala --- @@ -0,0 +1,143 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import scala.util.Random + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.linalg.{Vector, Vectors, VectorUDT} +import org.apache.spark.ml.param.{BooleanParam, Params} +import org.apache.spark.ml.util.{Identifiable, SchemaUtils} +import org.apache.spark.sql.types.StructType + +/** + * :: Experimental :: + * Params for [[MinHash]]. + */ +@Since("2.1.0") +private[ml] trait MinHashParams extends Params { + + /** + * If true, set the random seed to 0. Otherwise, use default setting in scala.util.Random + * @group param + */ + @Since("2.1.0") + val hasSeed: BooleanParam = new BooleanParam(this, "hasSeed", +"If true, set the random seed to 0.") + + /** @group getParam */ + @Since("2.1.0") + final def getHasSeed: Boolean = $(hasSeed) +} + +/** + * :: Experimental :: + * Model produced by [[MinHash]] + * @param hashFunctions A seq of hash functions, mapping elements to their hash values. + */ +@Experimental +@Since("2.1.0") +class MinHashModel private[ml] (override val uid: String, hashFunctions: Seq[Int => Long]) + extends LSHModel[MinHashModel] { + + @Since("2.1.0") + override protected[this] val hashFunction: Vector => Vector = { +elems: Vector => + require(elems.numNonzeros > 0, "Must have at least 1 non zero entry.") + val elemsList = elems.toSparse.indices.toList + Vectors.dense(hashFunctions.map( +func => elemsList.map(func).min.toDouble + ).toArray) + } + + @Since("2.1.0") + override protected[ml] def keyDistance(x: Vector, y: Vector): Double = { +val xSet = x.toSparse.indices.toSet +val ySet = y.toSparse.indices.toSet +val intersectionSize = xSet.intersect(ySet).size.toDouble +val unionSize = xSet.size + ySet.size - intersectionSize +assert(unionSize > 0, "The union of two input sets must have at least 1 elements") +1 - intersectionSize / unionSize + } + + @Since("2.1.0") + override protected[ml] def hashDistance(x: Vector, y: Vector): Double = { +// Since it's generated by hashing, it will be a pair of dense vectors. +x.toDense.values.zip(y.toDense.values).map(x => math.abs(x._1 - x._2)).min + } +} + +/** + * :: Experimental :: + * LSH class for Jaccard distance. + * + * The input can be dense or sparse vectors, but it is more efficient if it is sparse. For example, + *`Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 1.0)])` + * means there are 10 elements in the space. This set contains elem 2, elem 3 and elem 5. + * Also, any input vector must have at least 1 non-zero indices, and all non-zero values are treated + * as binary "1" values. + */ +@Experimental +@Since("2.1.0") +class MinHash(override val uid: String) extends LSH[MinHashModel] with MinHashParams { + + // A large prime smaller than sqrt(2^63 − 1) + private[this] val prime = 2038074743 + + @Since("2.1.0") + override def setInputCol(value: String): this.type = super.setInputCol(value) + + @Since("2.1.0") + override def setOutputCol(value: String): this.type = super.setOutputCol(value) + + @Since("2.1.0") + override def setOutputDim(value: Int): this.type = super.setOutputDim(value) + + @Since("2.1.0") + def this() = { +this(Identifiable.randomUID("min hash")) + } + + setDefault(outputDim -> 1, outputCol -> "lshFeatures", hasSeed -> false) --- End diff -- Done. --- If your project is set up for it, you can reply to this email and have your
[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r82722189 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHash.scala --- @@ -0,0 +1,143 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import scala.util.Random + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.linalg.{Vector, Vectors, VectorUDT} +import org.apache.spark.ml.param.{BooleanParam, Params} +import org.apache.spark.ml.util.{Identifiable, SchemaUtils} +import org.apache.spark.sql.types.StructType + +/** + * :: Experimental :: + * Params for [[MinHash]]. + */ +@Since("2.1.0") +private[ml] trait MinHashParams extends Params { + + /** + * If true, set the random seed to 0. Otherwise, use default setting in scala.util.Random + * @group param + */ + @Since("2.1.0") + val hasSeed: BooleanParam = new BooleanParam(this, "hasSeed", +"If true, set the random seed to 0.") + + /** @group getParam */ + @Since("2.1.0") + final def getHasSeed: Boolean = $(hasSeed) +} + +/** + * :: Experimental :: + * Model produced by [[MinHash]] + * @param hashFunctions A seq of hash functions, mapping elements to their hash values. + */ +@Experimental +@Since("2.1.0") +class MinHashModel private[ml] (override val uid: String, hashFunctions: Seq[Int => Long]) + extends LSHModel[MinHashModel] { + + @Since("2.1.0") + override protected[this] val hashFunction: Vector => Vector = { +elems: Vector => + require(elems.numNonzeros > 0, "Must have at least 1 non zero entry.") + val elemsList = elems.toSparse.indices.toList + Vectors.dense(hashFunctions.map( +func => elemsList.map(func).min.toDouble + ).toArray) + } + + @Since("2.1.0") + override protected[ml] def keyDistance(x: Vector, y: Vector): Double = { +val xSet = x.toSparse.indices.toSet +val ySet = y.toSparse.indices.toSet +val intersectionSize = xSet.intersect(ySet).size.toDouble +val unionSize = xSet.size + ySet.size - intersectionSize +assert(unionSize > 0, "The union of two input sets must have at least 1 elements") +1 - intersectionSize / unionSize + } + + @Since("2.1.0") + override protected[ml] def hashDistance(x: Vector, y: Vector): Double = { +// Since it's generated by hashing, it will be a pair of dense vectors. +x.toDense.values.zip(y.toDense.values).map(x => math.abs(x._1 - x._2)).min + } +} + +/** + * :: Experimental :: + * LSH class for Jaccard distance. + * + * The input can be dense or sparse vectors, but it is more efficient if it is sparse. For example, + *`Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 1.0)])` + * means there are 10 elements in the space. This set contains elem 2, elem 3 and elem 5. + * Also, any input vector must have at least 1 non-zero indices, and all non-zero values are treated + * as binary "1" values. + */ +@Experimental +@Since("2.1.0") +class MinHash(override val uid: String) extends LSH[MinHashModel] with MinHashParams { + + // A large prime smaller than sqrt(2^63 − 1) + private[this] val prime = 2038074743 + + @Since("2.1.0") + override def setInputCol(value: String): this.type = super.setInputCol(value) + + @Since("2.1.0") + override def setOutputCol(value: String): this.type = super.setOutputCol(value) + + @Since("2.1.0") + override def setOutputDim(value: Int): this.type = super.setOutputDim(value) + + @Since("2.1.0") + def this() = { +this(Identifiable.randomUID("min hash")) + } + + setDefault(outputDim -> 1, outputCol -> "lshFeatures", hasSeed -> false) + + @Since("2.1.0") + def setHasSeed(value: Boolean): this.type = set(hasSeed, value) + +
[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r82722181 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala --- @@ -0,0 +1,343 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import scala.util.Random + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.{Estimator, Model} +import org.apache.spark.ml.linalg.{Vector, VectorUDT} +import org.apache.spark.ml.param.{IntParam, ParamMap, ParamValidators} +import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol} +import org.apache.spark.ml.util.SchemaUtils +import org.apache.spark.sql._ +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ + +/** + * :: Experimental :: + * Params for [[LSH]]. + */ +@Since("2.1.0") +private[ml] trait LSHParams extends HasInputCol with HasOutputCol { + /** + * Param for the dimension of LSH OR-amplification. + * + * In this implementation, we use LSH OR-amplification to reduce the false negative rate. The + * higher the dimension is, the lower the false negative rate. + * @group param + */ + @Since("2.1.0") + final val outputDim: IntParam = new IntParam(this, "outputDim", "output dimension, where" + +"increasing dimensionality lowers the false negative rate, and decreasing dimensionality" + +" improves the running performance", ParamValidators.gt(0)) + + /** @group getParam */ + @Since("2.1.0") + final def getOutputDim: Int = $(outputDim) + + /** + * Transform the Schema for LSH + * @param schema The schema of the input dataset without [[outputCol]] + * @return A derived schema with [[outputCol]] added + */ + @Since("2.1.0") + protected[this] final def validateAndTransformSchema(schema: StructType): StructType = { +SchemaUtils.appendColumn(schema, $(outputCol), new VectorUDT) + } +} + +/** + * :: Experimental :: + * Model produced by [[LSH]]. + */ +@Experimental +@Since("2.1.0") +private[ml] abstract class LSHModel[T <: LSHModel[T]] extends Model[T] with LSHParams { + self: T => + + @Since("2.1.0") + override def copy(extra: ParamMap): T = defaultCopy(extra) + + /** + * The hash function of LSH, mapping a predefined KeyType to a Vector + * @return The mapping of LSH function. + */ + @Since("2.1.0") + protected[this] val hashFunction: Vector => Vector + + /** + * Calculate the distance between two different keys using the distance metric corresponding + * to the hashFunction + * @param x One input vector in the metric space + * @param y One input vector in the metric space + * @return The distance between x and y + */ + @Since("2.1.0") + protected[ml] def keyDistance(x: Vector, y: Vector): Double + + /** + * Calculate the distance between two different hash Vectors. + * + * @param x One of the hash vector + * @param y Another hash vector + * @return The distance between hash vectors x and y + */ + @Since("2.1.0") + protected[ml] def hashDistance(x: Vector, y: Vector): Double + + @Since("2.1.0") + override def transform(dataset: Dataset[_]): DataFrame = { +transformSchema(dataset.schema, logging = true) +val transformUDF = udf(hashFunction, new VectorUDT) +dataset.withColumn($(outputCol), transformUDF(dataset($(inputCol + } + + @Since("2.1.0") + override def transformSchema(schema: StructType): StructType = { +validateAndTransformSchema(schema) + } + + /** + * Given a large dataset and an item, approximately find at most k items which have the closest + * distance to the item. If the [[outputCol]] is missing, the method will transform the data; if + * the [[outputCol]] exists, it
[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r82722185 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHash.scala --- @@ -0,0 +1,143 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import scala.util.Random + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.linalg.{Vector, Vectors, VectorUDT} +import org.apache.spark.ml.param.{BooleanParam, Params} +import org.apache.spark.ml.util.{Identifiable, SchemaUtils} +import org.apache.spark.sql.types.StructType + +/** + * :: Experimental :: + * Params for [[MinHash]]. + */ +@Since("2.1.0") +private[ml] trait MinHashParams extends Params { + + /** + * If true, set the random seed to 0. Otherwise, use default setting in scala.util.Random + * @group param + */ + @Since("2.1.0") + val hasSeed: BooleanParam = new BooleanParam(this, "hasSeed", --- End diff -- Done for both MinHash and RP --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15408: [SPARK-17839][CORE] Use Nio's directbuffer instead of Bu...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15408 **[Test build #66714 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66714/consoleFull)** for PR 15408 at commit [`681ff62`](https://github.com/apache/spark/commit/681ff62409e1f6520057bdeafd991e2c12a0b232). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r82722177 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala --- @@ -0,0 +1,343 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import scala.util.Random + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.{Estimator, Model} +import org.apache.spark.ml.linalg.{Vector, VectorUDT} +import org.apache.spark.ml.param.{IntParam, ParamMap, ParamValidators} +import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol} +import org.apache.spark.ml.util.SchemaUtils +import org.apache.spark.sql._ +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ + +/** + * :: Experimental :: --- End diff -- Removed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15377: [SPARK-17802] Improved caller context logging.
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15377 **[Test build #66715 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66715/consoleFull)** for PR 15377 at commit [`7485ffa`](https://github.com/apache/spark/commit/7485ffaa3df508f35df4b878ed715eb1ece0f4db). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15148: [SPARK-5992][ML] Locality Sensitive Hashing
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15148 **[Test build #66717 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66717/consoleFull)** for PR 15148 at commit [`2c95e5c`](https://github.com/apache/spark/commit/2c95e5c1d89e2db0350b5d8667e2ae8d293df7a9). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15285: [SPARK-17711] Compress rolled executor log
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15285 **[Test build #66716 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66716/consoleFull)** for PR 15285 at commit [`ef4f2b9`](https://github.com/apache/spark/commit/ef4f2b9dc1be33d56d7d4c93bddcfcc2a69a44e9). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15408: [SPARK-17839][CORE] Use Nio's directbuffer instead of Bu...
Github user sitalkedia commented on the issue: https://github.com/apache/spark/pull/15408 jenkins retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15416: [SPARK-17849] [SQL] Fix NPE problem when using gr...
Github user yangw1234 commented on a diff in the pull request: https://github.com/apache/spark/pull/15416#discussion_r82721927 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -298,10 +298,14 @@ class Analyzer( case other => Alias(other, other.toString)() } -val nonNullBitmask = x.bitmasks.reduce(_ & _) +// The left most bit in the bitmasks corresponds to the last expression in groupByAliases +// with 0 indicating this expression is in the grouping set. The following line of code +// calculates the bit mask representing the expressions that exist in all the grouping sets. +val nonNullBitmask = ~ x.bitmasks.reduce(_ | _) --- End diff -- Do you mean `((nonNullBitmask >> (attrLength - idx - 1)) & 1) == 1`? We can only test on `0` if we left shift `1`, right? @davies --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15416: [SPARK-17849] [SQL] Fix NPE problem when using grouping ...
Github user davies commented on the issue: https://github.com/apache/spark/pull/15416 @yangw1234 Thanks for working on this, could you also double check that all the places that use bitmasks are correct? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14788: [SPARK-17174][SQL] Add the support for TimestampType for...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14788 **[Test build #66713 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66713/consoleFull)** for PR 14788 at commit [`ef67829`](https://github.com/apache/spark/commit/ef678292d104f2d7a4b637cedc0a388aeb900323). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15423: [SPARK-17860][SQL] SHOW COLUMN's database conflic...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/15423#discussion_r82721435 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala --- @@ -1713,4 +1713,19 @@ class DDLSuite extends QueryTest with SharedSQLContext with BeforeAndAfterEach { assert(sql("show user functions").count() === 1L) } } + + test("show columns - negative test") { +// When case sensitivity is true, the user supplied database name in table identifier +// should match the supplied database name in case sensitive way. +withSQLConf(SQLConf.CASE_SENSITIVE.key -> "true") { + val tabName = "showcolumn" + withTable(tabName) { +sql(s"CREATE TABLE $tabName(col1 int, col2 string) USING parquet ") --- End diff -- @viirya OK.. I agree. I will make the change --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14788: [SPARK-17174][SQL] Add the support for TimestampType for...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14788 **[Test build #66712 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66712/consoleFull)** for PR 14788 at commit [`537fe88`](https://github.com/apache/spark/commit/537fe8858fd78e11c47cb89e847bd355c2494529). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class DateSub(instant: Expression, days: Expression) extends AddDaysBase(instant, days) ` * `case class TruncInstant(instant: Expression, format: Expression)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14788: [SPARK-17174][SQL] Add the support for TimestampType for...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14788 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66712/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14788: [SPARK-17174][SQL] Add the support for TimestampType for...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14788 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r82721024 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala --- @@ -0,0 +1,339 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import scala.util.Random + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.{Estimator, Model} +import org.apache.spark.ml.linalg.{Vector, VectorUDT} +import org.apache.spark.ml.param.{IntParam, ParamMap, ParamValidators} +import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol} +import org.apache.spark.ml.util.SchemaUtils +import org.apache.spark.sql._ +import org.apache.spark.sql.expressions.UserDefinedFunction +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ + +/** + * Params for [[LSH]]. + */ +@Experimental +@Since("2.1.0") +private[ml] trait LSHParams extends HasInputCol with HasOutputCol { + /** + * Param for the dimension of LSH OR-amplification. + * + * In this implementation, we use LSH OR-amplification to reduce the false negative rate. The + * higher the dimension is, the lower the false negative rate. + * @group param + */ + @Since("2.1.0") + final val outputDim: IntParam = new IntParam(this, "outputDim", "output dimension, where" + +"increasing dimensionality lowers the false negative rate, and decreasing dimensionality" + +" improves the running performance", ParamValidators.gt(0)) + + /** @group getParam */ + @Since("2.1.0") + final def getOutputDim: Int = $(outputDim) + + // TODO: Decide about this default. It should probably depend on the particular LSH algorithm. + setDefault(outputDim -> 1, outputCol -> "lshFeatures") + + /** + * Transform the Schema for LSH + * @param schema The schema of the input dataset without [[outputCol]] + * @return A derived schema with [[outputCol]] added + */ + @Since("2.1.0") + protected[this] final def validateAndTransformSchema(schema: StructType): StructType = { +SchemaUtils.appendColumn(schema, $(outputCol), new VectorUDT) + } +} + +/** + * Model produced by [[LSH]]. + */ +@Experimental +@Since("2.1.0") +private[ml] abstract class LSHModel[T <: LSHModel[T]] extends Model[T] with LSHParams { + self: T => + + @Since("2.1.0") + override def copy(extra: ParamMap): T = defaultCopy(extra) + + /** + * The hash function of LSH, mapping a predefined KeyType to a Vector + * @return The mapping of LSH function. + */ + @Since("2.1.0") + protected[this] val hashFunction: Vector => Vector + + /** + * Calculate the distance between two different keys using the distance metric corresponding + * to the hashFunction + * @param x One of the point in the metric space + * @param y Another the point in the metric space + * @return The distance between x and y + */ + @Since("2.1.0") + protected[ml] def keyDistance(x: Vector, y: Vector): Double + + /** + * Calculate the distance between two different hash Vectors. + * + * @param x One of the hash vector + * @param y Another hash vector + * @return The distance between hash vectors x and y + */ + @Since("2.1.0") + protected[ml] def hashDistance(x: Vector, y: Vector): Double + + @Since("2.1.0") + override def transform(dataset: Dataset[_]): DataFrame = { +transformSchema(dataset.schema, logging = true) +val transformUDF = udf(hashFunction, new VectorUDT) +dataset.withColumn($(outputCol), transformUDF(dataset($(inputCol + } + + @Since("2.1.0") + override def transformSchema(schema: StructType): StructType = { +validateAndTransformSchema(schema) + } + + /** + * Given a large dataset and an item,
[GitHub] spark issue #14788: [SPARK-17174][SQL] Add the support for TimestampType for...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14788 **[Test build #66712 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66712/consoleFull)** for PR 14788 at commit [`537fe88`](https://github.com/apache/spark/commit/537fe8858fd78e11c47cb89e847bd355c2494529). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15423: [SPARK-17860][SQL] SHOW COLUMN's database conflic...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/15423#discussion_r82720911 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala --- @@ -207,6 +208,7 @@ class SQLQueryTestSuite extends QueryTest with SharedSQLContext { // Returns true if the plan is supposed to be sorted. def isSorted(plan: LogicalPlan): Boolean = plan match { case _: Join | _: Aggregate | _: Generate | _: Sample | _: Distinct => false + case _: ShowColumnsCommand => true --- End diff -- Personally I don't think it is odd because we just want to compare the results. Adding `ShowColumnsCommand` to sorted op looks more odd to me. cc @cloud-fan --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15416: [SPARK-17849] [SQL] Fix NPE problem when using gr...
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/15416#discussion_r82720505 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -298,10 +298,14 @@ class Analyzer( case other => Alias(other, other.toString)() } -val nonNullBitmask = x.bitmasks.reduce(_ & _) +// The left most bit in the bitmasks corresponds to the last expression in groupByAliases +// with 0 indicating this expression is in the grouping set. The following line of code +// calculates the bit mask representing the expressions that exist in all the grouping sets. +val nonNullBitmask = ~ x.bitmasks.reduce(_ | _) --- End diff -- Could you remove the '~' here, and use `(nonNullBitmask & (1 << (attrLength - idx - 1))) == 1`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15423: [SPARK-17860][SQL] SHOW COLUMN's database conflic...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/15423#discussion_r82720521 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala --- @@ -207,6 +208,7 @@ class SQLQueryTestSuite extends QueryTest with SharedSQLContext { // Returns true if the plan is supposed to be sorted. def isSorted(plan: LogicalPlan): Boolean = plan match { case _: Join | _: Aggregate | _: Generate | _: Sample | _: Distinct => false + case _: ShowColumnsCommand => true --- End diff -- @viirya So it seemed odd to have the generated output files to have column names sorted which didn't reflect the columns. In the test case i had the table create like following. ```SQL CREATE TABLE showcolumn2 (price int, qty int) partitioned by (year int, month int) ``` It seemed odd to me to have the generated output file report the columns as month, price, qty and year as opposed to price, qty, year and month. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15371: [SPARK-17816] [Core] Fix ConcurrentModificationEx...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/15371 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15371: [SPARK-17816] [Core] Fix ConcurrentModificationException...
Github user zsxwing commented on the issue: https://github.com/apache/spark/pull/15371 There are some conflicts with 2.0. @seyfe could you submit a PR for branch-2.0, please? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15371: [SPARK-17816] [Core] Fix ConcurrentModificationException...
Github user zsxwing commented on the issue: https://github.com/apache/spark/pull/15371 LGTM. Thanks! Merging to master and 2.0. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15375: [SPARK-17790][SPARKR] Support for parallelizing R data.f...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15375 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66699/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15375: [SPARK-17790][SPARKR] Support for parallelizing R data.f...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15375 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15375: [SPARK-17790][SPARKR] Support for parallelizing R data.f...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15375 **[Test build #66699 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66699/consoleFull)** for PR 15375 at commit [`62ab47b`](https://github.com/apache/spark/commit/62ab47b016aeb42c0721b52c4c37d502db18c535). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15285: [SPARK-17711] Compress rolled executor log
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15285 **[Test build #66711 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66711/consoleFull)** for PR 15285 at commit [`ae08495`](https://github.com/apache/spark/commit/ae08495549fe8a2b6750c2b2e4dba8e37779a740). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15285: [SPARK-17711] Compress rolled executor log
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15285 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66711/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15285: [SPARK-17711] Compress rolled executor log
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15285 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15285: [SPARK-17711] Compress rolled executor log
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15285 **[Test build #66711 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66711/consoleFull)** for PR 15285 at commit [`ae08495`](https://github.com/apache/spark/commit/ae08495549fe8a2b6750c2b2e4dba8e37779a740). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15414: [SPARK-17848][ML] Move LabelCol datatype cast into Predi...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15414 **[Test build #66710 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66710/consoleFull)** for PR 15414 at commit [`6c61e73`](https://github.com/apache/spark/commit/6c61e73c9b8d401f7ec9d48e9f74df7e134cec5f). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org