[GitHub] spark pull request #17330: [SPARK-19993][SQL] Caching logical plans containi...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/17330#discussion_r110693397 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala --- @@ -59,6 +58,13 @@ abstract class SubqueryExpression( children.zip(p.children).forall(p => p._1.semanticEquals(p._2)) case _ => false } + def canonicalize(attrs: AttributeSeq): SubqueryExpression = { +// Normalize the outer references in the subquery plan. +val subPlan = plan.transformAllExpressions { + case OuterReference(r) => QueryPlan.normalizeExprId(r, attrs) --- End diff -- @cloud-fan Actually you r right. Preserving the OuterReference would be good. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for...
Github user zero323 commented on a diff in the pull request: https://github.com/apache/spark/pull/17077#discussion_r110692936 --- Diff: python/pyspark/sql/tests.py --- @@ -2167,6 +2167,61 @@ def test_BinaryType_serialization(self): df = self.spark.createDataFrame(data, schema=schema) df.collect() +def test_bucketed_write(self): +data = [ +(1, "foo", 3.0), (2, "foo", 5.0), +(3, "bar", -1.0), (4, "bar", 6.0), +] +df = self.spark.createDataFrame(data, ["x", "y", "z"]) + +# Test write with one bucketing column +df.write.bucketBy(3, "x").mode("overwrite").saveAsTable("pyspark_bucket") +self.assertEqual( +len([c for c in self.spark.catalog.listColumns("pyspark_bucket") + if c.name == "x" and c.isBucket]), +1 +) +self.assertSetEqual(set(data), set(self.spark.table("pyspark_bucket").collect())) + +# Test write two bucketing columns +df.write.bucketBy(3, "x", "y").mode("overwrite").saveAsTable("pyspark_bucket") +self.assertEqual( +len([c for c in self.spark.catalog.listColumns("pyspark_bucket") + if c.name in ("x", "y") and c.isBucket]), +2 +) +self.assertSetEqual(set(data), set(self.spark.table("pyspark_bucket").collect())) + +# Test write with bucket and sort +df.write.bucketBy(2, "x").sortBy("z").mode("overwrite").saveAsTable("pyspark_bucket") +self.assertEqual( +len([c for c in self.spark.catalog.listColumns("pyspark_bucket") + if c.name == "x" and c.isBucket]), +1 +) +self.assertSetEqual(set(data), set(self.spark.table("pyspark_bucket").collect())) + +# Test write with a list of columns +df.write.bucketBy(3, ["x", "y"]).mode("overwrite").saveAsTable("pyspark_bucket") +self.assertEqual( +len([c for c in self.spark.catalog.listColumns("pyspark_bucket") + if c.name in ("x", "y") and c.isBucket]), +2 +) +self.assertSetEqual(set(data), set(self.spark.table("pyspark_bucket").collect())) + +# Test write with bucket and sort with a list of columns +(df.write.bucketBy(2, "x") +.sortBy(["y", "z"]) +.mode("overwrite").saveAsTable("pyspark_bucket")) +self.assertSetEqual(set(data), set(self.spark.table("pyspark_bucket").collect())) + +# Test write with bucket and sort with multiple columns +(df.write.bucketBy(2, "x") +.sortBy("y", "z") +.mode("overwrite").saveAsTable("pyspark_bucket")) --- End diff -- I don't think that dropping before is necessary. We override on each write and name clashes are unlikely. We can drop down after the tests but I am not sure how to do it right. `SQLTests` is overgrown and I am not sure if we should add `tearDown` only for this but adding `DROP TABLE` in test itself doesn't look right. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17591: [SPARK-20280][CORE] FileStatusCache Weigher integ...
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/17591#discussion_r110692833 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileStatusCache.scala --- @@ -94,27 +94,46 @@ private class SharedInMemoryCache(maxSizeInBytes: Long) extends Logging { // Opaque object that uniquely identifies a shared cache user private type ClientId = Object + private val warnedAboutEviction = new AtomicBoolean(false) // we use a composite cache key in order to distinguish entries inserted by different clients - private val cache: Cache[(ClientId, Path), Array[FileStatus]] = CacheBuilder.newBuilder() -.weigher(new Weigher[(ClientId, Path), Array[FileStatus]] { - override def weigh(key: (ClientId, Path), value: Array[FileStatus]): Int = { -(SizeEstimator.estimate(key) + SizeEstimator.estimate(value)).toInt - }}) -.removalListener(new RemovalListener[(ClientId, Path), Array[FileStatus]]() { - override def onRemoval(removed: RemovalNotification[(ClientId, Path), Array[FileStatus]]) + private val cache: Cache[(ClientId, Path), Array[FileStatus]] = { +/* [[Weigher]].weigh returns Int so we could only cache objects < 2GB + * instead, the weight is divided by this factor (which is smaller + * than the size of one [[FileStatus]]). + * so it will support objects up to 64GB in size. + */ +val weightScale = 32 +CacheBuilder.newBuilder() + .weigher(new Weigher[(ClientId, Path), Array[FileStatus]] { +override def weigh(key: (ClientId, Path), value: Array[FileStatus]): Int = { + val estimate = (SizeEstimator.estimate(key) + SizeEstimator.estimate(value)) / weightScale + if (estimate > Int.MaxValue) { +logWarning(s"Cached table partition metadata size is too big. Approximating to " + + s"${Int.MaxValue.toLong * weightScale}.") +Int.MaxValue + } else { +estimate.toInt + } +} + }) + .removalListener(new RemovalListener[(ClientId, Path), Array[FileStatus]]() { --- End diff -- This is kinda hard to read. Can we just initialize the weighter and the listener in separate variables? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17591: [SPARK-20280][CORE] FileStatusCache Weigher integ...
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/17591#discussion_r110692271 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileStatusCache.scala --- @@ -94,27 +94,46 @@ private class SharedInMemoryCache(maxSizeInBytes: Long) extends Logging { // Opaque object that uniquely identifies a shared cache user private type ClientId = Object + private val warnedAboutEviction = new AtomicBoolean(false) // we use a composite cache key in order to distinguish entries inserted by different clients - private val cache: Cache[(ClientId, Path), Array[FileStatus]] = CacheBuilder.newBuilder() -.weigher(new Weigher[(ClientId, Path), Array[FileStatus]] { - override def weigh(key: (ClientId, Path), value: Array[FileStatus]): Int = { -(SizeEstimator.estimate(key) + SizeEstimator.estimate(value)).toInt - }}) -.removalListener(new RemovalListener[(ClientId, Path), Array[FileStatus]]() { - override def onRemoval(removed: RemovalNotification[(ClientId, Path), Array[FileStatus]]) + private val cache: Cache[(ClientId, Path), Array[FileStatus]] = { +/* [[Weigher]].weigh returns Int so we could only cache objects < 2GB --- End diff -- NIT: Could you use java style comments `//` here? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17566: [SPARK-19518][SQL] IGNORE NULLS in first / last i...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/17566 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17591: [SPARK-20280][CORE] FileStatusCache Weigher integ...
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/17591#discussion_r110692150 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileIndexSuite.scala --- @@ -220,6 +221,32 @@ class FileIndexSuite extends SharedSQLContext { assert(catalog.leafDirPaths.head == fs.makeQualified(dirPath)) } } + + test("SPARK-20280 - FileStatusCache with a partition with very many files") { +/* fake the size, otherwise we need to allocate 2GB of data to trigger this bug */ +class MyFileStatus extends FileStatus with KnownSizeEstimation { + override def estimatedSize: Long = 1000 * 1000 * 1000 +} +/* files * MyFileStatus.estimatedSize should overflow to negative integer + * so, make it between 2bn and 4bn + */ +val files = (1 to 3).map { i => + new MyFileStatus() +} +val fileStatusCache = FileStatusCache.getOrCreate(spark) +fileStatusCache.putLeafFiles(new Path("/tmp", "abc"), files.toArray) +// scalastyle:off --- End diff -- Lets remove this comment block, the JIRA should be used for tracking these things. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17593: [SPARK-20279][WEB-UI]In web ui,'Only showing 200' should...
Github user ajbozarth commented on the issue: https://github.com/apache/spark/pull/17593 I agree with @srowen we left it that way since sorting can change --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17566: [SPARK-19518][SQL] IGNORE NULLS in first / last in SQL
Github user hvanhovell commented on the issue: https://github.com/apache/spark/pull/17566 LGTM - merging to master. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17592: [SPARK-20243][TESTS] DebugFilesystem.assertNoOpen...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/17592 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17592: [SPARK-20243][TESTS] DebugFilesystem.assertNoOpenStreams...
Github user hvanhovell commented on the issue: https://github.com/apache/spark/pull/17592 Merging to master. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17592: [SPARK-20243][TESTS] DebugFilesystem.assertNoOpenStreams...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17592 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17592: [SPARK-20243][TESTS] DebugFilesystem.assertNoOpenStreams...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17592 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75662/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17592: [SPARK-20243][TESTS] DebugFilesystem.assertNoOpenStreams...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17592 **[Test build #75662 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75662/testReport)** for PR 17592 at commit [`d486d60`](https://github.com/apache/spark/commit/d486d6015ae0129ce41e6683eae37243c843ba59). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17308: [SPARK-19968][SS] Use a cached instance of `Kafka...
Github user BenFradet commented on a diff in the pull request: https://github.com/apache/spark/pull/17308#discussion_r110685760 --- Diff: external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaProducer.scala --- @@ -0,0 +1,70 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.kafka010 + +import java.{util => ju} + +import scala.collection.mutable + +import org.apache.kafka.clients.producer.KafkaProducer + +import org.apache.spark.internal.Logging + +private[kafka010] object CachedKafkaProducer extends Logging { + + private val cacheMap = new mutable.HashMap[Int, KafkaProducer[Array[Byte], Array[Byte]]]() + + private def createKafkaProducer( +producerConfiguration: ju.HashMap[String, Object]): KafkaProducer[Array[Byte], Array[Byte]] = { +val kafkaProducer: KafkaProducer[Array[Byte], Array[Byte]] = + new KafkaProducer[Array[Byte], Array[Byte]](producerConfiguration) +cacheMap.put(producerConfiguration.hashCode(), kafkaProducer) --- End diff -- True, my bad I thought `KafkaSink` was a public API. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17491: [SPARK-20175][SQL] Exists should not be evaluated in Joi...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/17491 cc @cloud-fan --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17591: [SPARK-20280][CORE] FileStatusCache Weigher integer over...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17591 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75663/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17591: [SPARK-20280][CORE] FileStatusCache Weigher integer over...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17591 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17591: [SPARK-20280][CORE] FileStatusCache Weigher integer over...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17591 **[Test build #75663 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75663/testReport)** for PR 17591 at commit [`ef8e9e9`](https://github.com/apache/spark/commit/ef8e9e93f6883542847e2a136fb63a4565bf07b4). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17592: [SPARK-20243][TESTS] DebugFilesystem.assertNoOpenStreams...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17592 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75661/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17592: [SPARK-20243][TESTS] DebugFilesystem.assertNoOpenStreams...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17592 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17592: [SPARK-20243][TESTS] DebugFilesystem.assertNoOpenStreams...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17592 **[Test build #75661 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75661/testReport)** for PR 17592 at commit [`19ca5a1`](https://github.com/apache/spark/commit/19ca5a1114dbe666ac40f4c966f3e600b58f92c2). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17587: [SPARK-20274][SQL] support compatible array eleme...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/17587#discussion_r110676717 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala --- @@ -270,12 +270,13 @@ object ScalaReflection extends ScalaReflection { case t if t <:< localTypeOf[Array[_]] => val TypeRef(_, _, Seq(elementType)) = t -val Schema(_, elementNullable) = schemaFor(elementType) +val Schema(dataType, elementNullable) = schemaFor(elementType) val className = getClassNameFromType(elementType) val newTypePath = s"""- array element class: "$className +: walkedTypePath -val mapFunction: Expression => Expression = p => { - val converter = deserializerFor(elementType, Some(p), newTypePath) +val mapFunction: Expression => Expression = element => { + val casted = upCastToExpectedType(element, dataType, newTypePath) --- End diff -- Shall we add a comment? E.g., If it is compatible array element type, we will cast the element to the element type the decoder expects. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17587: [SPARK-20274][SQL] support compatible array element type...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/17587 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17575: [SPARK-20265][MLlib] Improve Prefix'span pre-proc...
Github user Syrux commented on a diff in the pull request: https://github.com/apache/spark/pull/17575#discussion_r110671916 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala --- @@ -232,6 +200,68 @@ class PrefixSpan private ( object PrefixSpan extends Logging { /** + * This methods finds all frequent items in a input dataset. + * + * @param data Sequences of itemsets. + * @param minCount The minimal number of sequence an item should be present in to be frequent + * + * @return An array of Item containing only frequent items. + */ + private[fpm] def findFrequentItems[Item: ClassTag](data : RDD[Array[Array[Item]]], + minCount : Long): Array[Item] = { + +data.flatMap { itemsets => + val uniqItems = mutable.Set.empty[Item] --- End diff -- itemsets.foreach(set => uniqItems ++= set) does work. I will change it in my next commit. I will push it once I know what to do for the flag. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17330: [SPARK-19993][SQL] Caching logical plans containi...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/17330#discussion_r110671813 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala --- @@ -59,6 +58,13 @@ abstract class SubqueryExpression( children.zip(p.children).forall(p => p._1.semanticEquals(p._2)) case _ => false } + def canonicalize(attrs: AttributeSeq): SubqueryExpression = { +// Normalize the outer references in the subquery plan. +val subPlan = plan.transformAllExpressions { + case OuterReference(r) => QueryPlan.normalizeExprId(r, attrs) --- End diff -- The `OuterReference` will all be removed, is it expected? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17330: [SPARK-19993][SQL] Caching logical plans containi...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/17330#discussion_r110671334 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala --- @@ -59,6 +58,13 @@ abstract class SubqueryExpression( children.zip(p.children).forall(p => p._1.semanticEquals(p._2)) case _ => false } + def canonicalize(attrs: AttributeSeq): SubqueryExpression = { +// Normalize the outer references in the subquery plan. +val subPlan = plan.transformAllExpressions { --- End diff -- `canonicalizedPlan` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17575: [SPARK-20265][MLlib] Improve Prefix'span pre-proc...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/17575#discussion_r110671100 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala --- @@ -232,6 +200,68 @@ class PrefixSpan private ( object PrefixSpan extends Logging { /** + * This methods finds all frequent items in a input dataset. + * + * @param data Sequences of itemsets. + * @param minCount The minimal number of sequence an item should be present in to be frequent + * + * @return An array of Item containing only frequent items. + */ + private[fpm] def findFrequentItems[Item: ClassTag](data : RDD[Array[Array[Item]]], + minCount : Long): Array[Item] = { + +data.flatMap { itemsets => + val uniqItems = mutable.Set.empty[Item] + itemsets.foreach { _.foreach { item => +uniqItems += item + }} + uniqItems.toIterator.map((_, 1L)) +}.reduceByKey(_ + _).filter { case (_, count) => +count >= minCount +}.sortBy(-_._2).map(_._1).collect() + } + + /** + * This methods cleans the input dataset from un-frequent items, and translate it's item + * to their corresponding Int identifier. + * + * @param data Sequences of itemsets. + * @param itemToInt A map allowing translation of frequent Items to their Int Identifier. + * The map should only contain frequent item. + * + * @return The internal repr of the inputted dataset. With properly placed zero delimiter. + */ + private[fpm] def toDatabaseInternalRepr[Item: ClassTag](data : RDD[Array[Array[Item]]], +itemToInt : Map[Item, Int]): + RDD[Array[Int]] = { + +data.flatMap { itemsets => + val allItems = mutable.ArrayBuilder.make[Int] + var containsFreqItems = false + allItems += 0 + itemsets.foreach { itemsets => +val items = mutable.ArrayBuilder.make[Int] +itemsets.foreach { item => + if (itemToInt.contains(item)) { +items += itemToInt(item) + 1 // using 1-indexing in internal format + } +} +val result = items.result() +if (result.nonEmpty) { + containsFreqItems = true + allItems ++= result.sorted + allItems += 0 +} + } + if (containsFreqItems) { --- End diff -- OK no problem, leave it. Just riffing while we're editing the code. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17330: [SPARK-19993][SQL] Caching logical plans containi...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/17330#discussion_r110671091 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala --- @@ -236,6 +244,12 @@ case class ScalarSubquery( override def nullable: Boolean = true override def withNewPlan(plan: LogicalPlan): ScalarSubquery = copy(plan = plan) override def toString: String = s"scalar-subquery#${exprId.id} $conditionString" + override lazy val canonicalized: Expression = { --- End diff -- oh sorry it's expression --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17587: [SPARK-20274][SQL] support compatible array element type...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/17587 @kiszk The plan may not be very helpful, as the only difference in the plan is `MapObjects.lambdaFunction`, before it was just `LambdaVariable`, but now it's `Cast(LambdaVariable, xxx)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17575: [SPARK-20265][MLlib] Improve Prefix'span pre-proc...
Github user Syrux commented on a diff in the pull request: https://github.com/apache/spark/pull/17575#discussion_r110669561 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala --- @@ -232,6 +200,68 @@ class PrefixSpan private ( object PrefixSpan extends Logging { /** + * This methods finds all frequent items in a input dataset. + * + * @param data Sequences of itemsets. + * @param minCount The minimal number of sequence an item should be present in to be frequent + * + * @return An array of Item containing only frequent items. + */ + private[fpm] def findFrequentItems[Item: ClassTag](data : RDD[Array[Array[Item]]], + minCount : Long): Array[Item] = { + +data.flatMap { itemsets => + val uniqItems = mutable.Set.empty[Item] + itemsets.foreach { _.foreach { item => +uniqItems += item + }} + uniqItems.toIterator.map((_, 1L)) +}.reduceByKey(_ + _).filter { case (_, count) => +count >= minCount +}.sortBy(-_._2).map(_._1).collect() + } + + /** + * This methods cleans the input dataset from un-frequent items, and translate it's item + * to their corresponding Int identifier. + * + * @param data Sequences of itemsets. + * @param itemToInt A map allowing translation of frequent Items to their Int Identifier. + * The map should only contain frequent item. + * + * @return The internal repr of the inputted dataset. With properly placed zero delimiter. + */ + private[fpm] def toDatabaseInternalRepr[Item: ClassTag](data : RDD[Array[Array[Item]]], +itemToInt : Map[Item, Int]): + RDD[Array[Int]] = { + +data.flatMap { itemsets => + val allItems = mutable.ArrayBuilder.make[Int] + var containsFreqItems = false + allItems += 0 + itemsets.foreach { itemsets => +val items = mutable.ArrayBuilder.make[Int] +itemsets.foreach { item => + if (itemToInt.contains(item)) { +items += itemToInt(item) + 1 // using 1-indexing in internal format + } +} +val result = items.result() +if (result.nonEmpty) { + containsFreqItems = true + allItems ++= result.sorted + allItems += 0 +} + } + if (containsFreqItems) { --- End diff -- Apparently, prepending is impossible on an arrayBuilder. The method doesn't exist (http://www.scala-lang.org/api/2.12.0/scala/collection/mutable/ArrayBuilder.html). I think the flag is our best bet for performance. Changing it to an arrayBuffer would be far worse since a type encapsulation would be forced on the ints it contain. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17575: [SPARK-20265][MLlib] Improve Prefix'span pre-proc...
Github user Syrux commented on a diff in the pull request: https://github.com/apache/spark/pull/17575#discussion_r110667171 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala --- @@ -232,6 +200,68 @@ class PrefixSpan private ( object PrefixSpan extends Logging { /** + * This methods finds all frequent items in a input dataset. + * + * @param data Sequences of itemsets. + * @param minCount The minimal number of sequence an item should be present in to be frequent + * + * @return An array of Item containing only frequent items. + */ + private[fpm] def findFrequentItems[Item: ClassTag](data : RDD[Array[Array[Item]]], + minCount : Long): Array[Item] = { + +data.flatMap { itemsets => + val uniqItems = mutable.Set.empty[Item] + itemsets.foreach { _.foreach { item => +uniqItems += item + }} + uniqItems.toIterator.map((_, 1L)) +}.reduceByKey(_ + _).filter { case (_, count) => +count >= minCount +}.sortBy(-_._2).map(_._1).collect() + } + + /** + * This methods cleans the input dataset from un-frequent items, and translate it's item + * to their corresponding Int identifier. + * + * @param data Sequences of itemsets. + * @param itemToInt A map allowing translation of frequent Items to their Int Identifier. + * The map should only contain frequent item. + * + * @return The internal repr of the inputted dataset. With properly placed zero delimiter. + */ + private[fpm] def toDatabaseInternalRepr[Item: ClassTag](data : RDD[Array[Array[Item]]], +itemToInt : Map[Item, Int]): + RDD[Array[Int]] = { + +data.flatMap { itemsets => + val allItems = mutable.ArrayBuilder.make[Int] + var containsFreqItems = false + allItems += 0 + itemsets.foreach { itemsets => +val items = mutable.ArrayBuilder.make[Int] +itemsets.foreach { item => + if (itemToInt.contains(item)) { +items += itemToInt(item) + 1 // using 1-indexing in internal format + } +} +val result = items.result() +if (result.nonEmpty) { + containsFreqItems = true + allItems ++= result.sorted + allItems += 0 +} + } + if (containsFreqItems) { --- End diff -- I am not sure about the performance of a pre-append on arrayBuilder. I will check them first. Back in a few minutes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17591: [SPARK-20280][CORE] FileStatusCache Weigher integer over...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17591 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75660/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17591: [SPARK-20280][CORE] FileStatusCache Weigher integer over...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17591 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17591: [SPARK-20280][CORE] FileStatusCache Weigher integer over...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17591 **[Test build #75660 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75660/testReport)** for PR 17591 at commit [`f4ae52a`](https://github.com/apache/spark/commit/f4ae52a0c60b9d2a086e4f1ad0bb3675bebc47da). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17575: [SPARK-20265][MLlib] Improve Prefix'span pre-proc...
Github user Syrux commented on a diff in the pull request: https://github.com/apache/spark/pull/17575#discussion_r110664282 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/fpm/PrefixSpanSuite.scala --- @@ -360,6 +360,55 @@ class PrefixSpanSuite extends SparkFunSuite with MLlibTestSparkContext { compareResults(expected, model.freqSequences.collect()) } + test("PrefixSpan pre-processing's cleaning test") { + +// One item per itemSet +val itemToInt1 = (4 to 5).zipWithIndex.toMap +val sequences1 = Seq( + Array(Array(4), Array(1), Array(2), Array(5), Array(2), Array(4), Array(5)), + Array(Array(6), Array(7), Array(8))) +val rdd1 = sc.parallelize(sequences1, 2).cache() + +val cleanedSequence1 = PrefixSpan.toDatabaseInternalRepr(rdd1, itemToInt1).collect() + +val expected1 = Array(Array(0, 4, 0, 5, 0, 4, 0, 5, 0)) + .map(x => x.map(y => { +if (y == 0) 0 +else itemToInt1(y) + 1 + })) + +compareInternalSequences(expected1, cleanedSequence1) + +// Multi-item sequence +val itemToInt2 = (4 to 6).zipWithIndex.toMap +val sequences2 = Seq( + Array(Array(4, 5), Array(1, 6, 2), Array(2), Array(5), Array(2), Array(4), Array(5, 6, 7)), + Array(Array(8, 9), Array(1, 2))) +val rdd2 = sc.parallelize(sequences2, 2).cache() + +val cleanedSequence2 = PrefixSpan.toDatabaseInternalRepr(rdd2, itemToInt2).collect() + +val expected2 = Array(Array(0, 4, 5, 0, 6, 0, 5, 0, 4, 0, 5, 6, 0)) + .map(x => x.map(y => { +if (y == 0) 0 +else itemToInt2(y) + 1 + })) + +compareInternalSequences(expected2, cleanedSequence2) + +// Emptied sequence +val itemToInt3 = (10 to 10).zipWithIndex.toMap +val sequences3 = Seq( + Array(Array(4, 5), Array(1, 6, 2), Array(2), Array(5), Array(2), Array(4), Array(5, 6, 7)), + Array(Array(8, 9), Array(1, 2))) +val rdd3 = sc.parallelize(sequences3, 2).cache() + +val cleanedSequence3 = PrefixSpan.toDatabaseInternalRepr(rdd3, itemToInt3).collect() +val expected3: Array[Array[Int]] = Array() --- End diff -- Yep, it can. It even avoids a useless cast. I will push the new version asap --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17527: [SPARK-20156][CORE][SQL][STREAMING][MLLIB] Java String t...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17527 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75657/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17527: [SPARK-20156][CORE][SQL][STREAMING][MLLIB] Java String t...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17527 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17527: [SPARK-20156][CORE][SQL][STREAMING][MLLIB] Java String t...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17527 **[Test build #75657 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75657/testReport)** for PR 17527 at commit [`2ac5843`](https://github.com/apache/spark/commit/2ac5843a071847dbe6e8cea08b49a7ef36587101). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17575: [SPARK-20265][MLlib] Improve Prefix'span pre-proc...
Github user Syrux commented on a diff in the pull request: https://github.com/apache/spark/pull/17575#discussion_r110662589 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/fpm/PrefixSpanSuite.scala --- @@ -360,6 +360,55 @@ class PrefixSpanSuite extends SparkFunSuite with MLlibTestSparkContext { compareResults(expected, model.freqSequences.collect()) } + test("PrefixSpan pre-processing's cleaning test") { + +// One item per itemSet +val itemToInt1 = (4 to 5).zipWithIndex.toMap +val sequences1 = Seq( + Array(Array(4), Array(1), Array(2), Array(5), Array(2), Array(4), Array(5)), + Array(Array(6), Array(7), Array(8))) +val rdd1 = sc.parallelize(sequences1, 2).cache() + +val cleanedSequence1 = PrefixSpan.toDatabaseInternalRepr(rdd1, itemToInt1).collect() + +val expected1 = Array(Array(0, 4, 0, 5, 0, 4, 0, 5, 0)) + .map(x => x.map(y => { +if (y == 0) 0 --- End diff -- Ok, changing it --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17575: [SPARK-20265][MLlib] Improve Prefix'span pre-proc...
Github user Syrux commented on a diff in the pull request: https://github.com/apache/spark/pull/17575#discussion_r110662506 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala --- @@ -232,6 +200,68 @@ class PrefixSpan private ( object PrefixSpan extends Logging { /** + * This methods finds all frequent items in a input dataset. + * + * @param data Sequences of itemsets. + * @param minCount The minimal number of sequence an item should be present in to be frequent + * + * @return An array of Item containing only frequent items. + */ + private[fpm] def findFrequentItems[Item: ClassTag](data : RDD[Array[Array[Item]]], + minCount : Long): Array[Item] = { + +data.flatMap { itemsets => + val uniqItems = mutable.Set.empty[Item] --- End diff -- Ok, changed the accolades to parenthesis. (I suppose that what you meant, correct me if I'm wrong) Also, just by curiosity, do you know if that make any differences in performances ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17575: [SPARK-20265][MLlib] Improve Prefix'span pre-proc...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/17575#discussion_r110662332 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala --- @@ -232,6 +200,68 @@ class PrefixSpan private ( object PrefixSpan extends Logging { /** + * This methods finds all frequent items in a input dataset. + * + * @param data Sequences of itemsets. + * @param minCount The minimal number of sequence an item should be present in to be frequent + * + * @return An array of Item containing only frequent items. + */ + private[fpm] def findFrequentItems[Item: ClassTag](data : RDD[Array[Array[Item]]], + minCount : Long): Array[Item] = { + +data.flatMap { itemsets => + val uniqItems = mutable.Set.empty[Item] + itemsets.foreach { _.foreach { item => +uniqItems += item + }} + uniqItems.toIterator.map((_, 1L)) +}.reduceByKey(_ + _).filter { case (_, count) => +count >= minCount +}.sortBy(-_._2).map(_._1).collect() + } + + /** + * This methods cleans the input dataset from un-frequent items, and translate it's item + * to their corresponding Int identifier. + * + * @param data Sequences of itemsets. + * @param itemToInt A map allowing translation of frequent Items to their Int Identifier. + * The map should only contain frequent item. + * + * @return The internal repr of the inputted dataset. With properly placed zero delimiter. + */ + private[fpm] def toDatabaseInternalRepr[Item: ClassTag](data : RDD[Array[Array[Item]]], +itemToInt : Map[Item, Int]): + RDD[Array[Int]] = { + +data.flatMap { itemsets => + val allItems = mutable.ArrayBuilder.make[Int] + var containsFreqItems = false + allItems += 0 + itemsets.foreach { itemsets => +val items = mutable.ArrayBuilder.make[Int] +itemsets.foreach { item => + if (itemToInt.contains(item)) { +items += itemToInt(item) + 1 // using 1-indexing in internal format + } +} +val result = items.result() +if (result.nonEmpty) { + containsFreqItems = true + allItems ++= result.sorted + allItems += 0 +} + } + if (containsFreqItems) { --- End diff -- I see. What about waiting to pre-pend the initial 0 until the end, only if not empty? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17575: [SPARK-20265][MLlib] Improve Prefix'span pre-proc...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/17575#discussion_r110662125 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala --- @@ -232,6 +200,68 @@ class PrefixSpan private ( object PrefixSpan extends Logging { /** + * This methods finds all frequent items in a input dataset. + * + * @param data Sequences of itemsets. + * @param minCount The minimal number of sequence an item should be present in to be frequent + * + * @return An array of Item containing only frequent items. + */ + private[fpm] def findFrequentItems[Item: ClassTag](data : RDD[Array[Array[Item]]], + minCount : Long): Array[Item] = { + +data.flatMap { itemsets => + val uniqItems = mutable.Set.empty[Item] --- End diff -- or does `itemsets.foreach(set => uniqItems ++= set)` work? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17575: [SPARK-20265][MLlib] Improve Prefix'span pre-proc...
Github user Syrux commented on a diff in the pull request: https://github.com/apache/spark/pull/17575#discussion_r110661717 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala --- @@ -232,6 +200,68 @@ class PrefixSpan private ( object PrefixSpan extends Logging { /** + * This methods finds all frequent items in a input dataset. + * + * @param data Sequences of itemsets. + * @param minCount The minimal number of sequence an item should be present in to be frequent + * + * @return An array of Item containing only frequent items. + */ + private[fpm] def findFrequentItems[Item: ClassTag](data : RDD[Array[Array[Item]]], + minCount : Long): Array[Item] = { + +data.flatMap { itemsets => + val uniqItems = mutable.Set.empty[Item] + itemsets.foreach { _.foreach { item => +uniqItems += item + }} + uniqItems.toIterator.map((_, 1L)) +}.reduceByKey(_ + _).filter { case (_, count) => +count >= minCount +}.sortBy(-_._2).map(_._1).collect() + } + + /** + * This methods cleans the input dataset from un-frequent items, and translate it's item + * to their corresponding Int identifier. + * + * @param data Sequences of itemsets. + * @param itemToInt A map allowing translation of frequent Items to their Int Identifier. + * The map should only contain frequent item. + * + * @return The internal repr of the inputted dataset. With properly placed zero delimiter. + */ + private[fpm] def toDatabaseInternalRepr[Item: ClassTag](data : RDD[Array[Array[Item]]], --- End diff -- Ok, the new version will fix that and the colon space. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17575: [SPARK-20265][MLlib] Improve Prefix'span pre-proc...
Github user Syrux commented on a diff in the pull request: https://github.com/apache/spark/pull/17575#discussion_r110661386 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala --- @@ -232,6 +200,68 @@ class PrefixSpan private ( object PrefixSpan extends Logging { /** + * This methods finds all frequent items in a input dataset. + * + * @param data Sequences of itemsets. + * @param minCount The minimal number of sequence an item should be present in to be frequent + * + * @return An array of Item containing only frequent items. + */ + private[fpm] def findFrequentItems[Item: ClassTag](data : RDD[Array[Array[Item]]], + minCount : Long): Array[Item] = { + +data.flatMap { itemsets => + val uniqItems = mutable.Set.empty[Item] + itemsets.foreach { _.foreach { item => +uniqItems += item + }} + uniqItems.toIterator.map((_, 1L)) +}.reduceByKey(_ + _).filter { case (_, count) => +count >= minCount +}.sortBy(-_._2).map(_._1).collect() + } + + /** + * This methods cleans the input dataset from un-frequent items, and translate it's item + * to their corresponding Int identifier. + * + * @param data Sequences of itemsets. + * @param itemToInt A map allowing translation of frequent Items to their Int Identifier. + * The map should only contain frequent item. + * + * @return The internal repr of the inputted dataset. With properly placed zero delimiter. + */ + private[fpm] def toDatabaseInternalRepr[Item: ClassTag](data : RDD[Array[Array[Item]]], +itemToInt : Map[Item, Int]): + RDD[Array[Int]] = { + +data.flatMap { itemsets => + val allItems = mutable.ArrayBuilder.make[Int] + var containsFreqItems = false + allItems += 0 + itemsets.foreach { itemsets => +val items = mutable.ArrayBuilder.make[Int] +itemsets.foreach { item => + if (itemToInt.contains(item)) { +items += itemToInt(item) + 1 // using 1-indexing in internal format + } +} +val result = items.result() +if (result.nonEmpty) { + containsFreqItems = true + allItems ++= result.sorted + allItems += 0 +} + } + if (containsFreqItems) { --- End diff -- Yes, but allItems is an arrayBuilder, so there is no size method. I could do allIItems.result().size but I think the performance might be worse than a flag. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17582: [SPARK-20239][Core] Improve HistoryServer's ACL mechanis...
Github user tgravescs commented on the issue: https://github.com/apache/spark/pull/17582 Sorry but I'm confused by the explanation in the description. I didn't completely follow what problems you are seeing that aren't intended and I don't understand how you are proposing to fix. Can you please describe the design you are proposing in more detail? On the description can you please clarify each of your bullets? For instance: 1. if base URL's ACL (spark.acls.enable) is enabled but user A has no view permission. User "A" cannot see the app list but could still access details of it's own app. Are you saying user A is not in the list of acls or is? if they have no view permission then they shouldn't be able to see the app. I don't understnad what you mean by "could still access details of it's own app"? Is this user A's application (meaning they started it) and hence he would automatically be in the acl list? Clarifying the other bullets would be helpful as well. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17330: [SPARK-19993][SQL] Caching logical plans containing subq...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17330 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17330: [SPARK-19993][SQL] Caching logical plans containing subq...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17330 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75654/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17330: [SPARK-19993][SQL] Caching logical plans containing subq...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17330 **[Test build #75654 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75654/testReport)** for PR 17330 at commit [`22db44a`](https://github.com/apache/spark/commit/22db44addbe1892b52353fa22bd9b65952e8bdf5). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17556: [SPARK-16957][MLlib] Use weighted midpoints for split va...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17556 **[Test build #3655 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3655/testReport)** for PR 17556 at commit [`9ca5750`](https://github.com/apache/spark/commit/9ca57505c8211954478a2d54ced48c2561cfb9f9). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17591: [SPARK-20280][CORE] FileStatusCache Weigher integ...
Github user bogdanrdc commented on a diff in the pull request: https://github.com/apache/spark/pull/17591#discussion_r110646607 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileStatusCache.scala --- @@ -94,13 +94,25 @@ private class SharedInMemoryCache(maxSizeInBytes: Long) extends Logging { // Opaque object that uniquely identifies a shared cache user private type ClientId = Object + /* [[Weigher]].weigh returns Int so we could only cache objects < 2GB + * instead, the weight is divided by this factor (which is smaller + * than the size of one [[FileStatus]]). + * so it will support objects up to 64GB in size. + */ + private val weightScale = 32 + private val warnedAboutEviction = new AtomicBoolean(false) // we use a composite cache key in order to distinguish entries inserted by different clients private val cache: Cache[(ClientId, Path), Array[FileStatus]] = CacheBuilder.newBuilder() .weigher(new Weigher[(ClientId, Path), Array[FileStatus]] { override def weigh(key: (ClientId, Path), value: Array[FileStatus]): Int = { -(SizeEstimator.estimate(key) + SizeEstimator.estimate(value)).toInt +val estimate = (SizeEstimator.estimate(key) + SizeEstimator.estimate(value)) / weightScale +if (estimate > Int.MaxValue) { + throw new IllegalStateException( --- End diff -- yes, I guess it's better to fail later than sooner. I made it a warning instead. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17591: [SPARK-20280][CORE] FileStatusCache Weigher integer over...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17591 **[Test build #75663 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75663/testReport)** for PR 17591 at commit [`ef8e9e9`](https://github.com/apache/spark/commit/ef8e9e93f6883542847e2a136fb63a4565bf07b4). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17592: [SPARK-20243][TESTS] DebugFilesystem.assertNoOpenStreams...
Github user hvanhovell commented on the issue: https://github.com/apache/spark/pull/17592 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17592: [SPARK-20243][TESTS] DebugFilesystem.assertNoOpenStreams...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17592 **[Test build #75662 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75662/testReport)** for PR 17592 at commit [`d486d60`](https://github.com/apache/spark/commit/d486d6015ae0129ce41e6683eae37243c843ba59). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17556: [SPARK-16957][MLlib] Use weighted midpoints for split va...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17556 **[Test build #3655 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3655/testReport)** for PR 17556 at commit [`9ca5750`](https://github.com/apache/spark/commit/9ca57505c8211954478a2d54ced48c2561cfb9f9). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17593: [SPARK-20279][WEB-UI]In web ui,'Only showing 200' should...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/17593 I'm not sure this is accurate, given how sort order can be changed? maybe not, but it's at least ambiguous, and don't think it adds enough to change this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17591: [SPARK-20280][CORE] FileStatusCache Weigher integ...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/17591#discussion_r110642316 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileStatusCache.scala --- @@ -94,13 +94,25 @@ private class SharedInMemoryCache(maxSizeInBytes: Long) extends Logging { // Opaque object that uniquely identifies a shared cache user private type ClientId = Object + /* [[Weigher]].weigh returns Int so we could only cache objects < 2GB + * instead, the weight is divided by this factor (which is smaller + * than the size of one [[FileStatus]]). + * so it will support objects up to 64GB in size. + */ + private val weightScale = 32 + private val warnedAboutEviction = new AtomicBoolean(false) // we use a composite cache key in order to distinguish entries inserted by different clients private val cache: Cache[(ClientId, Path), Array[FileStatus]] = CacheBuilder.newBuilder() .weigher(new Weigher[(ClientId, Path), Array[FileStatus]] { override def weigh(key: (ClientId, Path), value: Array[FileStatus]): Int = { -(SizeEstimator.estimate(key) + SizeEstimator.estimate(value)).toInt +val estimate = (SizeEstimator.estimate(key) + SizeEstimator.estimate(value)) / weightScale +if (estimate > Int.MaxValue) { + throw new IllegalStateException( --- End diff -- Agree, though, this can only happen if the filesourcePartitionFileCacheSize is at least 64GB and some object is at least 64GB. The effect is to possibly cache things longer than they should, which seems better than failing. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17591: [SPARK-20280][CORE] FileStatusCache Weigher integ...
Github user bogdanrdc commented on a diff in the pull request: https://github.com/apache/spark/pull/17591#discussion_r110639886 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileStatusCache.scala --- @@ -94,13 +94,25 @@ private class SharedInMemoryCache(maxSizeInBytes: Long) extends Logging { // Opaque object that uniquely identifies a shared cache user private type ClientId = Object + /* [[Weigher]].weigh returns Int so we could only cache objects < 2GB + * instead, the weight is divided by this factor (which is smaller + * than the size of one [[FileStatus]]). + * so it will support objects up to 64GB in size. + */ + private val weightScale = 32 + private val warnedAboutEviction = new AtomicBoolean(false) // we use a composite cache key in order to distinguish entries inserted by different clients private val cache: Cache[(ClientId, Path), Array[FileStatus]] = CacheBuilder.newBuilder() .weigher(new Weigher[(ClientId, Path), Array[FileStatus]] { override def weigh(key: (ClientId, Path), value: Array[FileStatus]): Int = { -(SizeEstimator.estimate(key) + SizeEstimator.estimate(value)).toInt +val estimate = (SizeEstimator.estimate(key) + SizeEstimator.estimate(value)) / weightScale +if (estimate > Int.MaxValue) { + throw new IllegalStateException( --- End diff -- If it's capped, the cache might use more memory than configured with spark.sql.hive.filesourcePartitionFileCacheSize. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17592: [SPARK-20243][TESTS] DebugFilesystem.assertNoOpen...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/17592#discussion_r110639779 --- Diff: core/src/test/scala/org/apache/spark/DebugFilesystem.scala --- @@ -31,21 +30,29 @@ import org.apache.spark.internal.Logging object DebugFilesystem extends Logging { // Stores the set of active streams and their creation sites. - private val openStreams = new ConcurrentHashMap[FSDataInputStream, Throwable]() + private val openStreams = mutable.Map.empty[FSDataInputStream, Throwable] - def clearOpenStreams(): Unit = { + def addOpenStream(stream: FSDataInputStream): Unit = synchronized { --- End diff -- It's a little safer and tidier to synchronize on `openStream` rather than the containing `object`. Looks good to me though. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17591: [SPARK-20280][CORE] FileStatusCache Weigher integ...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/17591#discussion_r110639538 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileStatusCache.scala --- @@ -94,13 +94,25 @@ private class SharedInMemoryCache(maxSizeInBytes: Long) extends Logging { // Opaque object that uniquely identifies a shared cache user private type ClientId = Object + /* [[Weigher]].weigh returns Int so we could only cache objects < 2GB + * instead, the weight is divided by this factor (which is smaller + * than the size of one [[FileStatus]]). + * so it will support objects up to 64GB in size. + */ + private val weightScale = 32 --- End diff -- Rather than make it a member variable, just a local variable in the initializer for cache? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17593: [SPARK-20279][WEB-UI]In web ui,'Only showing 200' should...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17593 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17591: [SPARK-20280][CORE] FileStatusCache Weigher integ...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/17591#discussion_r110638961 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileStatusCache.scala --- @@ -94,13 +94,25 @@ private class SharedInMemoryCache(maxSizeInBytes: Long) extends Logging { // Opaque object that uniquely identifies a shared cache user private type ClientId = Object + /* [[Weigher]].weigh returns Int so we could only cache objects < 2GB + * instead, the weight is divided by this factor (which is smaller + * than the size of one [[FileStatus]]). + * so it will support objects up to 64GB in size. + */ + private val weightScale = 32 + private val warnedAboutEviction = new AtomicBoolean(false) // we use a composite cache key in order to distinguish entries inserted by different clients private val cache: Cache[(ClientId, Path), Array[FileStatus]] = CacheBuilder.newBuilder() .weigher(new Weigher[(ClientId, Path), Array[FileStatus]] { override def weigh(key: (ClientId, Path), value: Array[FileStatus]): Int = { -(SizeEstimator.estimate(key) + SizeEstimator.estimate(value)).toInt +val estimate = (SizeEstimator.estimate(key) + SizeEstimator.estimate(value)) / weightScale +if (estimate > Int.MaxValue) { + throw new IllegalStateException( --- End diff -- This shouldn't be an error. It's just a weight. Capping at `Int.MaxValue` is no problem. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17593: [SPARK-20279][WEB-UI]In web ui,'Only showing 200'...
GitHub user guoxiaolongzte opened a pull request: https://github.com/apache/spark/pull/17593 [SPARK-20279][WEB-UI]In web ui,'Only showing 200' should be changed to 'only showing last 200'. ## What changes were proposed in this pull request? In web ui,'Only showing 200' should be changed to 'only showing last 200' in the page of 'jobs' or stages. purpose: I think, the description about add 'last', the purpose is to ensure that users more clearly know that the current show 'jobs or stages', is the latest 200 'jobs or stages', Or the beginning 200 of the 'jobs or stages'. ## How was this patch tested? unit tests,manual tests Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/guoxiaolongzte/spark SPARK-20279 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17593.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17593 commit d383efba12c66addb17006dea107bb0421d50bc3 Author: éå°é¾ 10207633Date: 2017-03-31T13:57:09Z [SPARK-20177]Document about compression way has some little detail changes. commit 3059013e9d2aec76def14eb314b6761bea0e7ca0 Author: éå°é¾ 10207633 Date: 2017-04-01T01:38:02Z [SPARK-20177] event log add a space commit 555cef88fe09134ac98fd0ad056121c7df2539aa Author: guoxiaolongzte Date: 2017-04-02T00:16:08Z '/applications/[app-id]/jobs' in rest api,status should be [running|succeeded|failed|unknown] commit 46bb1ad3ddd9fb55b5607ac4f20213a90186cfe9 Author: éå°é¾ 10207633 Date: 2017-04-05T03:16:50Z Merge branch 'master' of https://github.com/apache/spark into SPARK-20177 commit 0efb0dd9e404229cce638fe3fb0c966276784df7 Author: éå°é¾ 10207633 Date: 2017-04-05T03:47:53Z [SPARK-20218]'/applications/[app-id]/stages' in REST API,add description. commit 0e37fdeee28e31fc97436dabd001d3c85c5a7794 Author: éå°é¾ 10207633 Date: 2017-04-05T05:22:54Z [SPARK-20218] '/applications/[app-id]/stages/[stage-id]' in REST API,remove redundant description. commit 52641bb01e55b48bd9e8579fea217439d14c7dc7 Author: éå°é¾ 10207633 Date: 2017-04-07T06:24:58Z Merge branch 'SPARK-20218' commit d3977c9cab0722d279e3fae7aacbd4eb944c22f6 Author: éå°é¾ 10207633 Date: 2017-04-08T07:13:02Z Merge branch 'master' of https://github.com/apache/spark commit 137b90e5a85cde7e9b904b3e5ea0bb52518c4716 Author: éå°é¾ 10207633 Date: 2017-04-10T05:13:40Z Merge branch 'master' of https://github.com/apache/spark commit 0fe5865b8022aeacdb2d194699b990d8467f7a0a Author: éå°é¾ 10207633 Date: 2017-04-10T10:25:22Z Merge branch 'SPARK-20190' of https://github.com/guoxiaolongzte/spark commit cf6f42ac84466960f2232c025b8faeb5d7378fe1 Author: éå°é¾ 10207633 Date: 2017-04-10T10:26:27Z Merge branch 'master' of https://github.com/apache/spark commit 83c8f4f270f9d4a0f16b6be4915a48537b79d2db Author: éå°é¾ 10207633 Date: 2017-04-10T12:02:46Z [SPARK-20279]In web ui,'Only showing 200' should be changed to 'only showing last 200' in the page of 'jobs' or'stages'. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17592: [SPARK-20243][TESTS] DebugFilesystem.assertNoOpenStreams...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17592 **[Test build #75661 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75661/testReport)** for PR 17592 at commit [`19ca5a1`](https://github.com/apache/spark/commit/19ca5a1114dbe666ac40f4c966f3e600b58f92c2). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17592: [SPARK-20243][TESTS] DebugFilesystem.assertNoOpen...
GitHub user bogdanrdc opened a pull request: https://github.com/apache/spark/pull/17592 [SPARK-20243][TESTS] DebugFilesystem.assertNoOpenStreams thread race ## What changes were proposed in this pull request? Synchronize access to openStreams map. ## How was this patch tested? Existing tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/bogdanrdc/spark SPARK-20243 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17592.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17592 commit 19ca5a1114dbe666ac40f4c966f3e600b58f92c2 Author: Bogdan RaducanuDate: 2017-04-10T12:08:55Z fix by synchronizing access to the stream map --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17591: [SPARK-20280][CORE] FileStatusCache Weigher integer over...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17591 **[Test build #75660 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75660/testReport)** for PR 17591 at commit [`f4ae52a`](https://github.com/apache/spark/commit/f4ae52a0c60b9d2a086e4f1ad0bb3675bebc47da). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17591: [SPARK-20280][CORE] FileStatusCache Weigher integ...
GitHub user bogdanrdc opened a pull request: https://github.com/apache/spark/pull/17591 [SPARK-20280][CORE] FileStatusCache Weigher integer overflow ## What changes were proposed in this pull request? Weigher.weigh needs to return Int but it is possible for an Array[FileStatus] to have size > Int.maxValue. To avoid this, the size is scaled down by a factor of 32. The maximumWeight of the cache is also scaled down by the same factor. ## How was this patch tested? New test in FileIndexSuite You can merge this pull request into a Git repository by running: $ git pull https://github.com/bogdanrdc/spark SPARK-20280 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17591.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17591 commit f4ae52a0c60b9d2a086e4f1ad0bb3675bebc47da Author: Bogdan RaducanuDate: 2017-04-10T11:52:41Z fix + test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17586: [SPARK-20249][ML][PYSPARK] Add summary for LinearSVCMode...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17586 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75655/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17586: [SPARK-20249][ML][PYSPARK] Add summary for LinearSVCMode...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17586 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17586: [SPARK-20249][ML][PYSPARK] Add summary for LinearSVCMode...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17586 **[Test build #75655 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75655/testReport)** for PR 17586 at commit [`0daa041`](https://github.com/apache/spark/commit/0daa0410af0b7a4968e852af7fbba6ebb2cb9064). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class LinearSVCTrainingSummary(` * `class LinearSVCTrainingSummary(JavaWrapper):` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17077#discussion_r110634385 --- Diff: python/pyspark/sql/tests.py --- @@ -2167,6 +2167,61 @@ def test_BinaryType_serialization(self): df = self.spark.createDataFrame(data, schema=schema) df.collect() +def test_bucketed_write(self): +data = [ +(1, "foo", 3.0), (2, "foo", 5.0), +(3, "bar", -1.0), (4, "bar", 6.0), +] +df = self.spark.createDataFrame(data, ["x", "y", "z"]) + +# Test write with one bucketing column +df.write.bucketBy(3, "x").mode("overwrite").saveAsTable("pyspark_bucket") +self.assertEqual( +len([c for c in self.spark.catalog.listColumns("pyspark_bucket") + if c.name == "x" and c.isBucket]), +1 +) +self.assertSetEqual(set(data), set(self.spark.table("pyspark_bucket").collect())) + +# Test write two bucketing columns +df.write.bucketBy(3, "x", "y").mode("overwrite").saveAsTable("pyspark_bucket") +self.assertEqual( +len([c for c in self.spark.catalog.listColumns("pyspark_bucket") + if c.name in ("x", "y") and c.isBucket]), +2 +) +self.assertSetEqual(set(data), set(self.spark.table("pyspark_bucket").collect())) + +# Test write with bucket and sort +df.write.bucketBy(2, "x").sortBy("z").mode("overwrite").saveAsTable("pyspark_bucket") +self.assertEqual( +len([c for c in self.spark.catalog.listColumns("pyspark_bucket") + if c.name == "x" and c.isBucket]), +1 +) +self.assertSetEqual(set(data), set(self.spark.table("pyspark_bucket").collect())) + +# Test write with a list of columns +df.write.bucketBy(3, ["x", "y"]).mode("overwrite").saveAsTable("pyspark_bucket") +self.assertEqual( +len([c for c in self.spark.catalog.listColumns("pyspark_bucket") + if c.name in ("x", "y") and c.isBucket]), +2 +) +self.assertSetEqual(set(data), set(self.spark.table("pyspark_bucket").collect())) + +# Test write with bucket and sort with a list of columns +(df.write.bucketBy(2, "x") +.sortBy(["y", "z"]) +.mode("overwrite").saveAsTable("pyspark_bucket")) +self.assertSetEqual(set(data), set(self.spark.table("pyspark_bucket").collect())) + +# Test write with bucket and sort with multiple columns +(df.write.bucketBy(2, "x") +.sortBy("y", "z") +.mode("overwrite").saveAsTable("pyspark_bucket")) --- End diff -- @zero323, should we drop the table before or after this test? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17077#discussion_r110632132 --- Diff: python/pyspark/sql/tests.py --- @@ -2167,6 +2167,61 @@ def test_BinaryType_serialization(self): df = self.spark.createDataFrame(data, schema=schema) df.collect() +def test_bucketed_write(self): +data = [ +(1, "foo", 3.0), (2, "foo", 5.0), +(3, "bar", -1.0), (4, "bar", 6.0), +] +df = self.spark.createDataFrame(data, ["x", "y", "z"]) + +# Test write with one bucketing column +df.write.bucketBy(3, "x").mode("overwrite").saveAsTable("pyspark_bucket") +self.assertEqual( +len([c for c in self.spark.catalog.listColumns("pyspark_bucket") + if c.name == "x" and c.isBucket]), +1 +) +self.assertSetEqual(set(data), set(self.spark.table("pyspark_bucket").collect())) + +# Test write two bucketing columns +df.write.bucketBy(3, "x", "y").mode("overwrite").saveAsTable("pyspark_bucket") +self.assertEqual( +len([c for c in self.spark.catalog.listColumns("pyspark_bucket") + if c.name in ("x", "y") and c.isBucket]), --- End diff -- @zero323, I am sorry. What do you think about something like this one below?: ```python cols = self.spark.catalog.listColumns("pyspark_bucket") num = len([c for c in cols if c.name in ("x", "y") and c.isBucket]) self.assertEqual(num, 2) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17077#discussion_r110630404 --- Diff: python/pyspark/sql/readwriter.py --- @@ -545,6 +545,57 @@ def partitionBy(self, *cols): self._jwrite = self._jwrite.partitionBy(_to_seq(self._spark._sc, cols)) return self +@since(2.2) +def bucketBy(self, numBuckets, *cols): +"""Buckets the output by the given columns on the file system. --- End diff -- I think just copying it from Scala doc is good enough to prevent overhead of sweeping the documentation when we start to support other operations later. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17077 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75659/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17077 **[Test build #75659 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75659/testReport)** for PR 17077 at commit [`845ee87`](https://github.com/apache/spark/commit/845ee8783c54123c743e176def11af7455192d42). * This patch **fails Python style tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17077 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17077 **[Test build #75659 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75659/testReport)** for PR 17077 at commit [`845ee87`](https://github.com/apache/spark/commit/845ee8783c54123c743e176def11af7455192d42). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17556: [SPARK-16957][MLlib] Use weighted midpoints for split va...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17556 **[Test build #3654 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3654/testReport)** for PR 17556 at commit [`9ca5750`](https://github.com/apache/spark/commit/9ca57505c8211954478a2d54ced48c2561cfb9f9). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17590: [SPARK-20278][R] Disable 'multiple_dots_linter' lint rul...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17590 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75656/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17590: [SPARK-20278][R] Disable 'multiple_dots_linter' lint rul...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17590 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17590: [SPARK-20278][R] Disable 'multiple_dots_linter' lint rul...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17590 **[Test build #75656 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75656/testReport)** for PR 17590 at commit [`ed0dd86`](https://github.com/apache/spark/commit/ed0dd869a24bc10cfce7dacc8b4b6d57c3ced6de). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17580: [20269][Structured Streaming] add class 'JavaWordCountPr...
Github user guoxiaolongzte commented on the issue: https://github.com/apache/spark/pull/17580 @jerryshao import org.apache.kafka.clients.producer.KafkaProducer; import org.apache.kafka.clients.producer.ProducerConfig; import org.apache.kafka.clients.producer.ProducerRecord; The above three classes in kafka-clients-0.10.0.1.jarãThe example here will not become obsolete. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17077 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75658/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17077 **[Test build #75658 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75658/testReport)** for PR 17077 at commit [`7b93482`](https://github.com/apache/spark/commit/7b93482f31f2efb3d4d742eb3e385e6b4a2bc14e). * This patch **fails Python style tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17077 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17077 **[Test build #75658 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75658/testReport)** for PR 17077 at commit [`7b93482`](https://github.com/apache/spark/commit/7b93482f31f2efb3d4d742eb3e385e6b4a2bc14e). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17556: [SPARK-16957][MLlib] Use weighted midpoints for split va...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17556 **[Test build #3654 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3654/testReport)** for PR 17556 at commit [`9ca5750`](https://github.com/apache/spark/commit/9ca57505c8211954478a2d54ced48c2561cfb9f9). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17586: [SPARK-20249][ML][PYSPARK] Add summary for LinearSVCMode...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17586 **[Test build #75655 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75655/testReport)** for PR 17586 at commit [`0daa041`](https://github.com/apache/spark/commit/0daa0410af0b7a4968e852af7fbba6ebb2cb9064). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17590: [SPARK-20278][R] Disable 'multiple_dots_linter' lint rul...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17590 **[Test build #75656 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75656/testReport)** for PR 17590 at commit [`ed0dd86`](https://github.com/apache/spark/commit/ed0dd869a24bc10cfce7dacc8b4b6d57c3ced6de). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17527: [SPARK-20156][CORE][SQL][STREAMING][MLLIB] Java String t...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17527 **[Test build #75657 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75657/testReport)** for PR 17527 at commit [`2ac5843`](https://github.com/apache/spark/commit/2ac5843a071847dbe6e8cea08b49a7ef36587101). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17330: [SPARK-19993][SQL] Caching logical plans containing subq...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17330 **[Test build #75654 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75654/testReport)** for PR 17330 at commit [`22db44a`](https://github.com/apache/spark/commit/22db44addbe1892b52353fa22bd9b65952e8bdf5). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17588: [SPARK-20275][UI] Do not display "Completed" column for ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17588 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75651/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17588: [SPARK-20275][UI] Do not display "Completed" column for ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17588 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17588: [SPARK-20275][UI] Do not display "Completed" column for ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17588 **[Test build #75651 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75651/testReport)** for PR 17588 at commit [`6828315`](https://github.com/apache/spark/commit/682831509e5a80555842d400af5c4a909c735414). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17077#discussion_r110626138 --- Diff: python/pyspark/sql/tests.py --- @@ -2038,6 +2038,61 @@ def test_BinaryType_serialization(self): df = self.spark.createDataFrame(data, schema=schema) df.collect() +def test_bucketed_write(self): +data = [ +(1, "foo", 3.0), (2, "foo", 5.0), +(3, "bar", -1.0), (4, "bar", 6.0), +] +df = self.spark.createDataFrame(data, ["x", "y", "z"]) + +# Test write with one bucketing column +df.write.bucketBy(3, "x").mode("overwrite").saveAsTable("pyspark_bucket") +self.assertEqual( +len([c for c in self.spark.catalog.listColumns("pyspark_bucket") + if c.name == "x" and c.isBucket]), --- End diff -- Thanks for taking a look for the related ones and trying it out. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17308: [SPARK-19968][SS] Use a cached instance of `Kafka...
Github user ScrapCodes commented on a diff in the pull request: https://github.com/apache/spark/pull/17308#discussion_r110625538 --- Diff: external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaProducer.scala --- @@ -0,0 +1,70 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.kafka010 + +import java.{util => ju} + +import scala.collection.mutable + +import org.apache.kafka.clients.producer.KafkaProducer + +import org.apache.spark.internal.Logging + +private[kafka010] object CachedKafkaProducer extends Logging { + + private val cacheMap = new mutable.HashMap[Int, KafkaProducer[Array[Byte], Array[Byte]]]() + + private def createKafkaProducer( +producerConfiguration: ju.HashMap[String, Object]): KafkaProducer[Array[Byte], Array[Byte]] = { +val kafkaProducer: KafkaProducer[Array[Byte], Array[Byte]] = + new KafkaProducer[Array[Byte], Array[Byte]](producerConfiguration) +cacheMap.put(producerConfiguration.hashCode(), kafkaProducer) --- End diff -- It is not a good idea to do like that. I had like my understanding to be corrected, as much as I understood. Since in this particular case Spark does not let user specify a key or value serializer/deserializer. So `Object` can be either a String, int or Long and for these hashcode would work correctly. I am also contemplating a better way to do it, now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17077#discussion_r110624985 --- Diff: python/pyspark/sql/tests.py --- @@ -2038,6 +2038,61 @@ def test_BinaryType_serialization(self): df = self.spark.createDataFrame(data, schema=schema) df.collect() +def test_bucketed_write(self): +data = [ +(1, "foo", 3.0), (2, "foo", 5.0), +(3, "bar", -1.0), (4, "bar", 6.0), +] +df = self.spark.createDataFrame(data, ["x", "y", "z"]) + +# Test write with one bucketing column +df.write.bucketBy(3, "x").mode("overwrite").saveAsTable("pyspark_bucket") +self.assertEqual( +len([c for c in self.spark.catalog.listColumns("pyspark_bucket") + if c.name == "x" and c.isBucket]), --- End diff -- Yup, I don't argue with my personal preference. I am fine with it. I dont strongly feel about both. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17580: [20269][Structured Streaming] add class 'JavaWordCountPr...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17580 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17580: [20269][Structured Streaming] add class 'JavaWord...
GitHub user guoxiaolongzte reopened a pull request: https://github.com/apache/spark/pull/17580 [20269][Structured Streaming] add class 'JavaWordCountProducer' to 'provide java word count producer'. ## What changes were proposed in this pull request? 1.run example of streaming kafka,currently missing java word count producer,not conducive to java developers to learn and test. add a class JavaKafkaWordCountProducer. 2.run example of JavaKafkaWordCount.I find no java word count producer. run example of KafkaWordCount.I find have scala word count producer. I think we should provide the corresponding example code to facilitate java developers to learn and test. 3.My project team develops spark applications,basically with java statements and java API. 4.test case 1)java kafka word count producer,generate data for run example of JavaKafkaWordCount.the class of JavaKafkaWordCountProducer add by me. spark-submit ... --class org.apache.spark.examples.streaming.JavaKafkaWordCountProducer --jars /home/gxl/spark/libext/kafka-clients-0.8.2.1.jar /home/gxl/spark/examples/jars/spark-examples_2.11-2.1.0.jar localhost:9092 topic1 3 4 2)java kafka streaming example spark-submit ... --class org.apache.spark.examples.streaming.JavaKafkaWordCount --jars /home/gxl/kafka/kafka_2.11-0.10.2.0/libs/kafka-clients-0.10.2.0.jar --jars /home/gxl/spark/libext/kafka_2.11-0.8.2.1.jar /home/gxl/spark/examples/jars/spark-examples_2.11-2.1.0.jar localhost:2181 topic1 topic1 2 ## How was this patch tested? manual tests Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/guoxiaolongzte/spark SPARK-20269 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17580.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17580 commit d383efba12c66addb17006dea107bb0421d50bc3 Author: éå°é¾ 10207633Date: 2017-03-31T13:57:09Z [SPARK-20177]Document about compression way has some little detail changes. commit 3059013e9d2aec76def14eb314b6761bea0e7ca0 Author: éå°é¾ 10207633 Date: 2017-04-01T01:38:02Z [SPARK-20177] event log add a space commit 46bb1ad3ddd9fb55b5607ac4f20213a90186cfe9 Author: éå°é¾ 10207633 Date: 2017-04-05T03:16:50Z Merge branch 'master' of https://github.com/apache/spark into SPARK-20177 commit 0efb0dd9e404229cce638fe3fb0c966276784df7 Author: éå°é¾ 10207633 Date: 2017-04-05T03:47:53Z [SPARK-20218]'/applications/[app-id]/stages' in REST API,add description. commit 0e37fdeee28e31fc97436dabd001d3c85c5a7794 Author: éå°é¾ 10207633 Date: 2017-04-05T05:22:54Z [SPARK-20218] '/applications/[app-id]/stages/[stage-id]' in REST API,remove redundant description. commit 52641bb01e55b48bd9e8579fea217439d14c7dc7 Author: éå°é¾ 10207633 Date: 2017-04-07T06:24:58Z Merge branch 'SPARK-20218' commit d3977c9cab0722d279e3fae7aacbd4eb944c22f6 Author: éå°é¾ 10207633 Date: 2017-04-08T07:13:02Z Merge branch 'master' of https://github.com/apache/spark commit 1279041f17f1f5d02c9aada5fb5af9c7d58f2423 Author: asmith26 Date: 2017-04-09T06:47:23Z [MINOR] Issue: Change "slice" vs "partition" in exception messages (and code?) ## What changes were proposed in this pull request? Came across the term "slice" when running some spark scala code. Consequently, a Google search indicated that "slices" and "partitions" refer to the same things; indeed see: - [This issue](https://issues.apache.org/jira/browse/SPARK-1701) - [This pull request](https://github.com/apache/spark/pull/2305) - [This StackOverflow answer](http://stackoverflow.com/questions/23436640/what-is-the-difference-between-an-rdd-partition-and-a-slice) and [this one](http://stackoverflow.com/questions/24269495/what-are-the-differences-between-slices-and-partitions-of-rdds) Thus this pull request fixes the occurrence of slice I came accross. Nonetheless, [it would appear](https://github.com/apache/spark/search?utf8=%E2%9C%93=slice=) there are still many references to "slice/slices" - thus I thought I'd raise this Pull Request to address the issue (sorry if this is the wrong place, I'm not too familar with raising apache issues). ## How was this patch tested? (Not tested locally - only a minor exception message change.) Please review http://spark.apache.org/contributing.html before opening
[GitHub] spark issue #17589: [SPARK-16544][SQL] Support for conversion from numeric c...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17589 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75653/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17589: [SPARK-16544][SQL] Support for conversion from numeric c...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17589 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org