[GitHub] spark pull request #17330: [SPARK-19993][SQL] Caching logical plans containi...

2017-04-10 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/17330#discussion_r110693397
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala
 ---
@@ -59,6 +58,13 @@ abstract class SubqueryExpression(
 children.zip(p.children).forall(p => p._1.semanticEquals(p._2))
 case _ => false
   }
+  def canonicalize(attrs: AttributeSeq): SubqueryExpression = {
+// Normalize the outer references in the subquery plan.
+val subPlan = plan.transformAllExpressions {
+  case OuterReference(r) => QueryPlan.normalizeExprId(r, attrs)
--- End diff --

@cloud-fan Actually you r right. Preserving the OuterReference would be 
good.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for...

2017-04-10 Thread zero323
Github user zero323 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17077#discussion_r110692936
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -2167,6 +2167,61 @@ def test_BinaryType_serialization(self):
 df = self.spark.createDataFrame(data, schema=schema)
 df.collect()
 
+def test_bucketed_write(self):
+data = [
+(1, "foo", 3.0), (2, "foo", 5.0),
+(3, "bar", -1.0), (4, "bar", 6.0),
+]
+df = self.spark.createDataFrame(data, ["x", "y", "z"])
+
+# Test write with one bucketing column
+df.write.bucketBy(3, 
"x").mode("overwrite").saveAsTable("pyspark_bucket")
+self.assertEqual(
+len([c for c in 
self.spark.catalog.listColumns("pyspark_bucket")
+ if c.name == "x" and c.isBucket]),
+1
+)
+self.assertSetEqual(set(data), 
set(self.spark.table("pyspark_bucket").collect()))
+
+# Test write two bucketing columns
+df.write.bucketBy(3, "x", 
"y").mode("overwrite").saveAsTable("pyspark_bucket")
+self.assertEqual(
+len([c for c in 
self.spark.catalog.listColumns("pyspark_bucket")
+ if c.name in ("x", "y") and c.isBucket]),
+2
+)
+self.assertSetEqual(set(data), 
set(self.spark.table("pyspark_bucket").collect()))
+
+# Test write with bucket and sort
+df.write.bucketBy(2, 
"x").sortBy("z").mode("overwrite").saveAsTable("pyspark_bucket")
+self.assertEqual(
+len([c for c in 
self.spark.catalog.listColumns("pyspark_bucket")
+ if c.name == "x" and c.isBucket]),
+1
+)
+self.assertSetEqual(set(data), 
set(self.spark.table("pyspark_bucket").collect()))
+
+# Test write with a list of columns
+df.write.bucketBy(3, ["x", 
"y"]).mode("overwrite").saveAsTable("pyspark_bucket")
+self.assertEqual(
+len([c for c in 
self.spark.catalog.listColumns("pyspark_bucket")
+ if c.name in ("x", "y") and c.isBucket]),
+2
+)
+self.assertSetEqual(set(data), 
set(self.spark.table("pyspark_bucket").collect()))
+
+# Test write with bucket and sort with a list of columns
+(df.write.bucketBy(2, "x")
+.sortBy(["y", "z"])
+.mode("overwrite").saveAsTable("pyspark_bucket"))
+self.assertSetEqual(set(data), 
set(self.spark.table("pyspark_bucket").collect()))
+
+# Test write with bucket and sort with multiple columns
+(df.write.bucketBy(2, "x")
+.sortBy("y", "z")
+.mode("overwrite").saveAsTable("pyspark_bucket"))
--- End diff --

I don't think that dropping before is necessary. We override on each write 
and name clashes are unlikely.

We can drop down after the tests but I am not sure how to do it right. 
`SQLTests` is overgrown and I am not sure if we should add `tearDown`  only for 
this but adding `DROP TABLE` in test itself doesn't look right.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17591: [SPARK-20280][CORE] FileStatusCache Weigher integ...

2017-04-10 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/17591#discussion_r110692833
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileStatusCache.scala
 ---
@@ -94,27 +94,46 @@ private class SharedInMemoryCache(maxSizeInBytes: Long) 
extends Logging {
   // Opaque object that uniquely identifies a shared cache user
   private type ClientId = Object
 
+
   private val warnedAboutEviction = new AtomicBoolean(false)
 
   // we use a composite cache key in order to distinguish entries inserted 
by different clients
-  private val cache: Cache[(ClientId, Path), Array[FileStatus]] = 
CacheBuilder.newBuilder()
-.weigher(new Weigher[(ClientId, Path), Array[FileStatus]] {
-  override def weigh(key: (ClientId, Path), value: Array[FileStatus]): 
Int = {
-(SizeEstimator.estimate(key) + SizeEstimator.estimate(value)).toInt
-  }})
-.removalListener(new RemovalListener[(ClientId, Path), 
Array[FileStatus]]() {
-  override def onRemoval(removed: RemovalNotification[(ClientId, 
Path), Array[FileStatus]])
+  private val cache: Cache[(ClientId, Path), Array[FileStatus]] = {
+/* [[Weigher]].weigh returns Int so we could only cache objects < 2GB
+ * instead, the weight is divided by this factor (which is smaller
+ * than the size of one [[FileStatus]]).
+ * so it will support objects up to 64GB in size.
+ */
+val weightScale = 32
+CacheBuilder.newBuilder()
+  .weigher(new Weigher[(ClientId, Path), Array[FileStatus]] {
+override def weigh(key: (ClientId, Path), value: 
Array[FileStatus]): Int = {
+  val estimate = (SizeEstimator.estimate(key) + 
SizeEstimator.estimate(value)) / weightScale
+  if (estimate > Int.MaxValue) {
+logWarning(s"Cached table partition metadata size is too big. 
Approximating to " +
+  s"${Int.MaxValue.toLong * weightScale}.")
+Int.MaxValue
+  } else {
+estimate.toInt
+  }
+}
+  })
+  .removalListener(new RemovalListener[(ClientId, Path), 
Array[FileStatus]]() {
--- End diff --

This is kinda hard to read. Can we just initialize the weighter and the 
listener in separate variables?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17591: [SPARK-20280][CORE] FileStatusCache Weigher integ...

2017-04-10 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/17591#discussion_r110692271
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileStatusCache.scala
 ---
@@ -94,27 +94,46 @@ private class SharedInMemoryCache(maxSizeInBytes: Long) 
extends Logging {
   // Opaque object that uniquely identifies a shared cache user
   private type ClientId = Object
 
+
   private val warnedAboutEviction = new AtomicBoolean(false)
 
   // we use a composite cache key in order to distinguish entries inserted 
by different clients
-  private val cache: Cache[(ClientId, Path), Array[FileStatus]] = 
CacheBuilder.newBuilder()
-.weigher(new Weigher[(ClientId, Path), Array[FileStatus]] {
-  override def weigh(key: (ClientId, Path), value: Array[FileStatus]): 
Int = {
-(SizeEstimator.estimate(key) + SizeEstimator.estimate(value)).toInt
-  }})
-.removalListener(new RemovalListener[(ClientId, Path), 
Array[FileStatus]]() {
-  override def onRemoval(removed: RemovalNotification[(ClientId, 
Path), Array[FileStatus]])
+  private val cache: Cache[(ClientId, Path), Array[FileStatus]] = {
+/* [[Weigher]].weigh returns Int so we could only cache objects < 2GB
--- End diff --

NIT: Could you use java style comments `//` here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17566: [SPARK-19518][SQL] IGNORE NULLS in first / last i...

2017-04-10 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/17566


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17591: [SPARK-20280][CORE] FileStatusCache Weigher integ...

2017-04-10 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/17591#discussion_r110692150
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileIndexSuite.scala
 ---
@@ -220,6 +221,32 @@ class FileIndexSuite extends SharedSQLContext {
   assert(catalog.leafDirPaths.head == fs.makeQualified(dirPath))
 }
   }
+
+  test("SPARK-20280 - FileStatusCache with a partition with very many 
files") {
+/* fake the size, otherwise we need to allocate 2GB of data to trigger 
this bug */
+class MyFileStatus extends FileStatus with KnownSizeEstimation {
+  override def estimatedSize: Long = 1000 * 1000 * 1000
+}
+/* files * MyFileStatus.estimatedSize should overflow to negative 
integer
+ * so, make it between 2bn and 4bn
+ */
+val files = (1 to 3).map { i =>
+  new MyFileStatus()
+}
+val fileStatusCache = FileStatusCache.getOrCreate(spark)
+fileStatusCache.putLeafFiles(new Path("/tmp", "abc"), files.toArray)
+// scalastyle:off
--- End diff --

Lets remove this comment block, the JIRA should be used for tracking these 
things.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17593: [SPARK-20279][WEB-UI]In web ui,'Only showing 200' should...

2017-04-10 Thread ajbozarth
Github user ajbozarth commented on the issue:

https://github.com/apache/spark/pull/17593
  
I agree with @srowen we left it that way since sorting can change


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17566: [SPARK-19518][SQL] IGNORE NULLS in first / last in SQL

2017-04-10 Thread hvanhovell
Github user hvanhovell commented on the issue:

https://github.com/apache/spark/pull/17566
  
LGTM - merging to master. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17592: [SPARK-20243][TESTS] DebugFilesystem.assertNoOpen...

2017-04-10 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/17592


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17592: [SPARK-20243][TESTS] DebugFilesystem.assertNoOpenStreams...

2017-04-10 Thread hvanhovell
Github user hvanhovell commented on the issue:

https://github.com/apache/spark/pull/17592
  
Merging to master. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17592: [SPARK-20243][TESTS] DebugFilesystem.assertNoOpenStreams...

2017-04-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17592
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17592: [SPARK-20243][TESTS] DebugFilesystem.assertNoOpenStreams...

2017-04-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17592
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75662/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17592: [SPARK-20243][TESTS] DebugFilesystem.assertNoOpenStreams...

2017-04-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17592
  
**[Test build #75662 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75662/testReport)**
 for PR 17592 at commit 
[`d486d60`](https://github.com/apache/spark/commit/d486d6015ae0129ce41e6683eae37243c843ba59).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17308: [SPARK-19968][SS] Use a cached instance of `Kafka...

2017-04-10 Thread BenFradet
Github user BenFradet commented on a diff in the pull request:

https://github.com/apache/spark/pull/17308#discussion_r110685760
  
--- Diff: 
external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaProducer.scala
 ---
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.kafka010
+
+import java.{util => ju}
+
+import scala.collection.mutable
+
+import org.apache.kafka.clients.producer.KafkaProducer
+
+import org.apache.spark.internal.Logging
+
+private[kafka010] object CachedKafkaProducer extends Logging {
+
+  private val cacheMap = new mutable.HashMap[Int, 
KafkaProducer[Array[Byte], Array[Byte]]]()
+
+  private def createKafkaProducer(
+producerConfiguration: ju.HashMap[String, Object]): 
KafkaProducer[Array[Byte], Array[Byte]] = {
+val kafkaProducer: KafkaProducer[Array[Byte], Array[Byte]] =
+  new KafkaProducer[Array[Byte], Array[Byte]](producerConfiguration)
+cacheMap.put(producerConfiguration.hashCode(), kafkaProducer)
--- End diff --

True, my bad I thought `KafkaSink` was a public API.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17491: [SPARK-20175][SQL] Exists should not be evaluated in Joi...

2017-04-10 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/17491
  
cc @cloud-fan 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17591: [SPARK-20280][CORE] FileStatusCache Weigher integer over...

2017-04-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17591
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75663/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17591: [SPARK-20280][CORE] FileStatusCache Weigher integer over...

2017-04-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17591
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17591: [SPARK-20280][CORE] FileStatusCache Weigher integer over...

2017-04-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17591
  
**[Test build #75663 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75663/testReport)**
 for PR 17591 at commit 
[`ef8e9e9`](https://github.com/apache/spark/commit/ef8e9e93f6883542847e2a136fb63a4565bf07b4).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17592: [SPARK-20243][TESTS] DebugFilesystem.assertNoOpenStreams...

2017-04-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17592
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75661/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17592: [SPARK-20243][TESTS] DebugFilesystem.assertNoOpenStreams...

2017-04-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17592
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17592: [SPARK-20243][TESTS] DebugFilesystem.assertNoOpenStreams...

2017-04-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17592
  
**[Test build #75661 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75661/testReport)**
 for PR 17592 at commit 
[`19ca5a1`](https://github.com/apache/spark/commit/19ca5a1114dbe666ac40f4c966f3e600b58f92c2).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17587: [SPARK-20274][SQL] support compatible array eleme...

2017-04-10 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/17587#discussion_r110676717
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala 
---
@@ -270,12 +270,13 @@ object ScalaReflection extends ScalaReflection {
 
   case t if t <:< localTypeOf[Array[_]] =>
 val TypeRef(_, _, Seq(elementType)) = t
-val Schema(_, elementNullable) = schemaFor(elementType)
+val Schema(dataType, elementNullable) = schemaFor(elementType)
 val className = getClassNameFromType(elementType)
 val newTypePath = s"""- array element class: "$className +: 
walkedTypePath
 
-val mapFunction: Expression => Expression = p => {
-  val converter = deserializerFor(elementType, Some(p), 
newTypePath)
+val mapFunction: Expression => Expression = element => {
+  val casted = upCastToExpectedType(element, dataType, newTypePath)
--- End diff --

Shall we add a comment?

E.g., If it is compatible array element type, we will cast the element to 
the element type the decoder expects.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17587: [SPARK-20274][SQL] support compatible array element type...

2017-04-10 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/17587
  
LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17575: [SPARK-20265][MLlib] Improve Prefix'span pre-proc...

2017-04-10 Thread Syrux
Github user Syrux commented on a diff in the pull request:

https://github.com/apache/spark/pull/17575#discussion_r110671916
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala 
---
@@ -232,6 +200,68 @@ class PrefixSpan private (
 object PrefixSpan extends Logging {
 
   /**
+   * This methods finds all frequent items in a input dataset.
+   *
+   * @param data Sequences of itemsets.
+   * @param minCount The minimal number of sequence an item should be 
present in to be frequent
+   *
+   * @return An array of Item containing only frequent items.
+   */
+  private[fpm] def findFrequentItems[Item: ClassTag](data : 
RDD[Array[Array[Item]]],
+ minCount : Long): 
Array[Item] = {
+
+data.flatMap { itemsets =>
+  val uniqItems = mutable.Set.empty[Item]
--- End diff --

itemsets.foreach(set => uniqItems ++= set) does work. I will change it in 
my next commit. I will push it once I know what to do for the flag.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17330: [SPARK-19993][SQL] Caching logical plans containi...

2017-04-10 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/17330#discussion_r110671813
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala
 ---
@@ -59,6 +58,13 @@ abstract class SubqueryExpression(
 children.zip(p.children).forall(p => p._1.semanticEquals(p._2))
 case _ => false
   }
+  def canonicalize(attrs: AttributeSeq): SubqueryExpression = {
+// Normalize the outer references in the subquery plan.
+val subPlan = plan.transformAllExpressions {
+  case OuterReference(r) => QueryPlan.normalizeExprId(r, attrs)
--- End diff --

The `OuterReference` will all be removed, is it expected?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17330: [SPARK-19993][SQL] Caching logical plans containi...

2017-04-10 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/17330#discussion_r110671334
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala
 ---
@@ -59,6 +58,13 @@ abstract class SubqueryExpression(
 children.zip(p.children).forall(p => p._1.semanticEquals(p._2))
 case _ => false
   }
+  def canonicalize(attrs: AttributeSeq): SubqueryExpression = {
+// Normalize the outer references in the subquery plan.
+val subPlan = plan.transformAllExpressions {
--- End diff --

`canonicalizedPlan`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17575: [SPARK-20265][MLlib] Improve Prefix'span pre-proc...

2017-04-10 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/17575#discussion_r110671100
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala 
---
@@ -232,6 +200,68 @@ class PrefixSpan private (
 object PrefixSpan extends Logging {
 
   /**
+   * This methods finds all frequent items in a input dataset.
+   *
+   * @param data Sequences of itemsets.
+   * @param minCount The minimal number of sequence an item should be 
present in to be frequent
+   *
+   * @return An array of Item containing only frequent items.
+   */
+  private[fpm] def findFrequentItems[Item: ClassTag](data : 
RDD[Array[Array[Item]]],
+ minCount : Long): 
Array[Item] = {
+
+data.flatMap { itemsets =>
+  val uniqItems = mutable.Set.empty[Item]
+  itemsets.foreach { _.foreach { item =>
+uniqItems += item
+  }}
+  uniqItems.toIterator.map((_, 1L))
+}.reduceByKey(_ + _).filter { case (_, count) =>
+count >= minCount
+}.sortBy(-_._2).map(_._1).collect()
+  }
+
+  /**
+   * This methods cleans the input dataset from un-frequent items, and 
translate it's item
+   * to their corresponding Int identifier.
+   *
+   * @param data Sequences of itemsets.
+   * @param itemToInt A map allowing translation of frequent Items to 
their Int Identifier.
+   *  The map should only contain frequent item.
+   *
+   * @return The internal repr of the inputted dataset. With properly 
placed zero delimiter.
+   */
+  private[fpm] def toDatabaseInternalRepr[Item: ClassTag](data : 
RDD[Array[Array[Item]]],
+itemToInt : 
Map[Item, Int]):
+  RDD[Array[Int]] = {
+
+data.flatMap { itemsets =>
+  val allItems = mutable.ArrayBuilder.make[Int]
+  var containsFreqItems = false
+  allItems += 0
+  itemsets.foreach { itemsets =>
+val items = mutable.ArrayBuilder.make[Int]
+itemsets.foreach { item =>
+  if (itemToInt.contains(item)) {
+items += itemToInt(item) + 1 // using 1-indexing in internal 
format
+  }
+}
+val result = items.result()
+if (result.nonEmpty) {
+  containsFreqItems = true
+  allItems ++= result.sorted
+  allItems += 0
+}
+  }
+  if (containsFreqItems) {
--- End diff --

OK no problem, leave it. Just riffing while we're editing the code.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17330: [SPARK-19993][SQL] Caching logical plans containi...

2017-04-10 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/17330#discussion_r110671091
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala
 ---
@@ -236,6 +244,12 @@ case class ScalarSubquery(
   override def nullable: Boolean = true
   override def withNewPlan(plan: LogicalPlan): ScalarSubquery = copy(plan 
= plan)
   override def toString: String = s"scalar-subquery#${exprId.id} 
$conditionString"
+  override lazy val canonicalized: Expression = {
--- End diff --

oh sorry it's expression


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17587: [SPARK-20274][SQL] support compatible array element type...

2017-04-10 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/17587
  
@kiszk The plan may not be very helpful, as the only difference in the plan 
is `MapObjects.lambdaFunction`, before it was just `LambdaVariable`, but now 
it's `Cast(LambdaVariable, xxx)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17575: [SPARK-20265][MLlib] Improve Prefix'span pre-proc...

2017-04-10 Thread Syrux
Github user Syrux commented on a diff in the pull request:

https://github.com/apache/spark/pull/17575#discussion_r110669561
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala 
---
@@ -232,6 +200,68 @@ class PrefixSpan private (
 object PrefixSpan extends Logging {
 
   /**
+   * This methods finds all frequent items in a input dataset.
+   *
+   * @param data Sequences of itemsets.
+   * @param minCount The minimal number of sequence an item should be 
present in to be frequent
+   *
+   * @return An array of Item containing only frequent items.
+   */
+  private[fpm] def findFrequentItems[Item: ClassTag](data : 
RDD[Array[Array[Item]]],
+ minCount : Long): 
Array[Item] = {
+
+data.flatMap { itemsets =>
+  val uniqItems = mutable.Set.empty[Item]
+  itemsets.foreach { _.foreach { item =>
+uniqItems += item
+  }}
+  uniqItems.toIterator.map((_, 1L))
+}.reduceByKey(_ + _).filter { case (_, count) =>
+count >= minCount
+}.sortBy(-_._2).map(_._1).collect()
+  }
+
+  /**
+   * This methods cleans the input dataset from un-frequent items, and 
translate it's item
+   * to their corresponding Int identifier.
+   *
+   * @param data Sequences of itemsets.
+   * @param itemToInt A map allowing translation of frequent Items to 
their Int Identifier.
+   *  The map should only contain frequent item.
+   *
+   * @return The internal repr of the inputted dataset. With properly 
placed zero delimiter.
+   */
+  private[fpm] def toDatabaseInternalRepr[Item: ClassTag](data : 
RDD[Array[Array[Item]]],
+itemToInt : 
Map[Item, Int]):
+  RDD[Array[Int]] = {
+
+data.flatMap { itemsets =>
+  val allItems = mutable.ArrayBuilder.make[Int]
+  var containsFreqItems = false
+  allItems += 0
+  itemsets.foreach { itemsets =>
+val items = mutable.ArrayBuilder.make[Int]
+itemsets.foreach { item =>
+  if (itemToInt.contains(item)) {
+items += itemToInt(item) + 1 // using 1-indexing in internal 
format
+  }
+}
+val result = items.result()
+if (result.nonEmpty) {
+  containsFreqItems = true
+  allItems ++= result.sorted
+  allItems += 0
+}
+  }
+  if (containsFreqItems) {
--- End diff --

Apparently, prepending is impossible on an arrayBuilder. The method doesn't 
exist 
(http://www.scala-lang.org/api/2.12.0/scala/collection/mutable/ArrayBuilder.html).
I think the flag is our best bet for performance. Changing it to an 
arrayBuffer would be far worse since a type encapsulation would be forced on 
the ints it contain.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17575: [SPARK-20265][MLlib] Improve Prefix'span pre-proc...

2017-04-10 Thread Syrux
Github user Syrux commented on a diff in the pull request:

https://github.com/apache/spark/pull/17575#discussion_r110667171
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala 
---
@@ -232,6 +200,68 @@ class PrefixSpan private (
 object PrefixSpan extends Logging {
 
   /**
+   * This methods finds all frequent items in a input dataset.
+   *
+   * @param data Sequences of itemsets.
+   * @param minCount The minimal number of sequence an item should be 
present in to be frequent
+   *
+   * @return An array of Item containing only frequent items.
+   */
+  private[fpm] def findFrequentItems[Item: ClassTag](data : 
RDD[Array[Array[Item]]],
+ minCount : Long): 
Array[Item] = {
+
+data.flatMap { itemsets =>
+  val uniqItems = mutable.Set.empty[Item]
+  itemsets.foreach { _.foreach { item =>
+uniqItems += item
+  }}
+  uniqItems.toIterator.map((_, 1L))
+}.reduceByKey(_ + _).filter { case (_, count) =>
+count >= minCount
+}.sortBy(-_._2).map(_._1).collect()
+  }
+
+  /**
+   * This methods cleans the input dataset from un-frequent items, and 
translate it's item
+   * to their corresponding Int identifier.
+   *
+   * @param data Sequences of itemsets.
+   * @param itemToInt A map allowing translation of frequent Items to 
their Int Identifier.
+   *  The map should only contain frequent item.
+   *
+   * @return The internal repr of the inputted dataset. With properly 
placed zero delimiter.
+   */
+  private[fpm] def toDatabaseInternalRepr[Item: ClassTag](data : 
RDD[Array[Array[Item]]],
+itemToInt : 
Map[Item, Int]):
+  RDD[Array[Int]] = {
+
+data.flatMap { itemsets =>
+  val allItems = mutable.ArrayBuilder.make[Int]
+  var containsFreqItems = false
+  allItems += 0
+  itemsets.foreach { itemsets =>
+val items = mutable.ArrayBuilder.make[Int]
+itemsets.foreach { item =>
+  if (itemToInt.contains(item)) {
+items += itemToInt(item) + 1 // using 1-indexing in internal 
format
+  }
+}
+val result = items.result()
+if (result.nonEmpty) {
+  containsFreqItems = true
+  allItems ++= result.sorted
+  allItems += 0
+}
+  }
+  if (containsFreqItems) {
--- End diff --

I am not sure about the performance of a pre-append on arrayBuilder. I will 
check them first. Back in a few minutes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17591: [SPARK-20280][CORE] FileStatusCache Weigher integer over...

2017-04-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17591
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75660/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17591: [SPARK-20280][CORE] FileStatusCache Weigher integer over...

2017-04-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17591
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17591: [SPARK-20280][CORE] FileStatusCache Weigher integer over...

2017-04-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17591
  
**[Test build #75660 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75660/testReport)**
 for PR 17591 at commit 
[`f4ae52a`](https://github.com/apache/spark/commit/f4ae52a0c60b9d2a086e4f1ad0bb3675bebc47da).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17575: [SPARK-20265][MLlib] Improve Prefix'span pre-proc...

2017-04-10 Thread Syrux
Github user Syrux commented on a diff in the pull request:

https://github.com/apache/spark/pull/17575#discussion_r110664282
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/fpm/PrefixSpanSuite.scala ---
@@ -360,6 +360,55 @@ class PrefixSpanSuite extends SparkFunSuite with 
MLlibTestSparkContext {
 compareResults(expected, model.freqSequences.collect())
   }
 
+  test("PrefixSpan pre-processing's cleaning test") {
+
+// One item per itemSet
+val itemToInt1 = (4 to 5).zipWithIndex.toMap
+val sequences1 = Seq(
+  Array(Array(4), Array(1), Array(2), Array(5), Array(2), Array(4), 
Array(5)),
+  Array(Array(6), Array(7), Array(8)))
+val rdd1 = sc.parallelize(sequences1, 2).cache()
+
+val cleanedSequence1 = PrefixSpan.toDatabaseInternalRepr(rdd1, 
itemToInt1).collect()
+
+val expected1 = Array(Array(0, 4, 0, 5, 0, 4, 0, 5, 0))
+  .map(x => x.map(y => {
+if (y == 0) 0
+else itemToInt1(y) + 1
+  }))
+
+compareInternalSequences(expected1, cleanedSequence1)
+
+// Multi-item sequence
+val itemToInt2 = (4 to 6).zipWithIndex.toMap
+val sequences2 = Seq(
+  Array(Array(4, 5), Array(1, 6, 2), Array(2), Array(5), Array(2), 
Array(4), Array(5, 6, 7)),
+  Array(Array(8, 9), Array(1, 2)))
+val rdd2 = sc.parallelize(sequences2, 2).cache()
+
+val cleanedSequence2 = PrefixSpan.toDatabaseInternalRepr(rdd2, 
itemToInt2).collect()
+
+val expected2 = Array(Array(0, 4, 5, 0, 6, 0, 5, 0, 4, 0, 5, 6, 0))
+  .map(x => x.map(y => {
+if (y == 0) 0
+else itemToInt2(y) + 1
+  }))
+
+compareInternalSequences(expected2, cleanedSequence2)
+
+// Emptied sequence
+val itemToInt3 = (10 to 10).zipWithIndex.toMap
+val sequences3 = Seq(
+  Array(Array(4, 5), Array(1, 6, 2), Array(2), Array(5), Array(2), 
Array(4), Array(5, 6, 7)),
+  Array(Array(8, 9), Array(1, 2)))
+val rdd3 = sc.parallelize(sequences3, 2).cache()
+
+val cleanedSequence3 = PrefixSpan.toDatabaseInternalRepr(rdd3, 
itemToInt3).collect()
+val expected3: Array[Array[Int]] = Array()
--- End diff --

Yep, it can. It even avoids a useless cast.
I will push the new version asap


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17527: [SPARK-20156][CORE][SQL][STREAMING][MLLIB] Java String t...

2017-04-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17527
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75657/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17527: [SPARK-20156][CORE][SQL][STREAMING][MLLIB] Java String t...

2017-04-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17527
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17527: [SPARK-20156][CORE][SQL][STREAMING][MLLIB] Java String t...

2017-04-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17527
  
**[Test build #75657 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75657/testReport)**
 for PR 17527 at commit 
[`2ac5843`](https://github.com/apache/spark/commit/2ac5843a071847dbe6e8cea08b49a7ef36587101).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17575: [SPARK-20265][MLlib] Improve Prefix'span pre-proc...

2017-04-10 Thread Syrux
Github user Syrux commented on a diff in the pull request:

https://github.com/apache/spark/pull/17575#discussion_r110662589
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/fpm/PrefixSpanSuite.scala ---
@@ -360,6 +360,55 @@ class PrefixSpanSuite extends SparkFunSuite with 
MLlibTestSparkContext {
 compareResults(expected, model.freqSequences.collect())
   }
 
+  test("PrefixSpan pre-processing's cleaning test") {
+
+// One item per itemSet
+val itemToInt1 = (4 to 5).zipWithIndex.toMap
+val sequences1 = Seq(
+  Array(Array(4), Array(1), Array(2), Array(5), Array(2), Array(4), 
Array(5)),
+  Array(Array(6), Array(7), Array(8)))
+val rdd1 = sc.parallelize(sequences1, 2).cache()
+
+val cleanedSequence1 = PrefixSpan.toDatabaseInternalRepr(rdd1, 
itemToInt1).collect()
+
+val expected1 = Array(Array(0, 4, 0, 5, 0, 4, 0, 5, 0))
+  .map(x => x.map(y => {
+if (y == 0) 0
--- End diff --

Ok, changing it


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17575: [SPARK-20265][MLlib] Improve Prefix'span pre-proc...

2017-04-10 Thread Syrux
Github user Syrux commented on a diff in the pull request:

https://github.com/apache/spark/pull/17575#discussion_r110662506
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala 
---
@@ -232,6 +200,68 @@ class PrefixSpan private (
 object PrefixSpan extends Logging {
 
   /**
+   * This methods finds all frequent items in a input dataset.
+   *
+   * @param data Sequences of itemsets.
+   * @param minCount The minimal number of sequence an item should be 
present in to be frequent
+   *
+   * @return An array of Item containing only frequent items.
+   */
+  private[fpm] def findFrequentItems[Item: ClassTag](data : 
RDD[Array[Array[Item]]],
+ minCount : Long): 
Array[Item] = {
+
+data.flatMap { itemsets =>
+  val uniqItems = mutable.Set.empty[Item]
--- End diff --

Ok, changed the accolades to parenthesis. (I suppose that what you meant, 
correct me if I'm wrong)
Also, just by curiosity, do you know if that make any differences in 
performances ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17575: [SPARK-20265][MLlib] Improve Prefix'span pre-proc...

2017-04-10 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/17575#discussion_r110662332
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala 
---
@@ -232,6 +200,68 @@ class PrefixSpan private (
 object PrefixSpan extends Logging {
 
   /**
+   * This methods finds all frequent items in a input dataset.
+   *
+   * @param data Sequences of itemsets.
+   * @param minCount The minimal number of sequence an item should be 
present in to be frequent
+   *
+   * @return An array of Item containing only frequent items.
+   */
+  private[fpm] def findFrequentItems[Item: ClassTag](data : 
RDD[Array[Array[Item]]],
+ minCount : Long): 
Array[Item] = {
+
+data.flatMap { itemsets =>
+  val uniqItems = mutable.Set.empty[Item]
+  itemsets.foreach { _.foreach { item =>
+uniqItems += item
+  }}
+  uniqItems.toIterator.map((_, 1L))
+}.reduceByKey(_ + _).filter { case (_, count) =>
+count >= minCount
+}.sortBy(-_._2).map(_._1).collect()
+  }
+
+  /**
+   * This methods cleans the input dataset from un-frequent items, and 
translate it's item
+   * to their corresponding Int identifier.
+   *
+   * @param data Sequences of itemsets.
+   * @param itemToInt A map allowing translation of frequent Items to 
their Int Identifier.
+   *  The map should only contain frequent item.
+   *
+   * @return The internal repr of the inputted dataset. With properly 
placed zero delimiter.
+   */
+  private[fpm] def toDatabaseInternalRepr[Item: ClassTag](data : 
RDD[Array[Array[Item]]],
+itemToInt : 
Map[Item, Int]):
+  RDD[Array[Int]] = {
+
+data.flatMap { itemsets =>
+  val allItems = mutable.ArrayBuilder.make[Int]
+  var containsFreqItems = false
+  allItems += 0
+  itemsets.foreach { itemsets =>
+val items = mutable.ArrayBuilder.make[Int]
+itemsets.foreach { item =>
+  if (itemToInt.contains(item)) {
+items += itemToInt(item) + 1 // using 1-indexing in internal 
format
+  }
+}
+val result = items.result()
+if (result.nonEmpty) {
+  containsFreqItems = true
+  allItems ++= result.sorted
+  allItems += 0
+}
+  }
+  if (containsFreqItems) {
--- End diff --

I see. What about waiting to pre-pend the initial 0 until the end, only if 
not empty? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17575: [SPARK-20265][MLlib] Improve Prefix'span pre-proc...

2017-04-10 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/17575#discussion_r110662125
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala 
---
@@ -232,6 +200,68 @@ class PrefixSpan private (
 object PrefixSpan extends Logging {
 
   /**
+   * This methods finds all frequent items in a input dataset.
+   *
+   * @param data Sequences of itemsets.
+   * @param minCount The minimal number of sequence an item should be 
present in to be frequent
+   *
+   * @return An array of Item containing only frequent items.
+   */
+  private[fpm] def findFrequentItems[Item: ClassTag](data : 
RDD[Array[Array[Item]]],
+ minCount : Long): 
Array[Item] = {
+
+data.flatMap { itemsets =>
+  val uniqItems = mutable.Set.empty[Item]
--- End diff --

or does `itemsets.foreach(set => uniqItems ++= set)` work?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17575: [SPARK-20265][MLlib] Improve Prefix'span pre-proc...

2017-04-10 Thread Syrux
Github user Syrux commented on a diff in the pull request:

https://github.com/apache/spark/pull/17575#discussion_r110661717
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala 
---
@@ -232,6 +200,68 @@ class PrefixSpan private (
 object PrefixSpan extends Logging {
 
   /**
+   * This methods finds all frequent items in a input dataset.
+   *
+   * @param data Sequences of itemsets.
+   * @param minCount The minimal number of sequence an item should be 
present in to be frequent
+   *
+   * @return An array of Item containing only frequent items.
+   */
+  private[fpm] def findFrequentItems[Item: ClassTag](data : 
RDD[Array[Array[Item]]],
+ minCount : Long): 
Array[Item] = {
+
+data.flatMap { itemsets =>
+  val uniqItems = mutable.Set.empty[Item]
+  itemsets.foreach { _.foreach { item =>
+uniqItems += item
+  }}
+  uniqItems.toIterator.map((_, 1L))
+}.reduceByKey(_ + _).filter { case (_, count) =>
+count >= minCount
+}.sortBy(-_._2).map(_._1).collect()
+  }
+
+  /**
+   * This methods cleans the input dataset from un-frequent items, and 
translate it's item
+   * to their corresponding Int identifier.
+   *
+   * @param data Sequences of itemsets.
+   * @param itemToInt A map allowing translation of frequent Items to 
their Int Identifier.
+   *  The map should only contain frequent item.
+   *
+   * @return The internal repr of the inputted dataset. With properly 
placed zero delimiter.
+   */
+  private[fpm] def toDatabaseInternalRepr[Item: ClassTag](data : 
RDD[Array[Array[Item]]],
--- End diff --

Ok, the new version will fix that and the colon space.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17575: [SPARK-20265][MLlib] Improve Prefix'span pre-proc...

2017-04-10 Thread Syrux
Github user Syrux commented on a diff in the pull request:

https://github.com/apache/spark/pull/17575#discussion_r110661386
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala 
---
@@ -232,6 +200,68 @@ class PrefixSpan private (
 object PrefixSpan extends Logging {
 
   /**
+   * This methods finds all frequent items in a input dataset.
+   *
+   * @param data Sequences of itemsets.
+   * @param minCount The minimal number of sequence an item should be 
present in to be frequent
+   *
+   * @return An array of Item containing only frequent items.
+   */
+  private[fpm] def findFrequentItems[Item: ClassTag](data : 
RDD[Array[Array[Item]]],
+ minCount : Long): 
Array[Item] = {
+
+data.flatMap { itemsets =>
+  val uniqItems = mutable.Set.empty[Item]
+  itemsets.foreach { _.foreach { item =>
+uniqItems += item
+  }}
+  uniqItems.toIterator.map((_, 1L))
+}.reduceByKey(_ + _).filter { case (_, count) =>
+count >= minCount
+}.sortBy(-_._2).map(_._1).collect()
+  }
+
+  /**
+   * This methods cleans the input dataset from un-frequent items, and 
translate it's item
+   * to their corresponding Int identifier.
+   *
+   * @param data Sequences of itemsets.
+   * @param itemToInt A map allowing translation of frequent Items to 
their Int Identifier.
+   *  The map should only contain frequent item.
+   *
+   * @return The internal repr of the inputted dataset. With properly 
placed zero delimiter.
+   */
+  private[fpm] def toDatabaseInternalRepr[Item: ClassTag](data : 
RDD[Array[Array[Item]]],
+itemToInt : 
Map[Item, Int]):
+  RDD[Array[Int]] = {
+
+data.flatMap { itemsets =>
+  val allItems = mutable.ArrayBuilder.make[Int]
+  var containsFreqItems = false
+  allItems += 0
+  itemsets.foreach { itemsets =>
+val items = mutable.ArrayBuilder.make[Int]
+itemsets.foreach { item =>
+  if (itemToInt.contains(item)) {
+items += itemToInt(item) + 1 // using 1-indexing in internal 
format
+  }
+}
+val result = items.result()
+if (result.nonEmpty) {
+  containsFreqItems = true
+  allItems ++= result.sorted
+  allItems += 0
+}
+  }
+  if (containsFreqItems) {
--- End diff --

Yes, but allItems is an arrayBuilder, so there is no size method.
I could do allIItems.result().size but I think the performance might be 
worse than a flag.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17582: [SPARK-20239][Core] Improve HistoryServer's ACL mechanis...

2017-04-10 Thread tgravescs
Github user tgravescs commented on the issue:

https://github.com/apache/spark/pull/17582
  
Sorry but I'm confused by the explanation in the description.  I didn't 
completely follow what problems you are seeing that aren't intended and I don't 
understand how you are proposing to fix.  Can you please describe the design 
you are proposing in more detail?

On the description can you please clarify each of your bullets? For 
instance:
1. if base URL's ACL (spark.acls.enable) is enabled but user A has no view 
permission. User "A" cannot see the app list but could still access details of 
it's own app.

Are you saying user A is not in the list of acls or is?  if they have no 
view permission then they shouldn't be able to see the app.  I don't understnad 
what you mean by "could still access details of it's own app"?  Is this user 
A's application (meaning they started it) and hence he would automatically be 
in the acl list?

Clarifying the other bullets would be helpful as well.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17330: [SPARK-19993][SQL] Caching logical plans containing subq...

2017-04-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17330
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17330: [SPARK-19993][SQL] Caching logical plans containing subq...

2017-04-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17330
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75654/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17330: [SPARK-19993][SQL] Caching logical plans containing subq...

2017-04-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17330
  
**[Test build #75654 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75654/testReport)**
 for PR 17330 at commit 
[`22db44a`](https://github.com/apache/spark/commit/22db44addbe1892b52353fa22bd9b65952e8bdf5).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17556: [SPARK-16957][MLlib] Use weighted midpoints for split va...

2017-04-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17556
  
**[Test build #3655 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3655/testReport)**
 for PR 17556 at commit 
[`9ca5750`](https://github.com/apache/spark/commit/9ca57505c8211954478a2d54ced48c2561cfb9f9).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17591: [SPARK-20280][CORE] FileStatusCache Weigher integ...

2017-04-10 Thread bogdanrdc
Github user bogdanrdc commented on a diff in the pull request:

https://github.com/apache/spark/pull/17591#discussion_r110646607
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileStatusCache.scala
 ---
@@ -94,13 +94,25 @@ private class SharedInMemoryCache(maxSizeInBytes: Long) 
extends Logging {
   // Opaque object that uniquely identifies a shared cache user
   private type ClientId = Object
 
+  /* [[Weigher]].weigh returns Int so we could only cache objects < 2GB
+   * instead, the weight is divided by this factor (which is smaller
+   * than the size of one [[FileStatus]]).
+   * so it will support objects up to 64GB in size.
+   */
+  private val weightScale = 32
+
   private val warnedAboutEviction = new AtomicBoolean(false)
 
   // we use a composite cache key in order to distinguish entries inserted 
by different clients
   private val cache: Cache[(ClientId, Path), Array[FileStatus]] = 
CacheBuilder.newBuilder()
 .weigher(new Weigher[(ClientId, Path), Array[FileStatus]] {
   override def weigh(key: (ClientId, Path), value: Array[FileStatus]): 
Int = {
-(SizeEstimator.estimate(key) + SizeEstimator.estimate(value)).toInt
+val estimate = (SizeEstimator.estimate(key) + 
SizeEstimator.estimate(value)) / weightScale
+if (estimate > Int.MaxValue) {
+  throw new IllegalStateException(
--- End diff --

yes, I guess it's better to fail later than sooner. I made it a warning 
instead.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17591: [SPARK-20280][CORE] FileStatusCache Weigher integer over...

2017-04-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17591
  
**[Test build #75663 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75663/testReport)**
 for PR 17591 at commit 
[`ef8e9e9`](https://github.com/apache/spark/commit/ef8e9e93f6883542847e2a136fb63a4565bf07b4).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17592: [SPARK-20243][TESTS] DebugFilesystem.assertNoOpenStreams...

2017-04-10 Thread hvanhovell
Github user hvanhovell commented on the issue:

https://github.com/apache/spark/pull/17592
  
LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17592: [SPARK-20243][TESTS] DebugFilesystem.assertNoOpenStreams...

2017-04-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17592
  
**[Test build #75662 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75662/testReport)**
 for PR 17592 at commit 
[`d486d60`](https://github.com/apache/spark/commit/d486d6015ae0129ce41e6683eae37243c843ba59).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17556: [SPARK-16957][MLlib] Use weighted midpoints for split va...

2017-04-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17556
  
**[Test build #3655 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3655/testReport)**
 for PR 17556 at commit 
[`9ca5750`](https://github.com/apache/spark/commit/9ca57505c8211954478a2d54ced48c2561cfb9f9).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17593: [SPARK-20279][WEB-UI]In web ui,'Only showing 200' should...

2017-04-10 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/17593
  
I'm not sure this is accurate, given how sort order can be changed? maybe 
not, but it's at least ambiguous, and don't think it adds enough to change this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17591: [SPARK-20280][CORE] FileStatusCache Weigher integ...

2017-04-10 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/17591#discussion_r110642316
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileStatusCache.scala
 ---
@@ -94,13 +94,25 @@ private class SharedInMemoryCache(maxSizeInBytes: Long) 
extends Logging {
   // Opaque object that uniquely identifies a shared cache user
   private type ClientId = Object
 
+  /* [[Weigher]].weigh returns Int so we could only cache objects < 2GB
+   * instead, the weight is divided by this factor (which is smaller
+   * than the size of one [[FileStatus]]).
+   * so it will support objects up to 64GB in size.
+   */
+  private val weightScale = 32
+
   private val warnedAboutEviction = new AtomicBoolean(false)
 
   // we use a composite cache key in order to distinguish entries inserted 
by different clients
   private val cache: Cache[(ClientId, Path), Array[FileStatus]] = 
CacheBuilder.newBuilder()
 .weigher(new Weigher[(ClientId, Path), Array[FileStatus]] {
   override def weigh(key: (ClientId, Path), value: Array[FileStatus]): 
Int = {
-(SizeEstimator.estimate(key) + SizeEstimator.estimate(value)).toInt
+val estimate = (SizeEstimator.estimate(key) + 
SizeEstimator.estimate(value)) / weightScale
+if (estimate > Int.MaxValue) {
+  throw new IllegalStateException(
--- End diff --

Agree, though, this can only happen if the filesourcePartitionFileCacheSize 
is at least 64GB and some object is at least 64GB. The effect is to possibly 
cache things longer than they should, which seems better than failing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17591: [SPARK-20280][CORE] FileStatusCache Weigher integ...

2017-04-10 Thread bogdanrdc
Github user bogdanrdc commented on a diff in the pull request:

https://github.com/apache/spark/pull/17591#discussion_r110639886
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileStatusCache.scala
 ---
@@ -94,13 +94,25 @@ private class SharedInMemoryCache(maxSizeInBytes: Long) 
extends Logging {
   // Opaque object that uniquely identifies a shared cache user
   private type ClientId = Object
 
+  /* [[Weigher]].weigh returns Int so we could only cache objects < 2GB
+   * instead, the weight is divided by this factor (which is smaller
+   * than the size of one [[FileStatus]]).
+   * so it will support objects up to 64GB in size.
+   */
+  private val weightScale = 32
+
   private val warnedAboutEviction = new AtomicBoolean(false)
 
   // we use a composite cache key in order to distinguish entries inserted 
by different clients
   private val cache: Cache[(ClientId, Path), Array[FileStatus]] = 
CacheBuilder.newBuilder()
 .weigher(new Weigher[(ClientId, Path), Array[FileStatus]] {
   override def weigh(key: (ClientId, Path), value: Array[FileStatus]): 
Int = {
-(SizeEstimator.estimate(key) + SizeEstimator.estimate(value)).toInt
+val estimate = (SizeEstimator.estimate(key) + 
SizeEstimator.estimate(value)) / weightScale
+if (estimate > Int.MaxValue) {
+  throw new IllegalStateException(
--- End diff --

If it's capped, the cache might use more memory than configured with 
spark.sql.hive.filesourcePartitionFileCacheSize. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17592: [SPARK-20243][TESTS] DebugFilesystem.assertNoOpen...

2017-04-10 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/17592#discussion_r110639779
  
--- Diff: core/src/test/scala/org/apache/spark/DebugFilesystem.scala ---
@@ -31,21 +30,29 @@ import org.apache.spark.internal.Logging
 
 object DebugFilesystem extends Logging {
   // Stores the set of active streams and their creation sites.
-  private val openStreams = new ConcurrentHashMap[FSDataInputStream, 
Throwable]()
+  private val openStreams = mutable.Map.empty[FSDataInputStream, Throwable]
 
-  def clearOpenStreams(): Unit = {
+  def addOpenStream(stream: FSDataInputStream): Unit = synchronized {
--- End diff --

It's a little safer and tidier to synchronize on `openStream` rather than 
the containing `object`. Looks good to me though.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17591: [SPARK-20280][CORE] FileStatusCache Weigher integ...

2017-04-10 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/17591#discussion_r110639538
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileStatusCache.scala
 ---
@@ -94,13 +94,25 @@ private class SharedInMemoryCache(maxSizeInBytes: Long) 
extends Logging {
   // Opaque object that uniquely identifies a shared cache user
   private type ClientId = Object
 
+  /* [[Weigher]].weigh returns Int so we could only cache objects < 2GB
+   * instead, the weight is divided by this factor (which is smaller
+   * than the size of one [[FileStatus]]).
+   * so it will support objects up to 64GB in size.
+   */
+  private val weightScale = 32
--- End diff --

Rather than make it a member variable, just a local variable in the 
initializer for cache?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17593: [SPARK-20279][WEB-UI]In web ui,'Only showing 200' should...

2017-04-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17593
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17591: [SPARK-20280][CORE] FileStatusCache Weigher integ...

2017-04-10 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/17591#discussion_r110638961
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileStatusCache.scala
 ---
@@ -94,13 +94,25 @@ private class SharedInMemoryCache(maxSizeInBytes: Long) 
extends Logging {
   // Opaque object that uniquely identifies a shared cache user
   private type ClientId = Object
 
+  /* [[Weigher]].weigh returns Int so we could only cache objects < 2GB
+   * instead, the weight is divided by this factor (which is smaller
+   * than the size of one [[FileStatus]]).
+   * so it will support objects up to 64GB in size.
+   */
+  private val weightScale = 32
+
   private val warnedAboutEviction = new AtomicBoolean(false)
 
   // we use a composite cache key in order to distinguish entries inserted 
by different clients
   private val cache: Cache[(ClientId, Path), Array[FileStatus]] = 
CacheBuilder.newBuilder()
 .weigher(new Weigher[(ClientId, Path), Array[FileStatus]] {
   override def weigh(key: (ClientId, Path), value: Array[FileStatus]): 
Int = {
-(SizeEstimator.estimate(key) + SizeEstimator.estimate(value)).toInt
+val estimate = (SizeEstimator.estimate(key) + 
SizeEstimator.estimate(value)) / weightScale
+if (estimate > Int.MaxValue) {
+  throw new IllegalStateException(
--- End diff --

This shouldn't be an error. It's just a weight. Capping at `Int.MaxValue` 
is no problem.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17593: [SPARK-20279][WEB-UI]In web ui,'Only showing 200'...

2017-04-10 Thread guoxiaolongzte
GitHub user guoxiaolongzte opened a pull request:

https://github.com/apache/spark/pull/17593

[SPARK-20279][WEB-UI]In web ui,'Only showing 200' should be changed to 
'only showing last 200'.

## What changes were proposed in this pull request?

In web ui,'Only showing 200' should be changed to 'only showing last 200' 
in the page of 'jobs' or stages.

purpose:
I think, the description about add 'last', the purpose is to ensure that 
users more clearly know that the current show 'jobs or stages', is the latest 
200 'jobs or stages', Or the beginning 200 of the 'jobs or stages'.


## How was this patch tested?
unit tests,manual tests

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/guoxiaolongzte/spark SPARK-20279

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17593.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17593


commit d383efba12c66addb17006dea107bb0421d50bc3
Author: 郭小龙 10207633 
Date:   2017-03-31T13:57:09Z

[SPARK-20177]Document about compression way has some little detail changes.

commit 3059013e9d2aec76def14eb314b6761bea0e7ca0
Author: 郭小龙 10207633 
Date:   2017-04-01T01:38:02Z

[SPARK-20177] event log add a space

commit 555cef88fe09134ac98fd0ad056121c7df2539aa
Author: guoxiaolongzte 
Date:   2017-04-02T00:16:08Z

'/applications/[app-id]/jobs' in rest api,status should be 
[running|succeeded|failed|unknown]

commit 46bb1ad3ddd9fb55b5607ac4f20213a90186cfe9
Author: 郭小龙 10207633 
Date:   2017-04-05T03:16:50Z

Merge branch 'master' of https://github.com/apache/spark into SPARK-20177

commit 0efb0dd9e404229cce638fe3fb0c966276784df7
Author: 郭小龙 10207633 
Date:   2017-04-05T03:47:53Z

[SPARK-20218]'/applications/[app-id]/stages' in REST API,add description.

commit 0e37fdeee28e31fc97436dabd001d3c85c5a7794
Author: 郭小龙 10207633 
Date:   2017-04-05T05:22:54Z

[SPARK-20218] '/applications/[app-id]/stages/[stage-id]' in REST API,remove 
redundant description.

commit 52641bb01e55b48bd9e8579fea217439d14c7dc7
Author: 郭小龙 10207633 
Date:   2017-04-07T06:24:58Z

Merge branch 'SPARK-20218'

commit d3977c9cab0722d279e3fae7aacbd4eb944c22f6
Author: 郭小龙 10207633 
Date:   2017-04-08T07:13:02Z

Merge branch 'master' of https://github.com/apache/spark

commit 137b90e5a85cde7e9b904b3e5ea0bb52518c4716
Author: 郭小龙 10207633 
Date:   2017-04-10T05:13:40Z

Merge branch 'master' of https://github.com/apache/spark

commit 0fe5865b8022aeacdb2d194699b990d8467f7a0a
Author: 郭小龙 10207633 
Date:   2017-04-10T10:25:22Z

Merge branch 'SPARK-20190' of https://github.com/guoxiaolongzte/spark

commit cf6f42ac84466960f2232c025b8faeb5d7378fe1
Author: 郭小龙 10207633 
Date:   2017-04-10T10:26:27Z

Merge branch 'master' of https://github.com/apache/spark

commit 83c8f4f270f9d4a0f16b6be4915a48537b79d2db
Author: 郭小龙 10207633 
Date:   2017-04-10T12:02:46Z

[SPARK-20279]In web ui,'Only showing 200' should be changed to 'only 
showing last 200' in the page of 'jobs' or'stages'.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17592: [SPARK-20243][TESTS] DebugFilesystem.assertNoOpenStreams...

2017-04-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17592
  
**[Test build #75661 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75661/testReport)**
 for PR 17592 at commit 
[`19ca5a1`](https://github.com/apache/spark/commit/19ca5a1114dbe666ac40f4c966f3e600b58f92c2).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17592: [SPARK-20243][TESTS] DebugFilesystem.assertNoOpen...

2017-04-10 Thread bogdanrdc
GitHub user bogdanrdc opened a pull request:

https://github.com/apache/spark/pull/17592

[SPARK-20243][TESTS] DebugFilesystem.assertNoOpenStreams thread race

## What changes were proposed in this pull request?

Synchronize access to openStreams map.

## How was this patch tested?

Existing tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/bogdanrdc/spark SPARK-20243

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17592.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17592


commit 19ca5a1114dbe666ac40f4c966f3e600b58f92c2
Author: Bogdan Raducanu 
Date:   2017-04-10T12:08:55Z

fix by synchronizing access to the stream map




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17591: [SPARK-20280][CORE] FileStatusCache Weigher integer over...

2017-04-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17591
  
**[Test build #75660 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75660/testReport)**
 for PR 17591 at commit 
[`f4ae52a`](https://github.com/apache/spark/commit/f4ae52a0c60b9d2a086e4f1ad0bb3675bebc47da).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17591: [SPARK-20280][CORE] FileStatusCache Weigher integ...

2017-04-10 Thread bogdanrdc
GitHub user bogdanrdc opened a pull request:

https://github.com/apache/spark/pull/17591

[SPARK-20280][CORE] FileStatusCache Weigher integer overflow

## What changes were proposed in this pull request?

Weigher.weigh needs to return Int but it is possible for an 
Array[FileStatus] to have size > Int.maxValue. To avoid this, the size is 
scaled down by a factor of 32. The maximumWeight of the cache is also scaled 
down by the same factor. 

## How was this patch tested?
New test in FileIndexSuite


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/bogdanrdc/spark SPARK-20280

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17591.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17591


commit f4ae52a0c60b9d2a086e4f1ad0bb3675bebc47da
Author: Bogdan Raducanu 
Date:   2017-04-10T11:52:41Z

fix + test




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17586: [SPARK-20249][ML][PYSPARK] Add summary for LinearSVCMode...

2017-04-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17586
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75655/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17586: [SPARK-20249][ML][PYSPARK] Add summary for LinearSVCMode...

2017-04-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17586
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17586: [SPARK-20249][ML][PYSPARK] Add summary for LinearSVCMode...

2017-04-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17586
  
**[Test build #75655 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75655/testReport)**
 for PR 17586 at commit 
[`0daa041`](https://github.com/apache/spark/commit/0daa0410af0b7a4968e852af7fbba6ebb2cb9064).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `case class LinearSVCTrainingSummary(`
  * `class LinearSVCTrainingSummary(JavaWrapper):`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for...

2017-04-10 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17077#discussion_r110634385
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -2167,6 +2167,61 @@ def test_BinaryType_serialization(self):
 df = self.spark.createDataFrame(data, schema=schema)
 df.collect()
 
+def test_bucketed_write(self):
+data = [
+(1, "foo", 3.0), (2, "foo", 5.0),
+(3, "bar", -1.0), (4, "bar", 6.0),
+]
+df = self.spark.createDataFrame(data, ["x", "y", "z"])
+
+# Test write with one bucketing column
+df.write.bucketBy(3, 
"x").mode("overwrite").saveAsTable("pyspark_bucket")
+self.assertEqual(
+len([c for c in 
self.spark.catalog.listColumns("pyspark_bucket")
+ if c.name == "x" and c.isBucket]),
+1
+)
+self.assertSetEqual(set(data), 
set(self.spark.table("pyspark_bucket").collect()))
+
+# Test write two bucketing columns
+df.write.bucketBy(3, "x", 
"y").mode("overwrite").saveAsTable("pyspark_bucket")
+self.assertEqual(
+len([c for c in 
self.spark.catalog.listColumns("pyspark_bucket")
+ if c.name in ("x", "y") and c.isBucket]),
+2
+)
+self.assertSetEqual(set(data), 
set(self.spark.table("pyspark_bucket").collect()))
+
+# Test write with bucket and sort
+df.write.bucketBy(2, 
"x").sortBy("z").mode("overwrite").saveAsTable("pyspark_bucket")
+self.assertEqual(
+len([c for c in 
self.spark.catalog.listColumns("pyspark_bucket")
+ if c.name == "x" and c.isBucket]),
+1
+)
+self.assertSetEqual(set(data), 
set(self.spark.table("pyspark_bucket").collect()))
+
+# Test write with a list of columns
+df.write.bucketBy(3, ["x", 
"y"]).mode("overwrite").saveAsTable("pyspark_bucket")
+self.assertEqual(
+len([c for c in 
self.spark.catalog.listColumns("pyspark_bucket")
+ if c.name in ("x", "y") and c.isBucket]),
+2
+)
+self.assertSetEqual(set(data), 
set(self.spark.table("pyspark_bucket").collect()))
+
+# Test write with bucket and sort with a list of columns
+(df.write.bucketBy(2, "x")
+.sortBy(["y", "z"])
+.mode("overwrite").saveAsTable("pyspark_bucket"))
+self.assertSetEqual(set(data), 
set(self.spark.table("pyspark_bucket").collect()))
+
+# Test write with bucket and sort with multiple columns
+(df.write.bucketBy(2, "x")
+.sortBy("y", "z")
+.mode("overwrite").saveAsTable("pyspark_bucket"))
--- End diff --

@zero323, should we drop the table before or after this test?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for...

2017-04-10 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17077#discussion_r110632132
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -2167,6 +2167,61 @@ def test_BinaryType_serialization(self):
 df = self.spark.createDataFrame(data, schema=schema)
 df.collect()
 
+def test_bucketed_write(self):
+data = [
+(1, "foo", 3.0), (2, "foo", 5.0),
+(3, "bar", -1.0), (4, "bar", 6.0),
+]
+df = self.spark.createDataFrame(data, ["x", "y", "z"])
+
+# Test write with one bucketing column
+df.write.bucketBy(3, 
"x").mode("overwrite").saveAsTable("pyspark_bucket")
+self.assertEqual(
+len([c for c in 
self.spark.catalog.listColumns("pyspark_bucket")
+ if c.name == "x" and c.isBucket]),
+1
+)
+self.assertSetEqual(set(data), 
set(self.spark.table("pyspark_bucket").collect()))
+
+# Test write two bucketing columns
+df.write.bucketBy(3, "x", 
"y").mode("overwrite").saveAsTable("pyspark_bucket")
+self.assertEqual(
+len([c for c in 
self.spark.catalog.listColumns("pyspark_bucket")
+ if c.name in ("x", "y") and c.isBucket]),
--- End diff --

@zero323, I am sorry. What do you think about something like this one 
below?:

```python
cols = self.spark.catalog.listColumns("pyspark_bucket")
num = len([c for c in cols if c.name in ("x", "y") and c.isBucket])
self.assertEqual(num, 2)
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for...

2017-04-10 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17077#discussion_r110630404
  
--- Diff: python/pyspark/sql/readwriter.py ---
@@ -545,6 +545,57 @@ def partitionBy(self, *cols):
 self._jwrite = self._jwrite.partitionBy(_to_seq(self._spark._sc, 
cols))
 return self
 
+@since(2.2)
+def bucketBy(self, numBuckets, *cols):
+"""Buckets the output by the given columns on the file system.
--- End diff --

I think just copying it from Scala doc is good enough to prevent overhead 
of sweeping the documentation when we start to support other operations later.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...

2017-04-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17077
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75659/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...

2017-04-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17077
  
**[Test build #75659 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75659/testReport)**
 for PR 17077 at commit 
[`845ee87`](https://github.com/apache/spark/commit/845ee8783c54123c743e176def11af7455192d42).
 * This patch **fails Python style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...

2017-04-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17077
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...

2017-04-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17077
  
**[Test build #75659 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75659/testReport)**
 for PR 17077 at commit 
[`845ee87`](https://github.com/apache/spark/commit/845ee8783c54123c743e176def11af7455192d42).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17556: [SPARK-16957][MLlib] Use weighted midpoints for split va...

2017-04-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17556
  
**[Test build #3654 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3654/testReport)**
 for PR 17556 at commit 
[`9ca5750`](https://github.com/apache/spark/commit/9ca57505c8211954478a2d54ced48c2561cfb9f9).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17590: [SPARK-20278][R] Disable 'multiple_dots_linter' lint rul...

2017-04-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17590
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75656/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17590: [SPARK-20278][R] Disable 'multiple_dots_linter' lint rul...

2017-04-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17590
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17590: [SPARK-20278][R] Disable 'multiple_dots_linter' lint rul...

2017-04-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17590
  
**[Test build #75656 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75656/testReport)**
 for PR 17590 at commit 
[`ed0dd86`](https://github.com/apache/spark/commit/ed0dd869a24bc10cfce7dacc8b4b6d57c3ced6de).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17580: [20269][Structured Streaming] add class 'JavaWordCountPr...

2017-04-10 Thread guoxiaolongzte
Github user guoxiaolongzte commented on the issue:

https://github.com/apache/spark/pull/17580
  
@jerryshao 

import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.ProducerConfig;
import org.apache.kafka.clients.producer.ProducerRecord;
The above three classes in kafka-clients-0.10.0.1.jar。The example here 
will not become obsolete.





---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...

2017-04-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17077
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75658/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...

2017-04-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17077
  
**[Test build #75658 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75658/testReport)**
 for PR 17077 at commit 
[`7b93482`](https://github.com/apache/spark/commit/7b93482f31f2efb3d4d742eb3e385e6b4a2bc14e).
 * This patch **fails Python style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...

2017-04-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17077
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...

2017-04-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17077
  
**[Test build #75658 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75658/testReport)**
 for PR 17077 at commit 
[`7b93482`](https://github.com/apache/spark/commit/7b93482f31f2efb3d4d742eb3e385e6b4a2bc14e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17556: [SPARK-16957][MLlib] Use weighted midpoints for split va...

2017-04-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17556
  
**[Test build #3654 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3654/testReport)**
 for PR 17556 at commit 
[`9ca5750`](https://github.com/apache/spark/commit/9ca57505c8211954478a2d54ced48c2561cfb9f9).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17586: [SPARK-20249][ML][PYSPARK] Add summary for LinearSVCMode...

2017-04-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17586
  
**[Test build #75655 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75655/testReport)**
 for PR 17586 at commit 
[`0daa041`](https://github.com/apache/spark/commit/0daa0410af0b7a4968e852af7fbba6ebb2cb9064).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17590: [SPARK-20278][R] Disable 'multiple_dots_linter' lint rul...

2017-04-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17590
  
**[Test build #75656 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75656/testReport)**
 for PR 17590 at commit 
[`ed0dd86`](https://github.com/apache/spark/commit/ed0dd869a24bc10cfce7dacc8b4b6d57c3ced6de).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17527: [SPARK-20156][CORE][SQL][STREAMING][MLLIB] Java String t...

2017-04-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17527
  
**[Test build #75657 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75657/testReport)**
 for PR 17527 at commit 
[`2ac5843`](https://github.com/apache/spark/commit/2ac5843a071847dbe6e8cea08b49a7ef36587101).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17330: [SPARK-19993][SQL] Caching logical plans containing subq...

2017-04-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17330
  
**[Test build #75654 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75654/testReport)**
 for PR 17330 at commit 
[`22db44a`](https://github.com/apache/spark/commit/22db44addbe1892b52353fa22bd9b65952e8bdf5).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17588: [SPARK-20275][UI] Do not display "Completed" column for ...

2017-04-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17588
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75651/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17588: [SPARK-20275][UI] Do not display "Completed" column for ...

2017-04-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17588
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17588: [SPARK-20275][UI] Do not display "Completed" column for ...

2017-04-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17588
  
**[Test build #75651 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75651/testReport)**
 for PR 17588 at commit 
[`6828315`](https://github.com/apache/spark/commit/682831509e5a80555842d400af5c4a909c735414).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for...

2017-04-10 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17077#discussion_r110626138
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -2038,6 +2038,61 @@ def test_BinaryType_serialization(self):
 df = self.spark.createDataFrame(data, schema=schema)
 df.collect()
 
+def test_bucketed_write(self):
+data = [
+(1, "foo", 3.0), (2, "foo", 5.0),
+(3, "bar", -1.0), (4, "bar", 6.0),
+]
+df = self.spark.createDataFrame(data, ["x", "y", "z"])
+
+# Test write with one bucketing column
+df.write.bucketBy(3, 
"x").mode("overwrite").saveAsTable("pyspark_bucket")
+self.assertEqual(
+len([c for c in 
self.spark.catalog.listColumns("pyspark_bucket")
+ if c.name == "x" and c.isBucket]),
--- End diff --

Thanks for taking a look for the related ones and trying it out.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17308: [SPARK-19968][SS] Use a cached instance of `Kafka...

2017-04-10 Thread ScrapCodes
Github user ScrapCodes commented on a diff in the pull request:

https://github.com/apache/spark/pull/17308#discussion_r110625538
  
--- Diff: 
external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaProducer.scala
 ---
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.kafka010
+
+import java.{util => ju}
+
+import scala.collection.mutable
+
+import org.apache.kafka.clients.producer.KafkaProducer
+
+import org.apache.spark.internal.Logging
+
+private[kafka010] object CachedKafkaProducer extends Logging {
+
+  private val cacheMap = new mutable.HashMap[Int, 
KafkaProducer[Array[Byte], Array[Byte]]]()
+
+  private def createKafkaProducer(
+producerConfiguration: ju.HashMap[String, Object]): 
KafkaProducer[Array[Byte], Array[Byte]] = {
+val kafkaProducer: KafkaProducer[Array[Byte], Array[Byte]] =
+  new KafkaProducer[Array[Byte], Array[Byte]](producerConfiguration)
+cacheMap.put(producerConfiguration.hashCode(), kafkaProducer)
--- End diff --

It is not a good idea to do like that. 

I had like my understanding to be corrected, as much as I understood. Since 
in this particular case Spark does not let user specify a key or value 
serializer/deserializer. So `Object` can be either a String, int or Long and 
for these hashcode would work correctly. I am also contemplating a better way 
to do it, now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for...

2017-04-10 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17077#discussion_r110624985
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -2038,6 +2038,61 @@ def test_BinaryType_serialization(self):
 df = self.spark.createDataFrame(data, schema=schema)
 df.collect()
 
+def test_bucketed_write(self):
+data = [
+(1, "foo", 3.0), (2, "foo", 5.0),
+(3, "bar", -1.0), (4, "bar", 6.0),
+]
+df = self.spark.createDataFrame(data, ["x", "y", "z"])
+
+# Test write with one bucketing column
+df.write.bucketBy(3, 
"x").mode("overwrite").saveAsTable("pyspark_bucket")
+self.assertEqual(
+len([c for c in 
self.spark.catalog.listColumns("pyspark_bucket")
+ if c.name == "x" and c.isBucket]),
--- End diff --

Yup, I don't argue with my personal preference. I am fine with it. I dont 
strongly feel about both.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17580: [20269][Structured Streaming] add class 'JavaWordCountPr...

2017-04-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17580
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17580: [20269][Structured Streaming] add class 'JavaWord...

2017-04-10 Thread guoxiaolongzte
GitHub user guoxiaolongzte reopened a pull request:

https://github.com/apache/spark/pull/17580

[20269][Structured Streaming] add class 'JavaWordCountProducer' to 'provide 
java word count producer'.

## What changes were proposed in this pull request?

1.run example of streaming kafka,currently missing java word count 
producer,not conducive to java developers to learn and test.
add a class JavaKafkaWordCountProducer.

2.run example  of JavaKafkaWordCount.I find no java word count producer.
run example of KafkaWordCount.I find have scala word count producer.
I think we should provide the corresponding example code to facilitate java 
developers to learn and test.

3.My project team develops spark applications,basically with java 
statements and java API.

4.test case
  1)java kafka word count producer,generate data for run example of 
JavaKafkaWordCount.the class of JavaKafkaWordCountProducer  add by me.
  spark-submit ... --class 
org.apache.spark.examples.streaming.JavaKafkaWordCountProducer  --jars  
/home/gxl/spark/libext/kafka-clients-0.8.2.1.jar 
/home/gxl/spark/examples/jars/spark-examples_2.11-2.1.0.jar localhost:9092 
topic1 3 4

  2)java kafka streaming example
spark-submit ... --class 
org.apache.spark.examples.streaming.JavaKafkaWordCount  --jars 
/home/gxl/kafka/kafka_2.11-0.10.2.0/libs/kafka-clients-0.10.2.0.jar --jars 
/home/gxl/spark/libext/kafka_2.11-0.8.2.1.jar 
/home/gxl/spark/examples/jars/spark-examples_2.11-2.1.0.jar localhost:2181 
topic1 topic1 2

## How was this patch tested?

manual tests



Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/guoxiaolongzte/spark SPARK-20269

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17580.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17580


commit d383efba12c66addb17006dea107bb0421d50bc3
Author: 郭小龙 10207633 
Date:   2017-03-31T13:57:09Z

[SPARK-20177]Document about compression way has some little detail changes.

commit 3059013e9d2aec76def14eb314b6761bea0e7ca0
Author: 郭小龙 10207633 
Date:   2017-04-01T01:38:02Z

[SPARK-20177] event log add a space

commit 46bb1ad3ddd9fb55b5607ac4f20213a90186cfe9
Author: 郭小龙 10207633 
Date:   2017-04-05T03:16:50Z

Merge branch 'master' of https://github.com/apache/spark into SPARK-20177

commit 0efb0dd9e404229cce638fe3fb0c966276784df7
Author: 郭小龙 10207633 
Date:   2017-04-05T03:47:53Z

[SPARK-20218]'/applications/[app-id]/stages' in REST API,add description.

commit 0e37fdeee28e31fc97436dabd001d3c85c5a7794
Author: 郭小龙 10207633 
Date:   2017-04-05T05:22:54Z

[SPARK-20218] '/applications/[app-id]/stages/[stage-id]' in REST API,remove 
redundant description.

commit 52641bb01e55b48bd9e8579fea217439d14c7dc7
Author: 郭小龙 10207633 
Date:   2017-04-07T06:24:58Z

Merge branch 'SPARK-20218'

commit d3977c9cab0722d279e3fae7aacbd4eb944c22f6
Author: 郭小龙 10207633 
Date:   2017-04-08T07:13:02Z

Merge branch 'master' of https://github.com/apache/spark

commit 1279041f17f1f5d02c9aada5fb5af9c7d58f2423
Author: asmith26 
Date:   2017-04-09T06:47:23Z

[MINOR] Issue: Change "slice" vs "partition" in exception messages (and 
code?)

## What changes were proposed in this pull request?

Came across the term "slice" when running some spark scala code. 
Consequently, a Google search indicated that "slices" and "partitions" refer to 
the same things; indeed see:

- [This issue](https://issues.apache.org/jira/browse/SPARK-1701)
- [This pull request](https://github.com/apache/spark/pull/2305)
- [This StackOverflow 
answer](http://stackoverflow.com/questions/23436640/what-is-the-difference-between-an-rdd-partition-and-a-slice)
 and [this 
one](http://stackoverflow.com/questions/24269495/what-are-the-differences-between-slices-and-partitions-of-rdds)

Thus this pull request fixes the occurrence of slice I came accross. 
Nonetheless, [it would 
appear](https://github.com/apache/spark/search?utf8=%E2%9C%93=slice=) 
there are still many references to "slice/slices" - thus I thought I'd raise 
this Pull Request to address the issue (sorry if this is the wrong place, I'm 
not too familar with raising apache issues).

## How was this patch tested?

(Not tested locally - only a minor exception message change.)

Please review http://spark.apache.org/contributing.html before opening 

[GitHub] spark issue #17589: [SPARK-16544][SQL] Support for conversion from numeric c...

2017-04-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17589
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75653/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17589: [SPARK-16544][SQL] Support for conversion from numeric c...

2017-04-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17589
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



<    1   2   3   4   5   >