date:20160912

[GitHub] spark issue #13762: [SPARK-14926] [ML] OneVsRest labelMetadata uses incorrec...

2016-09-12 Thread srowen

Github user srowen commented on the issue:

https://github.com/apache/spark/pull/13762
  
I think we should close this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15052: [SPARK-17500][PySpark]Make DiskBytesSpilled metric in Py...

2016-09-12 Thread srowen

Github user srowen commented on the issue:

https://github.com/apache/spark/pull/15052
  
I get that, but if it's always true, then there was no problem to begin 
with. That's what the code seems to think right now. I haven't looked at the 
code much but that's the question -- are you sure the files are non-empty in 
some scenario?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14638: [SPARK-11374][SQL] Support `skip.header.line.count` opti...

2016-09-12 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14638
  
**[Test build #65251 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65251/consoleFull)**
 for PR 14638 at commit 
[`257708a`](https://github.com/apache/spark/commit/257708af5a5c449781f57fe43d44df656eb75ba9).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14623: [SPARK-17044][SQL] Make test files for window functions ...

2016-09-12 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14623
  
**[Test build #65252 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65252/consoleFull)**
 for PR 14623 at commit 
[`04fe12d`](https://github.com/apache/spark/commit/04fe12dcc7749dc97a13683440a437ced5a71345).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14527: [SPARK-16938][SQL] `drop/dropDuplicate` should handle th...

2016-09-12 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14527
  
**[Test build #65248 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65248/consoleFull)**
 for PR 14527 at commit 
[`67ea924`](https://github.com/apache/spark/commit/67ea92448463c559ef9650eead719f69b5f3b51b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14961: [SPARK-17379] [BUILD] Upgrade netty-all to 4.0.41 final ...

2016-09-12 Thread a-roberts

Github user a-roberts commented on the issue:

https://github.com/apache/spark/pull/14961
  
Sean, yep, I've had trouble reproducing it too, kicked off a bunch of 
builds over the weekend including one using Hadoop-2.3 which was my initial 
theory (only difference between our testing environments apart from the options 
I mention below)

I'll add
```
  static {
System.setProperty("io.netty.recycler.maxCapacity", "0");
  }
```
in TransportConf then build and test locally before updating this.

FWIW I use these Java options for testing as our boxes have limited memory:

-Xss2048k **-Dspark.buffer.pageSize=1048576** -Xmx4g


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14116: [SPARK-16452][SQL] Support basic INFORMATION_SCHEMA

2016-09-12 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14116
  
**[Test build #65250 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65250/consoleFull)**
 for PR 14116 at commit 
[`c531025`](https://github.com/apache/spark/commit/c5310252d891486c6380a015ae43050a43faae61).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14426: [SPARK-16475][SQL] Broadcast Hint for SQL Queries

2016-09-12 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14426
  
**[Test build #65249 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65249/consoleFull)**
 for PR 14426 at commit 
[`47d98e7`](https://github.com/apache/spark/commit/47d98e7e4f6e8985fd3b543d5e44d30ee381f604).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15046: [SPARK-17492] [SQL] Fix Reading Cataloged Data Sources w...

2016-09-12 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/15046
  
ah good catch! But adding a new flag looks a little tricky, let me think if 
there is better way to fix it


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15052: [SPARK-17500][PySpark]Make DiskBytesSpilled metric in Py...

2016-09-12 Thread djvulee

Github user djvulee commented on the issue:

https://github.com/apache/spark/pull/15052
  
@srowen No. It does not matter whether the file is empty or not, if the 
file is empty, the `getsize()` just return 0, and this should be OK.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15023: Backport [SPARK-5847] Allow for configuring MetricsSyste...

2016-09-12 Thread AnthonyTruchet

Github user AnthonyTruchet commented on the issue:

https://github.com/apache/spark/pull/15023
  
I'm aware that features are not generally back-ported. The point is, for us 
this is a bug, preventing a deployment in production. We thus back-ported the 
fix internally and now propose to share it with the community as the work is 
already done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15056: [SPARK-17503][Core] Fix memory leak in Memory store when...

2016-09-12 Thread clockfly

Github user clockfly commented on the issue:

https://github.com/apache/spark/pull/15056
  
@yhuai 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15040: [WIP] [SPARK-17487] [SQL] Configurable bucketing info ex...

2016-09-12 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/15040
  
`BucketingInfoExtractor` maybe a too flexible concept, we only need a 
boolean flag to indicate it's a spark native bucketing or hive bucketing, and 
I'm sure how soon we need to support bucketed table from other systems.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14995: [Test Only][SPARK-6235][CORE]Address various 2G limits

2016-09-12 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14995
  
**[Test build #65247 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65247/consoleFull)**
 for PR 14995 at commit 
[`b31fbcd`](https://github.com/apache/spark/commit/b31fbcdb7c93b1badb0e67a509dee17445659b16).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15052: [SPARK-17500][PySpark]Make DiskBytesSpilled metric in Py...

2016-09-12 Thread srowen

Github user srowen commented on the issue:

https://github.com/apache/spark/pull/15052
  
Is the idea that the file may be non empty when written ?
There is at least one more instance of this call but maybe the file is 
known to be empty before. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15047: [SPARK-17495] [SQL] Add Hash capability semantica...

2016-09-12 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/15047#discussion_r78331887
  
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveHash.scala 
---
@@ -0,0 +1,145 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hive
+
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.util.{ArrayData, MapData}
+import org.apache.spark.sql.types._
+import org.apache.spark.unsafe.Platform
+import org.apache.spark.unsafe.types.UTF8String
+
+/**
+ * Simulates Hive's hashing function at
+ * 
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils#hashcode() 
in Hive
+ *
+ * We should use this hash function for both shuffle and bucket of Hive 
tables, so that
+ * we can guarantee shuffle and bucketing have same data distribution
+ *
+ * TODO: Support Decimal and date related types
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(a1, a2, ...) - Returns a hash value of the arguments.")
+case class HiveHash(children: Seq[Expression], seed: Int) extends 
HashExpression[Int] {
+  def this(arguments: Seq[Expression]) = this(arguments, 42)
+
+  override def dataType: DataType = IntegerType
+
+  override def prettyName: String = "hive-hash"
+
+  override protected def hasherClassName: String = 
classOf[HiveHash].getName
+
+  override protected def computeHash(value: Any, dataType: DataType, seed: 
Int): Int = {
+HiveHashFunction.hash(value, dataType, seed).toInt
+  }
+}
+
+object HiveHashFunction extends InterpretedHashFunction {
+  override protected def hashInt(i: Int, seed: Long): Long = {
+HiveHasher.hashInt(i, seed.toInt)
+  }
+
+  override protected def hashLong(l: Long, seed: Long): Long = {
+HiveHasher.hashLong(l, seed.toInt)
+  }
+
+  override protected def hashUnsafeBytes(base: AnyRef, offset: Long, len: 
Int, seed: Long): Long = {
+HiveHasher.hashUnsafeBytes(base, offset, len, seed.toInt)
+  }
+
+  override def hash(value: Any, dataType: DataType, seed: Long): Long = {
+value match {
+  case s: UTF8String =>
+val bytes = s.getBytes
+var result: Int = 0
+var i = 0
+while (i < bytes.length) {
+  result = (result * 31) + bytes(i).toInt
+  i += 1
+}
+result
+
+
+  case array: ArrayData =>
+val elementType = dataType match {
+  case udt: UserDefinedType[_] => 
udt.sqlType.asInstanceOf[ArrayType].elementType
--- End diff --

the caller of `hash` guarantees the value matches the data type. So in this 
branch, if the value is `ArrayData`, the data type must be `ArrayType` or UDT 
of `ArrayType`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15048: [SPARK-17409] [SQL] Do Not Optimize Query in CTAS...

2016-09-12 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/15048#discussion_r78331097
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala 
---
@@ -68,7 +68,7 @@ class ResolveDataSource(sparkSession: SparkSession) 
extends Rule[LogicalPlan] {
 /**
  * Preprocess some DDL plans, e.g. [[CreateTable]], to do some 
normalization and checking.
--- End diff --

we should update the comments to say that this rule will also analyze the 
query.(we may also wanna update the rule name)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15052: [SPARK-17500][PySpark]Make DiskBytesSpilled metric in Py...

2016-09-12 Thread djvulee

Github user djvulee commented on the issue:

https://github.com/apache/spark/pull/15052
  
@srowen  I update PR using an increment way to update the DiskBytesSpilled 
metrics.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14988: [SPARK-17425][SQL] Override sameResult in HiveTab...

2016-09-12 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14988#discussion_r78330099
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScanExec.scala
 ---
@@ -164,4 +164,28 @@ case class HiveTableScanExec(
   }
 
   override def output: Seq[Attribute] = attributes
+
+  override def sameResult(plan: SparkPlan): Boolean = plan match {
+case other: HiveTableScanExec =>
+  val thisRequestedAttributes = 
requestedAttributes.map(cleanExpression)
+  val otherRequestedAttributes = 
other.requestedAttributes.map(cleanExpression)
+
+  val result = partitionPruningPred == other.partitionPruningPred &&
+relation.sameResult(other.relation) &&
+  thisRequestedAttributes.zip(otherRequestedAttributes)
+.forall(p => p._1.semanticEquals(p._2))
+  result
+case _ => false
+  }
+
+  private def cleanExpression(e: Attribute): Expression = e match {
+case a: AttributeReference =>
+  // As the root of the expression, Alias will always take an 
arbitrary exprId, we need
--- End diff --

this comment doesn't match the code. Can you explain more about why the 
default `cleanExpression` doesn't work?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15052: [SPARK-17500][PySpark]Make DiskBytesSpilled metric in Py...

2016-09-12 Thread djvulee

Github user djvulee commented on the issue:

https://github.com/apache/spark/pull/15052
  
@srowen  you are right, I will correct it soon.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13513: [SPARK-15698][SQL][Streaming] Add the ability to remove ...

2016-09-12 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13513
  
**[Test build #65246 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65246/consoleFull)**
 for PR 13513 at commit 
[`31340b5`](https://github.com/apache/spark/commit/31340b58ffa7c46c2d9666569d5694bb23cc6144).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13513: [SPARK-15698][SQL][Streaming] Add the ability to remove ...

2016-09-12 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13513
  
**[Test build #65245 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65245/consoleFull)**
 for PR 13513 at commit 
[`f179349`](https://github.com/apache/spark/commit/f1793498a9625dc8d31039cd8e9a684611dddf23).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15055: [SPARK-17462][MLLIB]use VersionUtils to parse Spark vers...

2016-09-12 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15055
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15055: [SPARK-17462][MLLIB]use VersionUtils to parse Spark vers...

2016-09-12 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15055
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65241/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15055: [SPARK-17462][MLLIB]use VersionUtils to parse Spark vers...

2016-09-12 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15055
  
**[Test build #65241 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65241/consoleFull)**
 for PR 15055 at commit 
[`72a87b0`](https://github.com/apache/spark/commit/72a87b0f8a386ff53903620c3cddee6744c73a96).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13513: [SPARK-15698][SQL][Streaming] Add the ability to remove ...

2016-09-12 Thread jerryshao

Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/13513
  
@zsxwing , thanks a lot for your comments, I did several refactorings:

1. Abstract and consolidate `FileStreamSinkLog` and `FileStreamSourceLog`, 
now they share same code path to do compaction.
2. Change `FileStreamSourceLog` to use json format instead of binary 
coding, to add the compatibility and flexibility for future extension.
3. Improve the logics to fetch all metadata logs, now if compact log is 
existed, only scan compact log.

Please help to review again, thanks a lot.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13513: [SPARK-15698][SQL][Streaming] Add the ability to remove ...

2016-09-12 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13513
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65244/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13513: [SPARK-15698][SQL][Streaming] Add the ability to remove ...

2016-09-12 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13513
  
**[Test build #65244 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65244/consoleFull)**
 for PR 13513 at commit 
[`c2aad87`](https://github.com/apache/spark/commit/c2aad87ba012c41a0f4ef6290401e6789f2c9ed6).
 * This patch **fails to build**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `  case class FileEntry(path: String, timestamp: Timestamp, action: 
String = ADD_ACTION)`
  * `  class FileStreamSourceLog(sparkSession: SparkSession, path: String)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13513: [SPARK-15698][SQL][Streaming] Add the ability to remove ...

2016-09-12 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13513
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15053: [Doc] improve python API docstrings

2016-09-12 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15053
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15053: [Doc] improve python API docstrings

2016-09-12 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15053
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65242/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15053: [Doc] improve python API docstrings

2016-09-12 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15053
  
**[Test build #65242 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65242/consoleFull)**
 for PR 15053 at commit 
[`52240bc`](https://github.com/apache/spark/commit/52240bcf8df42dd454e874ce7640d7040c5cdad9).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13513: [SPARK-15698][SQL][Streaming] Add the ability to remove ...

2016-09-12 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13513
  
**[Test build #65244 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65244/consoleFull)**
 for PR 13513 at commit 
[`c2aad87`](https://github.com/apache/spark/commit/c2aad87ba012c41a0f4ef6290401e6789f2c9ed6).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15056: [SPARK-17503][Core] Fix memory leak in Memory store when...

2016-09-12 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15056
  
**[Test build #65243 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65243/consoleFull)**
 for PR 15056 at commit 
[`a9a4a8b`](https://github.com/apache/spark/commit/a9a4a8b23afc64d7e2d7426b92013442308a8ea3).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

2016-09-12 Thread yanboliang

Github user yanboliang commented on the issue:

https://github.com/apache/spark/pull/12819
  
@zhengruifeng I saw your implementation switch the training process from 
RDD operation to Dataset operation with UDAF. I think we should do some 
performance test to verify there is no performance degradation. Otherwise, we 
can still use the RDD operation in this PR and update it to use Dataset later. 
Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15056: [SPARK-17503][Core] Fix memory leak in Memory sto...

2016-09-12 Thread clockfly

GitHub user clockfly opened a pull request:

https://github.com/apache/spark/pull/15056

[SPARK-17503][Core] Fix memory leak in Memory store when unable to cache 
the whole RDD

## What changes were proposed in this pull request?

   Memory store may throws OutOfMemoryError when trying to cache a super 
big RDD that cannot fit in memory. 
   ```
   scala> sc.parallelize(1 to 1000, 5).map(new 
Array[Long](1000)).cache().count

   java.lang.OutOfMemoryError: Java heap space
at 
$line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(:24)
at 
$line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(:23)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$JoinIterator.next(Iterator.scala:232)
at 
org.apache.spark.storage.memory.PartiallyUnrolledIterator.next(MemoryStore.scala:683)
at 
org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1684)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1915)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1915)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
   ```

Spark MemoryStore uses SizeTrackingVector as a temporary unrolling buffer 
to store all input values it has read so far before transferring the values to 
cache. The problem is that when the input RDD is too big for caching, the 
temporary unrolling memory SizeTrackingVector is not garbage collected in time. 
As SizeTrackingVector can occupy all available storage memory, it may cause the 
executor JVM to run out of memory quickly.

## How was this patch tested?

Unit test.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/clockfly/spark memory_store_leak

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15056.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15056


commit a9a4a8b23afc64d7e2d7426b92013442308a8ea3
Author: Sean Zhong 
Date:   2016-09-12T07:12:48Z

SPARK-17503: Fix memory leak in Memory store when unable to cache the whole 
RDD




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15054: [SPARK-17502] [SQL] Fix Multiple Bugs in DDL Statements ...

2016-09-12 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15054
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65240/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15054: [SPARK-17502] [SQL] Fix Multiple Bugs in DDL Statements ...

2016-09-12 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15054
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15054: [SPARK-17502] [SQL] Fix Multiple Bugs in DDL Statements ...

2016-09-12 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15054
  
**[Test build #65240 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65240/consoleFull)**
 for PR 15054 at commit 
[`cc47c3e`](https://github.com/apache/spark/commit/cc47c3eb07a7628faa4277dd87c2837d01c4f175).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #12819: [SPARK-14077][ML] Refactor NaiveBayes to support ...

2016-09-12 Thread yanboliang

Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/12819#discussion_r78325688
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala ---
@@ -109,10 +120,51 @@ class NaiveBayes @Since("1.5.0") (
 s" numClasses=$numClasses, but thresholds has length 
${$(thresholds).length}")
 }
 
-val oldDataset: RDD[OldLabeledPoint] =
-  extractLabeledPoints(dataset).map(OldLabeledPoint.fromML)
-val oldModel = OldNaiveBayes.train(oldDataset, $(smoothing), 
$(modelType))
-NaiveBayesModel.fromOld(oldModel, this)
+val numFeatures = 
dataset.select(col($(featuresCol))).head().getAs[Vector](0).size
+
+val wvsum = new WeightedVectorSum($(modelType), numFeatures)
+
+val w = if ($(weightCol).isEmpty) lit(1.0) else col($(weightCol))
+
+val aggregated =
+  dataset.select(col($(labelCol)).cast(DoubleType).as("label"), 
w.as("weight"),
+col($(featuresCol)).as("features"))
+.groupBy(col($(labelCol)))
+.agg(sum(col("weight")), wvsum(col("weight"), col("features")))
+.collect().map { row =>
+(row.getDouble(0), (row.getDouble(1), 
row.getAs[Vector](2).toDense))
+  }.sortBy(_._1)
--- End diff --

Do you have some performance test for switching to Dataset based operation?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #12819: [SPARK-14077][ML] Refactor NaiveBayes to support ...

2016-09-12 Thread yanboliang

Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/12819#discussion_r78325579
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala ---
@@ -98,7 +99,17 @@ class NaiveBayes @Since("1.5.0") (
*/
   @Since("1.5.0")
   def setModelType(value: String): this.type = set(modelType, value)
-  setDefault(modelType -> OldNaiveBayes.Multinomial)
+  setDefault(modelType -> NaiveBayes.Multinomial)
+
+  /**
+   * Whether to over-/under-sample training instances according to the 
given weights in weightCol.
+   * If empty, all instances are treated equally (weight 1.0).
+   * Default is empty, so all instances have weight one.
+   * @group setParam
+   */
+  @Since("2.1.0")
+  def setWeightCol(value: String): this.type = set(weightCol, value)
+  setDefault(weightCol -> "")
--- End diff --

It's not necessary to set default for ```weightCol```. You can refer other 
place where used ```weightCol```.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15000: [SPARK-17437] Add uiWebUrl to JavaSparkContext and pyspa...

2016-09-12 Thread srowen

Github user srowen commented on the issue:

https://github.com/apache/spark/pull/15000
  
My only hesitation about this is that this property really only exists to 
print it in the shell. Is there a good use case for it otherwise? I know it's 
minor but want to make sure we're not just doing this for parity.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15053: [Doc] improve python API docstrings

2016-09-12 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15053
  
**[Test build #65242 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65242/consoleFull)**
 for PR 15053 at commit 
[`52240bc`](https://github.com/apache/spark/commit/52240bcf8df42dd454e874ce7640d7040c5cdad9).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15053: [Doc] improve python API docstrings

2016-09-12 Thread srowen

Github user srowen commented on the issue:

https://github.com/apache/spark/pull/15053
  
Jenkins test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15052: [SPARK-17500][PySpark]Make DiskBytesSpilled metric in Py...

2016-09-12 Thread srowen

Github user srowen commented on the issue:

https://github.com/apache/spark/pull/15052
  
Given how DiskBytesSpilled is used, and still used in other parts of the 
code, this doesn't look correct. It seems to be a global that is always 
incremented. Here you reset the value in certain cases, effectively. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15011: [SPARK-17122][SQL]support drop current database

2016-09-12 Thread adrian-wang

Github user adrian-wang commented on the issue:

https://github.com/apache/spark/pull/15011
  
@hvanhovell I have checked with Hive and MySQL, they all support dropping 
current database. By asking user to switch to another database before drop the 
current one is not enough though, if there are multiple users connected to the 
same metadata, even you are not using the certain database, some one else may 
be using that. What's more, if you want to drop a database but without the 
privilege to access databases created by other user, you will always leave one 
empty database behind.
In Spark's implementation, we ensure the database exists before we do 
anything, so drop current database is OK. This is also the way other systems 
adopt.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15055: [SPARK-17462][MLLIB]use VersionUtils to parse Spark vers...

2016-09-12 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15055
  
**[Test build #65241 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65241/consoleFull)**
 for PR 15055 at commit 
[`72a87b0`](https://github.com/apache/spark/commit/72a87b0f8a386ff53903620c3cddee6744c73a96).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15055: [SPARK-17462][MLLIB]use VersionUtils to parse Spa...

2016-09-12 Thread VinceShieh

GitHub user VinceShieh opened a pull request:

https://github.com/apache/spark/pull/15055

[SPARK-17462][MLLIB]use VersionUtils to parse Spark version strings

## What changes were proposed in this pull request?

Several places in MLlib use custom regexes or other approaches to parse 
Spark versions.
Those should be fixed to use the VersionUtils. This PR replaces custom 
regexes with
VersionUtils to get Spark version numbers.

## How was this patch tested?

Existing tests.

Signed-off-by: VinceShieh 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/VinceShieh/spark SPARK-17462

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15055.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15055


commit 72a87b0f8a386ff53903620c3cddee6744c73a96
Author: VinceShieh 
Date:   2016-09-12T06:37:42Z

[SPARK-17462][MLLIB]use VersionUtils to parse Spark version strings

Several places in MLlib use custom regexes or other approaches. Those 
should be
fixed to use the VersionUtils

Signed-off-by: VinceShieh 




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14834: [SPARK-17163][ML] Unified LogisticRegression inte...

2016-09-12 Thread dbtsai

Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/14834#discussion_r78321887
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -460,33 +577,74 @@ class LogisticRegression @Since("1.2.0") (
as a result, no scaling is needed.
  */
 val rawCoefficients = state.x.toArray.clone()
-var i = 0
-while (i < numFeatures) {
-  rawCoefficients(i) *= { if (featuresStd(i) != 0.0) 1.0 / 
featuresStd(i) else 0.0 }
-  i += 1
+val coefficientArray = Array.tabulate(numCoefficientSets * 
numFeatures) { i =>
+  // flatIndex will loop though rawCoefficients, and skip the 
intercept terms.
+  val flatIndex = if ($(fitIntercept)) i + i / numFeatures else i
+  val featureIndex = i % numFeatures
+  if (featuresStd(featureIndex) != 0.0) {
+rawCoefficients(flatIndex) / featuresStd(featureIndex)
+  } else {
+0.0
+  }
+}
+val coefficientMatrix =
+  new DenseMatrix(numCoefficientSets, numFeatures, 
coefficientArray, isTransposed = true)
+
+if ($(regParam) == 0.0 && isMultinomial) {
+  /*
+When no regularization is applied, the coefficients lack 
identifiability because
+we do not use a pivot class. We can add any constant value to 
the coefficients and
+get the same likelihood. So here, we choose the mean centered 
coefficients for
+reproducibility. This method follows the approach in glmnet, 
described here:
+
+Friedman, et al. "Regularization Paths for Generalized Linear 
Models via
+  Coordinate Descent," 
https://core.ac.uk/download/files/153/6287975.pdf
+   */
+  val coefficientMean = coefficientMatrix.values.sum / 
coefficientMatrix.values.length
+  coefficientMatrix.update(_ - coefficientMean)
 }
-bcFeaturesStd.destroy(blocking = false)
 
-if ($(fitIntercept)) {
-  (Vectors.dense(rawCoefficients.dropRight(1)).compressed, 
rawCoefficients.last,
-arrayBuilder.result())
+val interceptsArray: Array[Double] = if ($(fitIntercept)) {
+  Array.tabulate(numCoefficientSets) { i =>
+val coefIndex = (i + 1) * numFeaturesPlusIntercept - 1
+rawCoefficients(coefIndex)
+  }
+} else {
+  Array[Double]()
+}
+/*
+  The intercepts are never regularized, so we always center the 
mean.
+ */
+val interceptVector = if (interceptsArray.nonEmpty && 
isMultinomial) {
+  val interceptMean = interceptsArray.sum / numClasses
+  interceptsArray.indices.foreach { i => interceptsArray(i) -= 
interceptMean }
+  Vectors.dense(interceptsArray)
+} else if (interceptsArray.length == 1) {
+  Vectors.dense(interceptsArray)
 } else {
-  (Vectors.dense(rawCoefficients).compressed, 0.0, 
arrayBuilder.result())
+  Vectors.sparse(numCoefficientSets, Seq())
 }
+(coefficientMatrix, interceptVector, arrayBuilder.result())
--- End diff --

Should we implement `coefficientMatrix.compressed` before merging this? 
Otherwise, this can be a regression.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13758: [SPARK-16043][SQL] Prepare GenericArrayData implementati...

2016-09-12 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13758
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65239/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14834: [SPARK-17163][ML] Unified LogisticRegression inte...

2016-09-12 Thread dbtsai

Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/14834#discussion_r78321247
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -323,32 +382,33 @@ class LogisticRegression @Since("1.2.0") (
 instr.logNumClasses(numClasses)
 instr.logNumFeatures(numFeatures)
 
-val (coefficients, intercept, objectiveHistory) = {
+val (coefficientMatrix, interceptVector, objectiveHistory) = {
   if (numInvalid != 0) {
 val msg = s"Classification labels should be in [0 to ${numClasses 
- 1}]. " +
   s"Found $numInvalid invalid labels."
 logError(msg)
 throw new SparkException(msg)
   }
 
-  val isConstantLabel = histogram.count(_ != 0) == 1
+  val isConstantLabel = histogram.count(_ != 0.0) == 1
 
-  if (numClasses > 2) {
-val msg = s"LogisticRegression with ElasticNet in ML package only 
supports " +
-  s"binary classification. Found $numClasses in the input dataset. 
Consider using " +
-  s"MultinomialLogisticRegression instead."
-logError(msg)
-throw new SparkException(msg)
-  } else if ($(fitIntercept) && numClasses == 2 && isConstantLabel) {
-logWarning(s"All labels are one and fitIntercept=true, so the 
coefficients will be " +
-  s"zeros and the intercept will be positive infinity; as a 
result, " +
-  s"training is not needed.")
-(Vectors.sparse(numFeatures, Seq()), Double.PositiveInfinity, 
Array.empty[Double])
-  } else if ($(fitIntercept) && numClasses == 1) {
-logWarning(s"All labels are zero and fitIntercept=true, so the 
coefficients will be " +
-  s"zeros and the intercept will be negative infinity; as a 
result, " +
-  s"training is not needed.")
-(Vectors.sparse(numFeatures, Seq()), Double.NegativeInfinity, 
Array.empty[Double])
+  if ($(fitIntercept) && isConstantLabel) {
+logWarning(s"All labels are the same value and fitIntercept=true, 
so the coefficients " +
+  s"will be zeros. Training is not needed.")
+val constantLabelIndex = Vectors.dense(histogram).argmax
+val coefMatrix = if (numFeatures < numCoefficientSets) {
--- End diff --

Add a comment on this, having a TODO to consolidate the sparse matrix logic.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13758: [SPARK-16043][SQL] Prepare GenericArrayData implementati...

2016-09-12 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13758
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13758: [SPARK-16043][SQL] Prepare GenericArrayData implementati...

2016-09-12 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13758
  
**[Test build #65239 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65239/consoleFull)**
 for PR 13758 at commit 
[`deb363a`](https://github.com/apache/spark/commit/deb363afba6b8b3d2bd82b230ec132eb637c43c6).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class GenericArrayData(val array: Array[Any],`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14834: [SPARK-17163][ML] Unified LogisticRegression inte...

2016-09-12 Thread dbtsai

Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/14834#discussion_r78321146
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -311,8 +350,28 @@ class LogisticRegression @Since("1.2.0") (
 
 val histogram = labelSummarizer.histogram
 val numInvalid = labelSummarizer.countInvalid
-val numClasses = histogram.length
 val numFeatures = summarizer.mean.size
+val numFeaturesPlusIntercept = if (getFitIntercept) numFeatures + 1 
else numFeatures
+
+val numClasses = 
MetadataUtils.getNumClasses(dataset.schema($(labelCol))) match {
+  case Some(n: Int) =>
+require(n >= histogram.length, s"Specified number of classes $n 
was " +
+  s"less than the number of unique labels ${histogram.length}.")
+n
+  case None => histogram.length
+}
+
+val isBinaryClassification = numClasses == 1 || numClasses == 2
+val isMultinomial = $(family) match {
+  case "binomial" =>
+require(isBinaryClassification, s"Binomial family only supports 1 
or 2 " +
+s"outcome classes but found $numClasses.")
+false
+  case "multinomial" => true
+  case "auto" => !isBinaryClassification
+  case other => throw new IllegalArgumentException(s"Unsupported 
family: $other")
+}
--- End diff --

BTW, I think `isBinaryClassification` is not needed since it's not being 
used at all.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #11729: [SPARK-13073] [MLib] [WIP] creating R like summary for l...

2016-09-12 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/11729
  
gentle ping @mbaddar1 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #11079: [SPARK-13197][SQL] When trying to select from the data f...

2016-09-12 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/11079
  
+1 for not a problem.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15054: [SPARK-17502] [SQL] Fix Multiple Bugs in DDL Statements ...

2016-09-12 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15054
  
**[Test build #65240 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65240/consoleFull)**
 for PR 15054 at commit 
[`cc47c3e`](https://github.com/apache/spark/commit/cc47c3eb07a7628faa4277dd87c2837d01c4f175).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15054: [SPARK-17502] [SQL] Fix Multiple Bugs in DDL Stat...

2016-09-12 Thread gatorsmile

GitHub user gatorsmile opened a pull request:

https://github.com/apache/spark/pull/15054

[SPARK-17502] [SQL] Fix Multiple Bugs in DDL Statements on Temporary Views 
[WIP]

### What changes were proposed in this pull request?
- When the permanent tables/views do not exist but the temporary view 
exists, the expected error should be `NoSuchTableException` for 
partition-related ALTER TABLE commands. However, it always reports a confusing 
error message. For example, 
```
Partition spec is invalid. The spec (a, b) must match the partition spec () 
defined in table '`testview`';
```
- When the permanent tables/views do not exist but the temporary view 
exists, the expected error should be `NoSuchTableException` for `ALTER TABLE 
... UNSET TBLPROPERTIES`. However, it reports a missing table property. 
However, the expected error should be `NoSuchTableException`. For example, 
```
Attempted to unset non-existent property 'p' in table '`testView`';
```

TODO: Trying to add more test cases for DDL processing over temporary 
views. 

### How was this patch tested?
Added multiple test cases

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gatorsmile/spark tempViewDDL

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15054.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15054


commit cc47c3eb07a7628faa4277dd87c2837d01c4f175
Author: gatorsmile 
Date:   2016-09-12T06:08:17Z

fix




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15020: Spark 2.0 error in Intellij

2016-09-12 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/15020
  
ping @bigdatatraining 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

< 1 2 3 4 5 6

501 - 559 of 559 matches

Mail list logo