date:20160125

[GitHub] spark pull request: [SPARK-12968][SQL] Implement command to set cu...

2016-01-25 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10916#issuecomment-174881926
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50079/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12968][SQL] Implement command to set cu...

2016-01-25 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10916#issuecomment-174881888
  
**[Test build #50079 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50079/consoleFull)**
 for PR 10916 at commit 
[`43beb4b`](https://github.com/apache/spark/commit/43beb4ba499814c698df7537018ab6fafefa738e).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12968][SQL] Implement command to set cu...

2016-01-25 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10916#issuecomment-174881924
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12828][SQL]add natural join support

2016-01-25 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/10762#discussion_r50803540
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala ---
@@ -474,6 +474,7 @@ class DataFrame private[sql](
   val rightCol = 
withPlan(joined.right).resolve(col).toAttribute.withNullability(true)
   Alias(Coalesce(Seq(leftCol, rightCol)), col)()
 }
+  case NaturalJoin(_) => sys.error("NaturalJoin with using clause is 
not supported.")
--- End diff --

yup - although we should still throw some exception here just in case we 
refactor code in the future so this is reachable.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12828][SQL]add natural join support

2016-01-25 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/10762#discussion_r50803483
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -1159,6 +1161,25 @@ class Analyzer(
   }
 }
   }
+
+  /**
+   * Removes natural joins.
--- End diff --

I think we need more comments here, how we resolve a natural join to normal 
join?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12937][SQL] bloom filter serialization

2016-01-25 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10920#issuecomment-174881466
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12937][SQL] bloom filter serialization

2016-01-25 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10920#issuecomment-174881467
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50081/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12828][SQL]add natural join support

2016-01-25 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/10762#discussion_r50803405
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala ---
@@ -474,6 +474,7 @@ class DataFrame private[sql](
   val rightCol = 
withPlan(joined.right).resolve(col).toAttribute.withNullability(true)
   Alias(Coalesce(Seq(leftCol, rightCol)), col)()
 }
+  case NaturalJoin(_) => sys.error("NaturalJoin with using clause is 
not supported.")
--- End diff --

Then this case is unreachable as `JoinType.apply` won't produce natural 
join.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12937][SQL] bloom filter serialization

2016-01-25 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10920#issuecomment-174881345
  
**[Test build #50081 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50081/consoleFull)**
 for PR 10920 at commit 
[`4b05a35`](https://github.com/apache/spark/commit/4b05a35d58cdabccd915582894d303ba437bee0f).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12895][SPARK-12896] Migrate TaskMetrics...

2016-01-25 Thread JoshRosen

Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/10835#discussion_r50803319
  
--- Diff: 
core/src/test/scala/org/apache/spark/InternalAccumulatorSuite.scala ---
@@ -0,0 +1,329 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark
+
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.spark.scheduler.AccumulableInfo
+import org.apache.spark.storage.{BlockId, BlockStatus}
+
+
+class InternalAccumulatorSuite extends SparkFunSuite with 
LocalSparkContext {
+  import InternalAccumulator._
+  import AccumulatorParam._
+
+  test("get param") {
+assert(getParam(EXECUTOR_DESERIALIZE_TIME) === LongAccumulatorParam)
+assert(getParam(EXECUTOR_RUN_TIME) === LongAccumulatorParam)
+assert(getParam(RESULT_SIZE) === LongAccumulatorParam)
+assert(getParam(JVM_GC_TIME) === LongAccumulatorParam)
+assert(getParam(RESULT_SERIALIZATION_TIME) === LongAccumulatorParam)
+assert(getParam(MEMORY_BYTES_SPILLED) === LongAccumulatorParam)
+assert(getParam(DISK_BYTES_SPILLED) === LongAccumulatorParam)
+assert(getParam(PEAK_EXECUTION_MEMORY) === LongAccumulatorParam)
+assert(getParam(UPDATED_BLOCK_STATUSES) === 
UpdatedBlockStatusesAccumulatorParam)
+assert(getParam(TEST_ACCUM) === LongAccumulatorParam)
+// shuffle read
+assert(getParam(shuffleRead.REMOTE_BLOCKS_FETCHED) === 
IntAccumulatorParam)
+assert(getParam(shuffleRead.LOCAL_BLOCKS_FETCHED) === 
IntAccumulatorParam)
+assert(getParam(shuffleRead.REMOTE_BYTES_READ) === 
LongAccumulatorParam)
+assert(getParam(shuffleRead.LOCAL_BYTES_READ) === LongAccumulatorParam)
+assert(getParam(shuffleRead.FETCH_WAIT_TIME) === LongAccumulatorParam)
+assert(getParam(shuffleRead.RECORDS_READ) === LongAccumulatorParam)
+// shuffle write
+assert(getParam(shuffleWrite.BYTES_WRITTEN) === LongAccumulatorParam)
+assert(getParam(shuffleWrite.RECORDS_WRITTEN) === LongAccumulatorParam)
+assert(getParam(shuffleWrite.WRITE_TIME) === LongAccumulatorParam)
+// input
+assert(getParam(input.READ_METHOD) === StringAccumulatorParam)
+assert(getParam(input.RECORDS_READ) === LongAccumulatorParam)
+assert(getParam(input.BYTES_READ) === LongAccumulatorParam)
+// output
+assert(getParam(output.WRITE_METHOD) === StringAccumulatorParam)
+assert(getParam(output.RECORDS_WRITTEN) === LongAccumulatorParam)
+assert(getParam(output.BYTES_WRITTEN) === LongAccumulatorParam)
+// default to Long
+assert(getParam(METRICS_PREFIX + "anything") === LongAccumulatorParam)
+intercept[IllegalArgumentException] {
+  getParam("something that does not start with the right prefix")
+}
+  }
+
+  test("create by name") {
+val executorRunTime = create(EXECUTOR_RUN_TIME)
+val updatedBlockStatuses = create(UPDATED_BLOCK_STATUSES)
+val shuffleRemoteBlocksRead = create(shuffleRead.REMOTE_BLOCKS_FETCHED)
+val inputReadMethod = create(input.READ_METHOD)
+assert(executorRunTime.name === Some(EXECUTOR_RUN_TIME))
+assert(updatedBlockStatuses.name === Some(UPDATED_BLOCK_STATUSES))
+assert(shuffleRemoteBlocksRead.name === 
Some(shuffleRead.REMOTE_BLOCKS_FETCHED))
+assert(inputReadMethod.name === Some(input.READ_METHOD))
+assert(executorRunTime.value.isInstanceOf[Long])
+assert(updatedBlockStatuses.value.isInstanceOf[Seq[_]])
+// We cannot assert the type of the value directly since the type 
parameter is erased.
+// Instead, try casting a `Seq` of expected type and see if it fails 
in run time.
+updatedBlockStatuses.setValueAny(Seq.empty[(BlockId, BlockStatus)])
+assert(shuffleRemoteBlocksRead.value.isInstanceOf[Int])
+assert(inputReadMethod.value.isInstanceOf[String])
+// default to Long
+val anything = create(METRICS_PREFIX + "anything")
+assert(anything.value.isIns

[GitHub] spark pull request: [SPARK-12828][SQL]add natural join support

2016-01-25 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/10762#discussion_r50803298
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala ---
@@ -474,6 +474,7 @@ class DataFrame private[sql](
   val rightCol = 
withPlan(joined.right).resolve(col).toAttribute.withNullability(true)
   Alias(Coalesce(Seq(leftCol, rightCol)), col)()
 }
+  case NaturalJoin(_) => sys.error("NaturalJoin with using clause is 
not supported.")
--- End diff --

no i don't think we need to.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11518] [Deploy, Windows] Handle spaces ...

2016-01-25 Thread tsudukim

Github user tsudukim commented on the pull request:

https://github.com/apache/spark/pull/10789#issuecomment-174880482
  
I think just adding the quotation is good to solve this problem.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12994][CORE] It is not necessary to cre...

2016-01-25 Thread jerryshao

Github user jerryshao commented on the pull request:

https://github.com/apache/spark/pull/10914#issuecomment-174880398
  
All the DEA related unit tests are running on local mode, they will be 
failed with this change, we should fix it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12895][SPARK-12896] Migrate TaskMetrics...

2016-01-25 Thread JoshRosen

Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/10835#discussion_r50803281
  
--- Diff: 
core/src/test/scala/org/apache/spark/InternalAccumulatorSuite.scala ---
@@ -0,0 +1,329 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark
+
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.spark.scheduler.AccumulableInfo
+import org.apache.spark.storage.{BlockId, BlockStatus}
+
+
+class InternalAccumulatorSuite extends SparkFunSuite with 
LocalSparkContext {
+  import InternalAccumulator._
+  import AccumulatorParam._
+
+  test("get param") {
+assert(getParam(EXECUTOR_DESERIALIZE_TIME) === LongAccumulatorParam)
+assert(getParam(EXECUTOR_RUN_TIME) === LongAccumulatorParam)
+assert(getParam(RESULT_SIZE) === LongAccumulatorParam)
+assert(getParam(JVM_GC_TIME) === LongAccumulatorParam)
+assert(getParam(RESULT_SERIALIZATION_TIME) === LongAccumulatorParam)
+assert(getParam(MEMORY_BYTES_SPILLED) === LongAccumulatorParam)
+assert(getParam(DISK_BYTES_SPILLED) === LongAccumulatorParam)
+assert(getParam(PEAK_EXECUTION_MEMORY) === LongAccumulatorParam)
+assert(getParam(UPDATED_BLOCK_STATUSES) === 
UpdatedBlockStatusesAccumulatorParam)
+assert(getParam(TEST_ACCUM) === LongAccumulatorParam)
+// shuffle read
+assert(getParam(shuffleRead.REMOTE_BLOCKS_FETCHED) === 
IntAccumulatorParam)
+assert(getParam(shuffleRead.LOCAL_BLOCKS_FETCHED) === 
IntAccumulatorParam)
+assert(getParam(shuffleRead.REMOTE_BYTES_READ) === 
LongAccumulatorParam)
+assert(getParam(shuffleRead.LOCAL_BYTES_READ) === LongAccumulatorParam)
+assert(getParam(shuffleRead.FETCH_WAIT_TIME) === LongAccumulatorParam)
+assert(getParam(shuffleRead.RECORDS_READ) === LongAccumulatorParam)
+// shuffle write
+assert(getParam(shuffleWrite.BYTES_WRITTEN) === LongAccumulatorParam)
+assert(getParam(shuffleWrite.RECORDS_WRITTEN) === LongAccumulatorParam)
+assert(getParam(shuffleWrite.WRITE_TIME) === LongAccumulatorParam)
+// input
+assert(getParam(input.READ_METHOD) === StringAccumulatorParam)
+assert(getParam(input.RECORDS_READ) === LongAccumulatorParam)
+assert(getParam(input.BYTES_READ) === LongAccumulatorParam)
+// output
+assert(getParam(output.WRITE_METHOD) === StringAccumulatorParam)
+assert(getParam(output.RECORDS_WRITTEN) === LongAccumulatorParam)
+assert(getParam(output.BYTES_WRITTEN) === LongAccumulatorParam)
+// default to Long
+assert(getParam(METRICS_PREFIX + "anything") === LongAccumulatorParam)
+intercept[IllegalArgumentException] {
+  getParam("something that does not start with the right prefix")
+}
+  }
+
+  test("create by name") {
+val executorRunTime = create(EXECUTOR_RUN_TIME)
+val updatedBlockStatuses = create(UPDATED_BLOCK_STATUSES)
+val shuffleRemoteBlocksRead = create(shuffleRead.REMOTE_BLOCKS_FETCHED)
+val inputReadMethod = create(input.READ_METHOD)
+assert(executorRunTime.name === Some(EXECUTOR_RUN_TIME))
+assert(updatedBlockStatuses.name === Some(UPDATED_BLOCK_STATUSES))
+assert(shuffleRemoteBlocksRead.name === 
Some(shuffleRead.REMOTE_BLOCKS_FETCHED))
+assert(inputReadMethod.name === Some(input.READ_METHOD))
+assert(executorRunTime.value.isInstanceOf[Long])
+assert(updatedBlockStatuses.value.isInstanceOf[Seq[_]])
+// We cannot assert the type of the value directly since the type 
parameter is erased.
+// Instead, try casting a `Seq` of expected type and see if it fails 
in run time.
+updatedBlockStatuses.setValueAny(Seq.empty[(BlockId, BlockStatus)])
+assert(shuffleRemoteBlocksRead.value.isInstanceOf[Int])
+assert(inputReadMethod.value.isInstanceOf[String])
+// default to Long
+val anything = create(METRICS_PREFIX + "anything")
+assert(anything.value.isIns

[GitHub] spark pull request: [SPARK-12828][SQL]add natural join support

2016-01-25 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/10762#discussion_r50803191
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala ---
@@ -474,6 +474,7 @@ class DataFrame private[sql](
   val rightCol = 
withPlan(joined.right).resolve(col).toAttribute.withNullability(true)
   Alias(Coalesce(Seq(leftCol, rightCol)), col)()
 }
+  case NaturalJoin(_) => sys.error("NaturalJoin with using clause is 
not supported.")
--- End diff --

Are we going to support natural join in `DataFrame`? If so, I think we 
should also change `JoinType.apply`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12994][CORE] It is not necessary to cre...

2016-01-25 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10914#issuecomment-174878692
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12828][SQL]add natural join support

2016-01-25 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/10762#discussion_r50803015
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -919,6 +919,7 @@ object PushPredicateThroughJoin extends 
Rule[LogicalPlan] with PredicateHelper {
   (rightFilterConditions ++ commonFilterCondition).
 reduceLeftOption(And).map(Filter(_, 
newJoin)).getOrElse(newJoin)
 case FullOuter => f // DO Nothing for Full Outer Join
+case NaturalJoin(_) => sys.error("Untransformed NaturalJoin node")
--- End diff --

Do we need to catch it? I think we can guarantee there is no `NaturalJoin` 
after `CheckAnalysis`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12994][CORE] It is not necessary to cre...

2016-01-25 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10914#issuecomment-174878693
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50073/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12994][CORE] It is not necessary to cre...

2016-01-25 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10914#issuecomment-174878554
  
**[Test build #50073 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50073/consoleFull)**
 for PR 10914 at commit 
[`0467617`](https://github.com/apache/spark/commit/0467617746590b3083deafaa763ee4cae50d4dc0).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12895][SPARK-12896] Migrate TaskMetrics...

2016-01-25 Thread JoshRosen

Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/10835#discussion_r50802781
  
--- Diff: core/src/main/scala/org/apache/spark/InternalAccumulator.scala ---
@@ -17,42 +17,193 @@
 
 package org.apache.spark
 
+import org.apache.spark.storage.{BlockId, BlockStatus}
 
-// This is moved to its own file because many more things will be added to 
it in SPARK-10620.
+
+/**
+ * A collection of fields and methods concerned with internal accumulators 
that represent
+ * task level metrics.
+ */
 private[spark] object InternalAccumulator {
-  val PEAK_EXECUTION_MEMORY = "peakExecutionMemory"
-  val TEST_ACCUMULATOR = "testAccumulator"
-
-  // For testing only.
-  // This needs to be a def since we don't want to reuse the same 
accumulator across stages.
-  private def maybeTestAccumulator: Option[Accumulator[Long]] = {
-if (sys.props.contains("spark.testing")) {
-  Some(new Accumulator(
-0L, AccumulatorParam.LongAccumulatorParam, Some(TEST_ACCUMULATOR), 
internal = true))
-} else {
-  None
+
+  import AccumulatorParam._
+
+  // Prefixes used in names of internal task level metrics
+  val METRICS_PREFIX = "internal.metrics."
+  val SHUFFLE_READ_METRICS_PREFIX = METRICS_PREFIX + "shuffle.read."
+  val SHUFFLE_WRITE_METRICS_PREFIX = METRICS_PREFIX + "shuffle.write."
+  val OUTPUT_METRICS_PREFIX = METRICS_PREFIX + "output."
+  val INPUT_METRICS_PREFIX = METRICS_PREFIX + "input."
+
+  // Names of internal task level metrics
+  val EXECUTOR_DESERIALIZE_TIME = METRICS_PREFIX + 
"executorDeserializeTime"
+  val EXECUTOR_RUN_TIME = METRICS_PREFIX + "executorRunTime"
+  val RESULT_SIZE = METRICS_PREFIX + "resultSize"
+  val JVM_GC_TIME = METRICS_PREFIX + "jvmGCTime"
+  val RESULT_SERIALIZATION_TIME = METRICS_PREFIX + 
"resultSerializationTime"
+  val MEMORY_BYTES_SPILLED = METRICS_PREFIX + "memoryBytesSpilled"
+  val DISK_BYTES_SPILLED = METRICS_PREFIX + "diskBytesSpilled"
+  val PEAK_EXECUTION_MEMORY = METRICS_PREFIX + "peakExecutionMemory"
+  val UPDATED_BLOCK_STATUSES = METRICS_PREFIX + "updatedBlockStatuses"
+  val TEST_ACCUM = METRICS_PREFIX + "testAccumulator"
+
+  // scalastyle:off
+
+  // Names of shuffle read metrics
+  object shuffleRead {
+val REMOTE_BLOCKS_FETCHED = SHUFFLE_READ_METRICS_PREFIX + 
"remoteBlocksFetched"
+val LOCAL_BLOCKS_FETCHED = SHUFFLE_READ_METRICS_PREFIX + 
"localBlocksFetched"
+val REMOTE_BYTES_READ = SHUFFLE_READ_METRICS_PREFIX + "remoteBytesRead"
+val LOCAL_BYTES_READ = SHUFFLE_READ_METRICS_PREFIX + "localBytesRead"
+val FETCH_WAIT_TIME = SHUFFLE_READ_METRICS_PREFIX + "fetchWaitTime"
+val RECORDS_READ = SHUFFLE_READ_METRICS_PREFIX + "recordsRead"
+  }
+
+  // Names of shuffle write metrics
+  object shuffleWrite {
+val BYTES_WRITTEN = SHUFFLE_WRITE_METRICS_PREFIX + "bytesWritten"
+val RECORDS_WRITTEN = SHUFFLE_WRITE_METRICS_PREFIX + "recordsWritten"
+val WRITE_TIME = SHUFFLE_WRITE_METRICS_PREFIX + "writeTime"
+  }
+
+  // Names of output metrics
+  object output {
+val WRITE_METHOD = OUTPUT_METRICS_PREFIX + "writeMethod"
+val BYTES_WRITTEN = OUTPUT_METRICS_PREFIX + "bytesWritten"
+val RECORDS_WRITTEN = OUTPUT_METRICS_PREFIX + "recordsWritten"
+  }
+
+  // Names of input metrics
+  object input {
+val READ_METHOD = INPUT_METRICS_PREFIX + "readMethod"
+val BYTES_READ = INPUT_METRICS_PREFIX + "bytesRead"
+val RECORDS_READ = INPUT_METRICS_PREFIX + "recordsRead"
+  }
+
+  // scalastyle:on
+
+  /**
+   * Create an internal [[Accumulator]] by name, which must begin with 
[[METRICS_PREFIX]].
+   */
+  def create(name: String): Accumulator[_] = {
+assert(name.startsWith(METRICS_PREFIX),
+  s"internal accumulator name must start with '$METRICS_PREFIX': 
$name")
+getParam(name) match {
+  case p @ LongAccumulatorParam => newMetric[Long](0L, name, p)
+  case p @ IntAccumulatorParam => newMetric[Int](0, name, p)
+  case p @ StringAccumulatorParam => newMetric[String]("", name, p)
+  case p @ UpdatedBlockStatusesAccumulatorParam =>
+newMetric[Seq[(BlockId, BlockStatus)]](Seq(), name, p)
+  case p => throw new IllegalArgumentException(
+s"unsupported accumulator param '${p.getClass.getSimpleName}' for 
metric '$name'.")
+}
+  }
+
+  /**
+   * Get the [[AccumulatorParam]] associated with the internal metric name,
+   * which must begin with [[METRICS_PREFIX]].
+   */
+  def getParam(name: String): AccumulatorParam[_] = {
+assert(name.startsWith(METRICS_PREFIX),
+  s"internal accumulator name must s

[GitHub] spark pull request: [SPARK-12895][SPARK-12896] Migrate TaskMetrics...

2016-01-25 Thread JoshRosen

Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/10835#discussion_r50802605
  
--- Diff: 
core/src/test/scala/org/apache/spark/executor/TaskMetricsSuite.scala ---
@@ -17,12 +17,543 @@
 
 package org.apache.spark.executor
 
-import org.apache.spark.SparkFunSuite
+import org.scalatest.Assertions
+
+import org.apache.spark._
+import org.apache.spark.scheduler.AccumulableInfo
+import org.apache.spark.storage.{BlockId, BlockStatus, StorageLevel, 
TestBlockId}
+
 
 class TaskMetricsSuite extends SparkFunSuite {
-  test("[SPARK-5701] updateShuffleReadMetrics: ShuffleReadMetrics not 
added when no shuffle deps") {
-val taskMetrics = new TaskMetrics()
-taskMetrics.mergeShuffleReadMetrics()
-assert(taskMetrics.shuffleReadMetrics.isEmpty)
+  import AccumulatorParam._
+  import InternalAccumulator._
+  import StorageLevel._
+  import TaskMetricsSuite._
+
+  test("create") {
+val internalAccums = InternalAccumulator.create()
+val tm1 = new TaskMetrics
+val tm2 = new TaskMetrics(internalAccums)
+assert(tm1.accumulatorUpdates().size === internalAccums.size)
+assert(tm1.shuffleReadMetrics.isEmpty)
+assert(tm1.shuffleWriteMetrics.isEmpty)
+assert(tm1.inputMetrics.isEmpty)
+assert(tm1.outputMetrics.isEmpty)
+assert(tm2.accumulatorUpdates().size === internalAccums.size)
+assert(tm2.shuffleReadMetrics.isEmpty)
+assert(tm2.shuffleWriteMetrics.isEmpty)
+assert(tm2.inputMetrics.isEmpty)
+assert(tm2.outputMetrics.isEmpty)
+// TaskMetrics constructor expects minimal set of initial accumulators
+intercept[IllegalArgumentException] { new 
TaskMetrics(Seq.empty[Accumulator[_]]) }
+  }
+
+  test("create with unnamed accum") {
+intercept[IllegalArgumentException] {
+  new TaskMetrics(
+InternalAccumulator.create() ++ Seq(
+  new Accumulator(0, IntAccumulatorParam, None, internal = true)))
+}
+  }
+
+  test("create with duplicate name accum") {
+intercept[IllegalArgumentException] {
+  new TaskMetrics(
+InternalAccumulator.create() ++ Seq(
+  new Accumulator(0, IntAccumulatorParam, Some(RESULT_SIZE), 
internal = true)))
+}
+  }
+
+  test("create with external accum") {
+intercept[IllegalArgumentException] {
+  new TaskMetrics(
+InternalAccumulator.create() ++ Seq(
+  new Accumulator(0, IntAccumulatorParam, Some("x"
+}
+  }
+
+  test("create shuffle read metrics") {
+import shuffleRead._
+val accums = InternalAccumulator.createShuffleReadAccums()
+  .map { a => (a.name.get, a) }.toMap[String, Accumulator[_]]
+accums(REMOTE_BLOCKS_FETCHED).setValueAny(1)
+accums(LOCAL_BLOCKS_FETCHED).setValueAny(2)
+accums(REMOTE_BYTES_READ).setValueAny(3L)
+accums(LOCAL_BYTES_READ).setValueAny(4L)
+accums(FETCH_WAIT_TIME).setValueAny(5L)
+accums(RECORDS_READ).setValueAny(6L)
+val sr = new ShuffleReadMetrics(accums)
+assert(sr.remoteBlocksFetched === 1)
+assert(sr.localBlocksFetched === 2)
+assert(sr.remoteBytesRead === 3L)
+assert(sr.localBytesRead === 4L)
+assert(sr.fetchWaitTime === 5L)
+assert(sr.recordsRead === 6L)
+  }
+
+  test("create shuffle write metrics") {
+import shuffleWrite._
+val accums = InternalAccumulator.createShuffleWriteAccums()
+  .map { a => (a.name.get, a) }.toMap[String, Accumulator[_]]
+accums(BYTES_WRITTEN).setValueAny(1L)
+accums(RECORDS_WRITTEN).setValueAny(2L)
+accums(WRITE_TIME).setValueAny(3L)
+val sw = new ShuffleWriteMetrics(accums)
+assert(sw.bytesWritten === 1L)
+assert(sw.recordsWritten === 2L)
+assert(sw.writeTime === 3L)
+  }
+
+  test("create input metrics") {
+import input._
+val accums = InternalAccumulator.createInputAccums()
+  .map { a => (a.name.get, a) }.toMap[String, Accumulator[_]]
+accums(BYTES_READ).setValueAny(1L)
+accums(RECORDS_READ).setValueAny(2L)
+accums(READ_METHOD).setValueAny(DataReadMethod.Hadoop.toString)
+val im = new InputMetrics(accums)
+assert(im.bytesRead === 1L)
+assert(im.recordsRead === 2L)
+assert(im.readMethod === DataReadMethod.Hadoop)
+  }
+
+  test("create output metrics") {
+import output._
+val accums = InternalAccumulator.createOutputAccums()
+  .map { a => (a.name.get, a) }.toMap[String, Accumulator[_]]
+accums(BYTES_WRITTEN).setValueAny(1L)
+accums(RECORDS_WRITTEN).setValueAny(2L)
+accums(WRITE_METHOD).setValueAny(DataWriteMethod.Hadoop.toS

[GitHub] spark pull request: [SPARK-12895][SPARK-12896] Migrate TaskMetrics...

2016-01-25 Thread JoshRosen

Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/10835#discussion_r50802571
  
--- Diff: core/src/main/scala/org/apache/spark/executor/TaskMetrics.scala 
---
@@ -230,86 +297,119 @@ class TaskMetrics extends Serializable {
*/
   def shuffleWriteMetrics: Option[ShuffleWriteMetrics] = 
_shuffleWriteMetrics
 
-  @deprecated("setting ShuffleWriteMetrics is for internal use only", 
"2.0.0")
-  def shuffleWriteMetrics_=(swm: Option[ShuffleWriteMetrics]): Unit = {
-_shuffleWriteMetrics = swm
-  }
-
   /**
* Get or create a new [[ShuffleWriteMetrics]] associated with this task.
*/
   private[spark] def registerShuffleWriteMetrics(): ShuffleWriteMetrics = 
synchronized {
 _shuffleWriteMetrics.getOrElse {
-  val metrics = new ShuffleWriteMetrics
+  val metrics = new ShuffleWriteMetrics(initialAccumsMap)
   _shuffleWriteMetrics = Some(metrics)
   metrics
 }
   }
 
-  private var _updatedBlockStatuses: Seq[(BlockId, BlockStatus)] =
-Seq.empty[(BlockId, BlockStatus)]
-
-  /**
-   * Storage statuses of any blocks that have been updated as a result of 
this task.
-   */
-  def updatedBlockStatuses: Seq[(BlockId, BlockStatus)] = 
_updatedBlockStatuses
 
-  @deprecated("setting updated blocks is for internal use only", "2.0.0")
-  def updatedBlocks_=(ub: Option[Seq[(BlockId, BlockStatus)]]): Unit = {
-_updatedBlockStatuses = ub.getOrElse(Seq.empty[(BlockId, BlockStatus)])
-  }
+  /* == *
+   |OTHER THINGS|
+   * == */
 
-  private[spark] def incUpdatedBlockStatuses(v: Seq[(BlockId, 
BlockStatus)]): Unit = {
-_updatedBlockStatuses ++= v
+  private[spark] def registerAccumulator(a: Accumulable[_, _]): Unit = {
+accums += a
   }
 
-  private[spark] def setUpdatedBlockStatuses(v: Seq[(BlockId, 
BlockStatus)]): Unit = {
-_updatedBlockStatuses = v
+  /**
+   * Return the latest updates of accumulators in this task.
+   */
+  def accumulatorUpdates(): Seq[AccumulableInfo] = accums.map { a =>
+new AccumulableInfo(
+  a.id, a.name.orNull, Some(a.localValue), None, a.isInternal, 
a.countFailedValues)
--- End diff --

If you decide to update this, please also update the other `.name.orNull` 
calls in this patch.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12937][SQL] bloom filter serialization

2016-01-25 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/10920#discussion_r50802512
  
--- Diff: 
common/sketch/src/main/java/org/apache/spark/util/sketch/Version.java ---
@@ -0,0 +1,35 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.util.sketch;
+
+/**
+ * Version number of the serialized binary format for bloom filter or 
count-min sketch.
+ */
+public enum Version {
--- End diff --

cc @liancheng on point 1 - the best place to document the binary protocol 
is in Version!



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12895][SPARK-12896] Migrate TaskMetrics...

2016-01-25 Thread JoshRosen

Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/10835#discussion_r50802530
  
--- Diff: core/src/main/scala/org/apache/spark/executor/TaskMetrics.scala 
---
@@ -230,86 +297,119 @@ class TaskMetrics extends Serializable {
*/
   def shuffleWriteMetrics: Option[ShuffleWriteMetrics] = 
_shuffleWriteMetrics
 
-  @deprecated("setting ShuffleWriteMetrics is for internal use only", 
"2.0.0")
-  def shuffleWriteMetrics_=(swm: Option[ShuffleWriteMetrics]): Unit = {
-_shuffleWriteMetrics = swm
-  }
-
   /**
* Get or create a new [[ShuffleWriteMetrics]] associated with this task.
*/
   private[spark] def registerShuffleWriteMetrics(): ShuffleWriteMetrics = 
synchronized {
 _shuffleWriteMetrics.getOrElse {
-  val metrics = new ShuffleWriteMetrics
+  val metrics = new ShuffleWriteMetrics(initialAccumsMap)
   _shuffleWriteMetrics = Some(metrics)
   metrics
 }
   }
 
-  private var _updatedBlockStatuses: Seq[(BlockId, BlockStatus)] =
-Seq.empty[(BlockId, BlockStatus)]
-
-  /**
-   * Storage statuses of any blocks that have been updated as a result of 
this task.
-   */
-  def updatedBlockStatuses: Seq[(BlockId, BlockStatus)] = 
_updatedBlockStatuses
 
-  @deprecated("setting updated blocks is for internal use only", "2.0.0")
-  def updatedBlocks_=(ub: Option[Seq[(BlockId, BlockStatus)]]): Unit = {
-_updatedBlockStatuses = ub.getOrElse(Seq.empty[(BlockId, BlockStatus)])
-  }
+  /* == *
+   |OTHER THINGS|
+   * == */
 
-  private[spark] def incUpdatedBlockStatuses(v: Seq[(BlockId, 
BlockStatus)]): Unit = {
-_updatedBlockStatuses ++= v
+  private[spark] def registerAccumulator(a: Accumulable[_, _]): Unit = {
+accums += a
   }
 
-  private[spark] def setUpdatedBlockStatuses(v: Seq[(BlockId, 
BlockStatus)]): Unit = {
-_updatedBlockStatuses = v
+  /**
+   * Return the latest updates of accumulators in this task.
+   */
+  def accumulatorUpdates(): Seq[AccumulableInfo] = accums.map { a =>
+new AccumulableInfo(
+  a.id, a.name.orNull, Some(a.localValue), None, a.isInternal, 
a.countFailedValues)
--- End diff --

This `.orNull` call here looks a little suspect to me; would you mind 
marking this field of `AccumulableInfo` with the `@Nullable` annotation if it's 
going to be legal to do this or convert it to an option? Either one of these 
options is fine by me, but just wanted to do this to make it clear to readers 
of AccumulableInfo that `name` might not be defined.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12937][SQL] bloom filter serialization

2016-01-25 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10920#issuecomment-174875525
  
**[Test build #50081 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50081/consoleFull)**
 for PR 10920 at commit 
[`4b05a35`](https://github.com/apache/spark/commit/4b05a35d58cdabccd915582894d303ba437bee0f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12961][Core] Prevent snappy-java memory...

2016-01-25 Thread viirya

Github user viirya commented on the pull request:

https://github.com/apache/spark/pull/10875#issuecomment-174875415
  
@JoshRosen @srowen Is this ready to merge?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12937][SQL] bloom filter serialization

2016-01-25 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/10920#discussion_r50802492
  
--- Diff: 
common/sketch/src/main/java/org/apache/spark/util/sketch/Version.java ---
@@ -0,0 +1,35 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.util.sketch;
+
+/**
+ * Version number of the serialized binary format for bloom filter or 
count-min sketch.
+ */
+public enum Version {
--- End diff --

I think we should move it back, because:

1. The version enum is actually the best place to document the binary 
protocol.

2. This will be really confusing when bloomfilter has v2 and yet count-min 
sketch has only v1.

3. The amount of code duplication you save is teeny (actually you probably 
added more loc by having an apache licensing header).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12888][SQL][follow-up] benchmark the ne...

2016-01-25 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10917#issuecomment-174875252
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12888][SQL][follow-up] benchmark the ne...

2016-01-25 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10917#issuecomment-174875254
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50072/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12888][SQL][follow-up] benchmark the ne...

2016-01-25 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10917#issuecomment-174874774
  
**[Test build #50072 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50072/consoleFull)**
 for PR 10917 at commit 
[`8207dc1`](https://github.com/apache/spark/commit/8207dc109f21527438cbd80894e9b49d63159f12).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12888][SQL][follow-up] benchmark the ne...

2016-01-25 Thread cloud-fan

Github user cloud-fan commented on the pull request:

https://github.com/apache/spark/pull/10917#issuecomment-174874622
  
@nongli  It's not doing anything to get the hash code of int field, but do 
a [simple multiplication and 
addition](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/rows.scala#L153)
 to get the hash code of the row.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12937][SQL] bloom filter serialization

2016-01-25 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/10920#discussion_r50802349
  
--- Diff: 
common/sketch/src/main/java/org/apache/spark/util/sketch/BloomFilterImpl.java 
---
@@ -161,4 +194,24 @@ public BloomFilter mergeInPlace(BloomFilter other) 
throws IncompatibleMergeExcep
 this.bits.putAll(that.bits);
 return this;
   }
+
+  @Override
+  public void writeTo(OutputStream out) throws IOException {
+DataOutputStream dos = new DataOutputStream(out);
+
+dos.writeInt(Version.V1.getVersionNumber());
+bits.writeTo(dos);
+dos.writeInt(numHashFunctions);
+  }
+
+  public static BloomFilterImpl readFrom(InputStream in) throws 
IOException {
+DataInputStream dis = new DataInputStream(in);
+
+int version = dis.readInt();
+if (version != Version.V1.getVersionNumber()) {
+  throw new IOException("Unexpected Bloom Filter version number (" + 
version + ")");
--- End diff --

BloomFilter, or Bloom filter


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12926][SQL] SQLContext to disallow user...

2016-01-25 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10849#issuecomment-174874233
  
**[Test build #50082 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50082/consoleFull)**
 for PR 10849 at commit 
[`f982d54`](https://github.com/apache/spark/commit/f982d5449fc52ef9b844761f92306fb7d238b542).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12895][SPARK-12896] Migrate TaskMetrics...

2016-01-25 Thread JoshRosen

Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/10835#discussion_r50802278
  
--- Diff: 
core/src/test/scala/org/apache/spark/executor/TaskMetricsSuite.scala ---
@@ -17,12 +17,345 @@
 
 package org.apache.spark.executor
 
-import org.apache.spark.SparkFunSuite
+import org.apache.spark._
+import org.apache.spark.storage.{BlockId, BlockStatus, StorageLevel, 
TestBlockId}
+
 
 class TaskMetricsSuite extends SparkFunSuite {
-  test("[SPARK-5701] updateShuffleReadMetrics: ShuffleReadMetrics not 
added when no shuffle deps") {
-val taskMetrics = new TaskMetrics()
-taskMetrics.mergeShuffleReadMetrics()
-assert(taskMetrics.shuffleReadMetrics.isEmpty)
+  import AccumulatorParam._
+  import InternalAccumulator._
+  import StorageLevel._
+  import TaskMetricsSuite._
+
+  test("create") {
--- End diff --

Cool, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12935][SQL] DataFrame API for Count-Min...

2016-01-25 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/10911#discussion_r50802122
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala ---
@@ -309,4 +311,84 @@ final class DataFrameStatFunctions private[sql](df: 
DataFrame) {
   def sampleBy[T](col: String, fractions: ju.Map[T, jl.Double], seed: 
Long): DataFrame = {
 sampleBy(col, fractions.asScala.toMap.asInstanceOf[Map[T, Double]], 
seed)
   }
+
+  /**
+   * Builds a Count-min Sketch over a specified column.
+   *
+   * @param colName name of the column over which the sketch is built
+   * @param depth depth of the sketch
+   * @param width width of the sketch
+   * @param seed random seed
+   * @return a [[CountMinSketch]] over column `colName`
+   * @since 2.0.0
+   */
+  def countMinSketch(colName: String, depth: Int, width: Int, seed: Int): 
CountMinSketch = {
+countMinSketch(Column(colName), depth, width, seed)
+  }
+
+  /**
+   * Builds a Count-min Sketch over a specified column.
+   *
+   * @param colName name of the column over which the sketch is built
+   * @param eps relative error of the sketch
+   * @param confidence confidence of the sketch
+   * @param seed random seed
+   * @return a [[CountMinSketch]] over column `colName`
+   * @since 2.0.0
+   */
+  def countMinSketch(
+  colName: String, eps: Double, confidence: Double, seed: Int): 
CountMinSketch = {
+countMinSketch(Column(colName), eps, confidence, seed)
+  }
+
+  /**
+   * Builds a Count-min Sketch over a specified column.
+   *
+   * @param col the column over which the sketch is built
+   * @param depth depth of the sketch
+   * @param width width of the sketch
+   * @param seed random seed
+   * @return a [[CountMinSketch]] over column `colName`
+   * @since 2.0.0
+   */
+  def countMinSketch(col: Column, depth: Int, width: Int, seed: Int): 
CountMinSketch = {
+countMinSketch(col, CountMinSketch.create(depth, width, seed))
+  }
+
+  /**
+   * Builds a Count-min Sketch over a specified column.
+   *
+   * @param col the column over which the sketch is built
+   * @param eps relative error of the sketch
+   * @param confidence confidence of the sketch
+   * @param seed random seed
+   * @return a [[CountMinSketch]] over column `colName`
+   * @since 2.0.0
+   */
+  def countMinSketch(col: Column, eps: Double, confidence: Double, seed: 
Int): CountMinSketch = {
+countMinSketch(col, CountMinSketch.create(eps, confidence, seed))
+  }
+
+  private def countMinSketch(col: Column, zero: CountMinSketch): 
CountMinSketch = {
+val singleCol = df.select(col)
+val colType = singleCol.schema.head.dataType
+val supportedTypes: Set[DataType] = Set(ByteType, ShortType, 
IntegerType, LongType, StringType)
+
+require(
+  supportedTypes.contains(colType),
+  s"Count-min Sketch only supports string type and integral types, " +
+s"and does not support type $colType."
+)
+
+singleCol.rdd.aggregate(zero)(
--- End diff --

Maybe we can improve it by UDAF in the future.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12926][SQL] SQLContext to disallow user...

2016-01-25 Thread tejasapatil

Github user tejasapatil commented on the pull request:

https://github.com/apache/spark/pull/10849#issuecomment-174873141
  
Fixed scala style test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12895][SPARK-12896] Migrate TaskMetrics...

2016-01-25 Thread JoshRosen

Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/10835#discussion_r50802114
  
--- Diff: 
core/src/main/scala/org/apache/spark/status/api/v1/AllStagesResource.scala ---
@@ -237,7 +237,8 @@ private[v1] object AllStagesResource {
   }
 
   def convertAccumulableInfo(acc: InternalAccumulableInfo): 
AccumulableInfo = {
-new AccumulableInfo(acc.id, acc.name, acc.update, acc.value)
+new AccumulableInfo(
+  acc.id, acc.name, acc.update.map(_.toString), 
acc.value.map(_.toString).orNull)
--- End diff --

This was kind of confusing on first glance until I rembered that we have 
the weird UI AccumulableInfo and the other version which is used elsewhere and 
which has been renamed to `InternalAccumulableInfo` here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12937][SQL] bloom filter serialization

2016-01-25 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/10920#discussion_r50802030
  
--- Diff: 
common/sketch/src/main/java/org/apache/spark/util/sketch/BloomFilter.java ---
@@ -83,7 +87,7 @@
* bloom filters are appropriately sized to avoid saturating them.
*
* @param other The bloom filter to combine this bloom filter with. It 
is not mutated.
-   * @throws IllegalArgumentException if {@code isCompatible(that) == 
false}
+   * @throws IncompatibleMergeException if {@code isCompatible(that) == 
false}
--- End diff --

you are using "other" instead of "that" here. make them consistent


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12937][SQL] bloom filter serialization

2016-01-25 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/10920#discussion_r50801991
  
--- Diff: 
common/sketch/src/main/java/org/apache/spark/util/sketch/BitArray.java ---
@@ -32,13 +38,14 @@ static int numWords(long numBits) {
   }
 
   BitArray(long numBits) {
-if (numBits <= 0) {
-  throw new IllegalArgumentException("numBits must be positive");
-}
-this.data = new long[numWords(numBits)];
+this(new long[numWords(numBits)]);
+  }
+
+  private BitArray(long[] data) {
+this.data = data;
 long bitCount = 0;
-for (long value : data) {
-  bitCount += Long.bitCount(value);
+for (long datum : data) {
--- End diff --

it is a little bit weird to say datam here, since you are actually working 
with 64 "datum" at once. maybe "word"?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12935][SQL] DataFrame API for Count-Min...

2016-01-25 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/10911#discussion_r50801910
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala ---
@@ -309,4 +311,84 @@ final class DataFrameStatFunctions private[sql](df: 
DataFrame) {
   def sampleBy[T](col: String, fractions: ju.Map[T, jl.Double], seed: 
Long): DataFrame = {
 sampleBy(col, fractions.asScala.toMap.asInstanceOf[Map[T, Double]], 
seed)
   }
+
+  /**
+   * Builds a Count-min Sketch over a specified column.
+   *
+   * @param colName name of the column over which the sketch is built
+   * @param depth depth of the sketch
+   * @param width width of the sketch
+   * @param seed random seed
+   * @return a [[CountMinSketch]] over column `colName`
+   * @since 2.0.0
+   */
+  def countMinSketch(colName: String, depth: Int, width: Int, seed: Int): 
CountMinSketch = {
+countMinSketch(Column(colName), depth, width, seed)
+  }
+
+  /**
+   * Builds a Count-min Sketch over a specified column.
+   *
+   * @param colName name of the column over which the sketch is built
+   * @param eps relative error of the sketch
+   * @param confidence confidence of the sketch
+   * @param seed random seed
+   * @return a [[CountMinSketch]] over column `colName`
+   * @since 2.0.0
+   */
+  def countMinSketch(
+  colName: String, eps: Double, confidence: Double, seed: Int): 
CountMinSketch = {
+countMinSketch(Column(colName), eps, confidence, seed)
+  }
+
+  /**
+   * Builds a Count-min Sketch over a specified column.
+   *
+   * @param col the column over which the sketch is built
+   * @param depth depth of the sketch
+   * @param width width of the sketch
+   * @param seed random seed
+   * @return a [[CountMinSketch]] over column `colName`
+   * @since 2.0.0
+   */
+  def countMinSketch(col: Column, depth: Int, width: Int, seed: Int): 
CountMinSketch = {
+countMinSketch(col, CountMinSketch.create(depth, width, seed))
+  }
+
+  /**
+   * Builds a Count-min Sketch over a specified column.
+   *
+   * @param col the column over which the sketch is built
+   * @param eps relative error of the sketch
+   * @param confidence confidence of the sketch
+   * @param seed random seed
+   * @return a [[CountMinSketch]] over column `colName`
+   * @since 2.0.0
+   */
+  def countMinSketch(col: Column, eps: Double, confidence: Double, seed: 
Int): CountMinSketch = {
+countMinSketch(col, CountMinSketch.create(eps, confidence, seed))
+  }
+
+  private def countMinSketch(col: Column, zero: CountMinSketch): 
CountMinSketch = {
+val singleCol = df.select(col)
+val colType = singleCol.schema.head.dataType
+val supportedTypes: Set[DataType] = Set(ByteType, ShortType, 
IntegerType, LongType, StringType)
+
+require(
+  supportedTypes.contains(colType),
--- End diff --

how about `colType == StringType || colType.isInstanceOf[IntegralType]`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12937][SQL] bloom filter serialization

2016-01-25 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/10920#discussion_r50801901
  
--- Diff: 
common/sketch/src/main/java/org/apache/spark/util/sketch/BitArray.java ---
@@ -24,6 +27,9 @@
   private long bitCount;
 
   static int numWords(long numBits) {
+if (numBits <= 0) {
+  throw new IllegalArgumentException("numBits must be positive");
--- End diff --

also include the current value


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12937][SQL] bloom filter serialization

2016-01-25 Thread cloud-fan

Github user cloud-fan commented on the pull request:

https://github.com/apache/spark/pull/10920#issuecomment-174869956
  
cc @rxin @liancheng 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12937][SQL] bloom filter serialization

2016-01-25 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/10920#discussion_r50801787
  
--- Diff: 
common/sketch/src/main/java/org/apache/spark/util/sketch/Version.java ---
@@ -0,0 +1,35 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.util.sketch;
+
+/**
+ * Version number of the serialized binary format for bloom filter or 
count-min sketch.
+ */
+public enum Version {
--- End diff --

bloom filter and count-min sketch can have different version values, but we 
can share same version class.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12937][SQL] bloom filter serialization

2016-01-25 Thread cloud-fan

GitHub user cloud-fan opened a pull request:

https://github.com/apache/spark/pull/10920

[SPARK-12937][SQL] bloom filter serialization

This PR adds serialization support for BloomFilter.

A version number is added to version the serialized binary format.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/cloud-fan/spark bloom-filter

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/10920.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #10920


commit 4b05a35d58cdabccd915582894d303ba437bee0f
Author: Wenchen Fan 
Date:   2016-01-26T07:23:51Z

bloom filter serialization




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12828][SQL]add natural join support

2016-01-25 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/10762#discussion_r50801765
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicOperators.scala
 ---
@@ -164,11 +164,42 @@ case class Join(
 left.output.map(_.withNullability(true)) ++ right.output
   case FullOuter =>
 left.output.map(_.withNullability(true)) ++ 
right.output.map(_.withNullability(true))
+  case NaturalJoin(jt) =>
+outerProjectList(jt).map(_.toAttribute)
   case _ =>
 left.output ++ right.output
 }
   }
 
+  def outerProjectList(jt: JoinType): Seq[NamedExpression] = {
--- End diff --

also i'm not 100% sure, but i suspect if we make resolve false when the 
plan is natural join type, then we can just move this function into the 
analyzer. it really doesn't belong in the join node itself.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12828][SQL]add natural join support

2016-01-25 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/10762#discussion_r50801713
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicOperators.scala
 ---
@@ -164,11 +164,42 @@ case class Join(
 left.output.map(_.withNullability(true)) ++ right.output
   case FullOuter =>
 left.output.map(_.withNullability(true)) ++ 
right.output.map(_.withNullability(true))
+  case NaturalJoin(jt) =>
+outerProjectList(jt).map(_.toAttribute)
   case _ =>
 left.output ++ right.output
 }
   }
 
+  def outerProjectList(jt: JoinType): Seq[NamedExpression] = {
+val leftNames = left.output.map(_.name)
+val rightNames = right.output.map(_.name)
+val commonNames = leftNames.intersect(rightNames)
+val commonOutputFromLeft = left.output.filter(att => 
commonNames.contains(att.name))
+val lUniqueOutput = left.output.filterNot(att => 
commonNames.contains(att.name))
+val rUniqueOutput = right.output.filterNot(att => 
commonNames.contains(att.name))
+jt match {
+  case LeftOuter =>
+commonOutputFromLeft ++ lUniqueOutput ++ 
rUniqueOutput.map(_.withNullability(true))
+  case RightOuter =>
+val commonOutputFromRight =
+  commonNames.map(cn => right.output.find(att => att.name == 
cn).get)
+commonOutputFromRight ++ 
lUniqueOutput.map(_.withNullability(true)) ++ rUniqueOutput
+  case FullOuter =>
+val commonOutputFromRight =
+  commonNames.map(cn => right.output.find(att => att.name == 
cn).get)
+val commonPairs = commonOutputFromLeft.zip(commonOutputFromRight)
+val commonOutputExp = commonPairs.map {
+  case (l: Attribute, r: Attribute) =>
+Alias(Coalesce(Seq(l, r)), l.name)()
+}
+commonOutputExp ++
+  lUniqueOutput.map(_.withNullability(true)) ++ 
rUniqueOutput.map(_.withNullability(true))
+  case _ =>
+commonOutputFromLeft ++ lUniqueOutput ++ rUniqueOutput
+}
+  }
+
   def selfJoinResolved: Boolean = 
left.outputSet.intersect(right.outputSet).isEmpty
 
   // Joins are only resolved if they don't introduce ambiguous expression 
ids.
--- End diff --

we should make resolved false if the type is natural join


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12854][SQL] Implement complex types sup...

2016-01-25 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10820#issuecomment-174867931
  
**[Test build #50080 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50080/consoleFull)**
 for PR 10820 at commit 
[`f378335`](https://github.com/apache/spark/commit/f378335858c1c10400936f046430f8e7f4c70c3c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9835] [ML] Implement IterativelyReweigh...

2016-01-25 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/10639#discussion_r50801341
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/optim/IterativelyReweightedLeastSquares.scala
 ---
@@ -0,0 +1,100 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.optim
+
+import org.apache.spark.Logging
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.mllib.linalg._
+import org.apache.spark.rdd.RDD
+
+/**
+ * Model fitted by [[IterativelyReweightedLeastSquares]].
+ * @param coefficients model coefficients
+ * @param intercept model intercept
+ */
+private[ml] class IterativelyReweightedLeastSquaresModel(
+val coefficients: DenseVector,
+val intercept: Double) extends Serializable
+
+/**
+ * Implements the method of iteratively reweighted least squares (IRLS) 
which is used to solve
+ * certain optimization problems by an iterative method. In each step of 
the iterations, it
+ * involves solving a weighted lease squares (WLS) problem by 
[[WeightedLeastSquares]].
+ * It can be used to find maximum likelihood estimates of a generalized 
linear model (GLM),
+ * find M-estimator in robust regression and other optimization problems.
--- End diff --

It would be good to provide a reference about IRLS. The IRLS page on 
Wikipedia is specialized for Lp regression. I would recommend Green's paper as 
a reference: http://www.jstor.org/stable/2345503


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12968][SQL] Implement command to set cu...

2016-01-25 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10916#issuecomment-174866893
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SQL][Minor] A few minor tweaks to CSV reader.

2016-01-25 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10919#issuecomment-174866845
  
**[Test build #2459 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2459/consoleFull)**
 for PR 10919 at commit 
[`3b3c1b7`](https://github.com/apache/spark/commit/3b3c1b73fe8dda6190d10ac567d33aead5beb337).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12968][SQL] Implement command to set cu...

2016-01-25 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10916#issuecomment-174866895
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50077/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9835] [ML] Implement IterativelyReweigh...

2016-01-25 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/10639#discussion_r50801103
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/optim/IterativelyReweightedLeastSquares.scala
 ---
@@ -0,0 +1,100 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.optim
+
+import org.apache.spark.Logging
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.mllib.linalg._
+import org.apache.spark.rdd.RDD
+
+/**
+ * Model fitted by [[IterativelyReweightedLeastSquares]].
+ * @param coefficients model coefficients
+ * @param intercept model intercept
+ */
+private[ml] class IterativelyReweightedLeastSquaresModel(
+val coefficients: DenseVector,
+val intercept: Double) extends Serializable
+
+/**
+ * Implements the method of iteratively reweighted least squares (IRLS) 
which is used to solve
+ * certain optimization problems by an iterative method. In each step of 
the iterations, it
+ * involves solving a weighted lease squares (WLS) problem by 
[[WeightedLeastSquares]].
+ * It can be used to find maximum likelihood estimates of a generalized 
linear model (GLM),
+ * find M-estimator in robust regression and other optimization problems.
+ *
+ * @param initialModel the initial guess model.
+ * @param reweightFunc the reweight function which is used to update 
offsets and weights
+ * at each iteration.
+ * @param fitIntercept whether to fit intercept.
+ * @param regParam L2 regularization parameter used by WLS.
+ * @param maxIter maximum number of iterations.
+ * @param tol the convergence tolerance.
+ */
+private[ml] class IterativelyReweightedLeastSquares(
+val initialModel: WeightedLeastSquaresModel,
+val reweightFunc: (Instance, WeightedLeastSquaresModel) => (Double, 
Double),
+val fitIntercept: Boolean,
+val regParam: Double,
+val maxIter: Int,
+val tol: Double) extends Logging with Serializable {
+
+  def fit(instances: RDD[Instance]): 
IterativelyReweightedLeastSquaresModel = {
+
+var converged = false
+var iter = 0
+
+var offsetsAndWeights: RDD[(Double, Double)] = null
+var model: WeightedLeastSquaresModel = initialModel
+var oldModel: WeightedLeastSquaresModel = initialModel
+
+while (iter < maxIter && !converged) {
+
+  oldModel = model
+
+  // Update offsets and weights using reweightFunc
+  offsetsAndWeights = instances.map { instance => 
reweightFunc(instance, oldModel) }
+
+  // Estimate new model
+  val newInstances = instances.zip(offsetsAndWeights).map {
+case (instance, (offset, weight)) => Instance(offset, weight, 
instance.features)
+  }
+  model = new WeightedLeastSquares(fitIntercept, regParam, false, 
false).fit(newInstances)
+
+  val oldParameters = Array.concat(Array(oldModel.intercept), 
oldModel.coefficients.toArray)
+  val parameters = Array.concat(Array(model.intercept), 
model.coefficients.toArray)
+  val deltaArray = oldParameters.zip(parameters).map { case (x: 
Double, y: Double) =>
+math.abs(x - y)
+  }
+  if (!deltaArray.exists(_ > tol)) {
--- End diff --

This is the infinity norm. Any reference?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12968][SQL] Implement command to set cu...

2016-01-25 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10916#issuecomment-174866819
  
**[Test build #50079 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50079/consoleFull)**
 for PR 10916 at commit 
[`43beb4b`](https://github.com/apache/spark/commit/43beb4ba499814c698df7537018ab6fafefa738e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9835] [ML] Implement IterativelyReweigh...

2016-01-25 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/10639#discussion_r50801090
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/optim/IterativelyReweightedLeastSquares.scala
 ---
@@ -0,0 +1,101 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.optim
+
+import org.apache.spark.Logging
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.mllib.linalg._
+import org.apache.spark.rdd.RDD
+
+/**
+ * Model fitted by [[IterativelyReweightedLeastSquares]].
+ * @param coefficients model coefficients
+ * @param intercept model intercept
+ */
+private[ml] class IterativelyReweightedLeastSquaresModel(
+val coefficients: DenseVector,
+val intercept: Double) extends Serializable
+
+/**
+ * Implements the method of iteratively reweighted least squares (IRLS) 
which is used to solve
+ * certain optimization problems by an iterative method. In each step of 
the iterations, it
+ * involves solving a weighted lease squares (WLS) problem by 
[[WeightedLeastSquares]].
+ * It can be used to find maximum likelihood estimates of a generalized 
linear model (GLM),
+ * find M-estimator in robust regression and some other optimization 
problems.
+ *
+ * @param initialModel the initial guess model.
+ * @param reweightFunc the reweight function which is used to update 
offsets and weights
+ * at each iteration.
+ * @param fitIntercept whether to fit intercept.
+ * @param regParam L2 regularization parameter used by WLS.
+ * @param maxIter maximum number of iterations.
+ * @param tol the convergence tolerance.
+ */
+private[ml] class IterativelyReweightedLeastSquares(
+val initialModel: WeightedLeastSquaresModel,
+val reweightFunc: (Instance, WeightedLeastSquaresModel) => (Double, 
Double),
+val fitIntercept: Boolean,
+val regParam: Double,
+val maxIter: Int,
+val tol: Double) extends Logging with Serializable {
+
+  def fit(instances: RDD[Instance]): 
IterativelyReweightedLeastSquaresModel = {
+
+var converged = false
+var iter = 0
+
+var offsetsAndWeights: RDD[(Double, Double)] = null
--- End diff --

It is fine for internal use.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9835] [ML] Implement IterativelyReweigh...

2016-01-25 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/10639#discussion_r50801097
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/optim/IterativelyReweightedLeastSquares.scala
 ---
@@ -0,0 +1,100 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.optim
+
+import org.apache.spark.Logging
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.mllib.linalg._
+import org.apache.spark.rdd.RDD
+
+/**
+ * Model fitted by [[IterativelyReweightedLeastSquares]].
+ * @param coefficients model coefficients
+ * @param intercept model intercept
+ */
+private[ml] class IterativelyReweightedLeastSquaresModel(
+val coefficients: DenseVector,
+val intercept: Double) extends Serializable
+
+/**
+ * Implements the method of iteratively reweighted least squares (IRLS) 
which is used to solve
+ * certain optimization problems by an iterative method. In each step of 
the iterations, it
+ * involves solving a weighted lease squares (WLS) problem by 
[[WeightedLeastSquares]].
+ * It can be used to find maximum likelihood estimates of a generalized 
linear model (GLM),
+ * find M-estimator in robust regression and other optimization problems.
+ *
+ * @param initialModel the initial guess model.
+ * @param reweightFunc the reweight function which is used to update 
offsets and weights
+ * at each iteration.
+ * @param fitIntercept whether to fit intercept.
+ * @param regParam L2 regularization parameter used by WLS.
+ * @param maxIter maximum number of iterations.
+ * @param tol the convergence tolerance.
+ */
+private[ml] class IterativelyReweightedLeastSquares(
+val initialModel: WeightedLeastSquaresModel,
+val reweightFunc: (Instance, WeightedLeastSquaresModel) => (Double, 
Double),
+val fitIntercept: Boolean,
+val regParam: Double,
+val maxIter: Int,
+val tol: Double) extends Logging with Serializable {
+
+  def fit(instances: RDD[Instance]): 
IterativelyReweightedLeastSquaresModel = {
+
+var converged = false
+var iter = 0
+
+var offsetsAndWeights: RDD[(Double, Double)] = null
+var model: WeightedLeastSquaresModel = initialModel
+var oldModel: WeightedLeastSquaresModel = initialModel
+
+while (iter < maxIter && !converged) {
+
+  oldModel = model
+
+  // Update offsets and weights using reweightFunc
+  offsetsAndWeights = instances.map { instance => 
reweightFunc(instance, oldModel) }
+
+  // Estimate new model
+  val newInstances = instances.zip(offsetsAndWeights).map {
+case (instance, (offset, weight)) => Instance(offset, weight, 
instance.features)
+  }
+  model = new WeightedLeastSquares(fitIntercept, regParam, false, 
false).fit(newInstances)
--- End diff --

use named arguments for booleans


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9835] [ML] Implement IterativelyReweigh...

2016-01-25 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/10639#discussion_r50801095
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/optim/IterativelyReweightedLeastSquares.scala
 ---
@@ -0,0 +1,100 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.optim
+
+import org.apache.spark.Logging
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.mllib.linalg._
+import org.apache.spark.rdd.RDD
+
+/**
+ * Model fitted by [[IterativelyReweightedLeastSquares]].
+ * @param coefficients model coefficients
+ * @param intercept model intercept
+ */
+private[ml] class IterativelyReweightedLeastSquaresModel(
+val coefficients: DenseVector,
+val intercept: Double) extends Serializable
+
+/**
+ * Implements the method of iteratively reweighted least squares (IRLS) 
which is used to solve
+ * certain optimization problems by an iterative method. In each step of 
the iterations, it
+ * involves solving a weighted lease squares (WLS) problem by 
[[WeightedLeastSquares]].
+ * It can be used to find maximum likelihood estimates of a generalized 
linear model (GLM),
+ * find M-estimator in robust regression and other optimization problems.
+ *
+ * @param initialModel the initial guess model.
+ * @param reweightFunc the reweight function which is used to update 
offsets and weights
+ * at each iteration.
+ * @param fitIntercept whether to fit intercept.
+ * @param regParam L2 regularization parameter used by WLS.
+ * @param maxIter maximum number of iterations.
+ * @param tol the convergence tolerance.
+ */
+private[ml] class IterativelyReweightedLeastSquares(
+val initialModel: WeightedLeastSquaresModel,
+val reweightFunc: (Instance, WeightedLeastSquaresModel) => (Double, 
Double),
+val fitIntercept: Boolean,
+val regParam: Double,
+val maxIter: Int,
+val tol: Double) extends Logging with Serializable {
+
+  def fit(instances: RDD[Instance]): 
IterativelyReweightedLeastSquaresModel = {
+
+var converged = false
+var iter = 0
+
+var offsetsAndWeights: RDD[(Double, Double)] = null
+var model: WeightedLeastSquaresModel = initialModel
+var oldModel: WeightedLeastSquaresModel = initialModel
+
+while (iter < maxIter && !converged) {
+
+  oldModel = model
+
+  // Update offsets and weights using reweightFunc
+  offsetsAndWeights = instances.map { instance => 
reweightFunc(instance, oldModel) }
+
+  // Estimate new model
+  val newInstances = instances.zip(offsetsAndWeights).map {
--- End diff --

`zip` is not efficient. Generate `newInstances` directly:

~~~scala
val newInstances = instances.map { instance =>
  val (newOffset, newWeight) = reweightFunc(instance, oldModel)
  Instance(newOffset, newWeight, instance.features)
}
~~~


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9835] [ML] Implement IterativelyReweigh...

2016-01-25 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/10639#discussion_r50801093
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/optim/IterativelyReweightedLeastSquares.scala
 ---
@@ -0,0 +1,100 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.optim
+
+import org.apache.spark.Logging
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.mllib.linalg._
+import org.apache.spark.rdd.RDD
+
+/**
+ * Model fitted by [[IterativelyReweightedLeastSquares]].
+ * @param coefficients model coefficients
+ * @param intercept model intercept
+ */
+private[ml] class IterativelyReweightedLeastSquaresModel(
+val coefficients: DenseVector,
+val intercept: Double) extends Serializable
+
+/**
+ * Implements the method of iteratively reweighted least squares (IRLS) 
which is used to solve
+ * certain optimization problems by an iterative method. In each step of 
the iterations, it
+ * involves solving a weighted lease squares (WLS) problem by 
[[WeightedLeastSquares]].
+ * It can be used to find maximum likelihood estimates of a generalized 
linear model (GLM),
+ * find M-estimator in robust regression and other optimization problems.
+ *
+ * @param initialModel the initial guess model.
+ * @param reweightFunc the reweight function which is used to update 
offsets and weights
+ * at each iteration.
+ * @param fitIntercept whether to fit intercept.
+ * @param regParam L2 regularization parameter used by WLS.
+ * @param maxIter maximum number of iterations.
+ * @param tol the convergence tolerance.
+ */
+private[ml] class IterativelyReweightedLeastSquares(
+val initialModel: WeightedLeastSquaresModel,
+val reweightFunc: (Instance, WeightedLeastSquaresModel) => (Double, 
Double),
+val fitIntercept: Boolean,
+val regParam: Double,
+val maxIter: Int,
+val tol: Double) extends Logging with Serializable {
+
+  def fit(instances: RDD[Instance]): 
IterativelyReweightedLeastSquaresModel = {
+
+var converged = false
+var iter = 0
+
+var offsetsAndWeights: RDD[(Double, Double)] = null
+var model: WeightedLeastSquaresModel = initialModel
+var oldModel: WeightedLeastSquaresModel = initialModel
--- End diff --

`= null`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9835] [ML] Implement IterativelyReweigh...

2016-01-25 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/10639#discussion_r50801100
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/optim/IterativelyReweightedLeastSquares.scala
 ---
@@ -0,0 +1,100 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.optim
+
+import org.apache.spark.Logging
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.mllib.linalg._
+import org.apache.spark.rdd.RDD
+
+/**
+ * Model fitted by [[IterativelyReweightedLeastSquares]].
+ * @param coefficients model coefficients
+ * @param intercept model intercept
+ */
+private[ml] class IterativelyReweightedLeastSquaresModel(
+val coefficients: DenseVector,
+val intercept: Double) extends Serializable
+
+/**
+ * Implements the method of iteratively reweighted least squares (IRLS) 
which is used to solve
+ * certain optimization problems by an iterative method. In each step of 
the iterations, it
+ * involves solving a weighted lease squares (WLS) problem by 
[[WeightedLeastSquares]].
+ * It can be used to find maximum likelihood estimates of a generalized 
linear model (GLM),
+ * find M-estimator in robust regression and other optimization problems.
+ *
+ * @param initialModel the initial guess model.
+ * @param reweightFunc the reweight function which is used to update 
offsets and weights
+ * at each iteration.
+ * @param fitIntercept whether to fit intercept.
+ * @param regParam L2 regularization parameter used by WLS.
+ * @param maxIter maximum number of iterations.
+ * @param tol the convergence tolerance.
+ */
+private[ml] class IterativelyReweightedLeastSquares(
+val initialModel: WeightedLeastSquaresModel,
+val reweightFunc: (Instance, WeightedLeastSquaresModel) => (Double, 
Double),
+val fitIntercept: Boolean,
+val regParam: Double,
+val maxIter: Int,
+val tol: Double) extends Logging with Serializable {
+
+  def fit(instances: RDD[Instance]): 
IterativelyReweightedLeastSquaresModel = {
+
+var converged = false
+var iter = 0
+
+var offsetsAndWeights: RDD[(Double, Double)] = null
+var model: WeightedLeastSquaresModel = initialModel
+var oldModel: WeightedLeastSquaresModel = initialModel
+
+while (iter < maxIter && !converged) {
+
+  oldModel = model
+
+  // Update offsets and weights using reweightFunc
+  offsetsAndWeights = instances.map { instance => 
reweightFunc(instance, oldModel) }
+
+  // Estimate new model
+  val newInstances = instances.zip(offsetsAndWeights).map {
+case (instance, (offset, weight)) => Instance(offset, weight, 
instance.features)
+  }
+  model = new WeightedLeastSquares(fitIntercept, regParam, false, 
false).fit(newInstances)
+
+  val oldParameters = Array.concat(Array(oldModel.intercept), 
oldModel.coefficients.toArray)
+  val parameters = Array.concat(Array(model.intercept), 
model.coefficients.toArray)
+  val deltaArray = oldParameters.zip(parameters).map { case (x: 
Double, y: Double) =>
+math.abs(x - y)
+  }
--- End diff --

This is inefficient because it allocates several temporary vectors. We can 
compute `(intercept - oldIntercept)^2 + ||coefficients - oldCoefficients||_2^2` 
and then take the square root, without allocating new vectors.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11780][SQL] Add catalyst type aliases b...

2016-01-25 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10915#issuecomment-174866518
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50070/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11780][SQL] Add catalyst type aliases b...

2016-01-25 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10915#issuecomment-174866516
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11780][SQL] Add catalyst type aliases b...

2016-01-25 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10915#issuecomment-174866357
  
**[Test build #50070 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50070/consoleFull)**
 for PR 10915 at commit 
[`9ef7185`](https://github.com/apache/spark/commit/9ef7185f5a9ce1f672559e00a34854c5afa4).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12888][SQL][follow-up] benchmark the ne...

2016-01-25 Thread nongli

Github user nongli commented on the pull request:

https://github.com/apache/spark/pull/10917#issuecomment-174866182
  
@cloud-fan Simple is just a single int right? It's not even doing anything 
in the previous case?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SQL][Minor] A few minor tweaks to CSV reader.

2016-01-25 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10919#issuecomment-174865380
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50076/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SQL][Minor] A few minor tweaks to CSV reader.

2016-01-25 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10919#issuecomment-174865375
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11622][MLLIB] Make LibSVMRelation exten...

2016-01-25 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/9595#issuecomment-174862732
  
**[Test build #50078 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50078/consoleFull)**
 for PR 9595 at commit 
[`5bdf224`](https://github.com/apache/spark/commit/5bdf2249a970e443796ab6f88f1680646109e570).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12865][SPARK-12866][SQL] Migrate SparkS...

2016-01-25 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/10905#discussion_r50800538
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/ASTNode.scala 
---
@@ -60,6 +60,12 @@ case class ASTNode(
   /** Source text. */
   lazy val source = stream.toString(startIndex, stopIndex)
 
+  /** Get the source text that remains after this token. */
+  lazy val remainder = {
--- End diff --

if you are updating the pr, can you add explicit types for all the public 
vals?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12968][SQL] Implement command to set cu...

2016-01-25 Thread viirya

Github user viirya commented on the pull request:

https://github.com/apache/spark/pull/10916#issuecomment-174861508
  
retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10086] [MLlib] [Streaming] [PySpark] ig...

2016-01-25 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/10909


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11622][MLLIB] Make LibSVMRelation exten...

2016-01-25 Thread mengxr

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/9595#issuecomment-174861333
  
test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SQL][Minor] A few minor tweaks to CSV reader.

2016-01-25 Thread rxin

GitHub user rxin opened a pull request:

https://github.com/apache/spark/pull/10919

[SQL][Minor] A few minor tweaks to CSV reader.

This pull request simply fixes a few minor coding style issues in csv, as I 
was reviewing the change post-hoc.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rxin/spark csv-minor

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/10919.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #10919


commit 3b3c1b73fe8dda6190d10ac567d33aead5beb337
Author: Reynold Xin 
Date:   2016-01-26T06:50:54Z

A few minor tweaks to CSV reader.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12983] [CORE] [DOC] Correct metrics.pro...

2016-01-25 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10902#issuecomment-174860964
  
**[Test build #50075 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50075/consoleFull)**
 for PR 10902 at commit 
[`9c45b8a`](https://github.com/apache/spark/commit/9c45b8a59e7f837a13549839149e59894af19a27).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10086] [MLlib] [Streaming] [PySpark] ig...

2016-01-25 Thread mengxr

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/10909#issuecomment-174860765
  
Recent failures in the last 4 days:

* 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50016/testReport/
* 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49996/testReport/
* 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49989/testReport/
* 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49870/testReport/

Merged into master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12994][CORE] It is not necessary to cre...

2016-01-25 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10914#issuecomment-174860449
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12995][GraphX] Remove deprecate APIs fr...

2016-01-25 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10918#issuecomment-174860265
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12994][CORE] It is not necessary to cre...

2016-01-25 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10914#issuecomment-174860451
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50069/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12994][CORE] It is not necessary to cre...

2016-01-25 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10914#issuecomment-174860248
  
**[Test build #50069 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50069/consoleFull)**
 for PR 10914 at commit 
[`90118ca`](https://github.com/apache/spark/commit/90118ca76c2cbe381bc06614c02cd3b089951c10).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12995][GraphX] Remove deprecate APIs fr...

2016-01-25 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10918#issuecomment-174860277
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50074/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12995][GraphX] Remove deprecate APIs fr...

2016-01-25 Thread maropu

Github user maropu commented on the pull request:

https://github.com/apache/spark/pull/10918#issuecomment-174859672
  
@srowen This is an activity from the discussion in #4402.
I checked that GraphX has deprecate APIs used only in Pregel and this pr 
removes them.
If there aren't any problems, I'll also remove deprecate ones from the 
test codes in GraphX.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12834] Change ser/de of JavaArray and J...

2016-01-25 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/10772


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12983] [CORE] [DOC] Correct metrics.pro...

2016-01-25 Thread BenFradet

Github user BenFradet commented on the pull request:

https://github.com/apache/spark/pull/10902#issuecomment-174859378
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12983] [CORE] [DOC] Correct metrics.pro...

2016-01-25 Thread BenFradet

Github user BenFradet commented on the pull request:

https://github.com/apache/spark/pull/10902#issuecomment-174859335
  
Unrelated to this pr, triggering a new build.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...

2016-01-25 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/10085


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12834] Change ser/de of JavaArray and J...

2016-01-25 Thread jkbradley

Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/10772#issuecomment-174858967
  
LGTM
Merging with master
Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...

2016-01-25 Thread jkbradley

Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/10085#issuecomment-174857126
  
LGTM
Merging with master
Thanks for the PR!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12968][SQL] Implement command to set cu...

2016-01-25 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10916#issuecomment-174855432
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12968][SQL] Implement command to set cu...

2016-01-25 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10916#issuecomment-174855251
  
**[Test build #50071 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50071/consoleFull)**
 for PR 10916 at commit 
[`46737b5`](https://github.com/apache/spark/commit/46737b5c9fecbc68b1e4e830b2a1b189a2e72158).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `case class SetDatabaseCommand(databaseName: String) extends 
RunnableCommand `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12968][SQL] Implement command to set cu...

2016-01-25 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10916#issuecomment-174855435
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50071/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12995][GraphX] Remove deprecate APIs fr...

2016-01-25 Thread maropu

GitHub user maropu opened a pull request:

https://github.com/apache/spark/pull/10918

[SPARK-12995][GraphX] Remove deprecate APIs from Pregel



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/maropu/spark RemoveDeprecateInPregel

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/10918.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #10918


commit fea631129df389b97f5695c11a6bb0c1fef0fb0c
Author: Takeshi YAMAMURO 
Date:   2016-01-26T06:21:17Z

Remove deprecate APIs from Pregel




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12994][CORE] It is not necessary to cre...

2016-01-25 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10914#issuecomment-174852807
  
**[Test build #50073 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50073/consoleFull)**
 for PR 10914 at commit 
[`0467617`](https://github.com/apache/spark/commit/0467617746590b3083deafaa763ee4cae50d4dc0).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12994][CORE] It is not necessary to cre...

2016-01-25 Thread zjffdu

Github user zjffdu commented on the pull request:

https://github.com/apache/spark/pull/10914#issuecomment-174852141
  
Thanks @jerryshao 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12993][PYSPARK] Remove usage of ADD_FIL...

2016-01-25 Thread rxin

Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/10913#issuecomment-174849947
  
Can you update the pull request description to describe why we are removing 
this? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12968][SQL] Implement command to set cu...

2016-01-25 Thread rxin

Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/10916#issuecomment-174849740
  
cc @hvanhovell for review.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12888][SQL][follow-up] benchmark the ne...

2016-01-25 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10917#issuecomment-174849282
  
**[Test build #50072 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50072/consoleFull)**
 for PR 10917 at commit 
[`8207dc1`](https://github.com/apache/spark/commit/8207dc109f21527438cbd80894e9b49d63159f12).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12935][SQL] DataFrame API for Count-Min...

2016-01-25 Thread rxin

Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/10911#issuecomment-174848875
  
cc @JoshRosen is the python tests broken?

```
Running PySpark tests. Output is in 
/home/jenkins/workspace/SparkPullRequestBuilder/python/unit-tests.log
Error: unrecognized module 'root'. Supported modules: pyspark-mllib, 
pyspark-core, pyspark-ml, pyspark-sql, pyspark-streaming
[error] running 
/home/jenkins/workspace/SparkPullRequestBuilder/python/run-tests 
--modules=pyspark-mllib,pyspark-ml,pyspark-sql,root --parallelism=4 ; received 
return code 255
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12968][SQL] Implement command to set cu...

2016-01-25 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10916#issuecomment-174848191
  
**[Test build #50071 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50071/consoleFull)**
 for PR 10916 at commit 
[`46737b5`](https://github.com/apache/spark/commit/46737b5c9fecbc68b1e4e830b2a1b189a2e72158).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12935][SQL] DataFrame API for Count-Min...

2016-01-25 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10911#issuecomment-174847369
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50061/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12935][SQL] DataFrame API for Count-Min...

2016-01-25 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10911#issuecomment-174847368
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12935][SQL] DataFrame API for Count-Min...

2016-01-25 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10911#issuecomment-174847279
  
**[Test build #50061 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50061/consoleFull)**
 for PR 10911 at commit 
[`32a9860`](https://github.com/apache/spark/commit/32a98600705c951b202b0a060eaca536b1477713).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12888][SQL][follow-up] benchmark the ne...

2016-01-25 Thread rxin

Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/10917#issuecomment-174847221
  
@nongli maybe we should just use the simpler multiplication and addition?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12935][SQL] DataFrame API for Count-Min...

2016-01-25 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/10911#discussion_r50797694
  
--- Diff: 
common/sketch/src/main/java/org/apache/spark/util/sketch/CountMinSketchImpl.java
 ---
@@ -368,4 +379,30 @@ public static CountMinSketchImpl readFrom(InputStream 
in) throws IOException {
 
 return new CountMinSketchImpl(depth, width, totalCount, hashA, table);
   }
+
+  @Override
+  public void writeExternal(ObjectOutput out) throws IOException {
+try (ByteArrayOutputStream bos = new ByteArrayOutputStream()) {
+  this.writeTo(bos);
+  byte[] bytes = bos.toByteArray();
+  out.writeObject(bytes);
+}
+  }
+
+  @Override
+  public void readExternal(ObjectInput in) throws IOException, 
ClassNotFoundException {
+byte[] bytes = (byte[]) in.readObject();
+
+try (ByteArrayInputStream bis = new ByteArrayInputStream(bytes)) {
+  CountMinSketchImpl sketch = CountMinSketchImpl.readFrom(bis);
+
+  this.depth = sketch.depth;
--- End diff --

it'd be good to refactor this so we don't need to assign the variables. one 
way is to take the serialization/deserialization code out of readFrom into a 
function.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 4 5 6 7 8 9 >

1 - 100 of 855 matches

Mail list logo