[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve Hybrid agg...

2015-11-03 Thread davies
Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/9383#discussion_r43827568
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TungstenAggregationIterator.scala
 ---
@@ -762,15 +679,7 @@ class TungstenAggregationIterator(
   /**
* Start processing input rows.
*/
-  testFallbackStartsAt match {
-case None =>
-  processInputs()
-case Some(fallbackStartsAt) =>
-  // This is the testing path. processInputsWithControlledFallback is 
same as processInputs
-  // except that it switches to sort-based aggregation after 
`fallbackStartsAt` input rows
-  // have been processed.
-  processInputsWithControlledFallback(fallbackStartsAt)
-  }
+  processInputs(testFallbackStartsAt.getOrElse(Int.MaxValue))
--- End diff --

Each record needs 30+ bytes, it needs to have more than 60G memory for 
single task to trigger this spilling, I think that's fine.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP][SPARK-11217][ML] save/load for non-meta ...

2015-11-03 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/9454#discussion_r43827580
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/param/params.scala ---
@@ -592,7 +592,7 @@ trait Params extends Identifiable with Serializable {
   /**
* Sets a parameter in the embedded param map.
*/
-  protected final def set[T](param: Param[T], value: T): this.type = {
+  final def set[T](param: Param[T], value: T): this.type = {
--- End diff --

it is not feasible to set an arbitrary param outside the instance.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP][SPARK-11217][ML] save/load for non-meta ...

2015-11-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9454#issuecomment-153532152
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP][SPARK-11217][ML] save/load for non-meta ...

2015-11-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9454#issuecomment-153532178
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP][SPARK-11217][ML] save/load for non-meta ...

2015-11-03 Thread mengxr
GitHub user mengxr opened a pull request:

https://github.com/apache/spark/pull/9454

[WIP][SPARK-11217][ML] save/load for non-meta estimators and transformers

This PR implements the default save/load for non-meta estimators and 
transformers using the JSON serialization of param values. The saved metadata 
includes:

* class name
* uid
* timestamp
* paramMap

The save/load interface is similar to DataFrames. We use the current active 
context by default, which should be sufficient for most use cases.

~~~scala
instance.save.to("path")
instance.save.options("overwrite" -> "true").to("path")
instance.save.context(sqlContext).to("path")

Instance.load.from("path")
~~~

The param handling is different from the design doc. We didn't save default 
and user-set params separately, and when we load it back, all parameters are 
user-set. This does cause issues. But it also cause other issues if we modify 
the default params.

TODOs:

* [ ] Java test
* [ ] a follow-up PR to implement default save/load for all non-meta 
estimators and transformers

cc @jkbradley 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mengxr/spark SPARK-11217

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/9454.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #9454


commit cd1c7eae3246f93b6ee4e03adfe57fdf1386
Author: Xiangrui Meng 
Date:   2015-11-03T18:56:22Z

initial implementation

commit df81d61f73c6a854913df638770f0b0409f046a3
Author: Xiangrui Meng 
Date:   2015-11-03T23:41:58Z

update doc and test




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11198][STREAMING][KINESIS] Support de-a...

2015-11-03 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/9403#issuecomment-153532104
  
**[Test build #44975 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44975/consoleFull)**
 for PR 9403 at commit 
[`7c1edd7`](https://github.com/apache/spark/commit/7c1edd74a93a8336a714e0b3e143a15daadaafc0).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10863][SPARKR] Method coltypes() to get...

2015-11-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8984#issuecomment-153531457
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11389][CORE] Add support for off-heap m...

2015-11-03 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/9344#discussion_r43827121
  
--- Diff: 
core/src/main/scala/org/apache/spark/memory/StorageMemoryPool.scala ---
@@ -0,0 +1,138 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.memory
+
+import javax.annotation.concurrent.GuardedBy
+
+import scala.collection.mutable
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.spark.{TaskContext, Logging}
+import org.apache.spark.storage.{MemoryStore, BlockStatus, BlockId}
+
+/**
+ * Performs bookkeeping for managing an adjustable-size pool of memory 
that is used for storage
+ * (caching).
+ *
+ * @param memoryManager a [[MemoryManager]] instance to synchronize on
+ */
+class StorageMemoryPool(memoryManager: Object) extends 
MemoryPool(memoryManager) with Logging {
+
+  @GuardedBy("memoryManager")
+  private[this] var _memoryUsed: Long = 0L
+
+  override def memoryUsed: Long = memoryManager.synchronized {
+_memoryUsed
+  }
+
+  private var _memoryStore: MemoryStore = _
+  def memoryStore: MemoryStore = {
+if (_memoryStore == null) {
+  throw new IllegalArgumentException("memory store not initialized 
yet")
--- End diff --

illegal state?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10863][SPARKR] Method coltypes() to get...

2015-11-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8984#issuecomment-153531477
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [DOC] Missing link to R DataFrame API doc

2015-11-03 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/9394#issuecomment-153531026
  
**[Test build #44976 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44976/consoleFull)**
 for PR 9394 at commit 
[`12f3a74`](https://github.com/apache/spark/commit/12f3a742db8f9dde54b7f8e1dfa9e880da1894a4).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: WIP: Optimize Inner joins with skewed null val...

2015-11-03 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/9451#issuecomment-153530818
  
**[Test build #44974 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44974/consoleFull)**
 for PR 9451 at commit 
[`9a6d9dc`](https://github.com/apache/spark/commit/9a6d9dc1fa097dd015b4bf83d9002a3f3d19d8ec).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: WIP: Optimize Inner joins with skewed null val...

2015-11-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9451#issuecomment-153530822
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: WIP: Optimize Inner joins with skewed null val...

2015-11-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9451#issuecomment-153530824
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44974/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11198][STREAMING][KINESIS] Support de-a...

2015-11-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9403#issuecomment-153530716
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [DOC] Missing link to R DataFrame API doc

2015-11-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9394#issuecomment-153530740
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [DOC] Missing link to R DataFrame API doc

2015-11-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9394#issuecomment-153530721
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11198][STREAMING][KINESIS] Support de-a...

2015-11-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9403#issuecomment-153530739
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve Hybrid agg...

2015-11-03 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/9383#discussion_r43826631
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TungstenAggregationIterator.scala
 ---
@@ -762,15 +679,7 @@ class TungstenAggregationIterator(
   /**
* Start processing input rows.
*/
-  testFallbackStartsAt match {
-case None =>
-  processInputs()
-case Some(fallbackStartsAt) =>
-  // This is the testing path. processInputsWithControlledFallback is 
same as processInputs
-  // except that it switches to sort-based aggregation after 
`fallbackStartsAt` input rows
-  // have been processed.
-  processInputsWithControlledFallback(fallbackStartsAt)
-  }
+  processInputs(testFallbackStartsAt.getOrElse(Int.MaxValue))
--- End diff --

It still possible that we basically have lot of memory space to use, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: WIP: Optimize Inner joins with skewed null val...

2015-11-03 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/9451#issuecomment-153530486
  
**[Test build #44974 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44974/consoleFull)**
 for PR 9451 at commit 
[`9a6d9dc`](https://github.com/apache/spark/commit/9a6d9dc1fa097dd015b4bf83d9002a3f3d19d8ec).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11389][CORE] Add support for off-heap m...

2015-11-03 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/9344#discussion_r43826411
  
--- Diff: core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java 
---
@@ -127,23 +127,30 @@ public TaskMemoryManager(MemoryManager memoryManager, 
long taskAttemptId) {
*
* @return number of bytes successfully granted (<= N).
*/
-  public long acquireExecutionMemory(long required, MemoryConsumer 
consumer) {
+  public long acquireExecutionMemory(
+  long required,
+  MemoryMode mode,
+  MemoryConsumer consumer) {
 assert(required >= 0);
+// If we are allocating Tungsten pages off-heap and receive a request 
to allocate on-heap
+// memory here, then it may not make sense to spill since that would 
only end up freeing
+// off-heap memory. This is subject to change, though, so it may be 
risky to make this
+// optimization now in case we forget to undo it late when making 
changes.
 synchronized (this) {
-  long got = memoryManager.acquireExecutionMemory(required, 
taskAttemptId);
+  long got = memoryManager.acquireExecutionMemory(required, 
taskAttemptId, mode);
 
-  // try to release memory from other consumers first, then we can 
reduce the frequency of
+  // Try to release memory from other consumers first, then we can 
reduce the frequency of
   // spilling, avoid to have too many spilled files.
   if (got < required) {
 // Call spill() on other consumers to release memory
 for (MemoryConsumer c: consumers) {
-  if (c != null && c != consumer && c.getUsed() > 0) {
+  if (c != consumer && c.getMemoryUsed(mode) > 0) {
--- End diff --

Because we'd end up storing `null` into the consumers map when allocating 
memory that was not requested by a particular consumer.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11389][CORE] Add support for off-heap m...

2015-11-03 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/9344#discussion_r43826383
  
--- Diff: core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java 
---
@@ -161,10 +168,10 @@ public long acquireExecutionMemory(long required, 
MemoryConsumer consumer) {
   if (got < required && consumer != null) {
 try {
   long released = consumer.spill(required - got, consumer);
-  if (released > 0) {
+  if (released > 0 && mode == tungstenMemoryMode) {
 logger.info("Task {} released {} from itself ({})", 
taskAttemptId,
--- End diff --

debug level


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11389][CORE] Add support for off-heap m...

2015-11-03 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/9344#discussion_r43826374
  
--- Diff: core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java 
---
@@ -127,23 +127,30 @@ public TaskMemoryManager(MemoryManager memoryManager, 
long taskAttemptId) {
*
* @return number of bytes successfully granted (<= N).
*/
-  public long acquireExecutionMemory(long required, MemoryConsumer 
consumer) {
+  public long acquireExecutionMemory(
+  long required,
+  MemoryMode mode,
+  MemoryConsumer consumer) {
 assert(required >= 0);
+// If we are allocating Tungsten pages off-heap and receive a request 
to allocate on-heap
+// memory here, then it may not make sense to spill since that would 
only end up freeing
+// off-heap memory. This is subject to change, though, so it may be 
risky to make this
+// optimization now in case we forget to undo it late when making 
changes.
 synchronized (this) {
-  long got = memoryManager.acquireExecutionMemory(required, 
taskAttemptId);
+  long got = memoryManager.acquireExecutionMemory(required, 
taskAttemptId, mode);
 
-  // try to release memory from other consumers first, then we can 
reduce the frequency of
+  // Try to release memory from other consumers first, then we can 
reduce the frequency of
   // spilling, avoid to have too many spilled files.
   if (got < required) {
 // Call spill() on other consumers to release memory
 for (MemoryConsumer c: consumers) {
-  if (c != null && c != consumer && c.getUsed() > 0) {
+  if (c != consumer && c.getMemoryUsed(mode) > 0) {
 try {
   long released = c.spill(required - got, consumer);
-  if (released > 0) {
+  if (released > 0 && mode == tungstenMemoryMode) {
 logger.info("Task {} released {} from {} for {}", 
taskAttemptId,
--- End diff --

debug level?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11433] [SQL] Cleanup the subquery name ...

2015-11-03 Thread gatorsmile
Github user gatorsmile commented on the pull request:

https://github.com/apache/spark/pull/9385#issuecomment-153529973
  
@cloud-fan @dbtsai , Jenkins did not start the testing. Could you let 
Jenkins to test it? 

Thank you! 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11389][CORE] Add support for off-heap m...

2015-11-03 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/9344#discussion_r43826353
  
--- Diff: core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java 
---
@@ -127,23 +127,30 @@ public TaskMemoryManager(MemoryManager memoryManager, 
long taskAttemptId) {
*
* @return number of bytes successfully granted (<= N).
*/
-  public long acquireExecutionMemory(long required, MemoryConsumer 
consumer) {
+  public long acquireExecutionMemory(
+  long required,
+  MemoryMode mode,
+  MemoryConsumer consumer) {
 assert(required >= 0);
+// If we are allocating Tungsten pages off-heap and receive a request 
to allocate on-heap
+// memory here, then it may not make sense to spill since that would 
only end up freeing
+// off-heap memory. This is subject to change, though, so it may be 
risky to make this
+// optimization now in case we forget to undo it late when making 
changes.
 synchronized (this) {
-  long got = memoryManager.acquireExecutionMemory(required, 
taskAttemptId);
+  long got = memoryManager.acquireExecutionMemory(required, 
taskAttemptId, mode);
 
-  // try to release memory from other consumers first, then we can 
reduce the frequency of
+  // Try to release memory from other consumers first, then we can 
reduce the frequency of
   // spilling, avoid to have too many spilled files.
   if (got < required) {
 // Call spill() on other consumers to release memory
 for (MemoryConsumer c: consumers) {
-  if (c != null && c != consumer && c.getUsed() > 0) {
+  if (c != consumer && c.getMemoryUsed(mode) > 0) {
--- End diff --

how come we needed the null check before?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: WIP: Optimize Inner joins with skewed null val...

2015-11-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9451#issuecomment-153529807
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11389][CORE] Add support for off-heap m...

2015-11-03 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/9344#discussion_r43826213
  
--- Diff: core/src/main/java/org/apache/spark/memory/MemoryConsumer.java ---
@@ -74,26 +78,26 @@ public void spill() throws IOException {
   public abstract long spill(long size, MemoryConsumer trigger) throws 
IOException;
 
   /**
-   * Acquire `size` bytes memory.
+   * Acquire `size` bytes of on-heap execution memory.
*
* If there is not enough memory, throws OutOfMemoryError.
*/
   protected void acquireMemory(long size) {
-long got = taskMemoryManager.acquireExecutionMemory(size, this);
+long got = taskMemoryManager.acquireExecutionMemory(size, 
MemoryMode.ON_HEAP, this);
--- End diff --

it's a little confusing that this method only supports ON_HEAP. Maybe 
rename the method?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: WIP: Optimize Inner joins with skewed null val...

2015-11-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9451#issuecomment-153529828
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: WIP: Optimize Inner joins with skewed null val...

2015-11-03 Thread yhuai
Github user yhuai commented on the pull request:

https://github.com/apache/spark/pull/9451#issuecomment-153529631
  
Regarding the format of the title, we can do `[SPARK-x] [SQL] ...`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: WIP: Optimize Inner joins with skewed null val...

2015-11-03 Thread yhuai
Github user yhuai commented on the pull request:

https://github.com/apache/spark/pull/9451#issuecomment-153529582
  
How about we update the title to include the jira? Is 
https://issues.apache.org/jira/browse/SPARK-9372 the right one?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11433] [SQL] Cleanup the subquery name ...

2015-11-03 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/9385#discussion_r43826123
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -1019,7 +1019,16 @@ class Analyzer(
  * scoping information for attributes and can be removed once analysis is 
complete.
  */
 object EliminateSubQueries extends Rule[LogicalPlan] {
-  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
+  def apply(plan: LogicalPlan): LogicalPlan = plan transformDown {
+case Project(projectList, child: Subquery) => {
+  Project(
+projectList.flatMap {
--- End diff --

Thank you! I did the change based on your suggestion. : )


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: WIP: Optimize Inner joins with skewed null val...

2015-11-03 Thread yhuai
Github user yhuai commented on the pull request:

https://github.com/apache/spark/pull/9451#issuecomment-153529478
  
ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11493] remove bitset from BytesToBytesM...

2015-11-03 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/9452#issuecomment-153529252
  
**[Test build #44973 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44973/consoleFull)**
 for PR 9452 at commit 
[`8ce9696`](https://github.com/apache/spark/commit/8ce96964b90cea68eb9c6237a445284730bc2802).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11493] remove bitset from BytesToBytesM...

2015-11-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9452#issuecomment-153529071
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11493] remove bitset from BytesToBytesM...

2015-11-03 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/9452#issuecomment-153529161
  
**[Test build #1974 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1974/consoleFull)**
 for PR 9452 at commit 
[`8ce9696`](https://github.com/apache/spark/commit/8ce96964b90cea68eb9c6237a445284730bc2802).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11493] remove bitset from BytesToBytesM...

2015-11-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9452#issuecomment-153529047
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9858] [SQL] Add an ExchangeCoordinator ...

2015-11-03 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/9453#issuecomment-153528846
  
**[Test build #44970 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44970/consoleFull)**
 for PR 9453 at commit 
[`d77c921`](https://github.com/apache/spark/commit/d77c9216ee4e2dae1b07633341dbc9f90d32af8d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve Hybrid agg...

2015-11-03 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/9383#issuecomment-153528665
  
**[Test build #44972 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44972/consoleFull)**
 for PR 9383 at commit 
[`6f3bb15`](https://github.com/apache/spark/commit/6f3bb15b19cd326f677f15860cf215f57fd3671a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11493] remove bitset from BytesToBytesM...

2015-11-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9452#issuecomment-153528562
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44971/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11493] remove bitset from BytesToBytesM...

2015-11-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9452#issuecomment-153528560
  
Build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11493] remove bitset from BytesToBytesM...

2015-11-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9452#issuecomment-153528326
  
Build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve Hybrid agg...

2015-11-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9383#issuecomment-153528306
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve Hybrid agg...

2015-11-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9383#issuecomment-153528327
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9858] [SQL] Add an ExchangeCoordinator ...

2015-11-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9453#issuecomment-153528320
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11493] remove bitset from BytesToBytesM...

2015-11-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9452#issuecomment-153528303
  
 Build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9858] [SQL] Add an ExchangeCoordinator ...

2015-11-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9453#issuecomment-153528291
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10863][SPARKR] Method coltypes() to get...

2015-11-03 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/8984#discussion_r43825376
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -1914,3 +1914,34 @@ setMethod("attach",
 }
 attach(newEnv, pos = pos, name = name, warn.conflicts = 
warn.conflicts)
   })
+
+#' Returns the column types of a DataFrame.
+#' 
+#' @name coltypes
+#' @title Get column types of a DataFrame
+#' @param x (DataFrame)
+#' @return value (character) A character vector with the column types of 
the given DataFrame
+#' @rdname coltypes
+setMethod("coltypes",
+  signature(x = "DataFrame"),
+  function(x) {
+# Get the data types of the DataFrame by invoking dtypes() 
function
+types <- lapply(dtypes(x), function(x) {x[[2]]})
+
+# Map Spark data types into R's data types using DATA_TYPES 
environment
+rTypes <- lapply(types, function(x) {
+  if (exists(x, envir=DATA_TYPES)) {
--- End diff --

nit: you could probably do
```
type <- DATA_TYPES[[x]]
if (is.null(x)) {
   stop("error")
}
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org




[GitHub] spark pull request: [SPARK-9858] [SQL] Add an ExchangeCoordinator ...

2015-11-03 Thread yhuai
Github user yhuai commented on the pull request:

https://github.com/apache/spark/pull/9453#issuecomment-153527809
  
cc @JoshRosen 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve Hybrid agg...

2015-11-03 Thread yhuai
Github user yhuai commented on the pull request:

https://github.com/apache/spark/pull/9383#issuecomment-153527858
  
test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9858] [SQL] Add an ExchangeCoordinator ...

2015-11-03 Thread yhuai
GitHub user yhuai opened a pull request:

https://github.com/apache/spark/pull/9453

[SPARK-9858] [SQL] Add an ExchangeCoordinator to estimate the number of 
post-shuffle partitions for aggregates and joins (follow-up)

https://issues.apache.org/jira/browse/SPARK-9858

This PR is the follow-up work of https://github.com/apache/spark/pull/9276. 
It addresses @JoshRosen's comments.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/yhuai/spark numReducer-followUp

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/9453.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #9453


commit d77c9216ee4e2dae1b07633341dbc9f90d32af8d
Author: Yin Huai 
Date:   2015-11-03T23:53:03Z

Address Josh's comments.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11493] remove bitset from BytesToBytesM...

2015-11-03 Thread davies
GitHub user davies opened a pull request:

https://github.com/apache/spark/pull/9452

[SPARK-11493] remove bitset from BytesToBytesMap

Since we have 4 bytes as number of records in the beginning of a page, the 
address can not be zero, so we do not need the bitset.

For performance concerns, the bitset could help speed up false lookup if 
the slot is empty (because bitset is smaller than longArray, cache hit rate 
will be higher). In practice, the map is filled with 35% - 70% (use 50% as 
average), so only half of the false lookups can benefit of it, all others will 
pay the cost of load the bitset (still need to access the longArray anyway).

For aggregation, we always need to access the longArray (insert a new key 
after false lookup), also confirmed by a benchmark.

 For broadcast hash join, there could be a regression, but a simple 
benchmark showed that it may not (most of lookup are false):

```
sqlContext.range(1<<20).write.parquet("small")
df = sqlContext.read.parquet('small')
for i in range(3):
t = time.time()
df2 = sqlContext.range(1<<26)
df2.join(df, df.id == df2.id).count()
print time.time() -t
```

Having bitset (used time in seconds):
```
15.9311430454
11.19282794
9.98717904091
```
After removing bitset (used time in seconds):
```
14.6036689281
9.53816604614
8.95434498787
``` 

cc @rxin @nongli 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/davies/spark remove_bitset

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/9452.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #9452


commit 6ea56eaf35762df5c04440a6005ccf3bf88dabde
Author: Davies Liu 
Date:   2015-11-03T22:49:38Z

remove bitset




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve Hybrid agg...

2015-11-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9383#issuecomment-153527297
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve Hybrid agg...

2015-11-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9383#issuecomment-153527299
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44959/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve Hybrid agg...

2015-11-03 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/9383#issuecomment-153527227
  
**[Test build #44959 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44959/consoleFull)**
 for PR 9383 at commit 
[`6f3bb15`](https://github.com/apache/spark/commit/6f3bb15b19cd326f677f15860cf215f57fd3671a).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11410] [SQL] Add APIs to provide functi...

2015-11-03 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/9364#discussion_r43823876
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala 
---
@@ -997,4 +1001,116 @@ class DataFrameSuite extends QueryTest with 
SharedSQLContext {
   }
 }
   }
+
+  /**
+   * Verifies that there is no Exchange between the Aggregations for `df`
+   */
+  private def verifyNonExchangingAgg(df: DataFrame) = {
+var atFirstAgg: Boolean = false
+df.queryExecution.executedPlan.foreach {
+  case agg: TungstenAggregate => {
+atFirstAgg = !atFirstAgg
+  }
+  case _ => {
+if (atFirstAgg) {
+  fail("Should not have operators between the two aggregations")
+}
+  }
+}
+  }
+
+  /**
+   * Verifies that there is an Exchange between the Aggregations for `df`
+   */
+  private def verifyExchangingAgg(df: DataFrame) = {
+var atFirstAgg: Boolean = false
+df.queryExecution.executedPlan.foreach {
+  case agg: TungstenAggregate => {
+if (atFirstAgg) {
+  fail("Should not have back to back Aggregates")
+}
+atFirstAgg = true
+  }
+  case e: Exchange => atFirstAgg = false
+  case _ =>
+}
+  }
+
+  test("distributeBy and localSort") {
+val original = testData.repartition(1)
+assert(original.rdd.partitions.length == 1)
+val df = original.distributeBy(Column("key") :: Nil, 5)
+assert(df.rdd.partitions.length  == 5)
+checkAnswer(original.select(), df.select())
+
+val df2 = original.distributeBy(Column("key") :: Nil, 10)
+assert(df2.rdd.partitions.length  == 10)
+checkAnswer(original.select(), df2.select())
+
+// Group by the column we are distributed by. This should generate a 
plan with no exchange
+// between the aggregates
+val df3 = testData.distributeBy(Column("key") :: 
Nil).groupBy("key").count()
+verifyNonExchangingAgg(df3)
+verifyNonExchangingAgg(testData.distributeBy(Column("key") :: 
Column("value") :: Nil)
+  .groupBy("key", "value").count())
+
+// Grouping by just the first distributeBy expr, need to exchange.
+verifyExchangingAgg(testData.distributeBy(Column("key") :: 
Column("value") :: Nil)
+  .groupBy("key").count())
+
+val data = sqlContext.sparkContext.parallelize(
+  (1 to 100).map(i => TestData2(i % 10, i))).toDF()
+
+// Distribute and order by.
+val df4 = data.distributeBy(Column("a") :: Nil).localSort($"b".desc)
+// Walk each partition and verify that it is sorted descending and 
does not contain all
+// the values.
+df4.rdd.foreachPartition(p => {
--- End diff --

for future reference, we usually do

```scala
df4.rdd.foreachPartition { p =>
  ...
}
```
rather than `(p => {`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8467][MLlib][PySpark] Add LDAModel.desc...

2015-11-03 Thread yu-iskw
Github user yu-iskw commented on a diff in the pull request:

https://github.com/apache/spark/pull/8643#discussion_r43823836
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/api/python/LDAModelWrapper.scala ---
@@ -0,0 +1,45 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.mllib.api.python
+
+import scala.collection.JavaConverters
+
+import org.apache.spark.SparkContext
+import org.apache.spark.mllib.clustering.LDAModel
+import org.apache.spark.mllib.linalg.Matrix
+
+/**
+ * Wrapper around LDAModel to provide helper methods in Python
+ */
+private[python] class LDAModelWrapper(model: LDAModel) {
+
+  def topicsMatrix(): Matrix = model.topicsMatrix
+
+  def vocabSize(): Int = model.vocabSize
+
+  def describeTopics(): java.util.List[Array[Any]] = 
describeTopics(this.model.vocabSize)
+
+  def describeTopics(maxTermsPerTopic: Int): java.util.List[Array[Any]] = {
+
+val seq = model.describeTopics(maxTermsPerTopic).map { case (terms, 
termWeights) =>
+Array.empty[Any] ++ terms ++ termWeights
+  }.toSeq
+JavaConverters.seqAsJavaListConverter(seq).asJava
--- End diff --

@davies could you give me a little bit help? I tried to serialize the 
entire list after converting into `Array[Any](...)`. And then, when 
deserializing it in Python, there was something wrong with pickle : `TypeError: 
must be a unicode character, not bytes`.

- My patch

https://github.com/yu-iskw/spark/compare/SPARK-8467-2...yu-iskw:SPARK-8467-2.trial
- unit testing errors
https://gist.github.com/yu-iskw/60e0db67b1e222fc7fd4


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Fix typo in WebUI

2015-11-03 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/9444


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Fix typo in WebUI

2015-11-03 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/9444#issuecomment-153522488
  
@jaceklaskowski I've merged this -- but might be better in the future to 
batch a bunch of typo fixes, since you are so great at finding them.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7316] [MLlib] RDD sliding window with s...

2015-11-03 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/5855#discussion_r43822533
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/rdd/SlidingRDD.scala 
---
@@ -66,36 +69,54 @@ class SlidingRDD[T: ClassTag](@transient val parent: 
RDD[T], val windowSize: Int
 if (n == 0) {
   Array.empty
 } else if (n == 1) {
-  Array(new SlidingRDDPartition[T](0, parentPartitions(0), Seq.empty))
+  Array(new SlidingRDDPartition[T](0, parentPartitions(0), Seq.empty, 
0))
 } else {
   val n1 = n - 1
-  val w1 = windowSize - 1
-  // Get the first w1 items of each partition, starting from the 
second partition.
-  val nextHeads =
-parent.context.runJob(parent, (iter: Iterator[T]) => 
iter.take(w1).toArray, 1 until n)
+  // Get partitions sizes
+  val sizes =
+parent.context.runJob(parent, (iter: Iterator[T]) => iter.length, 
0 until n)
--- End diff --

If `step > 1`, in a single Spark job, we can collect the following:

1. the first `w1` elements from each partition (except the first one)
2. the size of each partition

Then we can compute the offset for each partition.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11490][SQL] variance should alias var_s...

2015-11-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9449#issuecomment-153521166
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11490][SQL] variance should alias var_s...

2015-11-03 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/9449#issuecomment-153521119
  
**[Test build #44968 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44968/consoleFull)**
 for PR 9449 at commit 
[`1f64c07`](https://github.com/apache/spark/commit/1f64c07b7303389cd21b47c51fecc7bb2e31f7e9).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11490][SQL] variance should alias var_s...

2015-11-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9449#issuecomment-153521169
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44968/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7316] [MLlib] RDD sliding window with s...

2015-11-03 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/5855#discussion_r43822024
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/rdd/SlidingRDD.scala 
---
@@ -66,36 +69,54 @@ class SlidingRDD[T: ClassTag](@transient val parent: 
RDD[T], val windowSize: Int
 if (n == 0) {
   Array.empty
 } else if (n == 1) {
-  Array(new SlidingRDDPartition[T](0, parentPartitions(0), Seq.empty))
+  Array(new SlidingRDDPartition[T](0, parentPartitions(0), Seq.empty, 
0))
 } else {
   val n1 = n - 1
-  val w1 = windowSize - 1
-  // Get the first w1 items of each partition, starting from the 
second partition.
-  val nextHeads =
-parent.context.runJob(parent, (iter: Iterator[T]) => 
iter.take(w1).toArray, 1 until n)
+  // Get partitions sizes
--- End diff --

Keep the original implementation if `step == 1` to save one job.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11477][SQL] support create Dataset from...

2015-11-03 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/9434


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: WIP: Optimize Inner joins with skewed null val...

2015-11-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9451#issuecomment-153519356
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11477][SQL] support create Dataset from...

2015-11-03 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/9434#issuecomment-153518772
  
Thanks, merging to master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: WIP: Optimize Inner joins with skewed null val...

2015-11-03 Thread vidma
GitHub user vidma opened a pull request:

https://github.com/apache/spark/pull/9451

WIP: Optimize Inner joins with skewed null values

Draft of first step in optimizing skew in joins (it is quite common to have 
skew in data, and lots of `nulls` on either side of join is quite common (for 
us), especially with left join, say when joining `dimensions` to `fact` tables)

feel free to propose a better approach / add commits.

any ideas for an easy way to check if the rule was already applied?  After 
adding a `isNotNull` filter `someAttribute.nullable`  still returns `true`. I 
couldn't come up with anything better than simply doing a separate batch of 1 
iteration.

@marmbrus  (as discussed at Spark Summit EU)

---

going more serious, a draft for fighting skew in left join is [rather 
simple with DataFrames](https://gist.github.com/vidma/98332db0f82e7e5b09e5), 
solves the null skew, and don't seem to add lots of overhead (though tried only 
on subset of all our joins which used another abstraction of ours).

however this, so far, seems harder to express in optimizer rules:
- need to add "fake" colums. no idea yet how to do this to be able to refer 
to the added column in join conditions
```scala
val leftNullsSprayValue = CaseWhen(
  Seq(
nullableJoinKeys(left).map(IsNull).reduceLeft(Or), // if any join 
keys are null
Cast(Multiply(new Rand(), Literal(10)), IntegerType),
Literal(0) // otherwise
  ))
// but how to add this column to left & right relations?
// e.g. this fails, saying it's not `resolved`
Alias(leftNullsSprayValue)("leftNullsSprayKey")()
```


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/vidma/spark feature/fight-skew-in-inner-join

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/9451.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #9451


commit 214deeae2d4c634536df0d9bd6c2ffcfc573ce7b
Author: vidmantas zemleris 
Date:   2015-11-03T22:38:08Z

Optimize Inner joins with skewed null values




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11491] Update build to use Scala 2.10.5

2015-11-03 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/9450#issuecomment-153518606
  
**[Test build #44969 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44969/consoleFull)**
 for PR 9450 at commit 
[`6ffe0b0`](https://github.com/apache/spark/commit/6ffe0b0540007a96a75d48a75303aad0b45fc9b0).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11491] Update build to use Scala 2.10.5

2015-11-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9450#issuecomment-153518453
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11491] Update build to use Scala 2.10.5

2015-11-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9450#issuecomment-153518428
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11269][SQL] Java API support & test cas...

2015-11-03 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/9358#discussion_r43821203
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/Encoder.scala
 ---
@@ -37,3 +35,39 @@ trait Encoder[T] extends Serializable {
   /** A ClassTag that can be used to construct and Array to contain a 
collection of `T`. */
   def clsTag: ClassTag[T]
 }
+
+object Encoder {
+  import scala.reflect.runtime.universe._
+
+  def forBoolean: Encoder[java.lang.Boolean] = ExpressionEncoder(flat = 
true)
--- End diff --

@rxin @mateiz thoughts on naming here?

`forInt` `int` `INT`?

`forTuple` `tuple`, `tuple2`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11491] Update build to use Scala 2.10.5

2015-11-03 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/9450#issuecomment-153518050
  
/cc @srowen @pwendell 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11491] Update build to use Scala 2.10.5

2015-11-03 Thread JoshRosen
GitHub user JoshRosen opened a pull request:

https://github.com/apache/spark/pull/9450

[SPARK-11491] Update build to use Scala 2.10.5

Spark should build against Scala 2.10.5, since that includes a fix for 
Scaladoc that will fix doc snapshot publishing: 
https://issues.scala-lang.org/browse/SI-8479

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/JoshRosen/spark upgrade-to-scala-2.10.5

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/9450.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #9450


commit 6ffe0b0540007a96a75d48a75303aad0b45fc9b0
Author: Josh Rosen 
Date:   2015-11-03T23:08:57Z

Update build to use Scala 2.10.5.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11269][SQL] Java API support & test cas...

2015-11-03 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/9358#issuecomment-153517822
  
It would be really great to also try and create a test suite that uses java 
8 lambdas (though we may need to pull this into a separate PR as I'm not sure 
how many build changes we will need)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11269][SQL] Java API support & test cas...

2015-11-03 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/9358#discussion_r43820879
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/Encoder.scala
 ---
@@ -37,3 +35,39 @@ trait Encoder[T] extends Serializable {
   /** A ClassTag that can be used to construct and Array to contain a 
collection of `T`. */
   def clsTag: ClassTag[T]
 }
+
+object Encoder {
+  import scala.reflect.runtime.universe._
+
+  def forBoolean: Encoder[java.lang.Boolean] = ExpressionEncoder(flat = 
true)
+  def forByte: Encoder[java.lang.Byte] = ExpressionEncoder(flat = true)
+  def forShort: Encoder[java.lang.Short] = ExpressionEncoder(flat = true)
+  def forInt: Encoder[java.lang.Integer] = ExpressionEncoder(flat = true)
+  def forLong: Encoder[java.lang.Long] = ExpressionEncoder(flat = true)
+  def forFloat: Encoder[java.lang.Float] = ExpressionEncoder(flat = true)
+  def forDouble: Encoder[java.lang.Double] = ExpressionEncoder(flat = true)
+  def forString: Encoder[java.lang.String] = ExpressionEncoder(flat = true)
+
+  def typeTagOfTuple2[T1 : TypeTag, T2 : TypeTag]: TypeTag[(T1, T2)] = 
typeTag[(T1, T2)]
--- End diff --

`private`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11269][SQL] Java API support & test cas...

2015-11-03 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/9358#discussion_r43820849
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/Encoder.scala
 ---
@@ -37,3 +35,39 @@ trait Encoder[T] extends Serializable {
   /** A ClassTag that can be used to construct and Array to contain a 
collection of `T`. */
   def clsTag: ClassTag[T]
 }
+
+object Encoder {
+  import scala.reflect.runtime.universe._
+
+  def forBoolean: Encoder[java.lang.Boolean] = ExpressionEncoder(flat = 
true)
+  def forByte: Encoder[java.lang.Byte] = ExpressionEncoder(flat = true)
+  def forShort: Encoder[java.lang.Short] = ExpressionEncoder(flat = true)
+  def forInt: Encoder[java.lang.Integer] = ExpressionEncoder(flat = true)
+  def forLong: Encoder[java.lang.Long] = ExpressionEncoder(flat = true)
+  def forFloat: Encoder[java.lang.Float] = ExpressionEncoder(flat = true)
+  def forDouble: Encoder[java.lang.Double] = ExpressionEncoder(flat = true)
+  def forString: Encoder[java.lang.String] = ExpressionEncoder(flat = true)
+
+  def typeTagOfTuple2[T1 : TypeTag, T2 : TypeTag]: TypeTag[(T1, T2)] = 
typeTag[(T1, T2)]
+
+  private def getTypeTag[T](c: Class[T]): TypeTag[T] = {
+import scala.reflect.api
+
+// val mirror = runtimeMirror(c.getClassLoader)
+val mirror = rootMirror
+val sym = mirror.staticClass(c.getName)
+val tpe = sym.selfType
+TypeTag(mirror, new api.TypeCreator {
+  def apply[U <: api.Universe with Singleton](m: api.Mirror[U]) =
+if (m eq mirror) tpe.asInstanceOf[U # Type]
+else throw new IllegalArgumentException(
+  s"Type tag defined in $mirror cannot be migrated to other 
mirrors.")
+})
+  }
+
+  def forTuple2[T1, T2](c1: Class[T1], c2: Class[T2]): Encoder[(T1, T2)] = 
{
--- End diff --

How about just `forTuple`, the type of tuple returned is obvious from the 
number of arguments.  We should also add at least up to tuple 5.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11269][SQL] Java API support & test cas...

2015-11-03 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/9358#discussion_r43820524
  
--- Diff: 
sql/core/src/test/java/test/org/apache/spark/sql/JavaDatasetSuite.java ---
@@ -0,0 +1,111 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package test.org.apache.spark.sql;
+
+import java.io.Serializable;
+import java.util.Arrays;
+import java.util.Iterator;
+import java.util.LinkedList;
+import java.util.List;
+
+import org.apache.spark.api.java.function.FlatMapFunction;
+import org.apache.spark.sql.catalyst.encoders.Encoder;
+import scala.Tuple2;
+
+import org.apache.spark.api.java.function.Function;
+import org.junit.*;
+
+import org.apache.spark.SparkContext;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.catalyst.encoders.Encoder$;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.test.TestSQLContext;
+
+public class JavaDatasetSuite implements Serializable {
+  private transient JavaSparkContext jsc;
+  private transient TestSQLContext context;
+
+  @Before
+  public void setUp() {
+// Trigger static initializer of TestData
+SparkContext sc = new SparkContext("local[*]", "testing");
+jsc = new JavaSparkContext(sc);
+context = new TestSQLContext(sc);
+context.loadTestData();
+  }
+
+  @After
+  public void tearDown() {
+context.sparkContext().stop();
+context = null;
+jsc = null;
+  }
+
+  @Test
+  public void testCommonOperation() {
+List data = Arrays.asList("hello", "world");
+Dataset ds = context.createDataset(data, 
Encoder$.MODULE$.forString());
--- End diff --

Yeah, scala should create static methods so that you can just call 
`Encoder.forString()`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11269][SQL] Java API support & test cas...

2015-11-03 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/9358#discussion_r43820397
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -441,6 +537,17 @@ class Dataset[T] private(
   /** Collects the elements to an Array. */
   def collect(): Array[T] = rdd.collect()
 
+  /**
+   * (Java-specific)
+   * Collects the elements to a Java list.
+   *
+   * Due to the incompatibility problem between Scala and Java, the return 
type of [[collect()]] at
--- End diff --

This just means that the RDD has the wrong classtag.  We need to find a way 
to pass the classtag from the encoder before calling collect.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11269][SQL] Java API support & test cas...

2015-11-03 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/9358#discussion_r43820335
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
@@ -499,6 +499,10 @@ class SQLContext private[sql](
 new Dataset[T](this, plan)
   }
 
+  def createDataset[T : Encoder](data: java.util.List[T]): Dataset[T] = {
--- End diff --

Oh, you already did it :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11477][SQL] support create Dataset from...

2015-11-03 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/9434#discussion_r43820137
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
@@ -499,6 +499,15 @@ class SQLContext private[sql](
 new Dataset[T](this, plan)
   }
 
+  def createDataset[T : Encoder](data: RDD[T]): Dataset[T] = {
+val enc = encoderFor[T]
+val attributes = enc.schema.toAttributes
+val encoded = data.map(d => enc.toRow(d))
--- End diff --

No, its the responsibility of downstream operators to copy as needed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11269][SQL] Java API support & test cas...

2015-11-03 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/9358#discussion_r43820035
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
@@ -499,6 +499,10 @@ class SQLContext private[sql](
 new Dataset[T](this, plan)
   }
 
+  def createDataset[T : Encoder](data: java.util.List[T]): Dataset[T] = {
--- End diff --

Yeah, lets do that in another PR please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11490][SQL] variance should alias var_s...

2015-11-03 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/9449#issuecomment-153515606
  
**[Test build #44968 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44968/consoleFull)**
 for PR 9449 at commit 
[`1f64c07`](https://github.com/apache/spark/commit/1f64c07b7303389cd21b47c51fecc7bb2e31f7e9).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11490][SQL] variance should alias var_s...

2015-11-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9449#issuecomment-153515293
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11490][SQL] variance should alias var_s...

2015-11-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9449#issuecomment-153515313
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11490][SQL] variance should alias var_s...

2015-11-03 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/9449#issuecomment-153514954
  
Note that this also includes changes from #9446. Look at the last commit 
for diff.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11490][SQL] variance should alias var_s...

2015-11-03 Thread rxin
GitHub user rxin opened a pull request:

https://github.com/apache/spark/pull/9449

[SPARK-11490][SQL] variance should alias var_samp instead of var_pop.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rxin/spark SPARK-11490

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/9449.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #9449


commit 856942f9d0677d381dfe18d8777fa6f0a2e858c8
Author: Reynold Xin 
Date:   2015-11-03T21:53:18Z

[SPARK-11489][SQL] Only include common first order statistics in GroupedData

commit 132ea9a1769bbaf1d0e0c662d26a947ef34dd73f
Author: Reynold Xin 
Date:   2015-11-03T22:05:06Z

Fix test

commit 1f64c07b7303389cd21b47c51fecc7bb2e31f7e9
Author: Reynold Xin 
Date:   2015-11-03T22:54:23Z

[SPARK-11490][SQL] variance should alias var_samp instead of var_pop.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4921. TaskSetManager.dequeueTask returns...

2015-11-03 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3816#issuecomment-153514540
  
**[Test build #44967 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44967/consoleFull)**
 for PR 3816 at commit 
[`247ce55`](https://github.com/apache/spark/commit/247ce5587b8a06fe586a2730f1ad2df4ab7f79dc).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4921. TaskSetManager.dequeueTask returns...

2015-11-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3816#issuecomment-153513337
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4921. TaskSetManager.dequeueTask returns...

2015-11-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3816#issuecomment-153513322
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10429] [SQL] make mutableProjection ato...

2015-11-03 Thread yhuai
Github user yhuai commented on the pull request:

https://github.com/apache/spark/pull/9422#issuecomment-153513069
  
ah sorry just saw this message. Since the changed behavior is easier to 
reason about, let's keep it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4921. TaskSetManager.dequeueTask returns...

2015-11-03 Thread zsxwing
Github user zsxwing commented on the pull request:

https://github.com/apache/spark/pull/3816#issuecomment-153513026
  
I think this is the right fix. However, more pairs of eyes on this change 
would be better.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark-11478] [ML] ML StringIndexer return inc...

2015-11-03 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/9440#issuecomment-153512622
  
cc @yanboliang 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4921. TaskSetManager.dequeueTask returns...

2015-11-03 Thread zsxwing
Github user zsxwing commented on the pull request:

https://github.com/apache/spark/pull/3816#issuecomment-153512508
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11378][STREAMING] make StreamingContext...

2015-11-03 Thread zsxwing
Github user zsxwing commented on the pull request:

https://github.com/apache/spark/pull/9336#issuecomment-153512190
  
LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11433] [SQL] Cleanup the subquery name ...

2015-11-03 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/9385#discussion_r43817697
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -1019,7 +1019,16 @@ class Analyzer(
  * scoping information for attributes and can be removed once analysis is 
complete.
  */
 object EliminateSubQueries extends Rule[LogicalPlan] {
-  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
+  def apply(plan: LogicalPlan): LogicalPlan = plan transformDown {
--- End diff --

Thank you for your comments! If we doing transformUp, subquery will be 
removed at first. Then, Project(projectList, child: Subquery) will not be 
applicable in this case. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11359][STREAMING][KINESIS] Checkpoint t...

2015-11-03 Thread zsxwing
Github user zsxwing commented on a diff in the pull request:

https://github.com/apache/spark/pull/9421#discussion_r43817494
  
--- Diff: 
extras/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisCheckpointState.scala
 ---
@@ -16,39 +16,77 @@
  */
 package org.apache.spark.streaming.kinesis
 
+import java.util.concurrent._
+
+import scala.util.control.NonFatal
+
+import 
com.amazonaws.services.kinesis.clientlibrary.interfaces.IRecordProcessorCheckpointer
+import com.amazonaws.services.kinesis.clientlibrary.types.ShutdownReason
+
 import org.apache.spark.Logging
 import org.apache.spark.streaming.Duration
-import org.apache.spark.util.{Clock, ManualClock, SystemClock}
 
 /**
- * This is a helper class for managing checkpoint clocks.
+ * This is a helper class for managing Kinesis checkpointing.
  *
- * @param checkpointInterval
- * @param currentClock.  Default to current SystemClock if none is passed 
in (mocking purposes)
+ * @param receiver The receiver that keeps track of which sequence numbers 
we can checkpoint
+ * @param checkpointInterval How frequently we will checkpoint to DynamoDB
+ * @param workerId Worker Id of KCL worker for logging purposes
+ * @param shardId The shard this worker was consuming data from
  */
-private[kinesis] class KinesisCheckpointState(
+private[kinesis] class KinesisCheckpointState[T](
+receiver: KinesisReceiver[T],
 checkpointInterval: Duration,
-currentClock: Clock = new SystemClock())
-  extends Logging {
+workerId: String,
+shardId: String) extends Logging {
 
-  /* Initialize the checkpoint clock using the given currentClock + 
checkpointInterval millis */
-  val checkpointClock = new ManualClock()
-  checkpointClock.setTime(currentClock.getTimeMillis() + 
checkpointInterval.milliseconds)
+  private var _checkpointer: Option[IRecordProcessorCheckpointer] = None
--- End diff --

This should be `volatile`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8658] [SQL] AttributeReference's equals...

2015-11-03 Thread gatorsmile
Github user gatorsmile commented on the pull request:

https://github.com/apache/spark/pull/9216#issuecomment-153511090
  
@JoshRosen @cloud-fan 
I submitted a pull request for JIRA Spark-11275: 
https://github.com/apache/spark/pull/9419 
Hopefully, after finishing the problem, this one can pass all the tests. 
Thanks! 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11359][STREAMING][KINESIS] Checkpoint t...

2015-11-03 Thread zsxwing
Github user zsxwing commented on a diff in the pull request:

https://github.com/apache/spark/pull/9421#discussion_r43817473
  
--- Diff: 
extras/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisCheckpointState.scala
 ---
@@ -16,39 +16,77 @@
  */
 package org.apache.spark.streaming.kinesis
 
+import java.util.concurrent._
+
+import scala.util.control.NonFatal
+
+import 
com.amazonaws.services.kinesis.clientlibrary.interfaces.IRecordProcessorCheckpointer
+import com.amazonaws.services.kinesis.clientlibrary.types.ShutdownReason
+
 import org.apache.spark.Logging
 import org.apache.spark.streaming.Duration
-import org.apache.spark.util.{Clock, ManualClock, SystemClock}
 
 /**
- * This is a helper class for managing checkpoint clocks.
+ * This is a helper class for managing Kinesis checkpointing.
  *
- * @param checkpointInterval
- * @param currentClock.  Default to current SystemClock if none is passed 
in (mocking purposes)
+ * @param receiver The receiver that keeps track of which sequence numbers 
we can checkpoint
+ * @param checkpointInterval How frequently we will checkpoint to DynamoDB
+ * @param workerId Worker Id of KCL worker for logging purposes
+ * @param shardId The shard this worker was consuming data from
  */
-private[kinesis] class KinesisCheckpointState(
+private[kinesis] class KinesisCheckpointState[T](
+receiver: KinesisReceiver[T],
 checkpointInterval: Duration,
-currentClock: Clock = new SystemClock())
-  extends Logging {
+workerId: String,
+shardId: String) extends Logging {
 
-  /* Initialize the checkpoint clock using the given currentClock + 
checkpointInterval millis */
-  val checkpointClock = new ManualClock()
-  checkpointClock.setTime(currentClock.getTimeMillis() + 
checkpointInterval.milliseconds)
+  private var _checkpointer: Option[IRecordProcessorCheckpointer] = None
 
-  /**
-   * Check if it's time to checkpoint based on the current time and the 
derived time
-   *   for the next checkpoint
-   *
-   * @return true if it's time to checkpoint
-   */
-  def shouldCheckpoint(): Boolean = {
-new SystemClock().getTimeMillis() > checkpointClock.getTimeMillis()
+  private val checkpointerThread = startCheckpointerThread()
+
+  /** Update the checkpointer instance to the most recent one. */
+  def setCheckpointer(checkpointer: IRecordProcessorCheckpointer): Unit = {
+_checkpointer = Option(checkpointer)
+  }
+
+  /** Perform the checkpoint */
+  private def checkpoint(checkpointer: 
Option[IRecordProcessorCheckpointer]): Unit = {
+// if this method throws an exception, then the scheduled task will 
not run again
+try {
+  checkpointer.foreach { cp =>
--- End diff --

What will happen if we use an old `IRecordProcessorCheckpointer` to 
checkpoint?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11389][CORE] Add support for off-heap m...

2015-11-03 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/9344#issuecomment-153509965
  
**[Test build #44965 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44965/consoleFull)**
 for PR 9344 at commit 
[`96705b8`](https://github.com/apache/spark/commit/96705b881caa82b87874d5145c16ce232c4a64a4).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11198][STREAMING][KINESIS] Support de-a...

2015-11-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9403#issuecomment-153509530
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44956/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11198][STREAMING][KINESIS] Support de-a...

2015-11-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9403#issuecomment-153509526
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



<    1   2   3   4   5   6   7   8   9   10   >