[GitHub] spark issue #19293: [SPARK-22079][SQL] Serializer in HiveOutputWriter miss l...

2017-09-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19293
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19293: [SPARK-22079][SQL] Serializer in HiveOutputWriter miss l...

2017-09-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19293
  
**[Test build #82126 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82126/testReport)**
 for PR 19293 at commit 
[`45477fb`](https://github.com/apache/spark/commit/45477fbf00558066e3733a34e1d59ce22c192ee2).
 * This patch **fails due to an unknown error code, -9**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19293: [SPARK-22079][SQL] Serializer in HiveOutputWriter miss l...

2017-09-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19293
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82126/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19294: [SPARK-21549][CORE] Respect OutputFormats with no output...

2017-09-24 Thread mridulm
Github user mridulm commented on the issue:

https://github.com/apache/spark/pull/19294
  
@szhem You are correct, currently it fails in the driver itself.
So failures in executor are not seen - since job submission fails.

With this pr, the job submission should succeed - but the subsequent 
execution in sql could fail (since sql uses some of the methods which have not 
been patched in this PR if I am not wrong - newTaskTempFileAbsPath, 
newTaskTempFile, etc).

A testcase to validate that would clarify things.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19332: [SPARK-22093][TESTS] Fixes `assume` in `UtilsSuite` and ...

2017-09-24 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19332
  
Merged to master.

Thank you @srowen and @dongjoon-hyun.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19332: [SPARK-22093][TESTS] Fixes `assume` in `UtilsSuit...

2017-09-24 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19332


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19326: [SPARK-22107] Change as to alias in python quickstart

2017-09-24 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19326
  
That's fine. I believe we don't usually need a JIRA for a trivial change 
though. Would you mind double checking if there are similar instances in the 
PySpark doc?

Also, it'd be great if we add `Closes #19283` in the PR description so that 
`#19283` can be automatically closed when this PR is merged. It looks `#19283` 
has gone inactive.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19277: [SPARK-22058][CORE]the BufferedInputStream will n...

2017-09-24 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19277


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19307: [SPARK-22087][SPARK-14650][WIP][BUILD][REPL][CORE] Compi...

2017-09-24 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/19307
  
Merged to master


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19307: [SPARK-22087][SPARK-14650][WIP][BUILD][REPL][CORE...

2017-09-24 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19307


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19184: [SPARK-21971][CORE] Too many open files in Spark due to ...

2017-09-24 Thread mridulm
Github user mridulm commented on the issue:

https://github.com/apache/spark/pull/19184
  
@viirya @jerryshao  To take a step back here.

This specific issue is applicable to window operations and not to shuffle.

In shuffle, you a much larger volume of data written per file vs 4k records 
per file for window operation.

To get to 9k files with shuffle, you are typically processing a TB or more 
data per shuffle task (unless executor is strapped of memory and spilt large 
number of files).

On other hand, with 4k window size (the default in spark), getting to 9k 
files is possible within a single task.

From what I see, there is actually no functional/performance reason to keep 
all the files opened, unlike in shuffle.
Having said that, there is an additional cost we pay with this approach.
With N files, we incur an additional cost of `N * cost(open + read int + 
close)`
Practically though, the impact is much lower when compared to current code 
(since re-open + read will be disk read friendly).

While getting it fixed for all cases would be ideal, the solution for 
window operation does not transfer to shuffle (and vice versa) due to the 
difference in the nature of how files are used in both.

In case I missed something here, please let me know.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18747: [WIP][SPARK-20822][SQL] Generate code to directly get va...

2017-09-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18747
  
**[Test build #82127 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82127/testReport)**
 for PR 18747 at commit 
[`c476e87`](https://github.com/apache/spark/commit/c476e87c5216db8e0dbaa2d0382b1cc1d44cab07).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19290: [WIP][SPARK-22063][R] Upgrades lintr to latest commit sh...

2017-09-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19290
  
**[Test build #82128 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82128/testReport)**
 for PR 19290 at commit 
[`387228d`](https://github.com/apache/spark/commit/387228d9fd5290df4a99026f9446821bce60e779).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19290: [WIP][SPARK-22063][R] Upgrades lintr to latest co...

2017-09-24 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19290#discussion_r140651399
  
--- Diff: R/pkg/R/column.R ---
@@ -238,8 +238,10 @@ setMethod("between", signature(x = "Column"),
 #' @param x a Column.
 #' @param dataType a character object describing the target data type.
 #'See
+# nolint start
--- End diff --

I just double checked the links.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19290: [WIP][SPARK-22063][R] Upgrades lintr to latest co...

2017-09-24 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19290#discussion_r140651387
  
--- Diff: dev/lint-r.R ---
@@ -24,10 +24,16 @@ if (! library(SparkR, lib.loc = LOCAL_LIB_LOC, 
logical.return = TRUE)) {
   stop("You should install SparkR in a local directory with 
`R/install-dev.sh`.")
 }
 
-# Installs lintr from Github in a local directory.
-# NOTE: The CRAN's version is too old to adapt to our rules.
-if ("lintr" %in% row.names(installed.packages())  == FALSE) {
-  devtools::install_github("jimhester/lintr@a769c0b")
+# Installs lintr from Github in a local directory if lintr is not 
installed already or
+# the existing lintr is not the specified version.
+#
+# Note that, the CRAN's version is too old to adapt to our rules. 
Therefore, we try to
+# install lintr from the latest commit in Github, rather than a specific 
tag or release.
+# The current latest is jimhester/lintr@5431140 (see SPARK-22063), the dev 
version,
+# '1.0.1.9000'.
+if ("lintr" %in% row.names(installed.packages()) == FALSE ||
+packageVersion("lintr") != "1.0.1.9000") {
--- End diff --

I checked this - when downgrading, upgrading and not installed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19290: [SPARK-22063][R] Upgrades lintr to latest commit sha1 ID

2017-09-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19290
  
**[Test build #82129 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82129/testReport)**
 for PR 19290 at commit 
[`ee7eb9d`](https://github.com/apache/spark/commit/ee7eb9dd925cf3892ba02ee5c116e3fe52274ebe).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19290: [SPARK-22063][R] Upgrades lintr to latest commit sha1 ID

2017-09-24 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19290
  
Ugh.. it failed to install due to permission issue ... 

```
Downloading GitHub repo jimhester/lintr@5431140
from URL https://api.github.com/repos/jimhester/lintr/zipball/5431140
Installing lintr
Downloading GitHub repo hadley/xml2@master
from URL https://api.github.com/repos/hadley/xml2/zipball/master
Installing xml2
'/usr/lib64/R/bin/R' --no-site-file --no-environ --no-save --no-restore  \
  --quiet CMD INSTALL  \
  '/tmp/RtmprFnaBu/devtools438c516b7717/r-lib-xml2-5799cd9'  \
  --library='/usr/lib64/R/library' --install-tests 

Error: ERROR: no permission to install to directory '/usr/lib64/R/library'
Installation failed: Command failed (1)
'/usr/lib64/R/bin/R' --no-site-file --no-environ --no-save --no-restore  \
  --quiet CMD INSTALL  \
  '/tmp/RtmprFnaBu/devtools438c286c3792/jimhester-lintr-5431140'  \
  --library='/usr/lib64/R/library' --install-tests 

Error: ERROR: no permission to install to directory '/usr/lib64/R/library'
Installation failed: Command failed (1)
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19311: [SPARK-22083][CORE] Release locks in MemoryStore....

2017-09-24 Thread mridulm
Github user mridulm commented on a diff in the pull request:

https://github.com/apache/spark/pull/19311#discussion_r140651513
  
--- Diff: 
core/src/test/scala/org/apache/spark/storage/MemoryStoreSuite.scala ---
@@ -407,4 +407,119 @@ class MemoryStoreSuite
 })
 assert(memoryStore.getSize(blockId) === 1)
   }
+
+  test("SPARK-22083: Release all locks in evictBlocksToFreeSpace") {
+// Setup a memory store with many blocks cached, and then one request 
which leads to multiple
+// blocks getting evicted.  We'll make the eviction throw an 
exception, and make sure that
+// all locks are released.
+val ct = implicitly[ClassTag[Array[Byte]]]
+def testFailureOnNthDrop(failAfterDroppingNBlocks: Int, 
readLockAfterDrop: Boolean): Unit = {
--- End diff --

Nit: failAfterDroppingNBlocks -> numValidBlocks, readLockAfterDrop -> 
validBlock ?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19311: [SPARK-22083][CORE] Release locks in MemoryStore....

2017-09-24 Thread mridulm
Github user mridulm commented on a diff in the pull request:

https://github.com/apache/spark/pull/19311#discussion_r140651741
  
--- Diff: 
core/src/test/scala/org/apache/spark/storage/MemoryStoreSuite.scala ---
@@ -407,4 +407,119 @@ class MemoryStoreSuite
 })
 assert(memoryStore.getSize(blockId) === 1)
   }
+
+  test("SPARK-22083: Release all locks in evictBlocksToFreeSpace") {
+// Setup a memory store with many blocks cached, and then one request 
which leads to multiple
+// blocks getting evicted.  We'll make the eviction throw an 
exception, and make sure that
+// all locks are released.
+val ct = implicitly[ClassTag[Array[Byte]]]
+def testFailureOnNthDrop(failAfterDroppingNBlocks: Int, 
readLockAfterDrop: Boolean): Unit = {
+  val tc = TaskContext.empty()
+  val memManager = new StaticMemoryManager(conf, Long.MaxValue, 100, 
numCores = 1)
+  val blockInfoManager = new BlockInfoManager
+  blockInfoManager.registerTask(tc.taskAttemptId)
+  var droppedSoFar = 0
+  val blockEvictionHandler = new BlockEvictionHandler {
+var memoryStore: MemoryStore = _
+
+override private[storage] def dropFromMemory[T: ClassTag](
+blockId: BlockId,
+data: () => Either[Array[T], ChunkedByteBuffer]): StorageLevel 
= {
+  if (droppedSoFar < failAfterDroppingNBlocks) {
+droppedSoFar += 1
+memoryStore.remove(blockId)
+if (readLockAfterDrop) {
+  // for testing purposes, we act like another thread gets the 
read lock on the new
+  // block
+  StorageLevel.DISK_ONLY
+} else {
+  StorageLevel.NONE
+}
+  } else {
+throw new RuntimeException(s"Mock error dropping block 
$droppedSoFar")
+  }
+}
+  }
+  val memoryStore = new MemoryStore(conf, blockInfoManager, 
serializerManager, memManager,
+  blockEvictionHandler) {
+override def afterDropAction(blockId: BlockId): Unit = {
+  if (readLockAfterDrop) {
+// pretend that we get a read lock on the block (now on disk) 
in another thread
+TaskContext.setTaskContext(tc)
+blockInfoManager.lockForReading(blockId)
+TaskContext.unset()
+  }
+}
+  }
+
+  blockEvictionHandler.memoryStore = memoryStore
+  memManager.setMemoryStore(memoryStore)
+
+  // Put in some small blocks to fill up the memory store
+  val initialBlocks = (1 to 10).map { id =>
+val blockId = BlockId(s"rdd_1_$id")
+val blockInfo = new BlockInfo(StorageLevel.MEMORY_ONLY, ct, 
tellMaster = false)
+val initialWriteLock = 
blockInfoManager.lockNewBlockForWriting(blockId, blockInfo)
+assert(initialWriteLock)
+val success = memoryStore.putBytes(blockId, 10, 
MemoryMode.ON_HEAP, () => {
+  new ChunkedByteBuffer(ByteBuffer.allocate(10))
+})
+assert(success)
+blockInfoManager.unlock(blockId, None)
+  }
+  assert(blockInfoManager.size === 10)
+
+
+  // Add one big block, which will require evicting everything in the 
memorystore.  However our
+  // mock BlockEvictionHandler will throw an exception -- make sure 
all locks are cleared.
+  val largeBlockId = BlockId(s"rdd_2_1")
+  val largeBlockInfo = new BlockInfo(StorageLevel.MEMORY_ONLY, ct, 
tellMaster = false)
+  val initialWriteLock = 
blockInfoManager.lockNewBlockForWriting(largeBlockId, largeBlockInfo)
+  assert(initialWriteLock)
+  if (failAfterDroppingNBlocks < 10) {
+val exc = intercept[RuntimeException] {
+  memoryStore.putBytes(largeBlockId, 100, MemoryMode.ON_HEAP, () 
=> {
+new ChunkedByteBuffer(ByteBuffer.allocate(100))
+  })
+}
+assert(exc.getMessage().startsWith("Mock error dropping block"), 
exc)
+// BlockManager.doPut takes care of releasing the lock for the 
newly written block -- not
+// testing that here, so do it manually
+blockInfoManager.removeBlock(largeBlockId)
+  } else {
+memoryStore.putBytes(largeBlockId, 100, MemoryMode.ON_HEAP, () => {
+  new ChunkedByteBuffer(ByteBuffer.allocate(100))
+})
+// BlockManager.doPut takes care of releasing the lock for the 
newly written block -- not
+// testing that here, so do it manually
+blockInfoManager.unlock(largeBlockId)
+  }
+
+  val largeBlockInMemory = if (failAfterDroppingNBlocks == 10) 1 else 0
+  val expBlocks = 10 +
+(if (re

[GitHub] spark pull request #19311: [SPARK-22083][CORE] Release locks in MemoryStore....

2017-09-24 Thread mridulm
Github user mridulm commented on a diff in the pull request:

https://github.com/apache/spark/pull/19311#discussion_r140651608
  
--- Diff: 
core/src/test/scala/org/apache/spark/storage/MemoryStoreSuite.scala ---
@@ -407,4 +407,119 @@ class MemoryStoreSuite
 })
 assert(memoryStore.getSize(blockId) === 1)
   }
+
+  test("SPARK-22083: Release all locks in evictBlocksToFreeSpace") {
+// Setup a memory store with many blocks cached, and then one request 
which leads to multiple
+// blocks getting evicted.  We'll make the eviction throw an 
exception, and make sure that
+// all locks are released.
+val ct = implicitly[ClassTag[Array[Byte]]]
+def testFailureOnNthDrop(failAfterDroppingNBlocks: Int, 
readLockAfterDrop: Boolean): Unit = {
+  val tc = TaskContext.empty()
+  val memManager = new StaticMemoryManager(conf, Long.MaxValue, 100, 
numCores = 1)
+  val blockInfoManager = new BlockInfoManager
+  blockInfoManager.registerTask(tc.taskAttemptId)
+  var droppedSoFar = 0
+  val blockEvictionHandler = new BlockEvictionHandler {
+var memoryStore: MemoryStore = _
+
+override private[storage] def dropFromMemory[T: ClassTag](
+blockId: BlockId,
+data: () => Either[Array[T], ChunkedByteBuffer]): StorageLevel 
= {
+  if (droppedSoFar < failAfterDroppingNBlocks) {
+droppedSoFar += 1
+memoryStore.remove(blockId)
+if (readLockAfterDrop) {
+  // for testing purposes, we act like another thread gets the 
read lock on the new
+  // block
+  StorageLevel.DISK_ONLY
+} else {
+  StorageLevel.NONE
+}
+  } else {
+throw new RuntimeException(s"Mock error dropping block 
$droppedSoFar")
+  }
+}
+  }
+  val memoryStore = new MemoryStore(conf, blockInfoManager, 
serializerManager, memManager,
+  blockEvictionHandler) {
+override def afterDropAction(blockId: BlockId): Unit = {
+  if (readLockAfterDrop) {
+// pretend that we get a read lock on the block (now on disk) 
in another thread
+TaskContext.setTaskContext(tc)
+blockInfoManager.lockForReading(blockId)
+TaskContext.unset()
+  }
+}
+  }
+
+  blockEvictionHandler.memoryStore = memoryStore
+  memManager.setMemoryStore(memoryStore)
+
+  // Put in some small blocks to fill up the memory store
+  val initialBlocks = (1 to 10).map { id =>
--- End diff --

To piggy back on @vanzin's comment, sizePerBlock also please (so that 100 
goes away) ? Thx


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19290: [SPARK-22063][R] Upgrades lintr to latest commit sha1 ID

2017-09-24 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19290
  
Hi @shaneknapp. I am sorry but it's me again ...

Here, this PR tries to upgrade an R package, 
[lintr](https://github.com/jimhester/lintr) for static Code analysis for R, 
which is ran via `./dev/lint-r` -> `dev/lint-r.R` as a part of Jenkins PR build.

This upgrade catches many meaningful instances that should have caught 
before in SparkR codes and if I understood correctly, we decided to upgrade 
this if it does not cause a problem.

I tried to fix the `dev/lint-r.R` script so that `lintr` can be upgraded if 
this package is not installed or not the specific version (1.0.1.9000) after 
testing this in my local, but looks failed in Jenkins due to a permission issue.

I believe the bash command below installs the package correctly:

```R
Rscript -e "devtools::install_github('jimhester/lintr@5431140')"
```

Maybe, I should have cc'ed you here first before trying this one. I 
appologise and I will cc you in such cases in the future.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19294: [SPARK-21549][CORE] Respect OutputFormats with no...

2017-09-24 Thread szhem
Github user szhem commented on a diff in the pull request:

https://github.com/apache/spark/pull/19294#discussion_r140652214
  
--- Diff: 
core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala
 ---
@@ -130,17 +135,21 @@ class HadoopMapReduceCommitProtocol(jobId: String, 
path: String)
 val filesToMove = taskCommits.map(_.obj.asInstanceOf[Map[String, 
String]])
   .foldLeft(Map[String, String]())(_ ++ _)
 logDebug(s"Committing files staged for absolute locations 
$filesToMove")
-val fs = absPathStagingDir.getFileSystem(jobContext.getConfiguration)
-for ((src, dst) <- filesToMove) {
-  fs.rename(new Path(src), new Path(dst))
+if (hasAbsPathFiles) {
+  val fs = absPathStagingDir.getFileSystem(jobContext.getConfiguration)
+  for ((src, dst) <- filesToMove) {
+fs.rename(new Path(src), new Path(dst))
+  }
+  fs.delete(absPathStagingDir, true)
 }
-fs.delete(absPathStagingDir, true)
--- End diff --

Wouldn't it be better to fix it in separate PR?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19295: [SPARK-22080][SQL] Adds support for allowing user to add...

2017-09-24 Thread sathiyapk
Github user sathiyapk commented on the issue:

https://github.com/apache/spark/pull/19295
  
@gatorsmile thanks for your comments. Here are my thoughts, thanks for 
correcting me if i'm wrong. (sorry for the big comment though :))
1. This PR don't change any existing API, it adds a new one.
2. In the usual cases, for the people who don't use `ExperimentalMethods`, 
it don't affect or break anything.
3. For the people who use `ExperimentalMethods`, irrespective of whether it 
is pre-optimizer or post-optimizer rule, it will break anyway if they do it 
wrong. 
4. One of the advantages of this PR 
`sparkSession.experimental.extraPreOptimizations` is that the user provided 
rule can get further optimizer by the native rules of spark, which is not 
possible with `sparkSession.experimental.extraOptimizations`. I'm writing a 
blog post regarding this with an example, i will post the link soon.
5. Last but not least, one of the main intention of the spark catalyst 
optimizer, as mentioned in its sigmod paper, is it's simplicity in defining new 
optimization rules and plug it into the query optimizer during runtime, so we 
should consider not to limit it even if it only concerns a rare case.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19295: [SPARK-22080][SQL] Adds support for allowing user...

2017-09-24 Thread sathiyapk
Github user sathiyapk commented on a diff in the pull request:

https://github.com/apache/spark/pull/19295#discussion_r140652720
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/ExperimentalMethods.scala ---
@@ -44,11 +44,14 @@ class ExperimentalMethods private[sql]() {
*/
   @volatile var extraStrategies: Seq[Strategy] = Nil
 
+  @volatile var extraPreOptimizations: Seq[Rule[LogicalPlan]] = Nil
+
   @volatile var extraOptimizations: Seq[Rule[LogicalPlan]] = Nil
--- End diff --

Yes, i agree with @gatorsmile, renaming `extraOptimizations ` to 
`extraPostOptimizations` will be symmetric with `extraPreOptimizations`, but 
doing so may affect the existing API calls.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19324: [SPARK-22103] Move HashAggregateExec parent consume to a...

2017-09-24 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/19324
  
@juliuszsompolski Thanks for pinging me.

#18931 is an attempt to separate the consume function as it can as 
possible. With long chain of any operators, you can have a long consume 
function and fail JIT, this is the one reason it tries to split into functions 
at the root of codegen support, instead of in some operators. I'd avoid to 
duplicate the separate logic in all operators, IMO.

For the explicit delaying evaluation of projection, currently the strategy 
I take is not going to split it. I guess that you mean the evaluation that can 
be delayed by the compiler, I personally think it should not be a big impact 
under the whole-stage codegen framework. The reason is those evaluation are 
actually needed and can't be avoided in most of (if not all) cases. From the 
benchmark we can see there is no negative impact even in the cases where no 
long consume function exists.

Yeah, I think the simplifies for the use of `continue` is a good thing, if 
it is possible.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18747: [WIP][SPARK-20822][SQL] Generate code to directly get va...

2017-09-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18747
  
**[Test build #82127 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82127/testReport)**
 for PR 18747 at commit 
[`c476e87`](https://github.com/apache/spark/commit/c476e87c5216db8e0dbaa2d0382b1cc1d44cab07).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18747: [WIP][SPARK-20822][SQL] Generate code to directly get va...

2017-09-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18747
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82127/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18747: [WIP][SPARK-20822][SQL] Generate code to directly get va...

2017-09-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18747
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19317: [SPARK-22098][CORE] Add new method aggregateByKeyLocally...

2017-09-24 Thread WeichenXu123
Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/19317
  
@ConeyLiu Yes tree aggregate introduce extra shuffle. But it is possible to 
improve perf when driver total collecting data size from executors are large 
and there're many partitions.
But I think we can keep the same with `reduceByKeyLocally` for now. This is 
possible optimization which can be done in future.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19294: [SPARK-21549][CORE] Respect OutputFormats with no output...

2017-09-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19294
  
**[Test build #82130 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82130/testReport)**
 for PR 19294 at commit 
[`ae0ba0a`](https://github.com/apache/spark/commit/ae0ba0a49454e67497aff4b2dd59e3986e5c6a03).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19294: [SPARK-21549][CORE] Respect OutputFormats with no...

2017-09-24 Thread szhem
Github user szhem commented on a diff in the pull request:

https://github.com/apache/spark/pull/19294#discussion_r140654204
  
--- Diff: 
core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala
 ---
@@ -57,6 +57,11 @@ class HadoopMapReduceCommitProtocol(jobId: String, path: 
String)
*/
   private def absPathStagingDir: Path = new Path(path, "_temporary-" + 
jobId)
 
+  /**
+   * Checks whether there are files to be committed to an absolute output 
location.
+   */
+  private def hasAbsPathFiles: Boolean = addedAbsPathFiles != null && 
addedAbsPathFiles.nonEmpty
--- End diff --

Good catch, thank you!
According to the `FileCommitProtocol`, `addedAbsPathFiles` is always null 
on driver, so  we will not be able to commit or remove these files.

Replaced it with

 private def hasAbsPathFiles: Boolean = path != null



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19295: [SPARK-22080][SQL] Adds support for allowing user to add...

2017-09-24 Thread sathiyapk
Github user sathiyapk commented on the issue:

https://github.com/apache/spark/pull/19295
  
I pushed a new commit that addresses @wzhfy review comments..


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19290: [SPARK-22063][R] Upgrades lintr to latest commit sha1 ID

2017-09-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19290
  
**[Test build #82128 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82128/testReport)**
 for PR 19290 at commit 
[`387228d`](https://github.com/apache/spark/commit/387228d9fd5290df4a99026f9446821bce60e779).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19290: [SPARK-22063][R] Upgrades lintr to latest commit sha1 ID

2017-09-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19290
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82128/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19290: [SPARK-22063][R] Upgrades lintr to latest commit sha1 ID

2017-09-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19290
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19290: [SPARK-22063][R] Upgrades lintr to latest commit sha1 ID

2017-09-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19290
  
**[Test build #82129 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82129/testReport)**
 for PR 19290 at commit 
[`ee7eb9d`](https://github.com/apache/spark/commit/ee7eb9dd925cf3892ba02ee5c116e3fe52274ebe).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19290: [SPARK-22063][R] Upgrades lintr to latest commit sha1 ID

2017-09-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19290
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19290: [SPARK-22063][R] Upgrades lintr to latest commit sha1 ID

2017-09-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19290
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82129/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19294: [SPARK-21549][CORE] Respect OutputFormats with no output...

2017-09-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19294
  
**[Test build #82131 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82131/testReport)**
 for PR 19294 at commit 
[`3429de5`](https://github.com/apache/spark/commit/3429de551b96ab60fa6c9b211a2c534ce71df398).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19294: [SPARK-21549][CORE] Respect OutputFormats with no output...

2017-09-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19294
  
**[Test build #82131 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82131/testReport)**
 for PR 19294 at commit 
[`3429de5`](https://github.com/apache/spark/commit/3429de551b96ab60fa6c9b211a2c534ce71df398).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19294: [SPARK-21549][CORE] Respect OutputFormats with no output...

2017-09-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19294
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82131/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19294: [SPARK-21549][CORE] Respect OutputFormats with no output...

2017-09-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19294
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19294: [SPARK-21549][CORE] Respect OutputFormats with no output...

2017-09-24 Thread szhem
Github user szhem commented on the issue:

https://github.com/apache/spark/pull/19294
  
@mridulm Updated `FileFormatWriterSuite` [to 
cover](https://github.com/apache/spark/pull/19294/files#diff-bc98a3d91cf4f95f4f473146400044aa)
 both branches of the [committer 
calling](https://github.com/apache/spark/blob/3f958a99921d149fb9fdf7ba7e78957afdad1405/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L498)
 - for `newTaskTempFile` as well as for `newTaskTempFileAbsPath`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19317: [SPARK-22098][CORE] Add new method aggregateByKeyLocally...

2017-09-24 Thread ConeyLiu
Github user ConeyLiu commented on the issue:

https://github.com/apache/spark/pull/19317
  
OK, just keep it. Does this need more test or more improvements ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19222: [SPARK-10399][CORE][SQL] Introduce multiple MemoryBlocks...

2017-09-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19222
  
**[Test build #82132 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82132/testReport)**
 for PR 19222 at commit 
[`7ec26f6`](https://github.com/apache/spark/commit/7ec26f678ecb08779c69ac10155d180ec8e58864).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19317: [SPARK-22098][CORE] Add new method aggregateByKeyLocally...

2017-09-24 Thread WeichenXu123
Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/19317
  
It is better adding more perf test for `OpenHashSet` replacement to avoid 
perf regression. And I found `reduceByKeyLocally` also use `JHashSet`, I am not 
sure whether there is some special reason. ping @cloud-fan Can you help confirm 
this ? I cannot find the original author for that.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19294: [SPARK-21549][CORE] Respect OutputFormats with no output...

2017-09-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19294
  
**[Test build #82133 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82133/testReport)**
 for PR 19294 at commit 
[`7963b58`](https://github.com/apache/spark/commit/7963b58e1a3e6b1687268e4b663ce4d5ffd821be).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19294: [SPARK-21549][CORE] Respect OutputFormats with no output...

2017-09-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19294
  
**[Test build #82134 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82134/testReport)**
 for PR 19294 at commit 
[`1e42e83`](https://github.com/apache/spark/commit/1e42e83f62c9eb7819662b6363d77e9eefb916ad).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19229: [SPARK-22001][ML][SQL] ImputerModel can do withColumn fo...

2017-09-24 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/19229
  
@WeichenXu123 Have any more comments on this? Thanks. I think the ML part 
is straightforward.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19294: [SPARK-21549][CORE] Respect OutputFormats with no output...

2017-09-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19294
  
**[Test build #82135 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82135/testReport)**
 for PR 19294 at commit 
[`0cf4724`](https://github.com/apache/spark/commit/0cf47248b84916092d85743d679969c26b0be54b).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19294: [SPARK-21549][CORE] Respect OutputFormats with no output...

2017-09-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19294
  
**[Test build #82130 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82130/testReport)**
 for PR 19294 at commit 
[`ae0ba0a`](https://github.com/apache/spark/commit/ae0ba0a49454e67497aff4b2dd59e3986e5c6a03).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19294: [SPARK-21549][CORE] Respect OutputFormats with no output...

2017-09-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19294
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82130/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19294: [SPARK-21549][CORE] Respect OutputFormats with no output...

2017-09-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19294
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19294: [SPARK-21549][CORE] Respect OutputFormats with no...

2017-09-24 Thread steveloughran
Github user steveloughran commented on a diff in the pull request:

https://github.com/apache/spark/pull/19294#discussion_r140658582
  
--- Diff: 
core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala
 ---
@@ -130,17 +135,21 @@ class HadoopMapReduceCommitProtocol(jobId: String, 
path: String)
 val filesToMove = taskCommits.map(_.obj.asInstanceOf[Map[String, 
String]])
   .foldLeft(Map[String, String]())(_ ++ _)
 logDebug(s"Committing files staged for absolute locations 
$filesToMove")
-val fs = absPathStagingDir.getFileSystem(jobContext.getConfiguration)
-for ((src, dst) <- filesToMove) {
-  fs.rename(new Path(src), new Path(dst))
+if (hasAbsPathFiles) {
+  val fs = absPathStagingDir.getFileSystem(jobContext.getConfiguration)
+  for ((src, dst) <- filesToMove) {
+fs.rename(new Path(src), new Path(dst))
+  }
+  fs.delete(absPathStagingDir, true)
 }
-fs.delete(absPathStagingDir, true)
--- End diff --

can do, now you've got a little mock committer in someone can just extend 
it to optionally throw an IOE in abort(). 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19294: [SPARK-21549][CORE] Respect OutputFormats with no output...

2017-09-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19294
  
**[Test build #82136 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82136/testReport)**
 for PR 19294 at commit 
[`7c58abb`](https://github.com/apache/spark/commit/7c58abb5d01a0ff30e001eb87051b3aa091ae805).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19294: [SPARK-21549][CORE] Respect OutputFormats with no output...

2017-09-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19294
  
**[Test build #82133 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82133/testReport)**
 for PR 19294 at commit 
[`7963b58`](https://github.com/apache/spark/commit/7963b58e1a3e6b1687268e4b663ce4d5ffd821be).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19294: [SPARK-21549][CORE] Respect OutputFormats with no output...

2017-09-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19294
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19294: [SPARK-21549][CORE] Respect OutputFormats with no output...

2017-09-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19294
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82133/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19294: [SPARK-21549][CORE] Respect OutputFormats with no output...

2017-09-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19294
  
**[Test build #82137 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82137/testReport)**
 for PR 19294 at commit 
[`11445b4`](https://github.com/apache/spark/commit/11445b4a0741257b5ac546513b288f98d4d6cdac).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19294: [SPARK-21549][CORE] Respect OutputFormats with no output...

2017-09-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19294
  
**[Test build #82134 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82134/testReport)**
 for PR 19294 at commit 
[`1e42e83`](https://github.com/apache/spark/commit/1e42e83f62c9eb7819662b6363d77e9eefb916ad).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19294: [SPARK-21549][CORE] Respect OutputFormats with no output...

2017-09-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19294
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82134/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19294: [SPARK-21549][CORE] Respect OutputFormats with no output...

2017-09-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19294
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19294: [SPARK-21549][CORE] Respect OutputFormats with no output...

2017-09-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19294
  
**[Test build #82135 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82135/testReport)**
 for PR 19294 at commit 
[`0cf4724`](https://github.com/apache/spark/commit/0cf47248b84916092d85743d679969c26b0be54b).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19222: [SPARK-10399][CORE][SQL] Introduce multiple MemoryBlocks...

2017-09-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19222
  
**[Test build #82132 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82132/testReport)**
 for PR 19222 at commit 
[`7ec26f6`](https://github.com/apache/spark/commit/7ec26f678ecb08779c69ac10155d180ec8e58864).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `public class MemoryBlockSuite `


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19222: [SPARK-10399][CORE][SQL] Introduce multiple MemoryBlocks...

2017-09-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19222
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19294: [SPARK-21549][CORE] Respect OutputFormats with no output...

2017-09-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19294
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19294: [SPARK-21549][CORE] Respect OutputFormats with no output...

2017-09-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19294
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82135/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19222: [SPARK-10399][CORE][SQL] Introduce multiple MemoryBlocks...

2017-09-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19222
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82132/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19294: [SPARK-21549][CORE] Respect OutputFormats with no output...

2017-09-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19294
  
**[Test build #82136 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82136/testReport)**
 for PR 19294 at commit 
[`7c58abb`](https://github.com/apache/spark/commit/7c58abb5d01a0ff30e001eb87051b3aa091ae805).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19294: [SPARK-21549][CORE] Respect OutputFormats with no output...

2017-09-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19294
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82136/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19294: [SPARK-21549][CORE] Respect OutputFormats with no output...

2017-09-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19294
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19326: [SPARK-22107] Change as to alias in python quickstart

2017-09-24 Thread jgoleary
Github user jgoleary commented on the issue:

https://github.com/apache/spark/pull/19326
  
Updated description. The only other mentions of `as()` I can find in the 
docs are in Java examples, and the method appears to exist on the Java side.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19222: [SPARK-10399][CORE][SQL] Introduce multiple MemoryBlocks...

2017-09-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19222
  
**[Test build #82138 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82138/testReport)**
 for PR 19222 at commit 
[`0714ddc`](https://github.com/apache/spark/commit/0714ddcab6d83a489e791536775630e75e8fe5c6).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19222: [SPARK-10399][CORE][SQL] Introduce multiple MemoryBlocks...

2017-09-24 Thread kiszk
Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/19222
  
@hvanhovell @rednaxelafx
After running a benchmark program, I took a polymorphic approach (i.e. each 
subclass has `getInt()`/`putInt()` methods. Then, I got better performance than 
monomorphic approach (i.e. only `MemoryBlock` class has `final` 
`getInt()`/`putInt()` methods.
**The root cause for better performance is to pass a concrete type to the 
first argument of `Platform.getInt()/putInt()` instead of virtual call.**

I run [this benchmark 
program](https://gist.github.com/kiszk/94f75b506c93a663bbbc372ffe8f05de) using 
[the 
commit](https://github.com/apache/spark/commit/0714ddcab6d83a489e791536775630e75e8fe5c6).
 I got the following results:

```
OpenJDK 64-Bit Server VM 1.8.0_121-8u121-b13-0ubuntu1.16.04.2-b13 on Linux 
4.4.0-22-generic
Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz
Memory access benchmarks:Best/Avg Time(ms)Rate(M/s)   
Per Row(ns)   Relative


IntArrayMemoryBlock423 /  445634.1  
 1.6   1.0X
ByteArrayMemoryBlock   433 /  443620.3  
 1.6   1.0X
Platform   431 /  436622.7  
 1.6   1.0X
Platform Object   1004 / 1055267.4  
 3.7   0.4X
Platform copyMemory 45 /   48   5903.9  
 0.2   9.3X
Platform copyMemory Object  45 /   47   6004.0  
 0.2   9.5X
```

This result shows three facts:
1. According to the first three results, To have `getInt()/putInt()` in 
subclasses of `MemoryBlock` can achieve comparable performance to the current 
implementation (`Platform` in a table).
2. According to the third and forth results, even if we use 
`Platform.getInt()/putInt(), we achieve more than 2x worse performance 
(`Platform Object` in a table) when we pass `Object` to the first argument 
instead of concrete type (i.e. `byte[]`).
For example, `byte[] b; Platform.getInt(b, 0);` can achieve better 
performance than `Object o; Platform.getInt(o, 0);`
3. According to the fifth and sixth results, for Platform.copy(), to pass 
`Object` can achieve the same performance as to pass `byte[]`.

From fact 2., I used polymorphic approach to pass the concrete type for 
each subclass of `MemoryBlock`. As a result, we can achieve the same 
performance if the current Spark uses a concrete type for the first argument of 
`Platform.getInt()/putInt()`.
If the current Spark uses `Object` (e.g. 
[here](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeArrayData.java#L61)),
 this PR can achieve better performance.

Probably, @rednaxelafx can explain this very well :)



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19324: [SPARK-22103] Move HashAggregateExec parent consu...

2017-09-24 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19324#discussion_r140664499
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashJoinExec.scala
 ---
@@ -328,10 +325,11 @@ case class BroadcastHashJoinExec(
  |  UnsafeRow $matched = $matches != null && $matches.hasNext() ?
  |(UnsafeRow) $matches.next() : null;
  |  ${checkCondition.trim}
--- End diff --

?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19324: [SPARK-22103] Move HashAggregateExec parent consu...

2017-09-24 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19324#discussion_r140664550
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashJoinExec.scala
 ---
@@ -328,10 +325,11 @@ case class BroadcastHashJoinExec(
  |  UnsafeRow $matched = $matches != null && $matches.hasNext() ?
  |(UnsafeRow) $matches.next() : null;
  |  ${checkCondition.trim}
--- End diff --

nvm. This is for outer join. The same name but different value.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19324: [SPARK-22103] Move HashAggregateExec parent consu...

2017-09-24 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19324#discussion_r140664581
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashJoinExec.scala
 ---
@@ -186,8 +186,7 @@ case class BroadcastHashJoinExec(
*/
   private def getJoinCondition(
   ctx: CodegenContext,
-  input: Seq[ExprCode],
-  anti: Boolean = false): (String, String, Seq[ExprCode]) = {
--- End diff --

I like this change.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19324: [SPARK-22103] Move HashAggregateExec parent consume to a...

2017-09-24 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19324
  
LGTM


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19294: [SPARK-21549][CORE] Respect OutputFormats with no output...

2017-09-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19294
  
**[Test build #82137 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82137/testReport)**
 for PR 19294 at commit 
[`11445b4`](https://github.com/apache/spark/commit/11445b4a0741257b5ac546513b288f98d4d6cdac).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19294: [SPARK-21549][CORE] Respect OutputFormats with no output...

2017-09-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19294
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19294: [SPARK-21549][CORE] Respect OutputFormats with no output...

2017-09-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19294
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82137/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19326: [SPARK-22107] Change as to alias in python quickstart

2017-09-24 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19326
  
ok to test


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19326: [SPARK-22107] Change as to alias in python quickstart

2017-09-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19326
  
**[Test build #82139 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82139/testReport)**
 for PR 19326 at commit 
[`c6bf156`](https://github.com/apache/spark/commit/c6bf156116e347e07a92cf75b07443cb327ce4ab).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19326: [SPARK-22107] Change as to alias in python quickstart

2017-09-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19326
  
**[Test build #82139 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82139/testReport)**
 for PR 19326 at commit 
[`c6bf156`](https://github.com/apache/spark/commit/c6bf156116e347e07a92cf75b07443cb327ce4ab).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19326: [SPARK-22107] Change as to alias in python quickstart

2017-09-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19326
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19326: [SPARK-22107] Change as to alias in python quickstart

2017-09-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19326
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82139/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19222: [SPARK-10399][CORE][SQL] Introduce multiple MemoryBlocks...

2017-09-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19222
  
**[Test build #82138 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82138/testReport)**
 for PR 19222 at commit 
[`0714ddc`](https://github.com/apache/spark/commit/0714ddcab6d83a489e791536775630e75e8fe5c6).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `public final class ByteArrayMemoryBlock extends MemoryBlock `
  * `public final class IntArrayMemoryBlock extends MemoryBlock `
  * `public final class LongArrayMemoryBlock extends MemoryBlock `
  * `public class OffHeapMemoryBlock extends MemoryBlock `


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19222: [SPARK-10399][CORE][SQL] Introduce multiple MemoryBlocks...

2017-09-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19222
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82138/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19222: [SPARK-10399][CORE][SQL] Introduce multiple MemoryBlocks...

2017-09-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19222
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16548: [SPARK-19158][SPARKR][EXAMPLES] Fix ml.R example fails d...

2017-09-24 Thread holdenk
Github user holdenk commented on the issue:

https://github.com/apache/spark/pull/16548
  
So there is something similar in the fulltests for R 
`./R/pkg/tests/fulltests/test_mllib.R` (found while working on packaging).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19326: [SPARK-22107] Change as to alias in python quickstart

2017-09-24 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19326
  
Merged to master and branch-2.2.

@jgoleary, I merged this considering the first contribution but let's do 
this in a batch if possible in the future.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19326: [SPARK-22107] Change as to alias in python quicks...

2017-09-24 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19326


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19283: Update quickstart python dataset example

2017-09-24 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19283


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19290: [SPARK-22063][R] Upgrades lintr to latest commit sha1 ID

2017-09-24 Thread shaneknapp
Github user shaneknapp commented on the issue:

https://github.com/apache/spark/pull/19290
  
@HyukjinKwon -- you will absolutely not have builds install packages on the 
build system.  this is a really bad idea.

is this absolutely required, or just to fix a warning in the build output?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19290: [SPARK-22063][R] Upgrades lintr to latest commit sha1 ID

2017-09-24 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19290
  
@shaneknapp Sure, it was my bad. I will be careful next time.

It is required to fix an actual issue in order to to detect R codes that do 
not follow project's R style.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19290: [SPARK-22063][R] Upgrades lintr to latest commit sha1 ID

2017-09-24 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19290
  
Do you maybe have some worries about this? If that worry is quite crucial, 
I think we could also consider an option, not upgrading this, leaving 
`lint-r.R` script as was, and only fixing the instances detected by the upgrade 
here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19229: [SPARK-22001][ML][SQL] ImputerModel can do withCo...

2017-09-24 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19229#discussion_r140675967
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala ---
@@ -223,20 +223,18 @@ class ImputerModel private[ml] (
 
   override def transform(dataset: Dataset[_]): DataFrame = {
 transformSchema(dataset.schema, logging = true)
-var outputDF = dataset
 val surrogates = surrogateDF.select($(inputCols).map(col): 
_*).head().toSeq
 
-$(inputCols).zip($(outputCols)).zip(surrogates).foreach {
+val newCols = $(inputCols).zip($(outputCols)).zip(surrogates).map {
   case ((inputCol, outputCol), surrogate) =>
 val inputType = dataset.schema(inputCol).dataType
 val ic = col(inputCol)
-outputDF = outputDF.withColumn(outputCol,
-  when(ic.isNull, surrogate)
+when(ic.isNull, surrogate)
--- End diff --

style: indent


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19321: [SPARK-22100] [SQL] Make percentile_approx support numer...

2017-09-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19321
  
**[Test build #82140 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82140/testReport)**
 for PR 19321 at commit 
[`1d26f50`](https://github.com/apache/spark/commit/1d26f501c6e9870969f5312b2a91b05c7e97cef3).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19229: [SPARK-22001][ML][SQL] ImputerModel can do withCo...

2017-09-24 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19229#discussion_r140678574
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala ---
@@ -223,20 +223,18 @@ class ImputerModel private[ml] (
 
   override def transform(dataset: Dataset[_]): DataFrame = {
 transformSchema(dataset.schema, logging = true)
-var outputDF = dataset
 val surrogates = surrogateDF.select($(inputCols).map(col): 
_*).head().toSeq
 
-$(inputCols).zip($(outputCols)).zip(surrogates).foreach {
+val newCols = $(inputCols).zip($(outputCols)).zip(surrogates).map {
   case ((inputCol, outputCol), surrogate) =>
 val inputType = dataset.schema(inputCol).dataType
 val ic = col(inputCol)
-outputDF = outputDF.withColumn(outputCol,
-  when(ic.isNull, surrogate)
+when(ic.isNull, surrogate)
--- End diff --

This `when` is not a call of previous line. I think it doesn't need to 
indent?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19229: [SPARK-22001][ML][SQL] ImputerModel can do withCo...

2017-09-24 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19229#discussion_r140678630
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -2102,6 +2102,55 @@ class Dataset[T] private[sql](
   }
 
   /**
+   * Returns a new Dataset by adding columns or replacing the existing 
columns that has
+   * the same names.
+   */
+  private[spark] def withColumns(colNames: Seq[String], cols: 
Seq[Column]): DataFrame = {
--- End diff --

ping @cloud-fan or @gatorsmile Can you check the SQL part? Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   >