[GitHub] spark issue #21129: [SPARK-7132][ML] Add fit with validation set to spark.ml...

2018-05-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21129
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3030/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21129: [SPARK-7132][ML] Add fit with validation set to spark.ml...

2018-05-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21129
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21266: [SPARK-24206][SQL] Improve DataSource read benchmark cod...

2018-05-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21266
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21266: [SPARK-24206][SQL] Improve DataSource read benchmark cod...

2018-05-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21266
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3029/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21193: [SPARK-24121][SQL] Add API for handling expression code ...

2018-05-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21193
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3028/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21193: [SPARK-24121][SQL] Add API for handling expression code ...

2018-05-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21193
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21266: [SPARK-24206][SQL] Improve DataSource read benchmark cod...

2018-05-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21266
  
**[Test build #90359 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90359/testReport)**
 for PR 21266 at commit 
[`09b3920`](https://github.com/apache/spark/commit/09b3920281c53d0853dd05bb064c2fc5c110564b).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21266: [SPARK-24206][SQL] Improve DataSource read benchmark cod...

2018-05-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21266
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90359/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21266: [SPARK-24206][SQL] Improve DataSource read benchmark cod...

2018-05-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21266
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21266: [SPARK-24206][SQL] Improve DataSource read benchmark cod...

2018-05-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21266
  
**[Test build #90359 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90359/testReport)**
 for PR 21266 at commit 
[`09b3920`](https://github.com/apache/spark/commit/09b3920281c53d0853dd05bb064c2fc5c110564b).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21266: [SPARK-24206][SQL] Improve DataSource read benchmark cod...

2018-05-07 Thread maropu
Github user maropu commented on the issue:

https://github.com/apache/spark/pull/21266
  
@gatorsmile @dongjoon-hyun 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...

2018-05-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21265
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3027/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...

2018-05-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21265
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21266: [SPARK-24206][SQL] Improve DataSource read benchmark cod...

2018-05-07 Thread maropu
Github user maropu commented on the issue:

https://github.com/apache/spark/pull/21266
  
Also, I'll make a follow-up pr for pushdown benchmarks;

https://github.com/apache/spark/compare/master...maropu:UpdateParquetBenchmark


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21266: [SPARK-24206][SQL] Improve DataSource read benchm...

2018-05-07 Thread maropu
GitHub user maropu opened a pull request:

https://github.com/apache/spark/pull/21266

[SPARK-24206][SQL] Improve DataSource read benchmark code

## What changes were proposed in this pull request?
This pr added benchmark code `DataSourceReadBenchmark` for `orc`, 
`paruqet`, `csv`, and `json` based on the existing `ParquetReadBenchmark` and 
`OrcReadBenchmark`.

## How was this patch tested?
N/A


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/maropu/spark DataSourceReadBenchmark

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21266.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21266


commit 09b3920281c53d0853dd05bb064c2fc5c110564b
Author: Takeshi Yamamuro 
Date:   2018-05-03T07:12:56Z

Fix




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21266: [SPARK-24206][SQL] Improve DataSource read benchmark cod...

2018-05-07 Thread maropu
Github user maropu commented on the issue:

https://github.com/apache/spark/pull/21266
  
I'll add benchmark results just after #21070 merged.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21193: [SPARK-24121][SQL] Add API for handling expression code ...

2018-05-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21193
  
**[Test build #90357 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90357/testReport)**
 for PR 21193 at commit 
[`53b329a`](https://github.com/apache/spark/commit/53b329a3b4e72d423cf503af2982d545848bece8).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21129: [SPARK-7132][ML] Add fit with validation set to spark.ml...

2018-05-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21129
  
**[Test build #90358 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90358/testReport)**
 for PR 21129 at commit 
[`54f73af`](https://github.com/apache/spark/commit/54f73af6df40eb1b3591e02a906b0fb72bc2).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21193: [SPARK-24121][SQL] Add API for handling expressio...

2018-05-07 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/21193#discussion_r186621034
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/javaCode.scala
 ---
@@ -114,19 +114,128 @@ object JavaCode {
   }
 }
 
+/**
+ * A trait representing a block of java code.
+ */
+trait Block extends JavaCode {
+
+  // The expressions to be evaluated inside this block.
+  def exprValues: Seq[ExprValue]
+
+  // Returns java code string for this code block.
+  override def toString: String = _marginChar match {
+case Some(c) => code.stripMargin(c).trim
+case _ => code.trim
+  }
+
+  // The leading prefix that should be stripped from each line.
+  // By default we strip blanks or control characters followed by '|' from 
the line.
+  var _marginChar: Option[Char] = Some('|')
+
+  def stripMargin(c: Char): this.type = {
+_marginChar = Some(c)
+this
+  }
+
+  def stripMargin: this.type = {
+_marginChar = Some('|')
+this
+  }
+
+  // Concatenates this block with other block.
+  def + (other: Block): Block
+}
+
+object Block {
+
+  val CODE_BLOCK_BUFFER_LENGTH: Int = 512
+
+  implicit def blocksToBlock(blocks: Seq[Block]): Block = Blocks(blocks)
+
+  implicit class BlockHelper(val sc: StringContext) extends AnyVal {
+def code(args: Any*): Block = {
+  sc.checkLengths(args)
+  if (sc.parts.length == 0) {
+EmptyBlock
+  } else {
+args.foreach {
+  case _: ExprValue =>
+  case _: Int | _: Long | _: Float | _: Double | _: String =>
+  case _: Block =>
+  case other => throw new IllegalArgumentException(
+s"Can not interpolate ${other.getClass.getName} into code 
block.")
+}
+CodeBlock(sc.parts, args)
+  }
+}
+  }
+}
+
+/**
+ * A block of java code. Including a sequence of code parts and some 
inputs to this block.
+ * The actual java code is generated by embedding the inputs into the code 
parts.
+ */
+case class CodeBlock(codeParts: Seq[String], blockInputs: Seq[Any]) 
extends Block {
+  override def exprValues: Seq[ExprValue] = {
+blockInputs.flatMap {
+  case b: Block => b.exprValues
+  case e: ExprValue => Seq(e)
--- End diff --

Note that we should not do `distinct` on `blockInputs` because the length 
of `codeParts` and `blockInputs` must be matched.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...

2018-05-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21265
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90356/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...

2018-05-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21265
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...

2018-05-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21265
  
**[Test build #90356 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90356/testReport)**
 for PR 21265 at commit 
[`83eeea1`](https://github.com/apache/spark/commit/83eeea1c539d59a4d8496437dcf06d82b43b0ca2).
 * This patch **fails Python style tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class PrefixSpan(object):`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...

2018-05-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21265
  
**[Test build #90356 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90356/testReport)**
 for PR 21265 at commit 
[`83eeea1`](https://github.com/apache/spark/commit/83eeea1c539d59a4d8496437dcf06d82b43b0ca2).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21265: [SPARK-24146][PySpark][ML] spark.ml parity for se...

2018-05-07 Thread WeichenXu123
GitHub user WeichenXu123 opened a pull request:

https://github.com/apache/spark/pull/21265

[SPARK-24146][PySpark][ML] spark.ml parity for sequential pattern mining - 
PrefixSpan: Python API

## What changes were proposed in this pull request?

spark.ml parity for sequential pattern mining - PrefixSpan: Python API

## How was this patch tested?

doctests


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/WeichenXu123/spark prefix_span_py

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21265.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21265


commit 83eeea1c539d59a4d8496437dcf06d82b43b0ca2
Author: WeichenXu 
Date:   2018-05-08T05:29:24Z

init pr




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21193: [SPARK-24121][SQL] Add API for handling expressio...

2018-05-07 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/21193#discussion_r186618899
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/javaCode.scala
 ---
@@ -114,19 +114,128 @@ object JavaCode {
   }
 }
 
+/**
+ * A trait representing a block of java code.
+ */
+trait Block extends JavaCode {
+
+  // The expressions to be evaluated inside this block.
+  def exprValues: Seq[ExprValue]
+
+  // Returns java code string for this code block.
+  override def toString: String = _marginChar match {
+case Some(c) => code.stripMargin(c).trim
+case _ => code.trim
+  }
+
+  // The leading prefix that should be stripped from each line.
+  // By default we strip blanks or control characters followed by '|' from 
the line.
+  var _marginChar: Option[Char] = Some('|')
+
+  def stripMargin(c: Char): this.type = {
+_marginChar = Some(c)
+this
+  }
+
+  def stripMargin: this.type = {
+_marginChar = Some('|')
+this
+  }
+
+  // Concatenates this block with other block.
+  def + (other: Block): Block
+}
+
+object Block {
+
+  val CODE_BLOCK_BUFFER_LENGTH: Int = 512
+
+  implicit def blocksToBlock(blocks: Seq[Block]): Block = Blocks(blocks)
+
+  implicit class BlockHelper(val sc: StringContext) extends AnyVal {
+def code(args: Any*): Block = {
+  sc.checkLengths(args)
+  if (sc.parts.length == 0) {
+EmptyBlock
+  } else {
+args.foreach {
+  case _: ExprValue =>
+  case _: Int | _: Long | _: Float | _: Double | _: String =>
+  case _: Block =>
+  case other => throw new IllegalArgumentException(
+s"Can not interpolate ${other.getClass.getName} into code 
block.")
+}
+CodeBlock(sc.parts, args)
+  }
+}
+  }
+}
+
+/**
+ * A block of java code. Including a sequence of code parts and some 
inputs to this block.
+ * The actual java code is generated by embedding the inputs into the code 
parts.
+ */
+case class CodeBlock(codeParts: Seq[String], blockInputs: Seq[Any]) 
extends Block {
+  override def exprValues: Seq[ExprValue] = {
--- End diff --

Ok.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21193: [SPARK-24121][SQL] Add API for handling expressio...

2018-05-07 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/21193#discussion_r186618877
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/javaCode.scala
 ---
@@ -114,19 +114,128 @@ object JavaCode {
   }
 }
 
+/**
+ * A trait representing a block of java code.
+ */
+trait Block extends JavaCode {
+
+  // The expressions to be evaluated inside this block.
+  def exprValues: Seq[ExprValue]
+
+  // Returns java code string for this code block.
+  override def toString: String = _marginChar match {
+case Some(c) => code.stripMargin(c).trim
+case _ => code.trim
+  }
+
+  // The leading prefix that should be stripped from each line.
+  // By default we strip blanks or control characters followed by '|' from 
the line.
+  var _marginChar: Option[Char] = Some('|')
+
+  def stripMargin(c: Char): this.type = {
+_marginChar = Some(c)
+this
+  }
+
+  def stripMargin: this.type = {
+_marginChar = Some('|')
+this
+  }
+
+  // Concatenates this block with other block.
+  def + (other: Block): Block
+}
+
+object Block {
+
+  val CODE_BLOCK_BUFFER_LENGTH: Int = 512
+
+  implicit def blocksToBlock(blocks: Seq[Block]): Block = Blocks(blocks)
+
+  implicit class BlockHelper(val sc: StringContext) extends AnyVal {
+def code(args: Any*): Block = {
+  sc.checkLengths(args)
+  if (sc.parts.length == 0) {
+EmptyBlock
+  } else {
+args.foreach {
+  case _: ExprValue =>
+  case _: Int | _: Long | _: Float | _: Double | _: String =>
+  case _: Block =>
+  case other => throw new IllegalArgumentException(
+s"Can not interpolate ${other.getClass.getName} into code 
block.")
+}
+CodeBlock(sc.parts, args)
+  }
+}
+  }
+}
+
+/**
+ * A block of java code. Including a sequence of code parts and some 
inputs to this block.
+ * The actual java code is generated by embedding the inputs into the code 
parts.
+ */
+case class CodeBlock(codeParts: Seq[String], blockInputs: Seq[Any]) 
extends Block {
+  override def exprValues: Seq[ExprValue] = {
+blockInputs.flatMap {
+  case b: Block => b.exprValues
+  case e: ExprValue => Seq(e)
+  case _ => Seq.empty
+}
+  }
+
+  override def code: String = {
+val strings = codeParts.iterator
+val inputs = blockInputs.iterator
+val buf = new StringBuilder(Block.CODE_BLOCK_BUFFER_LENGTH)
+buf append StringContext.treatEscapes(strings.next)
+while (strings.hasNext) {
+  buf append inputs.next
+  buf append StringContext.treatEscapes(strings.next)
+}
+buf.toString
+  }
+
+  override def + (other: Block): Block = other match {
+case c: CodeBlock => Blocks(Seq(this, c))
+case b: Blocks => Blocks(Seq(this) ++ b.blocks)
+case EmptyBlock => this
+  }
+}
+
+case class Blocks(blocks: Seq[Block]) extends Block {
+  override def exprValues: Seq[ExprValue] = blocks.flatMap(_.exprValues)
+  override def code: String = blocks.map(_.toString).mkString("\n")
+
+  override def + (other: Block): Block = other match {
+case c: CodeBlock => Blocks(blocks :+ c)
+case b: Blocks => Blocks(blocks ++ b.blocks)
+case EmptyBlock => this
+  }
+}
+
+object EmptyBlock extends Block with Serializable {
+  override def code: String = ""
+  override def exprValues: Seq[ExprValue] = Seq.empty
+
+  override def + (other: Block): Block = other
+}
+
 /**
  * A typed java fragment that must be a valid java expression.
  */
 trait ExprValue extends JavaCode {
   def javaType: Class[_]
   def isPrimitive: Boolean = javaType.isPrimitive
+
+  // This will be called during string interpolation.
+  override def toString: String = ExprValue.exprValueToString(this)
--- End diff --

Forgot to revert this change. In previous commit, I need 
`ExprValue.exprValueToString` as the only entry for tracking `ExprValue`. This 
should be reverted now.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21193: [SPARK-24121][SQL] Add API for handling expressio...

2018-05-07 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/21193#discussion_r186618580
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/javaCode.scala
 ---
@@ -114,19 +114,128 @@ object JavaCode {
   }
 }
 
+/**
+ * A trait representing a block of java code.
+ */
+trait Block extends JavaCode {
+
+  // The expressions to be evaluated inside this block.
+  def exprValues: Seq[ExprValue]
+
+  // Returns java code string for this code block.
+  override def toString: String = _marginChar match {
+case Some(c) => code.stripMargin(c).trim
+case _ => code.trim
+  }
+
+  // The leading prefix that should be stripped from each line.
+  // By default we strip blanks or control characters followed by '|' from 
the line.
+  var _marginChar: Option[Char] = Some('|')
+
+  def stripMargin(c: Char): this.type = {
+_marginChar = Some(c)
+this
+  }
+
+  def stripMargin: this.type = {
+_marginChar = Some('|')
+this
+  }
+
+  // Concatenates this block with other block.
+  def + (other: Block): Block
+}
+
+object Block {
+
+  val CODE_BLOCK_BUFFER_LENGTH: Int = 512
+
+  implicit def blocksToBlock(blocks: Seq[Block]): Block = Blocks(blocks)
+
+  implicit class BlockHelper(val sc: StringContext) extends AnyVal {
+def code(args: Any*): Block = {
+  sc.checkLengths(args)
+  if (sc.parts.length == 0) {
+EmptyBlock
+  } else {
+args.foreach {
+  case _: ExprValue =>
+  case _: Int | _: Long | _: Float | _: Double | _: String =>
+  case _: Block =>
+  case other => throw new IllegalArgumentException(
+s"Can not interpolate ${other.getClass.getName} into code 
block.")
+}
+CodeBlock(sc.parts, args)
+  }
+}
+  }
+}
+
+/**
+ * A block of java code. Including a sequence of code parts and some 
inputs to this block.
+ * The actual java code is generated by embedding the inputs into the code 
parts.
+ */
+case class CodeBlock(codeParts: Seq[String], blockInputs: Seq[Any]) 
extends Block {
+  override def exprValues: Seq[ExprValue] = {
+blockInputs.flatMap {
+  case b: Block => b.exprValues
+  case e: ExprValue => Seq(e)
--- End diff --

It's possible. A distinct helps here?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21193: [SPARK-24121][SQL] Add API for handling expressio...

2018-05-07 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/21193#discussion_r186618412
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala
 ---
@@ -623,8 +624,14 @@ case class Cast(child: Expression, dataType: DataType, 
timeZoneId: Option[String
   override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
 val eval = child.genCode(ctx)
 val nullSafeCast = nullSafeCastFunction(child.dataType, dataType, ctx)
+
+// Below the code comment including `eval.value` and `eval.isNull` is 
a trick. It makes the two
+// expr values are referred by this code block.
 ev.copy(code = eval.code +
-  castCode(ctx, eval.value, eval.isNull, ev.value, ev.isNull, 
dataType, nullSafeCast))
+  code"""
+// Cast from ${eval.value}, ${eval.isNull}
--- End diff --

I'd like to clean this part further. The generated codes in `Cast` are 
tangled now, IMHO. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16478: [SPARK-7768][SQL] Revise user defined types (UDT)

2018-05-07 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/16478#discussion_r186618254
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/types/UserDefinedType.scala ---
@@ -50,13 +54,46 @@ abstract class UserDefinedType[UserType >: Null] 
extends DataType with Serializa
   /** Serialized Python UDT class, if exists. */
   def serializedPyClass: String = null
 
-  /**
-   * Convert the user type to a SQL datum
-   */
-  def serialize(obj: UserType): Any
+  /** The RowEncoder used to serialize/deserialize extenal row to internal 
row format. */
+  private lazy val rowEncoder = UserDefinedType.getRowEncoder(sqlType)
--- End diff --

Due to recently added constraint that SQLConf can't be accessed during task 
execution, this causes the test failure. I'm wondering if we have other options 
here...


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21193: [SPARK-24121][SQL] Add API for handling expressio...

2018-05-07 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21193#discussion_r186617171
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/javaCode.scala
 ---
@@ -114,19 +114,128 @@ object JavaCode {
   }
 }
 
+/**
+ * A trait representing a block of java code.
+ */
+trait Block extends JavaCode {
+
+  // The expressions to be evaluated inside this block.
+  def exprValues: Seq[ExprValue]
+
+  // Returns java code string for this code block.
+  override def toString: String = _marginChar match {
+case Some(c) => code.stripMargin(c).trim
+case _ => code.trim
+  }
+
+  // The leading prefix that should be stripped from each line.
+  // By default we strip blanks or control characters followed by '|' from 
the line.
+  var _marginChar: Option[Char] = Some('|')
+
+  def stripMargin(c: Char): this.type = {
+_marginChar = Some(c)
+this
+  }
+
+  def stripMargin: this.type = {
+_marginChar = Some('|')
+this
+  }
+
+  // Concatenates this block with other block.
+  def + (other: Block): Block
+}
+
+object Block {
+
+  val CODE_BLOCK_BUFFER_LENGTH: Int = 512
+
+  implicit def blocksToBlock(blocks: Seq[Block]): Block = Blocks(blocks)
+
+  implicit class BlockHelper(val sc: StringContext) extends AnyVal {
+def code(args: Any*): Block = {
+  sc.checkLengths(args)
+  if (sc.parts.length == 0) {
+EmptyBlock
+  } else {
+args.foreach {
+  case _: ExprValue =>
+  case _: Int | _: Long | _: Float | _: Double | _: String =>
+  case _: Block =>
+  case other => throw new IllegalArgumentException(
+s"Can not interpolate ${other.getClass.getName} into code 
block.")
+}
+CodeBlock(sc.parts, args)
+  }
+}
+  }
+}
+
+/**
+ * A block of java code. Including a sequence of code parts and some 
inputs to this block.
+ * The actual java code is generated by embedding the inputs into the code 
parts.
+ */
+case class CodeBlock(codeParts: Seq[String], blockInputs: Seq[Any]) 
extends Block {
+  override def exprValues: Seq[ExprValue] = {
+blockInputs.flatMap {
+  case b: Block => b.exprValues
+  case e: ExprValue => Seq(e)
+  case _ => Seq.empty
+}
+  }
+
+  override def code: String = {
+val strings = codeParts.iterator
+val inputs = blockInputs.iterator
+val buf = new StringBuilder(Block.CODE_BLOCK_BUFFER_LENGTH)
+buf append StringContext.treatEscapes(strings.next)
+while (strings.hasNext) {
+  buf append inputs.next
+  buf append StringContext.treatEscapes(strings.next)
+}
+buf.toString
+  }
+
+  override def + (other: Block): Block = other match {
+case c: CodeBlock => Blocks(Seq(this, c))
+case b: Blocks => Blocks(Seq(this) ++ b.blocks)
+case EmptyBlock => this
+  }
+}
+
+case class Blocks(blocks: Seq[Block]) extends Block {
+  override def exprValues: Seq[ExprValue] = blocks.flatMap(_.exprValues)
+  override def code: String = blocks.map(_.toString).mkString("\n")
+
+  override def + (other: Block): Block = other match {
+case c: CodeBlock => Blocks(blocks :+ c)
+case b: Blocks => Blocks(blocks ++ b.blocks)
+case EmptyBlock => this
+  }
+}
+
+object EmptyBlock extends Block with Serializable {
+  override def code: String = ""
+  override def exprValues: Seq[ExprValue] = Seq.empty
+
+  override def + (other: Block): Block = other
+}
+
 /**
  * A typed java fragment that must be a valid java expression.
  */
 trait ExprValue extends JavaCode {
   def javaType: Class[_]
   def isPrimitive: Boolean = javaType.isPrimitive
+
+  // This will be called during string interpolation.
+  override def toString: String = ExprValue.exprValueToString(this)
--- End diff --

why is it needed? `JavaCode` already defines `override def toString: String 
= code`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21193: [SPARK-24121][SQL] Add API for handling expressio...

2018-05-07 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21193#discussion_r186615731
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala
 ---
@@ -623,8 +624,14 @@ case class Cast(child: Expression, dataType: DataType, 
timeZoneId: Option[String
   override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
 val eval = child.genCode(ctx)
 val nullSafeCast = nullSafeCastFunction(child.dataType, dataType, ctx)
+
+// Below the code comment including `eval.value` and `eval.isNull` is 
a trick. It makes the two
+// expr values are referred by this code block.
 ev.copy(code = eval.code +
-  castCode(ctx, eval.value, eval.isNull, ev.value, ev.isNull, 
dataType, nullSafeCast))
+  code"""
+// Cast from ${eval.value}, ${eval.isNull}
--- End diff --

how to guarantee this is the only one we need to take care?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21193: [SPARK-24121][SQL] Add API for handling expressio...

2018-05-07 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21193#discussion_r186617013
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/javaCode.scala
 ---
@@ -114,19 +114,128 @@ object JavaCode {
   }
 }
 
+/**
+ * A trait representing a block of java code.
+ */
+trait Block extends JavaCode {
+
+  // The expressions to be evaluated inside this block.
+  def exprValues: Seq[ExprValue]
+
+  // Returns java code string for this code block.
+  override def toString: String = _marginChar match {
+case Some(c) => code.stripMargin(c).trim
+case _ => code.trim
+  }
+
+  // The leading prefix that should be stripped from each line.
+  // By default we strip blanks or control characters followed by '|' from 
the line.
+  var _marginChar: Option[Char] = Some('|')
+
+  def stripMargin(c: Char): this.type = {
+_marginChar = Some(c)
+this
+  }
+
+  def stripMargin: this.type = {
+_marginChar = Some('|')
+this
+  }
+
+  // Concatenates this block with other block.
+  def + (other: Block): Block
+}
+
+object Block {
+
+  val CODE_BLOCK_BUFFER_LENGTH: Int = 512
+
+  implicit def blocksToBlock(blocks: Seq[Block]): Block = Blocks(blocks)
+
+  implicit class BlockHelper(val sc: StringContext) extends AnyVal {
+def code(args: Any*): Block = {
+  sc.checkLengths(args)
+  if (sc.parts.length == 0) {
+EmptyBlock
+  } else {
+args.foreach {
+  case _: ExprValue =>
+  case _: Int | _: Long | _: Float | _: Double | _: String =>
+  case _: Block =>
+  case other => throw new IllegalArgumentException(
+s"Can not interpolate ${other.getClass.getName} into code 
block.")
+}
+CodeBlock(sc.parts, args)
+  }
+}
+  }
+}
+
+/**
+ * A block of java code. Including a sequence of code parts and some 
inputs to this block.
+ * The actual java code is generated by embedding the inputs into the code 
parts.
+ */
+case class CodeBlock(codeParts: Seq[String], blockInputs: Seq[Any]) 
extends Block {
+  override def exprValues: Seq[ExprValue] = {
+blockInputs.flatMap {
+  case b: Block => b.exprValues
+  case e: ExprValue => Seq(e)
--- End diff --

is it possible that a `ExprValue` is referenced twice in the code string?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21193: [SPARK-24121][SQL] Add API for handling expressio...

2018-05-07 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21193#discussion_r186616773
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/javaCode.scala
 ---
@@ -114,19 +114,128 @@ object JavaCode {
   }
 }
 
+/**
+ * A trait representing a block of java code.
+ */
+trait Block extends JavaCode {
+
+  // The expressions to be evaluated inside this block.
+  def exprValues: Seq[ExprValue]
+
+  // Returns java code string for this code block.
+  override def toString: String = _marginChar match {
+case Some(c) => code.stripMargin(c).trim
+case _ => code.trim
+  }
+
+  // The leading prefix that should be stripped from each line.
+  // By default we strip blanks or control characters followed by '|' from 
the line.
+  var _marginChar: Option[Char] = Some('|')
+
+  def stripMargin(c: Char): this.type = {
+_marginChar = Some(c)
+this
+  }
+
+  def stripMargin: this.type = {
+_marginChar = Some('|')
+this
+  }
+
+  // Concatenates this block with other block.
+  def + (other: Block): Block
+}
+
+object Block {
+
+  val CODE_BLOCK_BUFFER_LENGTH: Int = 512
+
+  implicit def blocksToBlock(blocks: Seq[Block]): Block = Blocks(blocks)
+
+  implicit class BlockHelper(val sc: StringContext) extends AnyVal {
+def code(args: Any*): Block = {
+  sc.checkLengths(args)
+  if (sc.parts.length == 0) {
+EmptyBlock
+  } else {
+args.foreach {
+  case _: ExprValue =>
+  case _: Int | _: Long | _: Float | _: Double | _: String =>
+  case _: Block =>
+  case other => throw new IllegalArgumentException(
+s"Can not interpolate ${other.getClass.getName} into code 
block.")
+}
+CodeBlock(sc.parts, args)
+  }
+}
+  }
+}
+
+/**
+ * A block of java code. Including a sequence of code parts and some 
inputs to this block.
+ * The actual java code is generated by embedding the inputs into the code 
parts.
+ */
+case class CodeBlock(codeParts: Seq[String], blockInputs: Seq[Any]) 
extends Block {
+  override def exprValues: Seq[ExprValue] = {
--- End diff --

`lazy val`? this might be expensive.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21054: [SPARK-23907][SQL] Add regr_* functions

2018-05-07 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/21054#discussion_r186616594
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/regression.scala
 ---
@@ -0,0 +1,190 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions.aggregate
+
+import org.apache.spark.sql.catalyst.dsl.expressions._
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.types.{AbstractDataType, DoubleType}
+
+/**
+ * Base trait for all regression functions.
+ */
+trait RegrLike extends AggregateFunction with ImplicitCastInputTypes {
+  def y: Expression
+  def x: Expression
+
+  override def children: Seq[Expression] = Seq(y, x)
+  override def inputTypes: Seq[AbstractDataType] = Seq(DoubleType, 
DoubleType)
+
+  protected def updateIfNotNull(exprs: Seq[Expression]): Seq[Expression] = 
{
+assert(aggBufferAttributes.length == exprs.length)
+val nullableChildren = children.filter(_.nullable)
+if (nullableChildren.isEmpty) {
+  exprs
+} else {
+  exprs.zip(aggBufferAttributes).map { case (e, a) =>
+If(nullableChildren.map(IsNull).reduce(Or), a, e)
+  }
+}
+  }
+}
+
+
+@ExpressionDescription(
+  usage = "_FUNC_(y, x) - Returns the number of non-null pairs.",
+  since = "2.4.0")
+case class RegrCount(y: Expression, x: Expression)
+  extends CountLike with RegrLike {
+
+  override lazy val updateExpressions: Seq[Expression] = 
updateIfNotNull(Seq(count + 1L))
+
+  override def prettyName: String = "regr_count"
+}
+
+
+@ExpressionDescription(
+  usage = "_FUNC_(y, x) - Returns SUM(x*x)-SUM(x)*SUM(x)/N. Any pair with 
a NULL is ignored.",
--- End diff --

It is reasonable to follow Hive. Personally, I like DB2 or Oracle, because 
normally these commercial dbms is more professional. : )


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21201: [SPARK-24128][SQL] Mention configuration option i...

2018-05-07 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/21201


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21201: [SPARK-24128][SQL] Mention configuration option in impli...

2018-05-07 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21201
  
Merged to master and branch-2.3.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21255: [SPARK-24186][SparR][SQL]change reverse and conca...

2018-05-07 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21255#discussion_r186612820
  
--- Diff: R/pkg/tests/fulltests/test_sparkSQL.R ---
@@ -1739,6 +1748,13 @@ test_that("string operators", {
 collect(select(df5, repeat_string(df5$a, -1)))[1, 1],
 ""
   )
+
+  l6 <- list(list(a = "abc"))
+  df6 <- createDataFrame(l6)
+  expect_equal(
+collect(select(df6, reverse(df6$a)))[1, 1],
+"cba"
+  )
--- End diff --

Let's make this inlined while we are here. I would also put this test 
around 1505L above.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21255: [SPARK-24186][SparR][SQL]change reverse and conca...

2018-05-07 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21255#discussion_r186612845
  
--- Diff: R/pkg/tests/fulltests/test_sparkSQL.R ---
@@ -1739,6 +1748,13 @@ test_that("string operators", {
 collect(select(df5, repeat_string(df5$a, -1)))[1, 1],
 ""
   )
+
+  l6 <- list(list(a = "abc"))
+  df6 <- createDataFrame(l6)
--- End diff --

I would combine those two lines too.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21255: [SPARK-24186][SparR][SQL]change reverse and concat to co...

2018-05-07 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21255
  
PR title `[SparR]` -> `[SparkR]` or `[R]`. FWIW, I use `[R]` or `[PYTHON]` 
since that's what you see if you read the contribution guide closely.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21254: [SPARK-23094][SPARK-23723][SPARK-23724][SQL][FOLLOW-UP] ...

2018-05-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21254
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21254: [SPARK-23094][SPARK-23723][SPARK-23724][SQL][FOLLOW-UP] ...

2018-05-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21254
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90347/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21254: [SPARK-23094][SPARK-23723][SPARK-23724][SQL][FOLLOW-UP] ...

2018-05-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21254
  
**[Test build #90347 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90347/testReport)**
 for PR 21254 at commit 
[`d4c290e`](https://github.com/apache/spark/commit/d4c290e85eab07706a8a612dbbf58d5c14588b43).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21258: [SPARK-23933][SQL] Add map_fromarray function

2018-05-07 Thread kiszk
Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/21258#discussion_r186611074
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -1798,6 +1798,22 @@ def create_map(*cols):
 return Column(jc)
 
 
+@ignore_unicode_prefix
+@since(2.4)
+def create_map_fromarray(col1, col2):
--- End diff --

cc: @gatorsmile @ueshin


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21255: [SPARK-24186][SparR][SQL]change reverse and conca...

2018-05-07 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/21255#discussion_r186610290
  
--- Diff: R/pkg/R/functions.R ---
@@ -218,6 +219,8 @@ NULL
 #' head(select(tmp3, map_keys(tmp3$v3)))
 #' head(select(tmp3, map_values(tmp3$v3)))
 #' head(select(tmp3, element_at(tmp3$v3, "Valiant")))}
--- End diff --

The right `}` for this `dontrun` block is wrongly put here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21255: [SPARK-24186][SparR][SQL]change reverse and concat to co...

2018-05-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21255
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21255: [SPARK-24186][SparR][SQL]change reverse and concat to co...

2018-05-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21255
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90353/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21255: [SPARK-24186][SparR][SQL]change reverse and concat to co...

2018-05-07 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/21255
  
also cc @felixcheung.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21255: [SPARK-24186][SparR][SQL]change reverse and concat to co...

2018-05-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21255
  
**[Test build #90353 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90353/testReport)**
 for PR 21255 at commit 
[`6b36d79`](https://github.com/apache/spark/commit/6b36d7999feb002655666700061f48448b97501b).
 * This patch **fails SparkR unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21097: [SPARK-14682][ML] Provide evaluateEachIteration method o...

2018-05-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21097
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90351/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21097: [SPARK-14682][ML] Provide evaluateEachIteration method o...

2018-05-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21097
  
**[Test build #90351 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90351/testReport)**
 for PR 21097 at commit 
[`c32b5a8`](https://github.com/apache/spark/commit/c32b5a81de06d318aee1b18c82d82568d2e8).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21097: [SPARK-14682][ML] Provide evaluateEachIteration method o...

2018-05-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21097
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21262: [SPARK-24172][SQL]: Push projection and filters once whe...

2018-05-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21262
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90346/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21262: [SPARK-24172][SQL]: Push projection and filters once whe...

2018-05-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21262
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21262: [SPARK-24172][SQL]: Push projection and filters once whe...

2018-05-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21262
  
**[Test build #90346 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90346/testReport)**
 for PR 21262 at commit 
[`7497cc2`](https://github.com/apache/spark/commit/7497cc2308136ded913b3745b9232487a949804a).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwr...

2018-05-07 Thread zheh12
Github user zheh12 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21257#discussion_r186609102
  
--- Diff: 
core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala
 ---
@@ -235,4 +247,20 @@ class HadoopMapReduceCommitProtocol(
   tmp.getFileSystem(taskContext.getConfiguration).delete(tmp, false)
 }
   }
+
+  /**
+   * now just record the file to be delete
+   */
+  override def deleteWithJob(fs: FileSystem,
--- End diff --

I have changed this code.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21255: [SPARK-24186][SparR][SQL]change reverse and concat to co...

2018-05-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21255
  
**[Test build #90352 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90352/testReport)**
 for PR 21255 at commit 
[`64ba432`](https://github.com/apache/spark/commit/64ba432f7d854a96e70675ded4fdaa705ad01a06).
 * This patch **fails SparkR unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21255: [SPARK-24186][SparR][SQL]change reverse and concat to co...

2018-05-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21255
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21255: [SPARK-24186][SparR][SQL]change reverse and concat to co...

2018-05-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21255
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90352/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21193: [SPARK-24121][SQL] Add API for handling expression code ...

2018-05-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21193
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21193: [SPARK-24121][SQL] Add API for handling expression code ...

2018-05-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21193
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3026/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21193: [SPARK-24121][SQL] Add API for handling expression code ...

2018-05-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21193
  
**[Test build #90355 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90355/testReport)**
 for PR 21193 at commit 
[`aff411b`](https://github.com/apache/spark/commit/aff411bbf21f484c20588997d98682f2ec77191a).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwr...

2018-05-07 Thread zheh12
Github user zheh12 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21257#discussion_r186608143
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala
 ---
@@ -207,9 +207,25 @@ case class InsertIntoHadoopFsRelationCommand(
 }
 // first clear the path determined by the static partition keys (e.g. 
/table/foo=1)
 val staticPrefixPath = 
qualifiedOutputPath.suffix(staticPartitionPrefix)
-if (fs.exists(staticPrefixPath) && !committer.deleteWithJob(fs, 
staticPrefixPath, true)) {
-  throw new IOException(s"Unable to clear output " +
-s"directory $staticPrefixPath prior to writing to it")
+
+// check if delete the dir or just sub files
+if (fs.exists(staticPrefixPath)) {
+  // check if is he table root, and record the file to delete
+  if (staticPartitionPrefix.isEmpty) {
+val files = fs.listFiles(staticPrefixPath, false)
+while (files.hasNext) {
+  val file = files.next()
+  if (!committer.deleteWithJob(fs, file.getPath, true)) {
--- End diff --

We choose to postpone deletion. Whether or not `output` is the same as 
`input`,
now the `_temporary` directory is created in the `output` directory before 
deletion,
so that it is not possible to delete the root directory directly.

The original implementation was able to delete the root directory directly 
because it was deleted before the job was created, and then the root directory 
was rebuilt. Then the `_temporary` directory was created. Failure of any `task` 
in `job` in the original implementation will result in the loss of `output` 
data.

 I can't figure out how to separate the two situations. Do you have any 
good ideas?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21010: [SPARK-23900][SQL] format_number support user spe...

2018-05-07 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/21010#discussion_r186606198
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 ---
@@ -2108,35 +2133,57 @@ case class FormatNumber(x: Expression, d: 
Expression)
   // SPARK-13515: US Locale configures the DecimalFormat object to use 
a dot ('.')
   // as a decimal separator.
   val usLocale = "US"
-  val i = ctx.freshName("i")
-  val dFormat = ctx.freshName("dFormat")
-  val lastDValue =
-ctx.addMutableState(CodeGenerator.JAVA_INT, "lastDValue", v => 
s"$v = -100;")
-  val pattern = ctx.addMutableState(sb, "pattern", v => s"$v = new 
$sb();")
   val numberFormat = ctx.addMutableState(df, "numberFormat",
 v => s"""$v = new $df("", new $dfs($l.$usLocale));""")
 
-  s"""
-if ($d >= 0) {
-  $pattern.delete(0, $pattern.length());
-  if ($d != $lastDValue) {
-$pattern.append("#,###,###,###,###,###,##0");
-
-if ($d > 0) {
-  $pattern.append(".");
-  for (int $i = 0; $i < $d; $i++) {
-$pattern.append("0");
+  right.dataType match {
+case IntegerType =>
+  val pattern = ctx.addMutableState(sb, "pattern", v => s"$v = new 
$sb();")
+  val i = ctx.freshName("i")
+  val lastDValue =
+ctx.addMutableState(CodeGenerator.JAVA_INT, "lastDValue", v => 
s"$v = -100;")
+  s"""
+if ($d >= 0) {
+  $pattern.delete(0, $pattern.length());
+  if ($d != $lastDValue) {
+$pattern.append("$defaultFormat");
+
+if ($d > 0) {
+  $pattern.append(".");
+  for (int $i = 0; $i < $d; $i++) {
+$pattern.append("0");
+  }
+}
+$lastDValue = $d;
+$numberFormat.applyLocalizedPattern($pattern.toString());
   }
+  ${ev.value} = 
UTF8String.fromString($numberFormat.format(${typeHelper(num)}));
+} else {
+  ${ev.value} = null;
+  ${ev.isNull} = true;
 }
-$lastDValue = $d;
-$numberFormat.applyLocalizedPattern($pattern.toString());
-  }
-  ${ev.value} = 
UTF8String.fromString($numberFormat.format(${typeHelper(num)}));
-} else {
-  ${ev.value} = null;
-  ${ev.isNull} = true;
-}
-   """
+   """
+case StringType =>
+  val lastDValue = ctx.addMutableState("String", "lastDValue", v 
=> s"""$v = null;""")
+  val dValue = ctx.addMutableState("String", "dValue")
+  s"""
+$dValue = $d.toString();
+if (!$dValue.equals($lastDValue)) {
+  $lastDValue = $dValue;
+  if ($dValue.isEmpty()) {
+$numberFormat.applyLocalizedPattern("$defaultFormat");
+  } else {
+$numberFormat.applyLocalizedPattern($dValue);
+  }
+}
+${ev.value} = 
UTF8String.fromString($numberFormat.format(${typeHelper(num)}));
+   """
+case NullType =>
--- End diff --

We don't need this case?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21010: [SPARK-23900][SQL] format_number support user spe...

2018-05-07 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/21010#discussion_r186606172
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 ---
@@ -2108,35 +2133,57 @@ case class FormatNumber(x: Expression, d: 
Expression)
   // SPARK-13515: US Locale configures the DecimalFormat object to use 
a dot ('.')
   // as a decimal separator.
   val usLocale = "US"
-  val i = ctx.freshName("i")
-  val dFormat = ctx.freshName("dFormat")
-  val lastDValue =
-ctx.addMutableState(CodeGenerator.JAVA_INT, "lastDValue", v => 
s"$v = -100;")
-  val pattern = ctx.addMutableState(sb, "pattern", v => s"$v = new 
$sb();")
   val numberFormat = ctx.addMutableState(df, "numberFormat",
 v => s"""$v = new $df("", new $dfs($l.$usLocale));""")
 
-  s"""
-if ($d >= 0) {
-  $pattern.delete(0, $pattern.length());
-  if ($d != $lastDValue) {
-$pattern.append("#,###,###,###,###,###,##0");
-
-if ($d > 0) {
-  $pattern.append(".");
-  for (int $i = 0; $i < $d; $i++) {
-$pattern.append("0");
+  right.dataType match {
+case IntegerType =>
+  val pattern = ctx.addMutableState(sb, "pattern", v => s"$v = new 
$sb();")
+  val i = ctx.freshName("i")
+  val lastDValue =
+ctx.addMutableState(CodeGenerator.JAVA_INT, "lastDValue", v => 
s"$v = -100;")
+  s"""
+if ($d >= 0) {
+  $pattern.delete(0, $pattern.length());
+  if ($d != $lastDValue) {
+$pattern.append("$defaultFormat");
+
+if ($d > 0) {
+  $pattern.append(".");
+  for (int $i = 0; $i < $d; $i++) {
+$pattern.append("0");
+  }
+}
+$lastDValue = $d;
+$numberFormat.applyLocalizedPattern($pattern.toString());
   }
+  ${ev.value} = 
UTF8String.fromString($numberFormat.format(${typeHelper(num)}));
+} else {
+  ${ev.value} = null;
+  ${ev.isNull} = true;
 }
-$lastDValue = $d;
-$numberFormat.applyLocalizedPattern($pattern.toString());
-  }
-  ${ev.value} = 
UTF8String.fromString($numberFormat.format(${typeHelper(num)}));
-} else {
-  ${ev.value} = null;
-  ${ev.isNull} = true;
-}
-   """
+   """
+case StringType =>
+  val lastDValue = ctx.addMutableState("String", "lastDValue", v 
=> s"""$v = null;""")
+  val dValue = ctx.addMutableState("String", "dValue")
--- End diff --

Do we need to make this mutable state?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21010: [SPARK-23900][SQL] format_number support user spe...

2018-05-07 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/21010#discussion_r186607122
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/StringExpressionsSuite.scala
 ---
@@ -706,6 +706,30 @@ class StringExpressionsSuite extends SparkFunSuite 
with ExpressionEvalHelper {
   "15,159,339,180,002,773.2778")
 checkEvaluation(FormatNumber(Literal.create(null, IntegerType), 
Literal(3)), null)
 assert(FormatNumber(Literal.create(null, NullType), 
Literal(3)).resolved === false)
+
+checkEvaluation(FormatNumber(Literal(12332.123456), 
Literal("##.###")), "12332.123")
+checkEvaluation(FormatNumber(Literal(12332.123456), 
Literal("##.###")), "12332.123")
+checkEvaluation(FormatNumber(Literal(4.asInstanceOf[Byte]), 
Literal("##.")), "4")
+checkEvaluation(FormatNumber(Literal(4.asInstanceOf[Short]), 
Literal("##.")), "4")
+checkEvaluation(FormatNumber(Literal(4.0f), Literal("##.###")), "4")
+checkEvaluation(FormatNumber(Literal(4), Literal("##.###")), "4")
+checkEvaluation(FormatNumber(Literal(12831273.23481d),
+  Literal("###,###,###,###,###.###")), "12,831,273.235")
+checkEvaluation(FormatNumber(Literal(12831273.83421d), Literal("")), 
"12,831,274")
+checkEvaluation(FormatNumber(Literal(123123324123L), 
Literal("###,###,###,###,###.###")),
+  "123,123,324,123")
+checkEvaluation(
+  FormatNumber(Literal(Decimal(123123324123L) * 
Decimal(123123.21234d)),
+Literal("###,###,###,###,###.")), 
"15,159,339,180,002,773.2778")
+checkEvaluation(FormatNumber(Literal.create(null, IntegerType), 
Literal("##.###")), null)
+assert(FormatNumber(Literal.create(null, NullType), 
Literal("##.###")).resolved === false)
+
+checkEvaluation(FormatNumber(Literal(12332.123456), 
Literal("#,###,###,###,###,###,##0")),
+  "12,332")
+checkEvaluation(FormatNumber(
+  Literal.create(null, IntegerType), Literal.create(null, NullType)), 
null)
+checkEvaluation(FormatNumber(
+  Literal.create(null, NullType), Literal.create(null, NullType)), 
null)
--- End diff --

What's for these two cases?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21195: [Spark-23975][ML] Add support of array input for ...

2018-05-07 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/21195


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21129: [SPARK-7132][ML] Add fit with validation set to spark.ml...

2018-05-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21129
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21129: [SPARK-7132][ML] Add fit with validation set to spark.ml...

2018-05-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21129
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3025/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21195: [Spark-23975][ML] Add support of array input for all clu...

2018-05-07 Thread mengxr
Github user mengxr commented on the issue:

https://github.com/apache/spark/pull/21195
  
LGTM. Merged into master. Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21259: [SPARK-24112][SQL] Add `convertMetastoreTableProp...

2018-05-07 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/21259#discussion_r186606656
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1812,6 +1812,8 @@ working with timestamps in `pandas_udf`s to get the 
best performance, see
   - Since Spark 2.4, creating a managed table with nonempty location is 
not allowed. An exception is thrown when attempting to create a managed table 
with nonempty location. To set `true` to 
`spark.sql.allowCreatingManagedTableUsingNonemptyLocation` restores the 
previous behavior. This option will be removed in Spark 3.0.
   - Since Spark 2.4, the type coercion rules can automatically promote the 
argument types of the variadic SQL functions (e.g., IN/COALESCE) to the widest 
common type, no matter how the input arguments order. In prior Spark versions, 
the promotion could fail in some specific orders (e.g., TimestampType, 
IntegerType and StringType) and throw an exception.
   - In version 2.3 and earlier, `to_utc_timestamp` and 
`from_utc_timestamp` respect the timezone in the input timestamp string, which 
breaks the assumption that the input timestamp is in a specific timezone. 
Therefore, these 2 functions can return unexpected results. In version 2.4 and 
later, this problem has been fixed. `to_utc_timestamp` and `from_utc_timestamp` 
will return null if the input timestamp string contains timezone. As an 
example, `from_utc_timestamp('2000-10-10 00:00:00', 'GMT+1')` will return 
`2000-10-10 01:00:00` in both Spark 2.3 and 2.4. However, 
`from_utc_timestamp('2000-10-10 00:00:00+00:00', 'GMT+1')`, assuming a local 
timezone of GMT+8, will return `2000-10-10 09:00:00` in Spark 2.3 but `null` in 
2.4. For people who don't care about this problem and want to retain the 
previous behaivor to keep their query unchanged, you can set 
`spark.sql.function.rejectTimezoneInString` to false. This option will be 
removed in Spark 3.0 and should only be used as a tempora
 ry workaround.
+  - In version 2.3 and earlier, Spark converts Parquet Hive tables by 
default but ignores table properties like `TBLPROPERTIES (parquet.compression 
'NONE')`. This happens for ORC Hive table properties like `TBLPROPERTIES 
(orc.compress 'NONE')` in case of `spark.sql.hive.convertMetastoreOrc=true`, 
too. Since Spark 2.4, Spark supports Parquet/ORC specific table properties 
while converting Parquet/ORC Hive tables. As an example, `CREATE TABLE t(id 
int) STORED AS PARQUET TBLPROPERTIES (parquet.compression 'NONE')` would 
generate Snappy parquet files during insertion in Spark 2.3, and in Spark 2.4, 
the result would be uncompressed parquet files. To set `false` to 
`spark.sql.hive.convertMetastoreTableProperty` restores the previous behavior.
--- End diff --

Thank you for reviews, @cloud-fan , @mridulm , @HyukjinKwon ! I'll update 
like that.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21129: [SPARK-7132][ML] Add fit with validation set to spark.ml...

2018-05-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21129
  
**[Test build #90354 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90354/testReport)**
 for PR 21129 at commit 
[`fd99f51`](https://github.com/apache/spark/commit/fd99f51388ba84a968cda665f73a9a4ad8eebd05).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21255: [SPARK-24186][SparR][SQL]change reverse and concat to co...

2018-05-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21255
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21255: [SPARK-24186][SparR][SQL]change reverse and concat to co...

2018-05-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21255
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3024/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21028: [SPARK-23922][SQL] Add arrays_overlap function

2018-05-07 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21028#discussion_r186606368
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
 ---
@@ -28,6 +30,34 @@ import org.apache.spark.unsafe.Platform
 import org.apache.spark.unsafe.array.ByteArrayMethods
 import org.apache.spark.unsafe.types.{ByteArray, UTF8String}
 
+/**
+ * Base trait for [[BinaryExpression]]s with two arrays of the same 
element type and implicit
+ * casting.
+ */
+trait BinaryArrayExpressionWithImplicitCast extends BinaryExpression
+  with ImplicitCastInputTypes {
+
+  protected lazy val elementType: DataType = 
inputTypes.head.asInstanceOf[ArrayType].elementType
+
+  override def inputTypes: Seq[AbstractDataType] = {
+TypeCoercion.findWiderTypeForTwo(left.dataType, right.dataType) match {
--- End diff --

now the question is, shall we allow precision lose for array functions?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21235: [SPARK-24181][SQL] Better error message for writi...

2018-05-07 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21235#discussion_r186606160
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala ---
@@ -339,9 +339,16 @@ final class DataFrameWriter[T] private[sql](ds: 
Dataset[T]) {
 }
   }
 
-  private def assertNotBucketed(operation: String): Unit = {
-if (numBuckets.isDefined || sortColumnNames.isDefined) {
-  throw new AnalysisException(s"'$operation' does not support 
bucketing right now")
--- End diff --

how about minimizing the code changes
```
private def assertNotBucketed(operation: String): Unit = {
  if (getBucketSpec.isDefined) {
throw new AnalysisException(s"'$operation' does not support bucketing 
right now")
  }
}
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21255: [SPARK-24186][SparR][SQL]change reverse and concat to co...

2018-05-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21255
  
**[Test build #90353 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90353/testReport)**
 for PR 21255 at commit 
[`6b36d79`](https://github.com/apache/spark/commit/6b36d7999feb002655666700061f48448b97501b).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwrite a p...

2018-05-07 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/21257
  
cc @ericl 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwr...

2018-05-07 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21257#discussion_r186605552
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala
 ---
@@ -207,9 +207,25 @@ case class InsertIntoHadoopFsRelationCommand(
 }
 // first clear the path determined by the static partition keys (e.g. 
/table/foo=1)
 val staticPrefixPath = 
qualifiedOutputPath.suffix(staticPartitionPrefix)
-if (fs.exists(staticPrefixPath) && !committer.deleteWithJob(fs, 
staticPrefixPath, true)) {
-  throw new IOException(s"Unable to clear output " +
-s"directory $staticPrefixPath prior to writing to it")
+
+// check if delete the dir or just sub files
+if (fs.exists(staticPrefixPath)) {
+  // check if is he table root, and record the file to delete
+  if (staticPartitionPrefix.isEmpty) {
+val files = fs.listFiles(staticPrefixPath, false)
+while (files.hasNext) {
+  val file = files.next()
+  if (!committer.deleteWithJob(fs, file.getPath, true)) {
--- End diff --

> deleteMatchingPartitions happend only if overwrite is specified.

I mean, we should only do it if overwriting the same table.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21255: [SPARK-24186][SparR][SQL]change reverse and concat to co...

2018-05-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21255
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3023/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21255: [SPARK-24186][SparR][SQL]change reverse and concat to co...

2018-05-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21255
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwr...

2018-05-07 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21257#discussion_r186605405
  
--- Diff: 
core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala
 ---
@@ -235,4 +247,20 @@ class HadoopMapReduceCommitProtocol(
   tmp.getFileSystem(taskContext.getConfiguration).delete(tmp, false)
 }
   }
+
+  /**
+   * now just record the file to be delete
+   */
+  override def deleteWithJob(fs: FileSystem,
--- End diff --

nit: code style. see 
https://github.com/apache/spark/pull/21257/files#diff-d97cfb576287a7655f32cd5675cbR132


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-05-07 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21070
  
(While I am here) seems fine to me otherwise.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21255: [SPARK-24186][SparR][SQL]change reverse and concat to co...

2018-05-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21255
  
**[Test build #90352 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90352/testReport)**
 for PR 21255 at commit 
[`64ba432`](https://github.com/apache/spark/commit/64ba432f7d854a96e70675ded4fdaa705ad01a06).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21255: [SPARK-24186][SparR][SQL]change reverse and conca...

2018-05-07 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/21255#discussion_r186604950
  
--- Diff: R/pkg/tests/fulltests/test_sparkSQL.R ---
@@ -1748,6 +1748,14 @@ test_that("string operators", {
 collect(select(df5, repeat_string(df5$a, -1)))[1, 1],
 ""
   )
+
+  l6 <- list(list(a = "abc"))
+  df6 <- createDataFrame(l6)
+  df7 <- select(df6, reverse(df6$a))
--- End diff --

`df7` is not used below?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21199: [SPARK-24127][SS] Continuous text socket source

2018-05-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21199
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90349/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21199: [SPARK-24127][SS] Continuous text socket source

2018-05-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21199
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21199: [SPARK-24127][SS] Continuous text socket source

2018-05-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21199
  
**[Test build #90349 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90349/testReport)**
 for PR 21199 at commit 
[`f010943`](https://github.com/apache/spark/commit/f010943699b184cc9572bda8651cb40d6231bfa3).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21255: [SPARK-24186][SparR][SQL]change reverse and conca...

2018-05-07 Thread huaxingao
Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/21255#discussion_r186604721
  
--- Diff: R/pkg/tests/fulltests/test_sparkSQL.R ---
@@ -1502,12 +1502,21 @@ test_that("column functions", {
   result <- collect(select(df, sort_array(df[[1]])))[[1]]
   expect_equal(result, list(list(1L, 2L, 3L), list(4L, 5L, 6L)))
 
-  # Test flattern
+  result <- collect(select(df, reverse(df[[1]])))[[1]]
--- End diff --

@viirya Thank for your comments. I will make changes. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21264: Branch 2.2

2018-05-07 Thread joy-m
Github user joy-m closed the pull request at:

https://github.com/apache/spark/pull/21264


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-05-07 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/21070
  
LGTM except 2 comments


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10....

2018-05-07 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21070#discussion_r186604021
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedPlainValuesReader.java
 ---
@@ -63,115 +59,157 @@ public final void readBooleans(int total, 
WritableColumnVector c, int rowId) {
 }
   }
 
+  private ByteBuffer getBuffer(int length) {
+try {
+  return in.slice(length).order(ByteOrder.LITTLE_ENDIAN);
+} catch (IOException e) {
+  throw new ParquetDecodingException("Failed to read " + length + " 
bytes", e);
+}
+  }
+
   @Override
   public final void readIntegers(int total, WritableColumnVector c, int 
rowId) {
-c.putIntsLittleEndian(rowId, total, buffer, offset - 
Platform.BYTE_ARRAY_OFFSET);
-offset += 4 * total;
+int requiredBytes = total * 4;
+ByteBuffer buffer = getBuffer(requiredBytes);
+
+if (buffer.hasArray()) {
+  int offset = buffer.arrayOffset() + buffer.position();
+  c.putIntsLittleEndian(rowId, total, buffer.array(), offset);
+} else {
+  for (int i = 0; i < total; i += 1) {
+c.putInt(rowId + i, buffer.getInt());
+  }
+}
   }
 
   @Override
   public final void readLongs(int total, WritableColumnVector c, int 
rowId) {
-c.putLongsLittleEndian(rowId, total, buffer, offset - 
Platform.BYTE_ARRAY_OFFSET);
-offset += 8 * total;
+int requiredBytes = total * 8;
+ByteBuffer buffer = getBuffer(requiredBytes);
+
+if (buffer.hasArray()) {
+  int offset = buffer.arrayOffset() + buffer.position();
+  c.putLongsLittleEndian(rowId, total, buffer.array(), offset);
+} else {
+  for (int i = 0; i < total; i += 1) {
+c.putLong(rowId + i, buffer.getLong());
+  }
+}
   }
 
   @Override
   public final void readFloats(int total, WritableColumnVector c, int 
rowId) {
-c.putFloats(rowId, total, buffer, offset - Platform.BYTE_ARRAY_OFFSET);
-offset += 4 * total;
+int requiredBytes = total * 4;
+ByteBuffer buffer = getBuffer(requiredBytes);
+
+if (buffer.hasArray()) {
+  int offset = buffer.arrayOffset() + buffer.position();
+  c.putFloats(rowId, total, buffer.array(), offset);
+} else {
+  for (int i = 0; i < total; i += 1) {
+c.putFloat(rowId + i, buffer.getFloat());
+  }
+}
   }
 
   @Override
   public final void readDoubles(int total, WritableColumnVector c, int 
rowId) {
-c.putDoubles(rowId, total, buffer, offset - 
Platform.BYTE_ARRAY_OFFSET);
-offset += 8 * total;
+int requiredBytes = total * 8;
+ByteBuffer buffer = getBuffer(requiredBytes);
+
+if (buffer.hasArray()) {
+  int offset = buffer.arrayOffset() + buffer.position();
+  c.putDoubles(rowId, total, buffer.array(), offset);
+} else {
+  for (int i = 0; i < total; i += 1) {
+c.putDouble(rowId + i, buffer.getDouble());
+  }
+}
   }
 
   @Override
   public final void readBytes(int total, WritableColumnVector c, int 
rowId) {
-for (int i = 0; i < total; i++) {
-  // Bytes are stored as a 4-byte little endian int. Just read the 
first byte.
-  // TODO: consider pushing this in ColumnVector by adding a readBytes 
with a stride.
-  c.putByte(rowId + i, Platform.getByte(buffer, offset));
-  offset += 4;
+// Bytes are stored as a 4-byte little endian int. Just read the first 
byte.
+// TODO: consider pushing this in ColumnVector by adding a readBytes 
with a stride.
+int requiredBytes = total * 4;
+ByteBuffer buffer = getBuffer(requiredBytes);
+
+for (int i = 0; i < total; i += 1) {
+  c.putByte(rowId + i, buffer.get());
+  // skip the next 3 bytes
+  buffer.position(buffer.position() + 3);
 }
   }
 
   @Override
   public final boolean readBoolean() {
-byte b = Platform.getByte(buffer, offset);
-boolean v = (b & (1 << bitOffset)) != 0;
+// TODO: vectorize decoding and keep boolean[] instead of currentByte
+if (bitOffset == 0) {
+  try {
+currentByte = (byte) in.read();
+  } catch (IOException e) {
+throw new ParquetDecodingException("Failed to read a byte", e);
+  }
+}
+
+boolean v = (currentByte & (1 << bitOffset)) != 0;
 bitOffset += 1;
 if (bitOffset == 8) {
   bitOffset = 0;
-  offset++;
 }
 return v;
   }
 
   @Override
   public final int readInteger() {
-int v = Platform.getInt(buffer, offset);
-if 

[GitHub] spark pull request #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10....

2018-05-07 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21070#discussion_r186603708
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedRleValuesReader.java
 ---
@@ -619,32 +608,37 @@ private int ceil8(int value) {
   /**
* Reads the next group.
*/
-  private void readNextGroup()  {
-int header = readUnsignedVarInt();
-this.mode = (header & 1) == 0 ? MODE.RLE : MODE.PACKED;
-switch (mode) {
-  case RLE:
-this.currentCount = header >>> 1;
-this.currentValue = readIntLittleEndianPaddedOnBitWidth();
-return;
-  case PACKED:
-int numGroups = header >>> 1;
-this.currentCount = numGroups * 8;
-int bytesToRead = ceil8(this.currentCount * this.bitWidth);
-
-if (this.currentBuffer.length < this.currentCount) {
-  this.currentBuffer = new int[this.currentCount];
-}
-currentBufferIdx = 0;
-int valueIndex = 0;
-for (int byteIndex = offset; valueIndex < this.currentCount; 
byteIndex += this.bitWidth) {
-  this.packer.unpack8Values(in, byteIndex, this.currentBuffer, 
valueIndex);
-  valueIndex += 8;
-}
-offset += bytesToRead;
-return;
-  default:
-throw new ParquetDecodingException("not a valid mode " + 
this.mode);
+  private void readNextGroup() {
+try {
+  int header = readUnsignedVarInt();
+  this.mode = (header & 1) == 0 ? MODE.RLE : MODE.PACKED;
+  switch (mode) {
+case RLE:
+  this.currentCount = header >>> 1;
+  this.currentValue = readIntLittleEndianPaddedOnBitWidth();
+  return;
+case PACKED:
+  int numGroups = header >>> 1;
+  this.currentCount = numGroups * 8;
+
+  if (this.currentBuffer.length < this.currentCount) {
+this.currentBuffer = new int[this.currentCount];
+  }
+  currentBufferIdx = 0;
+  int valueIndex = 0;
+  while (valueIndex < this.currentCount) {
+// values are bit packed 8 at a time, so reading bitWidth will 
always work
+ByteBuffer buffer = in.slice(bitWidth);
+this.packer.unpack8Values(
+buffer, buffer.arrayOffset() + buffer.position(), 
this.currentBuffer, valueIndex);
--- End diff --

shall we assume the `ByteBuffer` may not be on-heap?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21259: [SPARK-24112][SQL] Add `convertMetastoreTableProp...

2018-05-07 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21259#discussion_r186603749
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1812,6 +1812,8 @@ working with timestamps in `pandas_udf`s to get the 
best performance, see
   - Since Spark 2.4, creating a managed table with nonempty location is 
not allowed. An exception is thrown when attempting to create a managed table 
with nonempty location. To set `true` to 
`spark.sql.allowCreatingManagedTableUsingNonemptyLocation` restores the 
previous behavior. This option will be removed in Spark 3.0.
   - Since Spark 2.4, the type coercion rules can automatically promote the 
argument types of the variadic SQL functions (e.g., IN/COALESCE) to the widest 
common type, no matter how the input arguments order. In prior Spark versions, 
the promotion could fail in some specific orders (e.g., TimestampType, 
IntegerType and StringType) and throw an exception.
   - In version 2.3 and earlier, `to_utc_timestamp` and 
`from_utc_timestamp` respect the timezone in the input timestamp string, which 
breaks the assumption that the input timestamp is in a specific timezone. 
Therefore, these 2 functions can return unexpected results. In version 2.4 and 
later, this problem has been fixed. `to_utc_timestamp` and `from_utc_timestamp` 
will return null if the input timestamp string contains timezone. As an 
example, `from_utc_timestamp('2000-10-10 00:00:00', 'GMT+1')` will return 
`2000-10-10 01:00:00` in both Spark 2.3 and 2.4. However, 
`from_utc_timestamp('2000-10-10 00:00:00+00:00', 'GMT+1')`, assuming a local 
timezone of GMT+8, will return `2000-10-10 09:00:00` in Spark 2.3 but `null` in 
2.4. For people who don't care about this problem and want to retain the 
previous behaivor to keep their query unchanged, you can set 
`spark.sql.function.rejectTimezoneInString` to false. This option will be 
removed in Spark 3.0 and should only be used as a tempora
 ry workaround.
+  - In version 2.3 and earlier, Spark converts Parquet Hive tables by 
default but ignores table properties like `TBLPROPERTIES (parquet.compression 
'NONE')`. This happens for ORC Hive table properties like `TBLPROPERTIES 
(orc.compress 'NONE')` in case of `spark.sql.hive.convertMetastoreOrc=true`, 
too. Since Spark 2.4, Spark supports Parquet/ORC specific table properties 
while converting Parquet/ORC Hive tables. As an example, `CREATE TABLE t(id 
int) STORED AS PARQUET TBLPROPERTIES (parquet.compression 'NONE')` would 
generate Snappy parquet files during insertion in Spark 2.3, and in Spark 2.4, 
the result would be uncompressed parquet files. To set `false` to 
`spark.sql.hive.convertMetastoreTableProperty` restores the previous behavior.
--- End diff --

+1 on ^ and @cloud-fan's.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21097: [SPARK-14682][ML] Provide evaluateEachIteration method o...

2018-05-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21097
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21097: [SPARK-14682][ML] Provide evaluateEachIteration method o...

2018-05-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21097
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3022/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21259: [SPARK-24112][SQL] Add `convertMetastoreTableProp...

2018-05-07 Thread mridulm
Github user mridulm commented on a diff in the pull request:

https://github.com/apache/spark/pull/21259#discussion_r186603250
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1812,6 +1812,8 @@ working with timestamps in `pandas_udf`s to get the 
best performance, see
   - Since Spark 2.4, creating a managed table with nonempty location is 
not allowed. An exception is thrown when attempting to create a managed table 
with nonempty location. To set `true` to 
`spark.sql.allowCreatingManagedTableUsingNonemptyLocation` restores the 
previous behavior. This option will be removed in Spark 3.0.
   - Since Spark 2.4, the type coercion rules can automatically promote the 
argument types of the variadic SQL functions (e.g., IN/COALESCE) to the widest 
common type, no matter how the input arguments order. In prior Spark versions, 
the promotion could fail in some specific orders (e.g., TimestampType, 
IntegerType and StringType) and throw an exception.
   - In version 2.3 and earlier, `to_utc_timestamp` and 
`from_utc_timestamp` respect the timezone in the input timestamp string, which 
breaks the assumption that the input timestamp is in a specific timezone. 
Therefore, these 2 functions can return unexpected results. In version 2.4 and 
later, this problem has been fixed. `to_utc_timestamp` and `from_utc_timestamp` 
will return null if the input timestamp string contains timezone. As an 
example, `from_utc_timestamp('2000-10-10 00:00:00', 'GMT+1')` will return 
`2000-10-10 01:00:00` in both Spark 2.3 and 2.4. However, 
`from_utc_timestamp('2000-10-10 00:00:00+00:00', 'GMT+1')`, assuming a local 
timezone of GMT+8, will return `2000-10-10 09:00:00` in Spark 2.3 but `null` in 
2.4. For people who don't care about this problem and want to retain the 
previous behaivor to keep their query unchanged, you can set 
`spark.sql.function.rejectTimezoneInString` to false. This option will be 
removed in Spark 3.0 and should only be used as a tempora
 ry workaround.
+  - In version 2.3 and earlier, Spark converts Parquet Hive tables by 
default but ignores table properties like `TBLPROPERTIES (parquet.compression 
'NONE')`. This happens for ORC Hive table properties like `TBLPROPERTIES 
(orc.compress 'NONE')` in case of `spark.sql.hive.convertMetastoreOrc=true`, 
too. Since Spark 2.4, Spark supports Parquet/ORC specific table properties 
while converting Parquet/ORC Hive tables. As an example, `CREATE TABLE t(id 
int) STORED AS PARQUET TBLPROPERTIES (parquet.compression 'NONE')` would 
generate Snappy parquet files during insertion in Spark 2.3, and in Spark 2.4, 
the result would be uncompressed parquet files. To set `false` to 
`spark.sql.hive.convertMetastoreTableProperty` restores the previous behavior.
--- End diff --

Setting a property and expecting spark to ignore it does not sound logical 
(spark not honoring a property is a bug IMO - which, thankfully, has been fixed 
in 2.4).
Having said that, I agree with you that mentioning this in migration guide 
might be sufficient;  we have behavior changes between versions all the time 
and a conf is not necessary when the change is in the right direction.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21264: Branch 2.2

2018-05-07 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21264
  
@yotingting, mind closing this and open an issue at JIRA or asking it to 
mailing list please? I think you can have a better answer there. Please check 
out https://spark.apache.org/community.html too.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwr...

2018-05-07 Thread zheh12
Github user zheh12 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21257#discussion_r186603145
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala
 ---
@@ -207,9 +207,25 @@ case class InsertIntoHadoopFsRelationCommand(
 }
 // first clear the path determined by the static partition keys (e.g. 
/table/foo=1)
 val staticPrefixPath = 
qualifiedOutputPath.suffix(staticPartitionPrefix)
-if (fs.exists(staticPrefixPath) && !committer.deleteWithJob(fs, 
staticPrefixPath, true)) {
-  throw new IOException(s"Unable to clear output " +
-s"directory $staticPrefixPath prior to writing to it")
+
+// check if delete the dir or just sub files
+if (fs.exists(staticPrefixPath)) {
+  // check if is he table root, and record the file to delete
+  if (staticPartitionPrefix.isEmpty) {
+val files = fs.listFiles(staticPrefixPath, false)
+while (files.hasNext) {
+  val file = files.next()
+  if (!committer.deleteWithJob(fs, file.getPath, true)) {
--- End diff --

1. From the code point of view, the current implementation is 
`deleteMatchingPartitions` happend only if `overwrite` is specified.
2. Using `dynamicPartitionOverwrite` will not solve this problem,because it 
will also generate a `.stage` directory under the table root directory. We 
still need to record all the files we want to delete, but we cannot directly 
delete the root directories.
The dynamic partition overwrite is actually recording all the partitions 
that need to be deleted and then deleted one by one. And the entire table 
`overwrite` deletes all the data of the entire directory, it needs to record 
all deleted partition directory files,so in fact the implementation of the code 
is similar with `dynamicPartitionOverwrite` .


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21097: [SPARK-14682][ML] Provide evaluateEachIteration method o...

2018-05-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21097
  
**[Test build #90351 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90351/testReport)**
 for PR 21097 at commit 
[`c32b5a8`](https://github.com/apache/spark/commit/c32b5a81de06d318aee1b18c82d82568d2e8).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10....

2018-05-07 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21070#discussion_r186603010
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedPlainValuesReader.java
 ---
@@ -63,115 +59,157 @@ public final void readBooleans(int total, 
WritableColumnVector c, int rowId) {
 }
   }
 
+  private ByteBuffer getBuffer(int length) {
+try {
+  return in.slice(length).order(ByteOrder.LITTLE_ENDIAN);
+} catch (IOException e) {
+  throw new ParquetDecodingException("Failed to read " + length + " 
bytes", e);
+}
+  }
+
   @Override
   public final void readIntegers(int total, WritableColumnVector c, int 
rowId) {
-c.putIntsLittleEndian(rowId, total, buffer, offset - 
Platform.BYTE_ARRAY_OFFSET);
-offset += 4 * total;
+int requiredBytes = total * 4;
+ByteBuffer buffer = getBuffer(requiredBytes);
+
+if (buffer.hasArray()) {
+  int offset = buffer.arrayOffset() + buffer.position();
+  c.putIntsLittleEndian(rowId, total, buffer.array(), offset);
+} else {
+  for (int i = 0; i < total; i += 1) {
+c.putInt(rowId + i, buffer.getInt());
+  }
+}
   }
 
   @Override
   public final void readLongs(int total, WritableColumnVector c, int 
rowId) {
-c.putLongsLittleEndian(rowId, total, buffer, offset - 
Platform.BYTE_ARRAY_OFFSET);
-offset += 8 * total;
+int requiredBytes = total * 8;
+ByteBuffer buffer = getBuffer(requiredBytes);
+
+if (buffer.hasArray()) {
+  int offset = buffer.arrayOffset() + buffer.position();
+  c.putLongsLittleEndian(rowId, total, buffer.array(), offset);
+} else {
+  for (int i = 0; i < total; i += 1) {
+c.putLong(rowId + i, buffer.getLong());
+  }
+}
   }
 
   @Override
   public final void readFloats(int total, WritableColumnVector c, int 
rowId) {
-c.putFloats(rowId, total, buffer, offset - Platform.BYTE_ARRAY_OFFSET);
-offset += 4 * total;
+int requiredBytes = total * 4;
+ByteBuffer buffer = getBuffer(requiredBytes);
+
+if (buffer.hasArray()) {
+  int offset = buffer.arrayOffset() + buffer.position();
+  c.putFloats(rowId, total, buffer.array(), offset);
+} else {
+  for (int i = 0; i < total; i += 1) {
+c.putFloat(rowId + i, buffer.getFloat());
+  }
+}
   }
 
   @Override
   public final void readDoubles(int total, WritableColumnVector c, int 
rowId) {
-c.putDoubles(rowId, total, buffer, offset - 
Platform.BYTE_ARRAY_OFFSET);
-offset += 8 * total;
+int requiredBytes = total * 8;
+ByteBuffer buffer = getBuffer(requiredBytes);
+
+if (buffer.hasArray()) {
+  int offset = buffer.arrayOffset() + buffer.position();
+  c.putDoubles(rowId, total, buffer.array(), offset);
+} else {
+  for (int i = 0; i < total; i += 1) {
+c.putDouble(rowId + i, buffer.getDouble());
+  }
+}
   }
 
   @Override
   public final void readBytes(int total, WritableColumnVector c, int 
rowId) {
-for (int i = 0; i < total; i++) {
-  // Bytes are stored as a 4-byte little endian int. Just read the 
first byte.
-  // TODO: consider pushing this in ColumnVector by adding a readBytes 
with a stride.
-  c.putByte(rowId + i, Platform.getByte(buffer, offset));
-  offset += 4;
+// Bytes are stored as a 4-byte little endian int. Just read the first 
byte.
+// TODO: consider pushing this in ColumnVector by adding a readBytes 
with a stride.
+int requiredBytes = total * 4;
+ByteBuffer buffer = getBuffer(requiredBytes);
+
+for (int i = 0; i < total; i += 1) {
+  c.putByte(rowId + i, buffer.get());
+  // skip the next 3 bytes
+  buffer.position(buffer.position() + 3);
 }
   }
 
   @Override
   public final boolean readBoolean() {
-byte b = Platform.getByte(buffer, offset);
-boolean v = (b & (1 << bitOffset)) != 0;
+// TODO: vectorize decoding and keep boolean[] instead of currentByte
+if (bitOffset == 0) {
+  try {
+currentByte = (byte) in.read();
+  } catch (IOException e) {
+throw new ParquetDecodingException("Failed to read a byte", e);
+  }
+}
+
+boolean v = (currentByte & (1 << bitOffset)) != 0;
 bitOffset += 1;
 if (bitOffset == 8) {
   bitOffset = 0;
-  offset++;
 }
 return v;
   }
 
   @Override
   public final int readInteger() {
-int v = Platform.getInt(buffer, offset);
-if 

  1   2   3   4   5   6   7   >