[GitHub] spark issue #21129: [SPARK-7132][ML] Add fit with validation set to spark.ml...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21129 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3030/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21129: [SPARK-7132][ML] Add fit with validation set to spark.ml...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21129 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21266: [SPARK-24206][SQL] Improve DataSource read benchmark cod...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21266 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21266: [SPARK-24206][SQL] Improve DataSource read benchmark cod...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21266 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3029/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21193: [SPARK-24121][SQL] Add API for handling expression code ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21193 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3028/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21193: [SPARK-24121][SQL] Add API for handling expression code ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21193 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21266: [SPARK-24206][SQL] Improve DataSource read benchmark cod...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21266 **[Test build #90359 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90359/testReport)** for PR 21266 at commit [`09b3920`](https://github.com/apache/spark/commit/09b3920281c53d0853dd05bb064c2fc5c110564b). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21266: [SPARK-24206][SQL] Improve DataSource read benchmark cod...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21266 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90359/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21266: [SPARK-24206][SQL] Improve DataSource read benchmark cod...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21266 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21266: [SPARK-24206][SQL] Improve DataSource read benchmark cod...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21266 **[Test build #90359 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90359/testReport)** for PR 21266 at commit [`09b3920`](https://github.com/apache/spark/commit/09b3920281c53d0853dd05bb064c2fc5c110564b). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21266: [SPARK-24206][SQL] Improve DataSource read benchmark cod...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/21266 @gatorsmile @dongjoon-hyun --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21265 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3027/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21265 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21266: [SPARK-24206][SQL] Improve DataSource read benchmark cod...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/21266 Also, I'll make a follow-up pr for pushdown benchmarks; https://github.com/apache/spark/compare/master...maropu:UpdateParquetBenchmark --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21266: [SPARK-24206][SQL] Improve DataSource read benchm...
GitHub user maropu opened a pull request: https://github.com/apache/spark/pull/21266 [SPARK-24206][SQL] Improve DataSource read benchmark code ## What changes were proposed in this pull request? This pr added benchmark code `DataSourceReadBenchmark` for `orc`, `paruqet`, `csv`, and `json` based on the existing `ParquetReadBenchmark` and `OrcReadBenchmark`. ## How was this patch tested? N/A You can merge this pull request into a Git repository by running: $ git pull https://github.com/maropu/spark DataSourceReadBenchmark Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21266.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21266 commit 09b3920281c53d0853dd05bb064c2fc5c110564b Author: Takeshi YamamuroDate: 2018-05-03T07:12:56Z Fix --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21266: [SPARK-24206][SQL] Improve DataSource read benchmark cod...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/21266 I'll add benchmark results just after #21070 merged. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21193: [SPARK-24121][SQL] Add API for handling expression code ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21193 **[Test build #90357 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90357/testReport)** for PR 21193 at commit [`53b329a`](https://github.com/apache/spark/commit/53b329a3b4e72d423cf503af2982d545848bece8). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21129: [SPARK-7132][ML] Add fit with validation set to spark.ml...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21129 **[Test build #90358 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90358/testReport)** for PR 21129 at commit [`54f73af`](https://github.com/apache/spark/commit/54f73af6df40eb1b3591e02a906b0fb72bc2). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21193: [SPARK-24121][SQL] Add API for handling expressio...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/21193#discussion_r186621034 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/javaCode.scala --- @@ -114,19 +114,128 @@ object JavaCode { } } +/** + * A trait representing a block of java code. + */ +trait Block extends JavaCode { + + // The expressions to be evaluated inside this block. + def exprValues: Seq[ExprValue] + + // Returns java code string for this code block. + override def toString: String = _marginChar match { +case Some(c) => code.stripMargin(c).trim +case _ => code.trim + } + + // The leading prefix that should be stripped from each line. + // By default we strip blanks or control characters followed by '|' from the line. + var _marginChar: Option[Char] = Some('|') + + def stripMargin(c: Char): this.type = { +_marginChar = Some(c) +this + } + + def stripMargin: this.type = { +_marginChar = Some('|') +this + } + + // Concatenates this block with other block. + def + (other: Block): Block +} + +object Block { + + val CODE_BLOCK_BUFFER_LENGTH: Int = 512 + + implicit def blocksToBlock(blocks: Seq[Block]): Block = Blocks(blocks) + + implicit class BlockHelper(val sc: StringContext) extends AnyVal { +def code(args: Any*): Block = { + sc.checkLengths(args) + if (sc.parts.length == 0) { +EmptyBlock + } else { +args.foreach { + case _: ExprValue => + case _: Int | _: Long | _: Float | _: Double | _: String => + case _: Block => + case other => throw new IllegalArgumentException( +s"Can not interpolate ${other.getClass.getName} into code block.") +} +CodeBlock(sc.parts, args) + } +} + } +} + +/** + * A block of java code. Including a sequence of code parts and some inputs to this block. + * The actual java code is generated by embedding the inputs into the code parts. + */ +case class CodeBlock(codeParts: Seq[String], blockInputs: Seq[Any]) extends Block { + override def exprValues: Seq[ExprValue] = { +blockInputs.flatMap { + case b: Block => b.exprValues + case e: ExprValue => Seq(e) --- End diff -- Note that we should not do `distinct` on `blockInputs` because the length of `codeParts` and `blockInputs` must be matched. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21265 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90356/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21265 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21265 **[Test build #90356 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90356/testReport)** for PR 21265 at commit [`83eeea1`](https://github.com/apache/spark/commit/83eeea1c539d59a4d8496437dcf06d82b43b0ca2). * This patch **fails Python style tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class PrefixSpan(object):` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21265 **[Test build #90356 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90356/testReport)** for PR 21265 at commit [`83eeea1`](https://github.com/apache/spark/commit/83eeea1c539d59a4d8496437dcf06d82b43b0ca2). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21265: [SPARK-24146][PySpark][ML] spark.ml parity for se...
GitHub user WeichenXu123 opened a pull request: https://github.com/apache/spark/pull/21265 [SPARK-24146][PySpark][ML] spark.ml parity for sequential pattern mining - PrefixSpan: Python API ## What changes were proposed in this pull request? spark.ml parity for sequential pattern mining - PrefixSpan: Python API ## How was this patch tested? doctests You can merge this pull request into a Git repository by running: $ git pull https://github.com/WeichenXu123/spark prefix_span_py Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21265.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21265 commit 83eeea1c539d59a4d8496437dcf06d82b43b0ca2 Author: WeichenXuDate: 2018-05-08T05:29:24Z init pr --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21193: [SPARK-24121][SQL] Add API for handling expressio...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/21193#discussion_r186618899 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/javaCode.scala --- @@ -114,19 +114,128 @@ object JavaCode { } } +/** + * A trait representing a block of java code. + */ +trait Block extends JavaCode { + + // The expressions to be evaluated inside this block. + def exprValues: Seq[ExprValue] + + // Returns java code string for this code block. + override def toString: String = _marginChar match { +case Some(c) => code.stripMargin(c).trim +case _ => code.trim + } + + // The leading prefix that should be stripped from each line. + // By default we strip blanks or control characters followed by '|' from the line. + var _marginChar: Option[Char] = Some('|') + + def stripMargin(c: Char): this.type = { +_marginChar = Some(c) +this + } + + def stripMargin: this.type = { +_marginChar = Some('|') +this + } + + // Concatenates this block with other block. + def + (other: Block): Block +} + +object Block { + + val CODE_BLOCK_BUFFER_LENGTH: Int = 512 + + implicit def blocksToBlock(blocks: Seq[Block]): Block = Blocks(blocks) + + implicit class BlockHelper(val sc: StringContext) extends AnyVal { +def code(args: Any*): Block = { + sc.checkLengths(args) + if (sc.parts.length == 0) { +EmptyBlock + } else { +args.foreach { + case _: ExprValue => + case _: Int | _: Long | _: Float | _: Double | _: String => + case _: Block => + case other => throw new IllegalArgumentException( +s"Can not interpolate ${other.getClass.getName} into code block.") +} +CodeBlock(sc.parts, args) + } +} + } +} + +/** + * A block of java code. Including a sequence of code parts and some inputs to this block. + * The actual java code is generated by embedding the inputs into the code parts. + */ +case class CodeBlock(codeParts: Seq[String], blockInputs: Seq[Any]) extends Block { + override def exprValues: Seq[ExprValue] = { --- End diff -- Ok. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21193: [SPARK-24121][SQL] Add API for handling expressio...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/21193#discussion_r186618877 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/javaCode.scala --- @@ -114,19 +114,128 @@ object JavaCode { } } +/** + * A trait representing a block of java code. + */ +trait Block extends JavaCode { + + // The expressions to be evaluated inside this block. + def exprValues: Seq[ExprValue] + + // Returns java code string for this code block. + override def toString: String = _marginChar match { +case Some(c) => code.stripMargin(c).trim +case _ => code.trim + } + + // The leading prefix that should be stripped from each line. + // By default we strip blanks or control characters followed by '|' from the line. + var _marginChar: Option[Char] = Some('|') + + def stripMargin(c: Char): this.type = { +_marginChar = Some(c) +this + } + + def stripMargin: this.type = { +_marginChar = Some('|') +this + } + + // Concatenates this block with other block. + def + (other: Block): Block +} + +object Block { + + val CODE_BLOCK_BUFFER_LENGTH: Int = 512 + + implicit def blocksToBlock(blocks: Seq[Block]): Block = Blocks(blocks) + + implicit class BlockHelper(val sc: StringContext) extends AnyVal { +def code(args: Any*): Block = { + sc.checkLengths(args) + if (sc.parts.length == 0) { +EmptyBlock + } else { +args.foreach { + case _: ExprValue => + case _: Int | _: Long | _: Float | _: Double | _: String => + case _: Block => + case other => throw new IllegalArgumentException( +s"Can not interpolate ${other.getClass.getName} into code block.") +} +CodeBlock(sc.parts, args) + } +} + } +} + +/** + * A block of java code. Including a sequence of code parts and some inputs to this block. + * The actual java code is generated by embedding the inputs into the code parts. + */ +case class CodeBlock(codeParts: Seq[String], blockInputs: Seq[Any]) extends Block { + override def exprValues: Seq[ExprValue] = { +blockInputs.flatMap { + case b: Block => b.exprValues + case e: ExprValue => Seq(e) + case _ => Seq.empty +} + } + + override def code: String = { +val strings = codeParts.iterator +val inputs = blockInputs.iterator +val buf = new StringBuilder(Block.CODE_BLOCK_BUFFER_LENGTH) +buf append StringContext.treatEscapes(strings.next) +while (strings.hasNext) { + buf append inputs.next + buf append StringContext.treatEscapes(strings.next) +} +buf.toString + } + + override def + (other: Block): Block = other match { +case c: CodeBlock => Blocks(Seq(this, c)) +case b: Blocks => Blocks(Seq(this) ++ b.blocks) +case EmptyBlock => this + } +} + +case class Blocks(blocks: Seq[Block]) extends Block { + override def exprValues: Seq[ExprValue] = blocks.flatMap(_.exprValues) + override def code: String = blocks.map(_.toString).mkString("\n") + + override def + (other: Block): Block = other match { +case c: CodeBlock => Blocks(blocks :+ c) +case b: Blocks => Blocks(blocks ++ b.blocks) +case EmptyBlock => this + } +} + +object EmptyBlock extends Block with Serializable { + override def code: String = "" + override def exprValues: Seq[ExprValue] = Seq.empty + + override def + (other: Block): Block = other +} + /** * A typed java fragment that must be a valid java expression. */ trait ExprValue extends JavaCode { def javaType: Class[_] def isPrimitive: Boolean = javaType.isPrimitive + + // This will be called during string interpolation. + override def toString: String = ExprValue.exprValueToString(this) --- End diff -- Forgot to revert this change. In previous commit, I need `ExprValue.exprValueToString` as the only entry for tracking `ExprValue`. This should be reverted now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21193: [SPARK-24121][SQL] Add API for handling expressio...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/21193#discussion_r186618580 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/javaCode.scala --- @@ -114,19 +114,128 @@ object JavaCode { } } +/** + * A trait representing a block of java code. + */ +trait Block extends JavaCode { + + // The expressions to be evaluated inside this block. + def exprValues: Seq[ExprValue] + + // Returns java code string for this code block. + override def toString: String = _marginChar match { +case Some(c) => code.stripMargin(c).trim +case _ => code.trim + } + + // The leading prefix that should be stripped from each line. + // By default we strip blanks or control characters followed by '|' from the line. + var _marginChar: Option[Char] = Some('|') + + def stripMargin(c: Char): this.type = { +_marginChar = Some(c) +this + } + + def stripMargin: this.type = { +_marginChar = Some('|') +this + } + + // Concatenates this block with other block. + def + (other: Block): Block +} + +object Block { + + val CODE_BLOCK_BUFFER_LENGTH: Int = 512 + + implicit def blocksToBlock(blocks: Seq[Block]): Block = Blocks(blocks) + + implicit class BlockHelper(val sc: StringContext) extends AnyVal { +def code(args: Any*): Block = { + sc.checkLengths(args) + if (sc.parts.length == 0) { +EmptyBlock + } else { +args.foreach { + case _: ExprValue => + case _: Int | _: Long | _: Float | _: Double | _: String => + case _: Block => + case other => throw new IllegalArgumentException( +s"Can not interpolate ${other.getClass.getName} into code block.") +} +CodeBlock(sc.parts, args) + } +} + } +} + +/** + * A block of java code. Including a sequence of code parts and some inputs to this block. + * The actual java code is generated by embedding the inputs into the code parts. + */ +case class CodeBlock(codeParts: Seq[String], blockInputs: Seq[Any]) extends Block { + override def exprValues: Seq[ExprValue] = { +blockInputs.flatMap { + case b: Block => b.exprValues + case e: ExprValue => Seq(e) --- End diff -- It's possible. A distinct helps here? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21193: [SPARK-24121][SQL] Add API for handling expressio...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/21193#discussion_r186618412 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala --- @@ -623,8 +624,14 @@ case class Cast(child: Expression, dataType: DataType, timeZoneId: Option[String override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = { val eval = child.genCode(ctx) val nullSafeCast = nullSafeCastFunction(child.dataType, dataType, ctx) + +// Below the code comment including `eval.value` and `eval.isNull` is a trick. It makes the two +// expr values are referred by this code block. ev.copy(code = eval.code + - castCode(ctx, eval.value, eval.isNull, ev.value, ev.isNull, dataType, nullSafeCast)) + code""" +// Cast from ${eval.value}, ${eval.isNull} --- End diff -- I'd like to clean this part further. The generated codes in `Cast` are tangled now, IMHO. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16478: [SPARK-7768][SQL] Revise user defined types (UDT)
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/16478#discussion_r186618254 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/types/UserDefinedType.scala --- @@ -50,13 +54,46 @@ abstract class UserDefinedType[UserType >: Null] extends DataType with Serializa /** Serialized Python UDT class, if exists. */ def serializedPyClass: String = null - /** - * Convert the user type to a SQL datum - */ - def serialize(obj: UserType): Any + /** The RowEncoder used to serialize/deserialize extenal row to internal row format. */ + private lazy val rowEncoder = UserDefinedType.getRowEncoder(sqlType) --- End diff -- Due to recently added constraint that SQLConf can't be accessed during task execution, this causes the test failure. I'm wondering if we have other options here... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21193: [SPARK-24121][SQL] Add API for handling expressio...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21193#discussion_r186617171 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/javaCode.scala --- @@ -114,19 +114,128 @@ object JavaCode { } } +/** + * A trait representing a block of java code. + */ +trait Block extends JavaCode { + + // The expressions to be evaluated inside this block. + def exprValues: Seq[ExprValue] + + // Returns java code string for this code block. + override def toString: String = _marginChar match { +case Some(c) => code.stripMargin(c).trim +case _ => code.trim + } + + // The leading prefix that should be stripped from each line. + // By default we strip blanks or control characters followed by '|' from the line. + var _marginChar: Option[Char] = Some('|') + + def stripMargin(c: Char): this.type = { +_marginChar = Some(c) +this + } + + def stripMargin: this.type = { +_marginChar = Some('|') +this + } + + // Concatenates this block with other block. + def + (other: Block): Block +} + +object Block { + + val CODE_BLOCK_BUFFER_LENGTH: Int = 512 + + implicit def blocksToBlock(blocks: Seq[Block]): Block = Blocks(blocks) + + implicit class BlockHelper(val sc: StringContext) extends AnyVal { +def code(args: Any*): Block = { + sc.checkLengths(args) + if (sc.parts.length == 0) { +EmptyBlock + } else { +args.foreach { + case _: ExprValue => + case _: Int | _: Long | _: Float | _: Double | _: String => + case _: Block => + case other => throw new IllegalArgumentException( +s"Can not interpolate ${other.getClass.getName} into code block.") +} +CodeBlock(sc.parts, args) + } +} + } +} + +/** + * A block of java code. Including a sequence of code parts and some inputs to this block. + * The actual java code is generated by embedding the inputs into the code parts. + */ +case class CodeBlock(codeParts: Seq[String], blockInputs: Seq[Any]) extends Block { + override def exprValues: Seq[ExprValue] = { +blockInputs.flatMap { + case b: Block => b.exprValues + case e: ExprValue => Seq(e) + case _ => Seq.empty +} + } + + override def code: String = { +val strings = codeParts.iterator +val inputs = blockInputs.iterator +val buf = new StringBuilder(Block.CODE_BLOCK_BUFFER_LENGTH) +buf append StringContext.treatEscapes(strings.next) +while (strings.hasNext) { + buf append inputs.next + buf append StringContext.treatEscapes(strings.next) +} +buf.toString + } + + override def + (other: Block): Block = other match { +case c: CodeBlock => Blocks(Seq(this, c)) +case b: Blocks => Blocks(Seq(this) ++ b.blocks) +case EmptyBlock => this + } +} + +case class Blocks(blocks: Seq[Block]) extends Block { + override def exprValues: Seq[ExprValue] = blocks.flatMap(_.exprValues) + override def code: String = blocks.map(_.toString).mkString("\n") + + override def + (other: Block): Block = other match { +case c: CodeBlock => Blocks(blocks :+ c) +case b: Blocks => Blocks(blocks ++ b.blocks) +case EmptyBlock => this + } +} + +object EmptyBlock extends Block with Serializable { + override def code: String = "" + override def exprValues: Seq[ExprValue] = Seq.empty + + override def + (other: Block): Block = other +} + /** * A typed java fragment that must be a valid java expression. */ trait ExprValue extends JavaCode { def javaType: Class[_] def isPrimitive: Boolean = javaType.isPrimitive + + // This will be called during string interpolation. + override def toString: String = ExprValue.exprValueToString(this) --- End diff -- why is it needed? `JavaCode` already defines `override def toString: String = code` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21193: [SPARK-24121][SQL] Add API for handling expressio...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21193#discussion_r186615731 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala --- @@ -623,8 +624,14 @@ case class Cast(child: Expression, dataType: DataType, timeZoneId: Option[String override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = { val eval = child.genCode(ctx) val nullSafeCast = nullSafeCastFunction(child.dataType, dataType, ctx) + +// Below the code comment including `eval.value` and `eval.isNull` is a trick. It makes the two +// expr values are referred by this code block. ev.copy(code = eval.code + - castCode(ctx, eval.value, eval.isNull, ev.value, ev.isNull, dataType, nullSafeCast)) + code""" +// Cast from ${eval.value}, ${eval.isNull} --- End diff -- how to guarantee this is the only one we need to take care? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21193: [SPARK-24121][SQL] Add API for handling expressio...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21193#discussion_r186617013 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/javaCode.scala --- @@ -114,19 +114,128 @@ object JavaCode { } } +/** + * A trait representing a block of java code. + */ +trait Block extends JavaCode { + + // The expressions to be evaluated inside this block. + def exprValues: Seq[ExprValue] + + // Returns java code string for this code block. + override def toString: String = _marginChar match { +case Some(c) => code.stripMargin(c).trim +case _ => code.trim + } + + // The leading prefix that should be stripped from each line. + // By default we strip blanks or control characters followed by '|' from the line. + var _marginChar: Option[Char] = Some('|') + + def stripMargin(c: Char): this.type = { +_marginChar = Some(c) +this + } + + def stripMargin: this.type = { +_marginChar = Some('|') +this + } + + // Concatenates this block with other block. + def + (other: Block): Block +} + +object Block { + + val CODE_BLOCK_BUFFER_LENGTH: Int = 512 + + implicit def blocksToBlock(blocks: Seq[Block]): Block = Blocks(blocks) + + implicit class BlockHelper(val sc: StringContext) extends AnyVal { +def code(args: Any*): Block = { + sc.checkLengths(args) + if (sc.parts.length == 0) { +EmptyBlock + } else { +args.foreach { + case _: ExprValue => + case _: Int | _: Long | _: Float | _: Double | _: String => + case _: Block => + case other => throw new IllegalArgumentException( +s"Can not interpolate ${other.getClass.getName} into code block.") +} +CodeBlock(sc.parts, args) + } +} + } +} + +/** + * A block of java code. Including a sequence of code parts and some inputs to this block. + * The actual java code is generated by embedding the inputs into the code parts. + */ +case class CodeBlock(codeParts: Seq[String], blockInputs: Seq[Any]) extends Block { + override def exprValues: Seq[ExprValue] = { +blockInputs.flatMap { + case b: Block => b.exprValues + case e: ExprValue => Seq(e) --- End diff -- is it possible that a `ExprValue` is referenced twice in the code string? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21193: [SPARK-24121][SQL] Add API for handling expressio...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21193#discussion_r186616773 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/javaCode.scala --- @@ -114,19 +114,128 @@ object JavaCode { } } +/** + * A trait representing a block of java code. + */ +trait Block extends JavaCode { + + // The expressions to be evaluated inside this block. + def exprValues: Seq[ExprValue] + + // Returns java code string for this code block. + override def toString: String = _marginChar match { +case Some(c) => code.stripMargin(c).trim +case _ => code.trim + } + + // The leading prefix that should be stripped from each line. + // By default we strip blanks or control characters followed by '|' from the line. + var _marginChar: Option[Char] = Some('|') + + def stripMargin(c: Char): this.type = { +_marginChar = Some(c) +this + } + + def stripMargin: this.type = { +_marginChar = Some('|') +this + } + + // Concatenates this block with other block. + def + (other: Block): Block +} + +object Block { + + val CODE_BLOCK_BUFFER_LENGTH: Int = 512 + + implicit def blocksToBlock(blocks: Seq[Block]): Block = Blocks(blocks) + + implicit class BlockHelper(val sc: StringContext) extends AnyVal { +def code(args: Any*): Block = { + sc.checkLengths(args) + if (sc.parts.length == 0) { +EmptyBlock + } else { +args.foreach { + case _: ExprValue => + case _: Int | _: Long | _: Float | _: Double | _: String => + case _: Block => + case other => throw new IllegalArgumentException( +s"Can not interpolate ${other.getClass.getName} into code block.") +} +CodeBlock(sc.parts, args) + } +} + } +} + +/** + * A block of java code. Including a sequence of code parts and some inputs to this block. + * The actual java code is generated by embedding the inputs into the code parts. + */ +case class CodeBlock(codeParts: Seq[String], blockInputs: Seq[Any]) extends Block { + override def exprValues: Seq[ExprValue] = { --- End diff -- `lazy val`? this might be expensive. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21054: [SPARK-23907][SQL] Add regr_* functions
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21054#discussion_r186616594 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/regression.scala --- @@ -0,0 +1,190 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.expressions.aggregate + +import org.apache.spark.sql.catalyst.dsl.expressions._ +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.types.{AbstractDataType, DoubleType} + +/** + * Base trait for all regression functions. + */ +trait RegrLike extends AggregateFunction with ImplicitCastInputTypes { + def y: Expression + def x: Expression + + override def children: Seq[Expression] = Seq(y, x) + override def inputTypes: Seq[AbstractDataType] = Seq(DoubleType, DoubleType) + + protected def updateIfNotNull(exprs: Seq[Expression]): Seq[Expression] = { +assert(aggBufferAttributes.length == exprs.length) +val nullableChildren = children.filter(_.nullable) +if (nullableChildren.isEmpty) { + exprs +} else { + exprs.zip(aggBufferAttributes).map { case (e, a) => +If(nullableChildren.map(IsNull).reduce(Or), a, e) + } +} + } +} + + +@ExpressionDescription( + usage = "_FUNC_(y, x) - Returns the number of non-null pairs.", + since = "2.4.0") +case class RegrCount(y: Expression, x: Expression) + extends CountLike with RegrLike { + + override lazy val updateExpressions: Seq[Expression] = updateIfNotNull(Seq(count + 1L)) + + override def prettyName: String = "regr_count" +} + + +@ExpressionDescription( + usage = "_FUNC_(y, x) - Returns SUM(x*x)-SUM(x)*SUM(x)/N. Any pair with a NULL is ignored.", --- End diff -- It is reasonable to follow Hive. Personally, I like DB2 or Oracle, because normally these commercial dbms is more professional. : ) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21201: [SPARK-24128][SQL] Mention configuration option i...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/21201 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21201: [SPARK-24128][SQL] Mention configuration option in impli...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21201 Merged to master and branch-2.3. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21255: [SPARK-24186][SparR][SQL]change reverse and conca...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21255#discussion_r186612820 --- Diff: R/pkg/tests/fulltests/test_sparkSQL.R --- @@ -1739,6 +1748,13 @@ test_that("string operators", { collect(select(df5, repeat_string(df5$a, -1)))[1, 1], "" ) + + l6 <- list(list(a = "abc")) + df6 <- createDataFrame(l6) + expect_equal( +collect(select(df6, reverse(df6$a)))[1, 1], +"cba" + ) --- End diff -- Let's make this inlined while we are here. I would also put this test around 1505L above. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21255: [SPARK-24186][SparR][SQL]change reverse and conca...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21255#discussion_r186612845 --- Diff: R/pkg/tests/fulltests/test_sparkSQL.R --- @@ -1739,6 +1748,13 @@ test_that("string operators", { collect(select(df5, repeat_string(df5$a, -1)))[1, 1], "" ) + + l6 <- list(list(a = "abc")) + df6 <- createDataFrame(l6) --- End diff -- I would combine those two lines too. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21255: [SPARK-24186][SparR][SQL]change reverse and concat to co...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21255 PR title `[SparR]` -> `[SparkR]` or `[R]`. FWIW, I use `[R]` or `[PYTHON]` since that's what you see if you read the contribution guide closely. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21254: [SPARK-23094][SPARK-23723][SPARK-23724][SQL][FOLLOW-UP] ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21254 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21254: [SPARK-23094][SPARK-23723][SPARK-23724][SQL][FOLLOW-UP] ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21254 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90347/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21254: [SPARK-23094][SPARK-23723][SPARK-23724][SQL][FOLLOW-UP] ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21254 **[Test build #90347 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90347/testReport)** for PR 21254 at commit [`d4c290e`](https://github.com/apache/spark/commit/d4c290e85eab07706a8a612dbbf58d5c14588b43). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21258: [SPARK-23933][SQL] Add map_fromarray function
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/21258#discussion_r186611074 --- Diff: python/pyspark/sql/functions.py --- @@ -1798,6 +1798,22 @@ def create_map(*cols): return Column(jc) +@ignore_unicode_prefix +@since(2.4) +def create_map_fromarray(col1, col2): --- End diff -- cc: @gatorsmile @ueshin --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21255: [SPARK-24186][SparR][SQL]change reverse and conca...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/21255#discussion_r186610290 --- Diff: R/pkg/R/functions.R --- @@ -218,6 +219,8 @@ NULL #' head(select(tmp3, map_keys(tmp3$v3))) #' head(select(tmp3, map_values(tmp3$v3))) #' head(select(tmp3, element_at(tmp3$v3, "Valiant")))} --- End diff -- The right `}` for this `dontrun` block is wrongly put here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21255: [SPARK-24186][SparR][SQL]change reverse and concat to co...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21255 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21255: [SPARK-24186][SparR][SQL]change reverse and concat to co...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21255 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90353/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21255: [SPARK-24186][SparR][SQL]change reverse and concat to co...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/21255 also cc @felixcheung. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21255: [SPARK-24186][SparR][SQL]change reverse and concat to co...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21255 **[Test build #90353 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90353/testReport)** for PR 21255 at commit [`6b36d79`](https://github.com/apache/spark/commit/6b36d7999feb002655666700061f48448b97501b). * This patch **fails SparkR unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21097: [SPARK-14682][ML] Provide evaluateEachIteration method o...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21097 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90351/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21097: [SPARK-14682][ML] Provide evaluateEachIteration method o...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21097 **[Test build #90351 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90351/testReport)** for PR 21097 at commit [`c32b5a8`](https://github.com/apache/spark/commit/c32b5a81de06d318aee1b18c82d82568d2e8). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21097: [SPARK-14682][ML] Provide evaluateEachIteration method o...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21097 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21262: [SPARK-24172][SQL]: Push projection and filters once whe...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21262 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90346/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21262: [SPARK-24172][SQL]: Push projection and filters once whe...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21262 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21262: [SPARK-24172][SQL]: Push projection and filters once whe...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21262 **[Test build #90346 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90346/testReport)** for PR 21262 at commit [`7497cc2`](https://github.com/apache/spark/commit/7497cc2308136ded913b3745b9232487a949804a). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwr...
Github user zheh12 commented on a diff in the pull request: https://github.com/apache/spark/pull/21257#discussion_r186609102 --- Diff: core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala --- @@ -235,4 +247,20 @@ class HadoopMapReduceCommitProtocol( tmp.getFileSystem(taskContext.getConfiguration).delete(tmp, false) } } + + /** + * now just record the file to be delete + */ + override def deleteWithJob(fs: FileSystem, --- End diff -- I have changed this code. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21255: [SPARK-24186][SparR][SQL]change reverse and concat to co...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21255 **[Test build #90352 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90352/testReport)** for PR 21255 at commit [`64ba432`](https://github.com/apache/spark/commit/64ba432f7d854a96e70675ded4fdaa705ad01a06). * This patch **fails SparkR unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21255: [SPARK-24186][SparR][SQL]change reverse and concat to co...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21255 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21255: [SPARK-24186][SparR][SQL]change reverse and concat to co...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21255 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90352/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21193: [SPARK-24121][SQL] Add API for handling expression code ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21193 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21193: [SPARK-24121][SQL] Add API for handling expression code ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21193 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3026/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21193: [SPARK-24121][SQL] Add API for handling expression code ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21193 **[Test build #90355 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90355/testReport)** for PR 21193 at commit [`aff411b`](https://github.com/apache/spark/commit/aff411bbf21f484c20588997d98682f2ec77191a). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwr...
Github user zheh12 commented on a diff in the pull request: https://github.com/apache/spark/pull/21257#discussion_r186608143 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala --- @@ -207,9 +207,25 @@ case class InsertIntoHadoopFsRelationCommand( } // first clear the path determined by the static partition keys (e.g. /table/foo=1) val staticPrefixPath = qualifiedOutputPath.suffix(staticPartitionPrefix) -if (fs.exists(staticPrefixPath) && !committer.deleteWithJob(fs, staticPrefixPath, true)) { - throw new IOException(s"Unable to clear output " + -s"directory $staticPrefixPath prior to writing to it") + +// check if delete the dir or just sub files +if (fs.exists(staticPrefixPath)) { + // check if is he table root, and record the file to delete + if (staticPartitionPrefix.isEmpty) { +val files = fs.listFiles(staticPrefixPath, false) +while (files.hasNext) { + val file = files.next() + if (!committer.deleteWithJob(fs, file.getPath, true)) { --- End diff -- We choose to postpone deletion. Whether or not `output` is the same as `input`, now the `_temporary` directory is created in the `output` directory before deletion, so that it is not possible to delete the root directory directly. The original implementation was able to delete the root directory directly because it was deleted before the job was created, and then the root directory was rebuilt. Then the `_temporary` directory was created. Failure of any `task` in `job` in the original implementation will result in the loss of `output` data. I can't figure out how to separate the two situations. Do you have any good ideas? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21010: [SPARK-23900][SQL] format_number support user spe...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/21010#discussion_r186606198 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala --- @@ -2108,35 +2133,57 @@ case class FormatNumber(x: Expression, d: Expression) // SPARK-13515: US Locale configures the DecimalFormat object to use a dot ('.') // as a decimal separator. val usLocale = "US" - val i = ctx.freshName("i") - val dFormat = ctx.freshName("dFormat") - val lastDValue = -ctx.addMutableState(CodeGenerator.JAVA_INT, "lastDValue", v => s"$v = -100;") - val pattern = ctx.addMutableState(sb, "pattern", v => s"$v = new $sb();") val numberFormat = ctx.addMutableState(df, "numberFormat", v => s"""$v = new $df("", new $dfs($l.$usLocale));""") - s""" -if ($d >= 0) { - $pattern.delete(0, $pattern.length()); - if ($d != $lastDValue) { -$pattern.append("#,###,###,###,###,###,##0"); - -if ($d > 0) { - $pattern.append("."); - for (int $i = 0; $i < $d; $i++) { -$pattern.append("0"); + right.dataType match { +case IntegerType => + val pattern = ctx.addMutableState(sb, "pattern", v => s"$v = new $sb();") + val i = ctx.freshName("i") + val lastDValue = +ctx.addMutableState(CodeGenerator.JAVA_INT, "lastDValue", v => s"$v = -100;") + s""" +if ($d >= 0) { + $pattern.delete(0, $pattern.length()); + if ($d != $lastDValue) { +$pattern.append("$defaultFormat"); + +if ($d > 0) { + $pattern.append("."); + for (int $i = 0; $i < $d; $i++) { +$pattern.append("0"); + } +} +$lastDValue = $d; +$numberFormat.applyLocalizedPattern($pattern.toString()); } + ${ev.value} = UTF8String.fromString($numberFormat.format(${typeHelper(num)})); +} else { + ${ev.value} = null; + ${ev.isNull} = true; } -$lastDValue = $d; -$numberFormat.applyLocalizedPattern($pattern.toString()); - } - ${ev.value} = UTF8String.fromString($numberFormat.format(${typeHelper(num)})); -} else { - ${ev.value} = null; - ${ev.isNull} = true; -} - """ + """ +case StringType => + val lastDValue = ctx.addMutableState("String", "lastDValue", v => s"""$v = null;""") + val dValue = ctx.addMutableState("String", "dValue") + s""" +$dValue = $d.toString(); +if (!$dValue.equals($lastDValue)) { + $lastDValue = $dValue; + if ($dValue.isEmpty()) { +$numberFormat.applyLocalizedPattern("$defaultFormat"); + } else { +$numberFormat.applyLocalizedPattern($dValue); + } +} +${ev.value} = UTF8String.fromString($numberFormat.format(${typeHelper(num)})); + """ +case NullType => --- End diff -- We don't need this case? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21010: [SPARK-23900][SQL] format_number support user spe...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/21010#discussion_r186606172 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala --- @@ -2108,35 +2133,57 @@ case class FormatNumber(x: Expression, d: Expression) // SPARK-13515: US Locale configures the DecimalFormat object to use a dot ('.') // as a decimal separator. val usLocale = "US" - val i = ctx.freshName("i") - val dFormat = ctx.freshName("dFormat") - val lastDValue = -ctx.addMutableState(CodeGenerator.JAVA_INT, "lastDValue", v => s"$v = -100;") - val pattern = ctx.addMutableState(sb, "pattern", v => s"$v = new $sb();") val numberFormat = ctx.addMutableState(df, "numberFormat", v => s"""$v = new $df("", new $dfs($l.$usLocale));""") - s""" -if ($d >= 0) { - $pattern.delete(0, $pattern.length()); - if ($d != $lastDValue) { -$pattern.append("#,###,###,###,###,###,##0"); - -if ($d > 0) { - $pattern.append("."); - for (int $i = 0; $i < $d; $i++) { -$pattern.append("0"); + right.dataType match { +case IntegerType => + val pattern = ctx.addMutableState(sb, "pattern", v => s"$v = new $sb();") + val i = ctx.freshName("i") + val lastDValue = +ctx.addMutableState(CodeGenerator.JAVA_INT, "lastDValue", v => s"$v = -100;") + s""" +if ($d >= 0) { + $pattern.delete(0, $pattern.length()); + if ($d != $lastDValue) { +$pattern.append("$defaultFormat"); + +if ($d > 0) { + $pattern.append("."); + for (int $i = 0; $i < $d; $i++) { +$pattern.append("0"); + } +} +$lastDValue = $d; +$numberFormat.applyLocalizedPattern($pattern.toString()); } + ${ev.value} = UTF8String.fromString($numberFormat.format(${typeHelper(num)})); +} else { + ${ev.value} = null; + ${ev.isNull} = true; } -$lastDValue = $d; -$numberFormat.applyLocalizedPattern($pattern.toString()); - } - ${ev.value} = UTF8String.fromString($numberFormat.format(${typeHelper(num)})); -} else { - ${ev.value} = null; - ${ev.isNull} = true; -} - """ + """ +case StringType => + val lastDValue = ctx.addMutableState("String", "lastDValue", v => s"""$v = null;""") + val dValue = ctx.addMutableState("String", "dValue") --- End diff -- Do we need to make this mutable state? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21010: [SPARK-23900][SQL] format_number support user spe...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/21010#discussion_r186607122 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/StringExpressionsSuite.scala --- @@ -706,6 +706,30 @@ class StringExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper { "15,159,339,180,002,773.2778") checkEvaluation(FormatNumber(Literal.create(null, IntegerType), Literal(3)), null) assert(FormatNumber(Literal.create(null, NullType), Literal(3)).resolved === false) + +checkEvaluation(FormatNumber(Literal(12332.123456), Literal("##.###")), "12332.123") +checkEvaluation(FormatNumber(Literal(12332.123456), Literal("##.###")), "12332.123") +checkEvaluation(FormatNumber(Literal(4.asInstanceOf[Byte]), Literal("##.")), "4") +checkEvaluation(FormatNumber(Literal(4.asInstanceOf[Short]), Literal("##.")), "4") +checkEvaluation(FormatNumber(Literal(4.0f), Literal("##.###")), "4") +checkEvaluation(FormatNumber(Literal(4), Literal("##.###")), "4") +checkEvaluation(FormatNumber(Literal(12831273.23481d), + Literal("###,###,###,###,###.###")), "12,831,273.235") +checkEvaluation(FormatNumber(Literal(12831273.83421d), Literal("")), "12,831,274") +checkEvaluation(FormatNumber(Literal(123123324123L), Literal("###,###,###,###,###.###")), + "123,123,324,123") +checkEvaluation( + FormatNumber(Literal(Decimal(123123324123L) * Decimal(123123.21234d)), +Literal("###,###,###,###,###.")), "15,159,339,180,002,773.2778") +checkEvaluation(FormatNumber(Literal.create(null, IntegerType), Literal("##.###")), null) +assert(FormatNumber(Literal.create(null, NullType), Literal("##.###")).resolved === false) + +checkEvaluation(FormatNumber(Literal(12332.123456), Literal("#,###,###,###,###,###,##0")), + "12,332") +checkEvaluation(FormatNumber( + Literal.create(null, IntegerType), Literal.create(null, NullType)), null) +checkEvaluation(FormatNumber( + Literal.create(null, NullType), Literal.create(null, NullType)), null) --- End diff -- What's for these two cases? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21195: [Spark-23975][ML] Add support of array input for ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/21195 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21129: [SPARK-7132][ML] Add fit with validation set to spark.ml...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21129 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21129: [SPARK-7132][ML] Add fit with validation set to spark.ml...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21129 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3025/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21195: [Spark-23975][ML] Add support of array input for all clu...
Github user mengxr commented on the issue: https://github.com/apache/spark/pull/21195 LGTM. Merged into master. Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21259: [SPARK-24112][SQL] Add `convertMetastoreTableProp...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/21259#discussion_r186606656 --- Diff: docs/sql-programming-guide.md --- @@ -1812,6 +1812,8 @@ working with timestamps in `pandas_udf`s to get the best performance, see - Since Spark 2.4, creating a managed table with nonempty location is not allowed. An exception is thrown when attempting to create a managed table with nonempty location. To set `true` to `spark.sql.allowCreatingManagedTableUsingNonemptyLocation` restores the previous behavior. This option will be removed in Spark 3.0. - Since Spark 2.4, the type coercion rules can automatically promote the argument types of the variadic SQL functions (e.g., IN/COALESCE) to the widest common type, no matter how the input arguments order. In prior Spark versions, the promotion could fail in some specific orders (e.g., TimestampType, IntegerType and StringType) and throw an exception. - In version 2.3 and earlier, `to_utc_timestamp` and `from_utc_timestamp` respect the timezone in the input timestamp string, which breaks the assumption that the input timestamp is in a specific timezone. Therefore, these 2 functions can return unexpected results. In version 2.4 and later, this problem has been fixed. `to_utc_timestamp` and `from_utc_timestamp` will return null if the input timestamp string contains timezone. As an example, `from_utc_timestamp('2000-10-10 00:00:00', 'GMT+1')` will return `2000-10-10 01:00:00` in both Spark 2.3 and 2.4. However, `from_utc_timestamp('2000-10-10 00:00:00+00:00', 'GMT+1')`, assuming a local timezone of GMT+8, will return `2000-10-10 09:00:00` in Spark 2.3 but `null` in 2.4. For people who don't care about this problem and want to retain the previous behaivor to keep their query unchanged, you can set `spark.sql.function.rejectTimezoneInString` to false. This option will be removed in Spark 3.0 and should only be used as a tempora ry workaround. + - In version 2.3 and earlier, Spark converts Parquet Hive tables by default but ignores table properties like `TBLPROPERTIES (parquet.compression 'NONE')`. This happens for ORC Hive table properties like `TBLPROPERTIES (orc.compress 'NONE')` in case of `spark.sql.hive.convertMetastoreOrc=true`, too. Since Spark 2.4, Spark supports Parquet/ORC specific table properties while converting Parquet/ORC Hive tables. As an example, `CREATE TABLE t(id int) STORED AS PARQUET TBLPROPERTIES (parquet.compression 'NONE')` would generate Snappy parquet files during insertion in Spark 2.3, and in Spark 2.4, the result would be uncompressed parquet files. To set `false` to `spark.sql.hive.convertMetastoreTableProperty` restores the previous behavior. --- End diff -- Thank you for reviews, @cloud-fan , @mridulm , @HyukjinKwon ! I'll update like that. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21129: [SPARK-7132][ML] Add fit with validation set to spark.ml...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21129 **[Test build #90354 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90354/testReport)** for PR 21129 at commit [`fd99f51`](https://github.com/apache/spark/commit/fd99f51388ba84a968cda665f73a9a4ad8eebd05). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21255: [SPARK-24186][SparR][SQL]change reverse and concat to co...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21255 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21255: [SPARK-24186][SparR][SQL]change reverse and concat to co...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21255 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3024/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21028: [SPARK-23922][SQL] Add arrays_overlap function
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21028#discussion_r186606368 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala --- @@ -28,6 +30,34 @@ import org.apache.spark.unsafe.Platform import org.apache.spark.unsafe.array.ByteArrayMethods import org.apache.spark.unsafe.types.{ByteArray, UTF8String} +/** + * Base trait for [[BinaryExpression]]s with two arrays of the same element type and implicit + * casting. + */ +trait BinaryArrayExpressionWithImplicitCast extends BinaryExpression + with ImplicitCastInputTypes { + + protected lazy val elementType: DataType = inputTypes.head.asInstanceOf[ArrayType].elementType + + override def inputTypes: Seq[AbstractDataType] = { +TypeCoercion.findWiderTypeForTwo(left.dataType, right.dataType) match { --- End diff -- now the question is, shall we allow precision lose for array functions? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21235: [SPARK-24181][SQL] Better error message for writi...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21235#discussion_r186606160 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala --- @@ -339,9 +339,16 @@ final class DataFrameWriter[T] private[sql](ds: Dataset[T]) { } } - private def assertNotBucketed(operation: String): Unit = { -if (numBuckets.isDefined || sortColumnNames.isDefined) { - throw new AnalysisException(s"'$operation' does not support bucketing right now") --- End diff -- how about minimizing the code changes ``` private def assertNotBucketed(operation: String): Unit = { if (getBucketSpec.isDefined) { throw new AnalysisException(s"'$operation' does not support bucketing right now") } } ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21255: [SPARK-24186][SparR][SQL]change reverse and concat to co...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21255 **[Test build #90353 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90353/testReport)** for PR 21255 at commit [`6b36d79`](https://github.com/apache/spark/commit/6b36d7999feb002655666700061f48448b97501b). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwrite a p...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/21257 cc @ericl --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwr...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21257#discussion_r186605552 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala --- @@ -207,9 +207,25 @@ case class InsertIntoHadoopFsRelationCommand( } // first clear the path determined by the static partition keys (e.g. /table/foo=1) val staticPrefixPath = qualifiedOutputPath.suffix(staticPartitionPrefix) -if (fs.exists(staticPrefixPath) && !committer.deleteWithJob(fs, staticPrefixPath, true)) { - throw new IOException(s"Unable to clear output " + -s"directory $staticPrefixPath prior to writing to it") + +// check if delete the dir or just sub files +if (fs.exists(staticPrefixPath)) { + // check if is he table root, and record the file to delete + if (staticPartitionPrefix.isEmpty) { +val files = fs.listFiles(staticPrefixPath, false) +while (files.hasNext) { + val file = files.next() + if (!committer.deleteWithJob(fs, file.getPath, true)) { --- End diff -- > deleteMatchingPartitions happend only if overwrite is specified. I mean, we should only do it if overwriting the same table. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21255: [SPARK-24186][SparR][SQL]change reverse and concat to co...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21255 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3023/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21255: [SPARK-24186][SparR][SQL]change reverse and concat to co...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21255 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwr...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21257#discussion_r186605405 --- Diff: core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala --- @@ -235,4 +247,20 @@ class HadoopMapReduceCommitProtocol( tmp.getFileSystem(taskContext.getConfiguration).delete(tmp, false) } } + + /** + * now just record the file to be delete + */ + override def deleteWithJob(fs: FileSystem, --- End diff -- nit: code style. see https://github.com/apache/spark/pull/21257/files#diff-d97cfb576287a7655f32cd5675cbR132 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21070 (While I am here) seems fine to me otherwise. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21255: [SPARK-24186][SparR][SQL]change reverse and concat to co...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21255 **[Test build #90352 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90352/testReport)** for PR 21255 at commit [`64ba432`](https://github.com/apache/spark/commit/64ba432f7d854a96e70675ded4fdaa705ad01a06). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21255: [SPARK-24186][SparR][SQL]change reverse and conca...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/21255#discussion_r186604950 --- Diff: R/pkg/tests/fulltests/test_sparkSQL.R --- @@ -1748,6 +1748,14 @@ test_that("string operators", { collect(select(df5, repeat_string(df5$a, -1)))[1, 1], "" ) + + l6 <- list(list(a = "abc")) + df6 <- createDataFrame(l6) + df7 <- select(df6, reverse(df6$a)) --- End diff -- `df7` is not used below? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21199: [SPARK-24127][SS] Continuous text socket source
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21199 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90349/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21199: [SPARK-24127][SS] Continuous text socket source
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21199 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21199: [SPARK-24127][SS] Continuous text socket source
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21199 **[Test build #90349 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90349/testReport)** for PR 21199 at commit [`f010943`](https://github.com/apache/spark/commit/f010943699b184cc9572bda8651cb40d6231bfa3). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21255: [SPARK-24186][SparR][SQL]change reverse and conca...
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/21255#discussion_r186604721 --- Diff: R/pkg/tests/fulltests/test_sparkSQL.R --- @@ -1502,12 +1502,21 @@ test_that("column functions", { result <- collect(select(df, sort_array(df[[1]])))[[1]] expect_equal(result, list(list(1L, 2L, 3L), list(4L, 5L, 6L))) - # Test flattern + result <- collect(select(df, reverse(df[[1]])))[[1]] --- End diff -- @viirya Thank for your comments. I will make changes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21264: Branch 2.2
Github user joy-m closed the pull request at: https://github.com/apache/spark/pull/21264 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/21070 LGTM except 2 comments --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10....
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21070#discussion_r186604021 --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedPlainValuesReader.java --- @@ -63,115 +59,157 @@ public final void readBooleans(int total, WritableColumnVector c, int rowId) { } } + private ByteBuffer getBuffer(int length) { +try { + return in.slice(length).order(ByteOrder.LITTLE_ENDIAN); +} catch (IOException e) { + throw new ParquetDecodingException("Failed to read " + length + " bytes", e); +} + } + @Override public final void readIntegers(int total, WritableColumnVector c, int rowId) { -c.putIntsLittleEndian(rowId, total, buffer, offset - Platform.BYTE_ARRAY_OFFSET); -offset += 4 * total; +int requiredBytes = total * 4; +ByteBuffer buffer = getBuffer(requiredBytes); + +if (buffer.hasArray()) { + int offset = buffer.arrayOffset() + buffer.position(); + c.putIntsLittleEndian(rowId, total, buffer.array(), offset); +} else { + for (int i = 0; i < total; i += 1) { +c.putInt(rowId + i, buffer.getInt()); + } +} } @Override public final void readLongs(int total, WritableColumnVector c, int rowId) { -c.putLongsLittleEndian(rowId, total, buffer, offset - Platform.BYTE_ARRAY_OFFSET); -offset += 8 * total; +int requiredBytes = total * 8; +ByteBuffer buffer = getBuffer(requiredBytes); + +if (buffer.hasArray()) { + int offset = buffer.arrayOffset() + buffer.position(); + c.putLongsLittleEndian(rowId, total, buffer.array(), offset); +} else { + for (int i = 0; i < total; i += 1) { +c.putLong(rowId + i, buffer.getLong()); + } +} } @Override public final void readFloats(int total, WritableColumnVector c, int rowId) { -c.putFloats(rowId, total, buffer, offset - Platform.BYTE_ARRAY_OFFSET); -offset += 4 * total; +int requiredBytes = total * 4; +ByteBuffer buffer = getBuffer(requiredBytes); + +if (buffer.hasArray()) { + int offset = buffer.arrayOffset() + buffer.position(); + c.putFloats(rowId, total, buffer.array(), offset); +} else { + for (int i = 0; i < total; i += 1) { +c.putFloat(rowId + i, buffer.getFloat()); + } +} } @Override public final void readDoubles(int total, WritableColumnVector c, int rowId) { -c.putDoubles(rowId, total, buffer, offset - Platform.BYTE_ARRAY_OFFSET); -offset += 8 * total; +int requiredBytes = total * 8; +ByteBuffer buffer = getBuffer(requiredBytes); + +if (buffer.hasArray()) { + int offset = buffer.arrayOffset() + buffer.position(); + c.putDoubles(rowId, total, buffer.array(), offset); +} else { + for (int i = 0; i < total; i += 1) { +c.putDouble(rowId + i, buffer.getDouble()); + } +} } @Override public final void readBytes(int total, WritableColumnVector c, int rowId) { -for (int i = 0; i < total; i++) { - // Bytes are stored as a 4-byte little endian int. Just read the first byte. - // TODO: consider pushing this in ColumnVector by adding a readBytes with a stride. - c.putByte(rowId + i, Platform.getByte(buffer, offset)); - offset += 4; +// Bytes are stored as a 4-byte little endian int. Just read the first byte. +// TODO: consider pushing this in ColumnVector by adding a readBytes with a stride. +int requiredBytes = total * 4; +ByteBuffer buffer = getBuffer(requiredBytes); + +for (int i = 0; i < total; i += 1) { + c.putByte(rowId + i, buffer.get()); + // skip the next 3 bytes + buffer.position(buffer.position() + 3); } } @Override public final boolean readBoolean() { -byte b = Platform.getByte(buffer, offset); -boolean v = (b & (1 << bitOffset)) != 0; +// TODO: vectorize decoding and keep boolean[] instead of currentByte +if (bitOffset == 0) { + try { +currentByte = (byte) in.read(); + } catch (IOException e) { +throw new ParquetDecodingException("Failed to read a byte", e); + } +} + +boolean v = (currentByte & (1 << bitOffset)) != 0; bitOffset += 1; if (bitOffset == 8) { bitOffset = 0; - offset++; } return v; } @Override public final int readInteger() { -int v = Platform.getInt(buffer, offset); -if
[GitHub] spark pull request #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10....
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21070#discussion_r186603708 --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedRleValuesReader.java --- @@ -619,32 +608,37 @@ private int ceil8(int value) { /** * Reads the next group. */ - private void readNextGroup() { -int header = readUnsignedVarInt(); -this.mode = (header & 1) == 0 ? MODE.RLE : MODE.PACKED; -switch (mode) { - case RLE: -this.currentCount = header >>> 1; -this.currentValue = readIntLittleEndianPaddedOnBitWidth(); -return; - case PACKED: -int numGroups = header >>> 1; -this.currentCount = numGroups * 8; -int bytesToRead = ceil8(this.currentCount * this.bitWidth); - -if (this.currentBuffer.length < this.currentCount) { - this.currentBuffer = new int[this.currentCount]; -} -currentBufferIdx = 0; -int valueIndex = 0; -for (int byteIndex = offset; valueIndex < this.currentCount; byteIndex += this.bitWidth) { - this.packer.unpack8Values(in, byteIndex, this.currentBuffer, valueIndex); - valueIndex += 8; -} -offset += bytesToRead; -return; - default: -throw new ParquetDecodingException("not a valid mode " + this.mode); + private void readNextGroup() { +try { + int header = readUnsignedVarInt(); + this.mode = (header & 1) == 0 ? MODE.RLE : MODE.PACKED; + switch (mode) { +case RLE: + this.currentCount = header >>> 1; + this.currentValue = readIntLittleEndianPaddedOnBitWidth(); + return; +case PACKED: + int numGroups = header >>> 1; + this.currentCount = numGroups * 8; + + if (this.currentBuffer.length < this.currentCount) { +this.currentBuffer = new int[this.currentCount]; + } + currentBufferIdx = 0; + int valueIndex = 0; + while (valueIndex < this.currentCount) { +// values are bit packed 8 at a time, so reading bitWidth will always work +ByteBuffer buffer = in.slice(bitWidth); +this.packer.unpack8Values( +buffer, buffer.arrayOffset() + buffer.position(), this.currentBuffer, valueIndex); --- End diff -- shall we assume the `ByteBuffer` may not be on-heap? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21259: [SPARK-24112][SQL] Add `convertMetastoreTableProp...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21259#discussion_r186603749 --- Diff: docs/sql-programming-guide.md --- @@ -1812,6 +1812,8 @@ working with timestamps in `pandas_udf`s to get the best performance, see - Since Spark 2.4, creating a managed table with nonempty location is not allowed. An exception is thrown when attempting to create a managed table with nonempty location. To set `true` to `spark.sql.allowCreatingManagedTableUsingNonemptyLocation` restores the previous behavior. This option will be removed in Spark 3.0. - Since Spark 2.4, the type coercion rules can automatically promote the argument types of the variadic SQL functions (e.g., IN/COALESCE) to the widest common type, no matter how the input arguments order. In prior Spark versions, the promotion could fail in some specific orders (e.g., TimestampType, IntegerType and StringType) and throw an exception. - In version 2.3 and earlier, `to_utc_timestamp` and `from_utc_timestamp` respect the timezone in the input timestamp string, which breaks the assumption that the input timestamp is in a specific timezone. Therefore, these 2 functions can return unexpected results. In version 2.4 and later, this problem has been fixed. `to_utc_timestamp` and `from_utc_timestamp` will return null if the input timestamp string contains timezone. As an example, `from_utc_timestamp('2000-10-10 00:00:00', 'GMT+1')` will return `2000-10-10 01:00:00` in both Spark 2.3 and 2.4. However, `from_utc_timestamp('2000-10-10 00:00:00+00:00', 'GMT+1')`, assuming a local timezone of GMT+8, will return `2000-10-10 09:00:00` in Spark 2.3 but `null` in 2.4. For people who don't care about this problem and want to retain the previous behaivor to keep their query unchanged, you can set `spark.sql.function.rejectTimezoneInString` to false. This option will be removed in Spark 3.0 and should only be used as a tempora ry workaround. + - In version 2.3 and earlier, Spark converts Parquet Hive tables by default but ignores table properties like `TBLPROPERTIES (parquet.compression 'NONE')`. This happens for ORC Hive table properties like `TBLPROPERTIES (orc.compress 'NONE')` in case of `spark.sql.hive.convertMetastoreOrc=true`, too. Since Spark 2.4, Spark supports Parquet/ORC specific table properties while converting Parquet/ORC Hive tables. As an example, `CREATE TABLE t(id int) STORED AS PARQUET TBLPROPERTIES (parquet.compression 'NONE')` would generate Snappy parquet files during insertion in Spark 2.3, and in Spark 2.4, the result would be uncompressed parquet files. To set `false` to `spark.sql.hive.convertMetastoreTableProperty` restores the previous behavior. --- End diff -- +1 on ^ and @cloud-fan's. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21097: [SPARK-14682][ML] Provide evaluateEachIteration method o...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21097 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21097: [SPARK-14682][ML] Provide evaluateEachIteration method o...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21097 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3022/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21259: [SPARK-24112][SQL] Add `convertMetastoreTableProp...
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/21259#discussion_r186603250 --- Diff: docs/sql-programming-guide.md --- @@ -1812,6 +1812,8 @@ working with timestamps in `pandas_udf`s to get the best performance, see - Since Spark 2.4, creating a managed table with nonempty location is not allowed. An exception is thrown when attempting to create a managed table with nonempty location. To set `true` to `spark.sql.allowCreatingManagedTableUsingNonemptyLocation` restores the previous behavior. This option will be removed in Spark 3.0. - Since Spark 2.4, the type coercion rules can automatically promote the argument types of the variadic SQL functions (e.g., IN/COALESCE) to the widest common type, no matter how the input arguments order. In prior Spark versions, the promotion could fail in some specific orders (e.g., TimestampType, IntegerType and StringType) and throw an exception. - In version 2.3 and earlier, `to_utc_timestamp` and `from_utc_timestamp` respect the timezone in the input timestamp string, which breaks the assumption that the input timestamp is in a specific timezone. Therefore, these 2 functions can return unexpected results. In version 2.4 and later, this problem has been fixed. `to_utc_timestamp` and `from_utc_timestamp` will return null if the input timestamp string contains timezone. As an example, `from_utc_timestamp('2000-10-10 00:00:00', 'GMT+1')` will return `2000-10-10 01:00:00` in both Spark 2.3 and 2.4. However, `from_utc_timestamp('2000-10-10 00:00:00+00:00', 'GMT+1')`, assuming a local timezone of GMT+8, will return `2000-10-10 09:00:00` in Spark 2.3 but `null` in 2.4. For people who don't care about this problem and want to retain the previous behaivor to keep their query unchanged, you can set `spark.sql.function.rejectTimezoneInString` to false. This option will be removed in Spark 3.0 and should only be used as a tempora ry workaround. + - In version 2.3 and earlier, Spark converts Parquet Hive tables by default but ignores table properties like `TBLPROPERTIES (parquet.compression 'NONE')`. This happens for ORC Hive table properties like `TBLPROPERTIES (orc.compress 'NONE')` in case of `spark.sql.hive.convertMetastoreOrc=true`, too. Since Spark 2.4, Spark supports Parquet/ORC specific table properties while converting Parquet/ORC Hive tables. As an example, `CREATE TABLE t(id int) STORED AS PARQUET TBLPROPERTIES (parquet.compression 'NONE')` would generate Snappy parquet files during insertion in Spark 2.3, and in Spark 2.4, the result would be uncompressed parquet files. To set `false` to `spark.sql.hive.convertMetastoreTableProperty` restores the previous behavior. --- End diff -- Setting a property and expecting spark to ignore it does not sound logical (spark not honoring a property is a bug IMO - which, thankfully, has been fixed in 2.4). Having said that, I agree with you that mentioning this in migration guide might be sufficient; we have behavior changes between versions all the time and a conf is not necessary when the change is in the right direction. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21264: Branch 2.2
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21264 @yotingting, mind closing this and open an issue at JIRA or asking it to mailing list please? I think you can have a better answer there. Please check out https://spark.apache.org/community.html too. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwr...
Github user zheh12 commented on a diff in the pull request: https://github.com/apache/spark/pull/21257#discussion_r186603145 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala --- @@ -207,9 +207,25 @@ case class InsertIntoHadoopFsRelationCommand( } // first clear the path determined by the static partition keys (e.g. /table/foo=1) val staticPrefixPath = qualifiedOutputPath.suffix(staticPartitionPrefix) -if (fs.exists(staticPrefixPath) && !committer.deleteWithJob(fs, staticPrefixPath, true)) { - throw new IOException(s"Unable to clear output " + -s"directory $staticPrefixPath prior to writing to it") + +// check if delete the dir or just sub files +if (fs.exists(staticPrefixPath)) { + // check if is he table root, and record the file to delete + if (staticPartitionPrefix.isEmpty) { +val files = fs.listFiles(staticPrefixPath, false) +while (files.hasNext) { + val file = files.next() + if (!committer.deleteWithJob(fs, file.getPath, true)) { --- End diff -- 1. From the code point of view, the current implementation is `deleteMatchingPartitions` happend only if `overwrite` is specified. 2. Using `dynamicPartitionOverwrite` will not solve this problem,because it will also generate a `.stage` directory under the table root directory. We still need to record all the files we want to delete, but we cannot directly delete the root directories. The dynamic partition overwrite is actually recording all the partitions that need to be deleted and then deleted one by one. And the entire table `overwrite` deletes all the data of the entire directory, it needs to record all deleted partition directory files,so in fact the implementation of the code is similar with `dynamicPartitionOverwrite` . --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21097: [SPARK-14682][ML] Provide evaluateEachIteration method o...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21097 **[Test build #90351 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90351/testReport)** for PR 21097 at commit [`c32b5a8`](https://github.com/apache/spark/commit/c32b5a81de06d318aee1b18c82d82568d2e8). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10....
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21070#discussion_r186603010 --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedPlainValuesReader.java --- @@ -63,115 +59,157 @@ public final void readBooleans(int total, WritableColumnVector c, int rowId) { } } + private ByteBuffer getBuffer(int length) { +try { + return in.slice(length).order(ByteOrder.LITTLE_ENDIAN); +} catch (IOException e) { + throw new ParquetDecodingException("Failed to read " + length + " bytes", e); +} + } + @Override public final void readIntegers(int total, WritableColumnVector c, int rowId) { -c.putIntsLittleEndian(rowId, total, buffer, offset - Platform.BYTE_ARRAY_OFFSET); -offset += 4 * total; +int requiredBytes = total * 4; +ByteBuffer buffer = getBuffer(requiredBytes); + +if (buffer.hasArray()) { + int offset = buffer.arrayOffset() + buffer.position(); + c.putIntsLittleEndian(rowId, total, buffer.array(), offset); +} else { + for (int i = 0; i < total; i += 1) { +c.putInt(rowId + i, buffer.getInt()); + } +} } @Override public final void readLongs(int total, WritableColumnVector c, int rowId) { -c.putLongsLittleEndian(rowId, total, buffer, offset - Platform.BYTE_ARRAY_OFFSET); -offset += 8 * total; +int requiredBytes = total * 8; +ByteBuffer buffer = getBuffer(requiredBytes); + +if (buffer.hasArray()) { + int offset = buffer.arrayOffset() + buffer.position(); + c.putLongsLittleEndian(rowId, total, buffer.array(), offset); +} else { + for (int i = 0; i < total; i += 1) { +c.putLong(rowId + i, buffer.getLong()); + } +} } @Override public final void readFloats(int total, WritableColumnVector c, int rowId) { -c.putFloats(rowId, total, buffer, offset - Platform.BYTE_ARRAY_OFFSET); -offset += 4 * total; +int requiredBytes = total * 4; +ByteBuffer buffer = getBuffer(requiredBytes); + +if (buffer.hasArray()) { + int offset = buffer.arrayOffset() + buffer.position(); + c.putFloats(rowId, total, buffer.array(), offset); +} else { + for (int i = 0; i < total; i += 1) { +c.putFloat(rowId + i, buffer.getFloat()); + } +} } @Override public final void readDoubles(int total, WritableColumnVector c, int rowId) { -c.putDoubles(rowId, total, buffer, offset - Platform.BYTE_ARRAY_OFFSET); -offset += 8 * total; +int requiredBytes = total * 8; +ByteBuffer buffer = getBuffer(requiredBytes); + +if (buffer.hasArray()) { + int offset = buffer.arrayOffset() + buffer.position(); + c.putDoubles(rowId, total, buffer.array(), offset); +} else { + for (int i = 0; i < total; i += 1) { +c.putDouble(rowId + i, buffer.getDouble()); + } +} } @Override public final void readBytes(int total, WritableColumnVector c, int rowId) { -for (int i = 0; i < total; i++) { - // Bytes are stored as a 4-byte little endian int. Just read the first byte. - // TODO: consider pushing this in ColumnVector by adding a readBytes with a stride. - c.putByte(rowId + i, Platform.getByte(buffer, offset)); - offset += 4; +// Bytes are stored as a 4-byte little endian int. Just read the first byte. +// TODO: consider pushing this in ColumnVector by adding a readBytes with a stride. +int requiredBytes = total * 4; +ByteBuffer buffer = getBuffer(requiredBytes); + +for (int i = 0; i < total; i += 1) { + c.putByte(rowId + i, buffer.get()); + // skip the next 3 bytes + buffer.position(buffer.position() + 3); } } @Override public final boolean readBoolean() { -byte b = Platform.getByte(buffer, offset); -boolean v = (b & (1 << bitOffset)) != 0; +// TODO: vectorize decoding and keep boolean[] instead of currentByte +if (bitOffset == 0) { + try { +currentByte = (byte) in.read(); + } catch (IOException e) { +throw new ParquetDecodingException("Failed to read a byte", e); + } +} + +boolean v = (currentByte & (1 << bitOffset)) != 0; bitOffset += 1; if (bitOffset == 8) { bitOffset = 0; - offset++; } return v; } @Override public final int readInteger() { -int v = Platform.getInt(buffer, offset); -if