date:20191119

[spark] branch branch-2.4 updated: [SPARK-29758][SQL][2.4] Fix truncation of requested string fields in `json_tuple`

2019-11-19 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-2.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-2.4 by this push:
 new a936522  [SPARK-29758][SQL][2.4] Fix truncation of requested string 
fields in `json_tuple`
a936522 is described below

commit a9365221133caadffce1aae1ace799a588a3
Author: Maxim Gekk 
AuthorDate: Wed Nov 20 15:32:28 2019 +0800

[SPARK-29758][SQL][2.4] Fix truncation of requested string fields in 
`json_tuple`

### What changes were proposed in this pull request?
In the PR, I propose to remove an optimization in `json_tuple` which causes 
truncation of results for large requested string fields.

### Why are the changes needed?
Spark 2.4 uses Jackson Core 2.6.7 which has a bug in copying string. This 
bug may lead to truncation of results in some cases. The bug has been already 
fixed by the commit 
https://github.com/FasterXML/jackson-core/commit/554f8db0f940b2a53f974852a2af194739d65200
 which is a part of Jackson Core since the version 2.7.7. Upgrading Jackson 
Core up to 2.7.7 or later version is risky. That's why this PR propose to avoid 
using the buggy methods of Jackson Core 2.6.7.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
By new test added to `JsonFunctionsSuite`

Closes #26563 from MaxGekk/fix-truncation-by-json_tuple-2.4.

Authored-by: Maxim Gekk 
Signed-off-by: Wenchen Fan 
---
 .../spark/sql/catalyst/expressions/jsonExpressions.scala   |  5 -
 .../test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala   | 10 ++
 2 files changed, 10 insertions(+), 5 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
index 6650e45..4cd1a091 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
@@ -472,11 +472,6 @@ case class JsonTuple(children: Seq[Expression])
 parser.getCurrentToken match {
   // if the user requests a string field it needs to be returned without 
enclosing
   // quotes which is accomplished via JsonGenerator.writeRaw instead of 
JsonGenerator.write
-  case JsonToken.VALUE_STRING if parser.hasTextCharacters =>
-// slight optimization to avoid allocating a String instance, though 
the characters
-// still have to be decoded... Jackson doesn't have a way to access 
the raw bytes
-generator.writeRaw(parser.getTextCharacters, parser.getTextOffset, 
parser.getTextLength)
-
   case JsonToken.VALUE_STRING =>
 // the normal String case, pass it through to the output without 
enclosing quotes
 generator.writeRaw(parser.getText)
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala
index b1f7446..18335ef 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala
@@ -535,4 +535,14 @@ class JsonFunctionsSuite extends QueryTest with 
SharedSQLContext {
   to_json(struct($"t"), Map("timestampFormat" -> "-MM-dd 
HH:mm:ss.SS")))
 checkAnswer(df, Row(s"""{"t":"$s"}"""))
   }
+
+  test("json_tuple - do not truncate results") {
+val len = 2800
+val str = Array.tabulate(len)(_ => "a").mkString
+val json_tuple_result = Seq(s"""{"test":"$str"}""").toDF("json")
+  .withColumn("result", json_tuple('json, "test"))
+  .select('result)
+  .as[String].head.length
+assert(json_tuple_result === len)
+  }
 }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (9e58b10 -> 5a70af7)

2019-11-19 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 9e58b10  [SPARK-29945][SQL] do not handle negative sign specially in 
the parser
 add 5a70af7  [SPARK-29029][SQL] Use AttributeMap in 
PhysicalOperation.collectProjectsAndFilters

No new revisions were added by this update.

Summary of changes:
 .../org/apache/spark/sql/catalyst/planning/patterns.scala   | 13 +++--
 1 file changed, 7 insertions(+), 6 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (40b8a08 -> 9e58b10)

2019-11-19 Thread yamamuro

This is an automated email from the ASF dual-hosted git repository.

yamamuro pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 40b8a08  [SPARK-29963][SQL][TESTS] Check formatting timestamps up to 
microsecond precision by JSON/CSV datasource
 add 9e58b10  [SPARK-29945][SQL] do not handle negative sign specially in 
the parser

No new revisions were added by this update.

Summary of changes:
 .../apache/spark/sql/catalyst/parser/SqlBase.g4|  4 +-
 .../spark/sql/catalyst/parser/AstBuilder.scala | 31 +++
 .../catalyst/parser/ExpressionParserSuite.scala|  5 +-
 .../test/resources/sql-tests/inputs/literals.sql   |  5 +-
 .../sql-tests/results/ansi/interval.sql.out| 12 ++--
 .../sql-tests/results/ansi/literals.sql.out| 65 +++---
 .../results/interval-display-iso_8601.sql.out  |  2 +-
 .../results/interval-display-sql_standard.sql.out  |  2 +-
 .../sql-tests/results/interval-display.sql.out |  2 +-
 .../resources/sql-tests/results/interval.sql.out   | 12 ++--
 .../resources/sql-tests/results/literals.sql.out   | 65 +++---
 .../sql-tests/results/postgreSQL/interval.sql.out  |  2 +-
 12 files changed, 98 insertions(+), 109 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (e753aa3 -> 40b8a08)

2019-11-19 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from e753aa3  [SPARK-29964][BUILD] lintr github workflows failed due to 
buggy GnuPG
 add 40b8a08  [SPARK-29963][SQL][TESTS] Check formatting timestamps up to 
microsecond precision by JSON/CSV datasource

No new revisions were added by this update.

Summary of changes:
 .../spark/sql/util/TimestampFormatterSuite.scala   | 40 ++
 .../org/apache/spark/sql/JsonFunctionsSuite.scala  |  7 
 .../sql/execution/datasources/csv/CSVSuite.scala   | 15 
 3 files changed, 62 insertions(+)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (e804ed5 -> e753aa3)

2019-11-19 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from e804ed5  [SPARK-29691][ML][PYTHON] ensure Param objects are valid in 
fit, transform
 add e753aa3  [SPARK-29964][BUILD] lintr github workflows failed due to 
buggy GnuPG

No new revisions were added by this update.

Summary of changes:
 .github/workflows/master.yml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-2.4 updated: [SPARK-29964][BUILD] lintr github workflows failed due to buggy GnuPG

2019-11-19 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-2.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-2.4 by this push:
 new 1a26c8e  [SPARK-29964][BUILD] lintr github workflows failed due to 
buggy GnuPG
1a26c8e is described below

commit 1a26c8edf15d2647c1462fa9971eae746bbe0b17
Author: Liang-Chi Hsieh 
AuthorDate: Tue Nov 19 15:56:50 2019 -0800

[SPARK-29964][BUILD] lintr github workflows failed due to buggy GnuPG

### What changes were proposed in this pull request?

Linter (R) github workflows failed sometimes like:

https://github.com/apache/spark/pull/26509/checks?check_run_id=310718016

Failed message:
```
Executing: /tmp/apt-key-gpghome.8r74rQNEjj/gpg.1.sh --keyserver 
keyserver.ubuntu.com --recv-keys E298A3A825C0D65DFD57CBB651716619E084DAB9
gpg: connecting dirmngr at '/tmp/apt-key-gpghome.8r74rQNEjj/S.dirmngr' 
failed: IPC connect call failed
gpg: keyserver receive failed: No dirmngr
##[error]Process completed with exit code 2.
```

It is due to a buggy GnuPG. Context:
https://github.com/sbt/website/pull/825
https://github.com/sbt/sbt/issues/4261
https://github.com/microsoft/WSL/issues/3286

### Why are the changes needed?

Make lint-r github workflows work.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Pass github workflows.

Closes #26602 from viirya/SPARK-29964.

Authored-by: Liang-Chi Hsieh 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit e753aa30e659706c3fa3414bf38566a79e0af8d6)
Signed-off-by: Dongjoon Hyun 
---
 .github/workflows/branch-2.4.yml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/.github/workflows/branch-2.4.yml b/.github/workflows/branch-2.4.yml
index b466995..2aeffc5 100644
--- a/.github/workflows/branch-2.4.yml
+++ b/.github/workflows/branch-2.4.yml
@@ -84,7 +84,7 @@ jobs:
 - name: install R
   run: |
 echo 'deb https://cloud.r-project.org/bin/linux/ubuntu bionic-cran35/' 
| sudo tee -a /etc/apt/sources.list
-sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 
E298A3A825C0D65DFD57CBB651716619E084DAB9
+curl -sL 
"https://keyserver.ubuntu.com/pks/lookup?op=get&search=0xE298A3A825C0D65DFD57CBB651716619E084DAB9";
 | sudo apt-key add
 sudo apt-get update
 sudo apt-get install -y r-base r-base-dev libcurl4-openssl-dev
 - name: install R packages


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (3d2a6f4 -> e804ed5)

2019-11-19 Thread cutlerb

This is an automated email from the ASF dual-hosted git repository.

cutlerb pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 3d2a6f4  [SPARK-29906][SQL] AQE should not introduce extra shuffle for 
outermost limit
 add e804ed5  [SPARK-29691][ML][PYTHON] ensure Param objects are valid in 
fit, transform

No new revisions were added by this update.

Summary of changes:
 python/pyspark/ml/param/__init__.py| 12 ++--
 python/pyspark/ml/tests/test_param.py  |  4 
 python/pyspark/ml/tests/test_tuning.py |  9 +
 python/pyspark/ml/tuning.py|  8 +++-
 4 files changed, 30 insertions(+), 3 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (3d2a6f4 -> e804ed5)

2019-11-19 Thread cutlerb

This is an automated email from the ASF dual-hosted git repository.

cutlerb pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 3d2a6f4  [SPARK-29906][SQL] AQE should not introduce extra shuffle for 
outermost limit
 add e804ed5  [SPARK-29691][ML][PYTHON] ensure Param objects are valid in 
fit, transform

No new revisions were added by this update.

Summary of changes:
 python/pyspark/ml/param/__init__.py| 12 ++--
 python/pyspark/ml/tests/test_param.py  |  4 
 python/pyspark/ml/tests/test_tuning.py |  9 +
 python/pyspark/ml/tuning.py|  8 +++-
 4 files changed, 30 insertions(+), 3 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (6fb8b86 -> 3d2a6f4)

2019-11-19 Thread lixiao

This is an automated email from the ASF dual-hosted git repository.

lixiao pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 6fb8b86  [SPARK-29913][SQL] Improve Exception in postgreCastToBoolean
 add 3d2a6f4  [SPARK-29906][SQL] AQE should not introduce extra shuffle for 
outermost limit

No new revisions were added by this update.

Summary of changes:
 .../execution/adaptive/AdaptiveSparkPlanExec.scala | 23 ++
 .../adaptive/AdaptiveQueryExecSuite.scala  | 21 
 2 files changed, 36 insertions(+), 8 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (6fb8b86 -> 3d2a6f4)

2019-11-19 Thread lixiao

This is an automated email from the ASF dual-hosted git repository.

lixiao pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 6fb8b86  [SPARK-29913][SQL] Improve Exception in postgreCastToBoolean
 add 3d2a6f4  [SPARK-29906][SQL] AQE should not introduce extra shuffle for 
outermost limit

No new revisions were added by this update.

Summary of changes:
 .../execution/adaptive/AdaptiveSparkPlanExec.scala | 23 ++
 .../adaptive/AdaptiveQueryExecSuite.scala  | 21 
 2 files changed, 36 insertions(+), 8 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (79ed4ae -> 6fb8b86)

2019-11-19 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 79ed4ae  [SPARK-29926][SQL] Fix weird interval string whose value is 
only a dangling decimal point
 add 6fb8b86  [SPARK-29913][SQL] Improve Exception in postgreCastToBoolean

No new revisions were added by this update.

Summary of changes:
 .../src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala   | 1 +
 1 file changed, 1 insertion(+)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (a8d9883 -> 79ed4ae)

2019-11-19 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from a8d9883  [SPARK-29893] improve the local shuffle reader performance by 
changing the reading task number from 1 to multi
 add 79ed4ae  [SPARK-29926][SQL] Fix weird interval string whose value is 
only a dangling decimal point

No new revisions were added by this update.

Summary of changes:
 .../org/apache/spark/sql/catalyst/util/IntervalUtils.scala | 10 +++---
 .../apache/spark/sql/catalyst/util/IntervalUtilsSuite.scala|  2 +-
 2 files changed, 8 insertions(+), 4 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (ffc9753 -> a8d9883)

2019-11-19 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from ffc9753  [SPARK-29918][SQL] RecordBinaryComparator should check 
endianness when compared by long
 add a8d9883  [SPARK-29893] improve the local shuffle reader performance by 
changing the reading task number from 1 to multi

No new revisions were added by this update.

Summary of changes:
 .../scala/org/apache/spark/MapOutputTracker.scala  |   3 +-
 .../execution/adaptive/AdaptiveSparkPlanExec.scala |  13 +--
 .../execution/adaptive/LocalShuffledRowRDD.scala   |  52 +++---
 .../adaptive/OptimizeLocalShuffleReader.scala  | 114 -
 .../execution/exchange/ShuffleExchangeExec.scala   |   5 +-
 .../adaptive/AdaptiveQueryExecSuite.scala  |  61 +++
 6 files changed, 170 insertions(+), 78 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-2.4 updated: [SPARK-29949][SQL][2.4] Fix formatting of timestamps by JSON/CSV datasources

2019-11-19 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-2.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-2.4 by this push:
 new 47cb1f3  [SPARK-29949][SQL][2.4] Fix formatting of timestamps by 
JSON/CSV datasources
47cb1f3 is described below

commit 47cb1f359af62383e24198dbbaa0b4503348cd04
Author: Maxim Gekk 
AuthorDate: Tue Nov 19 17:10:16 2019 +0800

[SPARK-29949][SQL][2.4] Fix formatting of timestamps by JSON/CSV datasources

### What changes were proposed in this pull request?
In the PR, I propose to use the `format()` method of `FastDateFormat` which 
accepts an instance of the `Calendar` type. This allows to adjust the 
`MILLISECOND` field of the calendar directly before formatting. I added new 
method `format()` to `DateTimeUtils.TimestampParser`. This method splits the 
input timestamp to a part truncated to seconds and the seconds fractional part. 
The calendar is initialized by the first part in normal way, and the last one 
is converted to a form appropria [...]

I refactored `MicrosCalendar` by passing the number of digits from the 
fraction pattern as a parameter to the default constructor because it is used 
by the existing `getMicros()` and new one `setMicros()`. `setMicros()` is used 
to set the seconds fraction to calendar's `MILLISECOND` field directly before 
formatting.

This PR supports various patterns for seconds fractions from `S` up to 
`SS`. If the patterns has more than 6 `S`, the first 6 digits reflect to 
milliseconds and microseconds of the input timestamp but the rest digits are 
set to `0`.

### Why are the changes needed?
This fixes a bug of incorrectly formatting timestamps in microsecond 
precision. For example:
```scala
Seq(java.sql.Timestamp.valueOf("2019-11-18 11:56:00.123456")).toDF("t")
  .select(to_json(struct($"t"), Map("timestampFormat" -> "-MM-dd 
HH:mm:ss.SS")).as("json"))
  .show(false)
+--+
|json  |
+--+
|{"t":"2019-11-18 11:56:00.000123"}|
+--+
```

### Does this PR introduce any user-facing change?
Yes. The example above outputs:
```scala
+--+
|json  |
+--+
|{"t":"2019-11-18 11:56:00.123456"}|
+--+
```

### How was this patch tested?
- By new tests for formatting by different patterns from `S` to `SS` in 
`DateTimeUtilsSuite`
- A test for `to_json()` in `JsonFunctionsSuite`
- A roundtrp test for writing and reading back a timestamp in a CSV file.

Closes #26582 from MaxGekk/micros-format-2.4.

Authored-by: Maxim Gekk 
Signed-off-by: Wenchen Fan 
---
 .../spark/sql/catalyst/json/JacksonGenerator.scala |  6 ++--
 .../spark/sql/catalyst/util/DateTimeUtils.scala| 35 ++-
 .../sql/catalyst/util/DateTimeUtilsSuite.scala | 40 ++
 .../datasources/csv/UnivocityGenerator.scala   |  6 ++--
 .../org/apache/spark/sql/JsonFunctionsSuite.scala  |  7 
 .../sql/execution/datasources/csv/CSVSuite.scala   | 15 
 6 files changed, 97 insertions(+), 12 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonGenerator.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonGenerator.scala
index 9b86d86..a379f86 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonGenerator.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonGenerator.scala
@@ -24,6 +24,7 @@ import com.fasterxml.jackson.core._
 import org.apache.spark.sql.catalyst.InternalRow
 import org.apache.spark.sql.catalyst.expressions.SpecializedGetters
 import org.apache.spark.sql.catalyst.util.{ArrayData, DateTimeUtils, MapData}
+import org.apache.spark.sql.catalyst.util.DateTimeUtils.TimestampParser
 import org.apache.spark.sql.types._
 
 /**
@@ -74,6 +75,8 @@ private[sql] class JacksonGenerator(
 
   private val lineSeparator: String = options.lineSeparatorInWrite
 
+  @transient private lazy val timestampParser = new 
TimestampParser(options.timestampFormat)
+
   private def makeWriter(dataType: DataType): ValueWriter = dataType match {
 case NullType =>
   (row: SpecializedGetters, ordinal: Int) =>
@@ -113,8 +116,7 @@ private[sql] class JacksonGenerator(
 
 case TimestampType =>
   (row: SpecializedGetters, ordinal: Int) =>
-val timestampString =
-  
options.timestampFormat.format(DateTimeUtils.toJavaTimestamp(row.getLong(ordinal)))
+val timestampString = timestampParser.format(row.getLong(ordinal))
 gen.writeString(timestampString)

[spark] branch branch-2.4 updated: [SPARK-29918][SQL] RecordBinaryComparator should check endianness when compared by long

2019-11-19 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-2.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-2.4 by this push:
 new dc2abe51 [SPARK-29918][SQL] RecordBinaryComparator should check 
endianness when compared by long
dc2abe51 is described below

commit dc2abe51ca2d3d702d6b6457301c3ca9c7244212
Author: wangguangxin.cn 
AuthorDate: Tue Nov 19 16:10:22 2019 +0800

[SPARK-29918][SQL] RecordBinaryComparator should check endianness when 
compared by long

### What changes were proposed in this pull request?
This PR try to make sure the comparison results of  `compared by 8 bytes at 
a time` and `compared by bytes wise` in RecordBinaryComparator is *consistent*, 
by reverse long bytes if it is little-endian and using Long.compareUnsigned.

### Why are the changes needed?
If the architecture supports unaligned or the offset is 8 bytes aligned, 
`RecordBinaryComparator` compare 8 bytes at a time by reading 8 bytes as a 
long.  Related code is
```
if (Platform.unaligned() || (((leftOff + i) % 8 == 0) && ((rightOff + 
i) % 8 == 0))) {
  while (i <= leftLen - 8) {
final long v1 = Platform.getLong(leftObj, leftOff + i);
final long v2 = Platform.getLong(rightObj, rightOff + i);
if (v1 != v2) {
  return v1 > v2 ? 1 : -1;
}
i += 8;
  }
}
```

Otherwise, it will compare bytes by bytes.  Related code is
```
while (i < leftLen) {
  final int v1 = Platform.getByte(leftObj, leftOff + i) & 0xff;
  final int v2 = Platform.getByte(rightObj, rightOff + i) & 0xff;
  if (v1 != v2) {
return v1 > v2 ? 1 : -1;
  }
  i += 1;
}
```

However, on little-endian machine,  the result of *compared by a long 
value* and *compared bytes by bytes* maybe different.

For two same records, its offsets may vary in the first run and second run, 
which will lead to compare them using long comparison or byte-by-byte 
comparison, the result maybe different.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Add new test cases in RecordBinaryComparatorSuite

Closes #26548 from WangGuangxin/binary_comparator.

Authored-by: wangguangxin.cn 
Signed-off-by: Wenchen Fan 
(cherry picked from commit ffc97530371433bc0221e06d8c1d11af8d92bd94)
Signed-off-by: Wenchen Fan 
---
 .../sql/execution/RecordBinaryComparator.java  | 30 +-
 .../sort/RecordBinaryComparatorSuite.java  | 47 +-
 2 files changed, 67 insertions(+), 10 deletions(-)

diff --git 
a/sql/catalyst/src/main/java/org/apache/spark/sql/execution/RecordBinaryComparator.java
 
b/sql/catalyst/src/main/java/org/apache/spark/sql/execution/RecordBinaryComparator.java
index 40c2cc8..1f24340 100644
--- 
a/sql/catalyst/src/main/java/org/apache/spark/sql/execution/RecordBinaryComparator.java
+++ 
b/sql/catalyst/src/main/java/org/apache/spark/sql/execution/RecordBinaryComparator.java
@@ -20,8 +20,13 @@ package org.apache.spark.sql.execution;
 import org.apache.spark.unsafe.Platform;
 import org.apache.spark.util.collection.unsafe.sort.RecordComparator;
 
+import java.nio.ByteOrder;
+
 public final class RecordBinaryComparator extends RecordComparator {
 
+  private static final boolean LITTLE_ENDIAN =
+  ByteOrder.nativeOrder().equals(ByteOrder.LITTLE_ENDIAN);
+
   @Override
   public int compare(
   Object leftObj, long leftOff, int leftLen, Object rightObj, long 
rightOff, int rightLen) {
@@ -38,10 +43,10 @@ public final class RecordBinaryComparator extends 
RecordComparator {
 // check if stars align and we can get both offsets to be aligned
 if ((leftOff % 8) == (rightOff % 8)) {
   while ((leftOff + i) % 8 != 0 && i < leftLen) {
-final int v1 = Platform.getByte(leftObj, leftOff + i) & 0xff;
-final int v2 = Platform.getByte(rightObj, rightOff + i) & 0xff;
+final int v1 = Platform.getByte(leftObj, leftOff + i);
+final int v2 = Platform.getByte(rightObj, rightOff + i);
 if (v1 != v2) {
-  return v1 > v2 ? 1 : -1;
+  return (v1 & 0xff) > (v2 & 0xff) ? 1 : -1;
 }
 i += 1;
   }
@@ -49,10 +54,17 @@ public final class RecordBinaryComparator extends 
RecordComparator {
 // for architectures that support unaligned accesses, chew it up 8 bytes 
at a time
 if (Platform.unaligned() || (((leftOff + i) % 8 == 0) && ((rightOff + i) % 
8 == 0))) {
   while (i <= leftLen - 8) {
-final long v1 = Platform.getLong(leftObj, leftOff + i);
-final long v2 = Platform.getLong(rightObj, rightOff + i);
+long v1 = Platform.getLong(leftObj, leftOff + i);
+long v2 = Platform.getLong(rightObj, rightOff + i)

[spark] branch master updated (16134d6 -> ffc9753)

2019-11-19 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 16134d6  [SPARK-29948][SQL] make the default alias consistent between 
date, timestamp and interval
 add ffc9753  [SPARK-29918][SQL] RecordBinaryComparator should check 
endianness when compared by long

No new revisions were added by this update.

Summary of changes:
 .../sql/execution/RecordBinaryComparator.java  | 30 +-
 .../sort/RecordBinaryComparatorSuite.java  | 47 +-
 2 files changed, 67 insertions(+), 10 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (16134d6 -> ffc9753)

2019-11-19 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 16134d6  [SPARK-29948][SQL] make the default alias consistent between 
date, timestamp and interval
 add ffc9753  [SPARK-29918][SQL] RecordBinaryComparator should check 
endianness when compared by long

No new revisions were added by this update.

Summary of changes:
 .../sql/execution/RecordBinaryComparator.java  | 30 +-
 .../sort/RecordBinaryComparatorSuite.java  | 47 +-
 2 files changed, 67 insertions(+), 10 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-2.4 updated: [SPARK-29758][SQL][2.4] Fix truncation of requested string fields in `json_tuple`

[spark] branch master updated (9e58b10 -> 5a70af7)

[spark] branch master updated (40b8a08 -> 9e58b10)

[spark] branch master updated (e753aa3 -> 40b8a08)

[spark] branch master updated (e804ed5 -> e753aa3)

[spark] branch branch-2.4 updated: [SPARK-29964][BUILD] lintr github workflows failed due to buggy GnuPG

[spark] branch master updated (3d2a6f4 -> e804ed5)

[spark] branch master updated (3d2a6f4 -> e804ed5)

[spark] branch master updated (6fb8b86 -> 3d2a6f4)

[spark] branch master updated (6fb8b86 -> 3d2a6f4)

[spark] branch master updated (79ed4ae -> 6fb8b86)

[spark] branch master updated (a8d9883 -> 79ed4ae)

[spark] branch master updated (ffc9753 -> a8d9883)

[spark] branch branch-2.4 updated: [SPARK-29949][SQL][2.4] Fix formatting of timestamps by JSON/CSV datasources

[spark] branch branch-2.4 updated: [SPARK-29918][SQL] RecordBinaryComparator should check endianness when compared by long

[spark] branch master updated (16134d6 -> ffc9753)

[spark] branch master updated (16134d6 -> ffc9753)

17 matches

Site Navigation

Mail list logo

Footer information