[GitHub] spark issue #22887: [SPARK-25880][CORE] user set's hadoop conf should not ov...

2018-11-09 Thread cxzl25
Github user cxzl25 commented on the issue:

https://github.com/apache/spark/pull/22887
  
user set hadoop conf can't overwrite spark-defaults.conf

**SparkHadoopUtil.get.appendS3AndSparkHadoopConfigurations** overwrite the 
user-set spark.hadoop with the default configuration 
(sparkSession.sparkContext.conf)

@gengliangwang @cloud-fan @gatorsmile  
Could you please give some comments when you have time?
Thanks so much.


https://github.com/apache/spark/blob/80813e198033cd63cc6100ee6ffe7d1eb1dff27b/sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala#L85-L89

## test:

### spark-defaults.conf
```
spark.hadoop.mapreduce.input.fileinputformat.split.maxsize  2
```

### spark-shell

```scala

val hadoopConfKey="mapreduce.input.fileinputformat.split.maxsize"
spark.conf.get("spark.hadoop."+hadoopConfKey) // 2
var hadoopConf=spark.sessionState.newHadoopConf
hadoopConf.get(hadoopConfKey) // 2

spark.conf.set(hadoopConfKey,1) // set 1
hadoopConf=spark.sessionState.newHadoopConf
hadoopConf.get(hadoopConfKey) // 1

//org.apache.spark.sql.hive.HadoopTableReader append Conf

org.apache.spark.deploy.SparkHadoopUtil.get.appendS3AndSparkHadoopConfigurations(spark.sparkContext.getConf,
 hadoopConf)

//org.apache.spark.sql.hive.HadoopTableReader _broadcastedHadoopConf
hadoopConf.get("mapreduce.input.fileinputformat.split.maxsize") // 2

```









---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21364: [SPARK-24317][SQL]Float-point numbers are display...

2018-10-11 Thread cxzl25
Github user cxzl25 closed the pull request at:

https://github.com/apache/spark/pull/21364


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21656: [SPARK-24677][Core]Avoid NoSuchElementException from Med...

2018-07-10 Thread cxzl25
Github user cxzl25 commented on the issue:

https://github.com/apache/spark/pull/21656
  
@tgravescs 
This is really not difficult.
I'm just not sure if we want to ignore or send down the real time.
Now I have submitted a change, use actual time of successful task.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21656: [SPARK-24677][Core]MedianHeap is empty when specu...

2018-07-04 Thread cxzl25
Github user cxzl25 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21656#discussion_r200236204
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala ---
@@ -772,6 +772,12 @@ private[spark] class TaskSetManager(
   private[scheduler] def markPartitionCompleted(partitionId: Int): Unit = {
 partitionToIndex.get(partitionId).foreach { index =>
   if (!successful(index)) {
+if (speculationEnabled) {
+  taskAttempts(index).headOption.map { info =>
+info.markFinished(TaskState.FINISHED, clock.getTimeMillis())
+successfulTaskDurations.insert(info.duration)
--- End diff --

TaskSetManager#handleSuccessfulTask update successful task durations, and 
write to successfulTaskDurations.

When there are multiple tasksets for this stage, 
markPartitionCompletedInAllTaskSets is
accumulate the value of tasksSuccessful.

In this case, when checkSpeculatableTasks is called, the value of 
tasksSuccessful matches the condition, but successfulTaskDurations is empty.


https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L723
```scala
  def handleSuccessfulTask(tid: Long, result: DirectTaskResult[_]): Unit = {
val info = taskInfos(tid)
val index = info.index
info.markFinished(TaskState.FINISHED, clock.getTimeMillis())
if (speculationEnabled) {
  successfulTaskDurations.insert(info.duration)
}
   // ...
   // There may be multiple tasksets for this stage -- we let all of them 
know that the partition
   // was completed.  This may result in some of the tasksets getting 
completed.
sched.markPartitionCompletedInAllTaskSets(stageId, 
tasks(index).partitionId)
```

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L987
```scala
override def checkSpeculatableTasks(minTimeToSpeculation: Int): Boolean = {
//...
  if (tasksSuccessful >= minFinishedForSpeculation && tasksSuccessful > 0) {
  val time = clock.getTimeMillis()
  val medianDuration = successfulTaskDurations.median
```



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21656: [SPARK-24677][Core]MedianHeap is empty when speculation ...

2018-07-04 Thread cxzl25
Github user cxzl25 commented on the issue:

https://github.com/apache/spark/pull/21656
  
@maropu @cloud-fan @squito 
Can you trigger a test for this?
This is the exception stack in the log:
```
ERROR Utils: uncaught error in thread task-scheduler-speculation, stopping 
SparkContext
java.util.NoSuchElementException: MedianHeap is empty.
at org.apache.spark.util.collection.MedianHeap.median(MedianHeap.scala:83)
at 
org.apache.spark.scheduler.TaskSetManager.checkSpeculatableTasks(TaskSetManager.scala:968)
at 
org.apache.spark.scheduler.Pool$$anonfun$checkSpeculatableTasks$1.apply(Pool.scala:94)
at 
org.apache.spark.scheduler.Pool$$anonfun$checkSpeculatableTasks$1.apply(Pool.scala:93)
at scala.collection.Iterator$class.foreach(Iterator.scala:742)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at org.apache.spark.scheduler.Pool.checkSpeculatableTasks(Pool.scala:93)
at 
org.apache.spark.scheduler.Pool$$anonfun$checkSpeculatableTasks$1.apply(Pool.scala:94)
at 
org.apache.spark.scheduler.Pool$$anonfun$checkSpeculatableTasks$1.apply(Pool.scala:93)
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21656: [SPARK-24677][Core]MedianHeap is empty when speculation ...

2018-07-02 Thread cxzl25
Github user cxzl25 commented on the issue:

https://github.com/apache/spark/pull/21656
  
@maropu 
I have added a unit test.
Can you trigger a test for this?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21656: [SPARK-24677][Core]MedianHeap is empty when specu...

2018-06-28 Thread cxzl25
GitHub user cxzl25 opened a pull request:

https://github.com/apache/spark/pull/21656

[SPARK-24677][Core]MedianHeap is empty when speculation is enabled, causing 
the SparkContext to stop

## What changes were proposed in this pull request?
When speculation is enabled, 
TaskSetManager#markPartitionCompleted should write successful task duration 
to MedianHeap, 
not just increase tasksSuccessful.

Otherwise when TaskSetManager#checkSpeculatableTasks,tasksSuccessful 
non-zero, but MedianHeap is empty.
Then throw an exception successfulTaskDurations.median 
java.util.NoSuchElementException: MedianHeap is empty.
Finally led to stopping SparkContext.
## How was this patch tested?
manual tests


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/cxzl25/spark fix_MedianHeap_empty

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21656.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21656


commit 467f0bccb7d1940bed4f1b2e633c9374b0e654f2
Author: sychen 
Date:   2018-06-28T07:34:38Z

MedianHeap is empty when speculation is enabled, causing the SparkContext 
to stop.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20739: [SPARK-23603][SQL]When the length of the json is ...

2018-06-27 Thread cxzl25
Github user cxzl25 closed the pull request at:

https://github.com/apache/spark/pull/20739


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20738: [SPARK-23603][SQL]When the length of the json is ...

2018-06-27 Thread cxzl25
Github user cxzl25 closed the pull request at:

https://github.com/apache/spark/pull/20738


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21596: [SPARK-24601] Bump Jackson version

2018-06-22 Thread cxzl25
Github user cxzl25 commented on the issue:

https://github.com/apache/spark/pull/21596
  
https://github.com/apache/spark/pull/20738

Bump jackson from 2.6.7&2.6.7.1 to 2.7.7
Jackson(>=2.7.7) fixes the possibility of missing tail data when the length 
of the value is in a range

[https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.7.7](https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.7.7)

[https://github.com/FasterXML/jackson-core/issues/30](https://github.com/FasterXML/jackson-core/issues/307)

spark-shell:
```
val value = "x" * 3000
val json = s"""{"big": "$value"}"""
spark.sql("select length(get_json_object(\'"+json+"\','$.big'))" ).collect
res0: Array[org.apache.spark.sql.Row] = Array([2991])
```
expect result : 3000 
actual  result : 2991


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18900: [SPARK-21687][SQL] Spark SQL should set createTim...

2018-06-07 Thread cxzl25
Github user cxzl25 commented on a diff in the pull request:

https://github.com/apache/spark/pull/18900#discussion_r193685282
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala 
---
@@ -1019,6 +1021,8 @@ private[hive] object HiveClientImpl {
 compressed = apiPartition.getSd.isCompressed,
 properties = Option(apiPartition.getSd.getSerdeInfo.getParameters)
   .map(_.asScala.toMap).orNull),
+  createTime = apiPartition.getCreateTime.toLong * 1000,
+  lastAccessTime = apiPartition.getLastAccessTime.toLong * 1000)
--- End diff --

Add a comma to the end?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18900: [SPARK-21687][SQL] Spark SQL should set createTime for H...

2018-06-06 Thread cxzl25
Github user cxzl25 commented on the issue:

https://github.com/apache/spark/pull/18900
  
**Modify the partition will lose createTime.**
Reading the hive partitions ignores createTime when converting the 
CatalogTablePartition, it will also be lost when modifying partitions.
Calling this method SessionCatalog#alterPartitions will be lost createTime.
So can you reopen this pr? @debugger87 

```sql
CREATE  TABLE `tmp_test_partition`(
  `c1` string
)
PARTITIONED BY (
  `d` string
);
ALTER TABLE `tmp_test_partition_1` ADD PARTITION (d='1');
ALTER TABLE `tmp_test_partition_1` PARTITION (d='1') SET LOCATION 'xxx';
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21164: [SPARK-24098][SQL] ScriptTransformationExec should wait ...

2018-06-05 Thread cxzl25
Github user cxzl25 commented on the issue:

https://github.com/apache/spark/pull/21164
  
@liutang123  cc @cloud-fan @gatorsmile 

I also encountered this problem.


![image](https://user-images.githubusercontent.com/3898450/40981493-9b0b0d46-690d-11e8-8607-c14756610d59.png)
python:
```python
import sys
for line in sys.stdin:
print 1/0
```
sql:
```sql
ADD FILE test.py;
SELECT TRANSFORM(1) USING 'python test.py'
AS (c1)
```

I solved it this way:  *writerThread.join()*
```java
if (scriptOutputReader.next(scriptOutputWritable) <= 0) {
writerThread.join()
checkFailureAndPropagate()
return false
}
```



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21364: [SPARK-24317][SQL]Float-point numbers are displayed with...

2018-05-28 Thread cxzl25
Github user cxzl25 commented on the issue:

https://github.com/apache/spark/pull/21364
  
ping @gatorsmile @liufengdb 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21311: [SPARK-24257][SQL]LongToUnsafeRowMap calculate the new s...

2018-05-23 Thread cxzl25
Github user cxzl25 commented on the issue:

https://github.com/apache/spark/pull/21311
  
@cloud-fan Thank you very much for your help.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21311: [SPARK-24257][SQL]LongToUnsafeRowMap calculate th...

2018-05-23 Thread cxzl25
Github user cxzl25 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21311#discussion_r190345942
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/joins/HashedRelationSuite.scala
 ---
@@ -254,6 +254,30 @@ class HashedRelationSuite extends SparkFunSuite with 
SharedSQLContext {
 map.free()
   }
 
+  test("LongToUnsafeRowMap with big values") {
+val taskMemoryManager = new TaskMemoryManager(
+  new StaticMemoryManager(
+new SparkConf().set(MEMORY_OFFHEAP_ENABLED.key, "false"),
+Long.MaxValue,
+Long.MaxValue,
+1),
+  0)
+val unsafeProj = UnsafeProjection.create(Array[DataType](StringType))
+val map = new LongToUnsafeRowMap(taskMemoryManager, 1)
+
+val key = 0L
+// the page array is initialized with length 1 << 17 (1M bytes),
+// so here we need a value larger than 1 << 18 (2M bytes),to trigger 
the bug
+val bigStr = UTF8String.fromString("x" * (1 << 22))
--- End diff --

Yes. I think so.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21311: [SPARK-24257][SQL]LongToUnsafeRowMap calculate th...

2018-05-23 Thread cxzl25
Github user cxzl25 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21311#discussion_r190146533
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/joins/HashedRelationSuite.scala
 ---
@@ -254,6 +254,30 @@ class HashedRelationSuite extends SparkFunSuite with 
SharedSQLContext {
 map.free()
   }
 
+  test("LongToUnsafeRowMap with big values") {
+val taskMemoryManager = new TaskMemoryManager(
+  new StaticMemoryManager(
+new SparkConf().set(MEMORY_OFFHEAP_ENABLED.key, "false"),
+Long.MaxValue,
+Long.MaxValue,
+1),
+  0)
+val unsafeProj = UnsafeProjection.create(Array[DataType](StringType))
+val map = new LongToUnsafeRowMap(taskMemoryManager, 1)
+
+val key = 0L
+// the page array is initialized with length 1 << 17 (1M bytes),
+// so here we need a value larger than 1 << 18 (2M bytes),to trigger 
the bug
+val bigStr = UTF8String.fromString("x" * (1 << 22))
--- End diff --

LongToUnsafeRowMap#getRow
resultRow=UnsafeRow#pointTo(page(1<<18), baseOffset(16), 
sizeInBytes(1<<21+16))

UTF8String#getBytes
copyMemory(base(page), offset, bytes, BYTE_ARRAY_OFFSET, 
numBytes(1<<21+16));

In the case of similar size sometimes, can still read the original value.

When introducing SPARK-10399,UnsafeRow#getUTF8String check the size at this 
time.
If we pick 1 << 18 + 1, 100% reproduce this bug.

But when this patch is not introduced, differences that are too small 
sometimes do not trigger.
So I chose a larger value.

My understanding may be problematic. Please advise. Thank you.

```java
sun.misc.Unsafe unsafe;
try {
Field unsafeField = Unsafe.class.getDeclaredField("theUnsafe");
unsafeField.setAccessible(true);
unsafe = (sun.misc.Unsafe) unsafeField.get(null);
} catch (Throwable cause) {
unsafe = null;
}

String value = "x";
byte[] src = value.getBytes();

byte[] dst = new byte[3];
byte[] newDst = new byte[5];

unsafe.copyMemory(src, 16, dst, 16, src.length);
unsafe.copyMemory(dst, 16, newDst, 16, src.length);

System.out.println("dst:" + new String(dst));
System.out.println("newDst:" + new String(newDst));
```
output:
>dst:xxx
>newDst:x





---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21311: [SPARK-24257][SQL]LongToUnsafeRowMap calculate th...

2018-05-22 Thread cxzl25
Github user cxzl25 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21311#discussion_r189997697
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/joins/HashedRelationSuite.scala
 ---
@@ -254,6 +254,30 @@ class HashedRelationSuite extends SparkFunSuite with 
SharedSQLContext {
 map.free()
   }
 
+  test("LongToUnsafeRowMap with big values") {
+val taskMemoryManager = new TaskMemoryManager(
+  new StaticMemoryManager(
+new SparkConf().set(MEMORY_OFFHEAP_ENABLED.key, "false"),
+Long.MaxValue,
+Long.MaxValue,
+1),
+  0)
+val unsafeProj = UnsafeProjection.create(Array[DataType](StringType))
+val map = new LongToUnsafeRowMap(taskMemoryManager, 1)
+
+val key = 0L
+// the page array is initialized with length 1 << 17 (1M bytes),
+// so here we need a value larger than 1 << 18 (2M bytes),to trigger 
the bug
+val bigStr = UTF8String.fromString("x" * (1 << 22))
--- End diff --

Not necessary. 
Just chose a larger value to make it easier to lose data.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21311: [SPARK-24257][SQL]LongToUnsafeRowMap calculate th...

2018-05-22 Thread cxzl25
Github user cxzl25 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21311#discussion_r189905873
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala
 ---
@@ -626,6 +618,32 @@ private[execution] final class LongToUnsafeRowMap(val 
mm: TaskMemoryManager, cap
 }
   }
 
+  private def grow(neededSize: Int): Unit = {
+// There is 8 bytes for the pointer to next value
+val totalNeededSize = cursor + 8 + neededSize
--- End diff --

@cloud-fan Thank you for your suggestion and code.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21311: [SPARK-24257][SQL]LongToUnsafeRowMap calculate the new s...

2018-05-22 Thread cxzl25
Github user cxzl25 commented on the issue:

https://github.com/apache/spark/pull/21311
  
@cloud-fan 
LongToUnsafeRowMap#append(key: Long, row: UnsafeRow)
when row.getSizeInBytes > newPageSize( oldPage.length * 8L * 2),still use 
newPageSize value.
When the new page size is insufficient to hold the entire row of data, 
Platform.copyMemory is still called.No error.
At this time, the last remaining data was discarded.
When reading data, read this buffer according to offset and length,the last 
data is unpredictable.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21311: [SPARK-24257][SQL]LongToUnsafeRowMap calculate the new s...

2018-05-22 Thread cxzl25
Github user cxzl25 commented on the issue:

https://github.com/apache/spark/pull/21311
  
@JoshRosen  @cloud-fan @gatorsmile 
When introducing 
[SPARK-10399](https://issues.apache.org/jira/browse/SPARK-10399),UnsafeRow#getUTF8String
 check the size at this time.

[UnsafeRow#getUTF8String](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java#L420)

[OnHeapMemoryBlock](https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/memory/OnHeapMemoryBlock.java#L34)
>The sum of size 2097152 and offset 32 should not be larger than the size 
of the given memory space 2097168

![image](https://user-images.githubusercontent.com/3898450/40353741-cbd02616-5de4-11e8-9406-4fcb65a69294.png)

But when this patch is not introduced, no error, get wrong value.

![spark-2 2 
0](https://user-images.githubusercontent.com/3898450/40353815-febe5494-5de4-11e8-9c43-e75a95a3e74f.png)






---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21364: [SPARK-24317][SQL]Float-point numbers are display...

2018-05-18 Thread cxzl25
GitHub user cxzl25 reopened a pull request:

https://github.com/apache/spark/pull/21364

[SPARK-24317][SQL]Float-point numbers are displayed with different 
precision in ThriftServer2

## What changes were proposed in this pull request?
When querying float-point numbers , the values displayed on beeline or jdbc 
are with different precision.
```
SELECT CAST(1.23 AS FLOAT)
Result:
1.230190734863
```
According to these two jira:
[HIVE-11802](https://issues.apache.org/jira/browse/HIVE-11802)
[HIVE-11832](https://issues.apache.org/jira/browse/HIVE-11832)
Make a slight modification to the spark hive thrift server.

## How was this patch tested?
HiveThriftBinaryServerSuite
test("JDBC query float-point number")


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/cxzl25/spark fix_float_number_display

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21364.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21364


commit 137abed62a75ea453a0d3c8436d9aa08b9abee36
Author: sychen <sychen@...>
Date:   2018-05-18T15:39:35Z

Float-point numbers are displayed with different precision in Beeline/JDBC




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21364: [SPARK-24317][SQL]Float-point numbers are display...

2018-05-18 Thread cxzl25
Github user cxzl25 closed the pull request at:

https://github.com/apache/spark/pull/21364


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21364: [SPARK-20173][SQL]Float-point numbers are display...

2018-05-18 Thread cxzl25
GitHub user cxzl25 opened a pull request:

https://github.com/apache/spark/pull/21364

[SPARK-20173][SQL]Float-point numbers are displayed with different 
precision in ThriftServer2

## What changes were proposed in this pull request?
When querying float-point numbers , the values displayed on beeline or jdbc 
are with different precision.
```
SELECT CAST(1.23 AS FLOAT)
Result:
1.230190734863
```
According to these two jira:
[HIVE-11802](https://issues.apache.org/jira/browse/HIVE-11802)
[HIVE-11832](https://issues.apache.org/jira/browse/HIVE-11832)
Make a slight modification to the spark hive thrift server.

## How was this patch tested?
HiveThriftBinaryServerSuite
test("JDBC query float-point number")


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/cxzl25/spark fix_float_number_display

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21364.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21364


commit 137abed62a75ea453a0d3c8436d9aa08b9abee36
Author: sychen <sychen@...>
Date:   2018-05-18T15:39:35Z

Float-point numbers are displayed with different precision in Beeline/JDBC




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21311: [SPARK-24257][SQL]LongToUnsafeRowMap calculate the new s...

2018-05-14 Thread cxzl25
Github user cxzl25 commented on the issue:

https://github.com/apache/spark/pull/21311
  
Thanks for your review. @maropu @kiszk @cloud-fan 

I submitted a modification including the following:
1. spliting append func into two parts:grow/appendG
2. doubling the size when growing
3. sys.error instead of UnsupportedOperationException



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21311: [SPARK-24257][SQL]LongToUnsafeRowMap calculate th...

2018-05-14 Thread cxzl25
Github user cxzl25 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21311#discussion_r187907410
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala
 ---
@@ -568,13 +568,16 @@ private[execution] final class LongToUnsafeRowMap(val 
mm: TaskMemoryManager, cap
 }
 
 // There is 8 bytes for the pointer to next value
-if (cursor + 8 + row.getSizeInBytes > page.length * 8L + 
Platform.LONG_ARRAY_OFFSET) {
+val needSize = cursor + 8 + row.getSizeInBytes
+val nowSize = page.length * 8L + Platform.LONG_ARRAY_OFFSET
+if (needSize > nowSize) {
   val used = page.length
   if (used >= (1 << 30)) {
 sys.error("Can not build a HashedRelation that is larger than 8G")
--- End diff --

ok. sys.error instead of UnsupportedOperationException


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21311: [SPARK-24257][SQL]LongToUnsafeRowMap calculate th...

2018-05-14 Thread cxzl25
Github user cxzl25 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21311#discussion_r187907559
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala
 ---
@@ -568,13 +568,16 @@ private[execution] final class LongToUnsafeRowMap(val 
mm: TaskMemoryManager, cap
 }
 
 // There is 8 bytes for the pointer to next value
-if (cursor + 8 + row.getSizeInBytes > page.length * 8L + 
Platform.LONG_ARRAY_OFFSET) {
+val needSize = cursor + 8 + row.getSizeInBytes
+val nowSize = page.length * 8L + Platform.LONG_ARRAY_OFFSET
+if (needSize > nowSize) {
   val used = page.length
   if (used >= (1 << 30)) {
 sys.error("Can not build a HashedRelation that is larger than 8G")
   }
-  ensureAcquireMemory(used * 8L * 2)
-  val newPage = new Array[Long](used * 2)
+  val multiples = math.max(math.ceil(needSize.toDouble / (used * 
8L)).toInt, 2)
+  ensureAcquireMemory(used * 8L * multiples)
--- End diff --

ok.Spliting append func into two parts: grow/append.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21311: [SPARK-24257][SQL]LongToUnsafeRowMap calculate th...

2018-05-14 Thread cxzl25
Github user cxzl25 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21311#discussion_r187907473
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala
 ---
@@ -568,13 +568,16 @@ private[execution] final class LongToUnsafeRowMap(val 
mm: TaskMemoryManager, cap
 }
 
 // There is 8 bytes for the pointer to next value
-if (cursor + 8 + row.getSizeInBytes > page.length * 8L + 
Platform.LONG_ARRAY_OFFSET) {
+val needSize = cursor + 8 + row.getSizeInBytes
+val nowSize = page.length * 8L + Platform.LONG_ARRAY_OFFSET
+if (needSize > nowSize) {
   val used = page.length
   if (used >= (1 << 30)) {
 sys.error("Can not build a HashedRelation that is larger than 8G")
   }
-  ensureAcquireMemory(used * 8L * 2)
--- End diff --

ok . Doubling the size when growing.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21311: [SPARK-24257][SQL]LongToUnsafeRowMap calculate th...

2018-05-12 Thread cxzl25
GitHub user cxzl25 opened a pull request:

https://github.com/apache/spark/pull/21311

[SPARK-24257][SQL]LongToUnsafeRowMap calculate the new size may be wrong

## What changes were proposed in this pull request?

LongToUnsafeRowMap
Calculate the new size simply by multiplying by 2
At this time, the size of the application may not be enough to store data
Some data is lost and the data read out is dirty

## How was this patch tested?
HashedRelationSuite
test("LongToUnsafeRowMap with big values")

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/cxzl25/spark fix_LongToUnsafeRowMap_page_size

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21311.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21311


commit d9d8e62c2de7d9d04534396ab3bbf984ab16c7f5
Author: sychen <sychen@...>
Date:   2018-05-12T11:14:17Z

LongToUnsafeRowMap Calculate the new correct size




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20739: [SPARK-23603][SQL]When the length of the json is in a ra...

2018-03-12 Thread cxzl25
Github user cxzl25 commented on the issue:

https://github.com/apache/spark/pull/20739
  
Another solution:
#20738 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20738: [SPARK-23603][SQL]When the length of the json is in a ra...

2018-03-12 Thread cxzl25
Github user cxzl25 commented on the issue:

https://github.com/apache/spark/pull/20738
  
Another solution:
https://github.com/apache/spark/pull/20739


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20739: [SPARK-23603][SQL]When the length of the json is ...

2018-03-05 Thread cxzl25
GitHub user cxzl25 opened a pull request:

https://github.com/apache/spark/pull/20739

[SPARK-23603][SQL]When the length of the json is in a range,get_json_object 
will result in missing tail data

## What changes were proposed in this pull request?
Replace writeRaw(char[] text, int offset, int len) with writeRaw(String 
text)
Jackson(>=2.7.7) fixes the possibility of missing tail data when the length 
of the value is in a range

[https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.7.7](https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.7.7)

[https://github.com/FasterXML/jackson-core/issues/30](https://github.com/FasterXML/jackson-core/issues/307)
## How was this patch tested?
org.apache.spark.sql.catalyst.expressions.JsonExpressionsSuite
test("some big value")


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/cxzl25/spark fix_udf_get_json_object

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20739.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20739


commit 5a83def86eb5a75eae0ff8f43e892e3daa52d50c
Author: sychen <sychen@...>
Date:   2018-03-05T15:30:44Z

Replace writeRaw(char[] text, int offset, int len) with writeRaw(String 
text)




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20738: [SPARK-23603][SQL]When the length of the json is ...

2018-03-05 Thread cxzl25
GitHub user cxzl25 opened a pull request:

https://github.com/apache/spark/pull/20738

[SPARK-23603][SQL]When the length of the json is in a range,get_json_object 
will result in missing tail data

## What changes were proposed in this pull request?

Bump jackson from 2.6.7&2.6.7.1 to 2.7.7
Jackson(>=2.7.7) fixes the possibility of missing tail data when the length 
of the value is in a range

[https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.7.7](https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.7.7)

[https://github.com/FasterXML/jackson-core/issues/30](https://github.com/FasterXML/jackson-core/issues/307)
## How was this patch tested?
org.apache.spark.sql.catalyst.expressions.JsonExpressionsSuite
test("some big value")


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/cxzl25/spark update_jackson_version

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20738.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20738


commit 0da293c1e8a56fd50d9284ab57eb0f24881f08ef
Author: sychen <sychen@...>
Date:   2018-03-05T15:34:51Z

Bump jackson from 2.6.7&2.6.7.1 to 2.7.7




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20593: [SPARK-23230][SQL][BRANCH-2.2]When hive.default.f...

2018-02-13 Thread cxzl25
Github user cxzl25 closed the pull request at:

https://github.com/apache/spark/pull/20593


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20406: [SPARK-23230][SQL]When hive.default.fileformat is other ...

2018-02-12 Thread cxzl25
Github user cxzl25 commented on the issue:

https://github.com/apache/spark/pull/20406
  
Thanks for your help , @dongjoon-hyun @gasparms .
I submit a separate PR to 2.2
https://github.com/apache/spark/pull/20593


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20593: [SPARK-23230][SQL][BRANCH-2.2]When hive.default.f...

2018-02-12 Thread cxzl25
GitHub user cxzl25 opened a pull request:

https://github.com/apache/spark/pull/20593

[SPARK-23230][SQL][BRANCH-2.2]When hive.default.fileformat is other kinds 
of file types, create textfile table cause a serde error

When hive.default.fileformat is other kinds of file types, create textfile 
table cause a serde error.
We should take the default type of textfile and sequencefile both as 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.

```
set hive.default.fileformat=orc;
create table tbl( i string ) stored as textfile;
desc formatted tbl;

Serde Library org.apache.hadoop.hive.ql.io.orc.OrcSerde
InputFormat  org.apache.hadoop.mapred.TextInputFormat
OutputFormat  org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
```



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/cxzl25/spark default_serde_2.2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20593.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20593


commit 979323a4e05cfdd5473369f5063967d69c40046c
Author: sychen <sychen@...>
Date:   2018-02-13T01:37:08Z

When hive.default.fileformat is other kinds of file types,
create textfile table cause a serde error




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20406: [SPARK-23230][SQL]When hive.default.fileformat is...

2018-02-11 Thread cxzl25
Github user cxzl25 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20406#discussion_r167482913
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveSerDeSuite.scala
 ---
@@ -100,6 +100,25 @@ class HiveSerDeSuite extends HiveComparisonTest with 
PlanTest with BeforeAndAfte
   assert(output == 
Some("org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat"))
   assert(serde == 
Some("org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe"))
 }
+
+withSQLConf("hive.default.fileformat" -> "orc") {
--- End diff --


https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/StorageFormat.java#L102

hive.default.serde
Default Value: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Description: The default SerDe Hive will use for storage formats that do 
not specify a SerDe.


https://cwiki.apache.org/confluence/display/Hive/DeveloperGuide#DeveloperGuide-RegistrationofNativeSerDes

hive cli
```
set hive.default.fileformat=orc;
create table tbl( i string ) stored as textfile;
desc formatted tbl;
```
```
SerDe Library:  
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat:org.apache.hadoop.mapred.TextInputFormat
OutputFormat:   
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
```








---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20406: [SPARK-23230][SQL]Error by creating a data table when us...

2018-02-11 Thread cxzl25
Github user cxzl25 commented on the issue:

https://github.com/apache/spark/pull/20406
  
ping @gatorsmile @dongjoon-hyun 

```
set hive.default.fileformat=orc;
create table tbl stored as textfile
as
select  1
```
It failed because it used the wrong SERDE
```
Caused by: java.lang.ClassCastException: 
org.apache.hadoop.hive.ql.io.orc.OrcSerde$OrcSerdeRow cannot be cast to 
org.apache.hadoop.io.BytesWritable
at 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat$1.write(HiveIgnoreKeyTextOutputFormat.java:91)
at 
org.apache.spark.sql.hive.execution.HiveOutputWriter.write(HiveFileFormat.scala:149)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:327)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:258)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:256)
at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1375)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:261)
... 16 more
```



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20406: [SPARK-23230][SQL]Error by creating a data table ...

2018-01-26 Thread cxzl25
GitHub user cxzl25 opened a pull request:

https://github.com/apache/spark/pull/20406

[SPARK-23230][SQL]Error by creating a data table when using 
hive.default.fileformat=orc

When hive.default.fileformat is other kinds of file types, create textfile 
table cause a serda error.
We should take the default type of textfile and sequencefile both as 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.

```
set hive.default.fileformat=orc;
create table tbl( i string ) stored as textfile;
desc formatted tbl;

Serde Library org.apache.hadoop.hive.ql.io.orc.OrcSerde
InputFormat  org.apache.hadoop.mapred.TextInputFormat
OutputFormat  org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
```

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/cxzl25/spark default_serde

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20406.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20406


commit f370dd6217cf8a590ef52ecc970e4dc33c235631
Author: sychen <sychen@...>
Date:   2018-01-26T12:40:48Z

take the default type of textfile and sequencefile both as 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org