[GitHub] spark issue #23228: [MINOR][DOC] Update the condition description of seriali...

2018-12-10 Thread 10110346
Github user 10110346 commented on the issue:

https://github.com/apache/spark/pull/23228
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23251: [SPARK-26300][SS] Remove a redundant `checkForStreaming`...

2018-12-10 Thread 10110346
Github user 10110346 commented on the issue:

https://github.com/apache/spark/pull/23251
  
cc @cloud-fan 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23228: [MINOR][DOC] Update the condition description of seriali...

2018-12-09 Thread 10110346
Github user 10110346 commented on the issue:

https://github.com/apache/spark/pull/23228
  
 I  have updated, thanks all.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23228: [MINOR][DOC]The condition description of serialized shuf...

2018-12-06 Thread 10110346
Github user 10110346 commented on the issue:

https://github.com/apache/spark/pull/23228
  
cc @JoshRosen  @cloud-fan 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23251: [SPARK-26300][SS] The `checkForStreaming` mothod ...

2018-12-06 Thread 10110346
GitHub user 10110346 opened a pull request:

https://github.com/apache/spark/pull/23251

[SPARK-26300][SS] The `checkForStreaming`  mothod  may be called twice in 
`createQuery`

## What changes were proposed in this pull request?
If `checkForContinuous`  is called ( `checkForStreaming` is called in 
`checkForContinuous`  ), the `checkForStreaming`  mothod  will be called twice 
in `createQuery` , this is not necessary,  and the `checkForStreaming` method 
has a lot of statements,  so it's better to remove one of them.

## How was this patch tested?

Existing unit tests in `StreamingQueryManagerSuite`

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/10110346/spark 
isUnsupportedOperationCheckEnabled

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23251.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23251


commit b1e71ee7a723d63f1cf3c0754f2372eb185439d3
Author: liuxian 
Date:   2018-12-07T03:08:26Z

fix




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23228: [MINOR][DOC]The condition description of serializ...

2018-12-05 Thread 10110346
GitHub user 10110346 opened a pull request:

https://github.com/apache/spark/pull/23228

[MINOR][DOC]The condition description of serialized shuffle is not very 
accurate

## What changes were proposed in this pull request?
`1. The shuffle dependency specifies no aggregation or output ordering.`
If the shuffle dependency specifies aggregation, but it only aggregates at 
the reducer side, serialized shuffle can still be used.
`3. The shuffle produces fewer than 16777216 output partitions.`
If the number of output partitions is 16777216 , we can use serialized 
shuffle.
## How was this patch tested?
N/A


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/10110346/spark SerializedShuffle_doc

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23228.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23228


commit d5dadbf30d5429c36ec3d5c2845a71c2717fd6f3
Author: liuxian 
Date:   2018-12-05T08:55:20Z

fix




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23216: [SPARK-26264][CORE]It is better to add @transient...

2018-12-04 Thread 10110346
Github user 10110346 closed the pull request at:

https://github.com/apache/spark/pull/23216


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23216: [SPARK-26264][CORE]It is better to add @transient to fie...

2018-12-04 Thread 10110346
Github user 10110346 commented on the issue:

https://github.com/apache/spark/pull/23216
  
Ok, I will close this PR, thank you very much


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23216: [SPARK-26264][CORE]It is better to add @transient to fie...

2018-12-04 Thread 10110346
Github user 10110346 commented on the issue:

https://github.com/apache/spark/pull/23216
  
> 
> 
> Are you sure it's even a field in the class? it looks like it's only used 
to define this:
> 
> ```
>   @transient private[this] val preferredLocs: Seq[TaskLocation] = {
> if (locs == null) Nil else locs.toSet.toSeq
>   }
> ```
> 
> I'd expect Scala would not generate a field. Indeed the thing it is used 
to make is transient.

Yeah, it would not generate a field, thanks @srowen 
By the way, is it better to remove `transient` for `ShuffleMapTask`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23216: [SPARK-26264][CORE]It is better to add @transient...

2018-12-04 Thread 10110346
GitHub user 10110346 opened a pull request:

https://github.com/apache/spark/pull/23216

[SPARK-26264][CORE]It is better to add @transient to field 'locs' for class 
`ResultTask`.

## What changes were proposed in this pull request?
The field 'locs' is only used in driver side  for class `ResultTask`, so it 
is not needed to serialize  when sending the `ResultTask`  to executor.
Although it's not very big, it's very frequent, so we can add` transient` 
for it  like `ShuffleMapTask`.


## How was this patch tested?
Existed unit tests


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/10110346/spark locs_transient

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23216.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23216


commit b3ede8be1a9073f057cc46fb82eacd7fa3ec36c6
Author: liuxian 
Date:   2018-12-04T08:55:40Z

fix




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23162: [MINOR][DOC] Correct some document description errors

2018-11-30 Thread 10110346
Github user 10110346 commented on the issue:

https://github.com/apache/spark/pull/23162
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23162: [MINOR][DOC] Correct some document description er...

2018-11-29 Thread 10110346
Github user 10110346 commented on a diff in the pull request:

https://github.com/apache/spark/pull/23162#discussion_r237713245
  
--- Diff: 
core/src/main/scala/org/apache/spark/internal/config/package.scala ---
@@ -430,8 +430,8 @@ package object config {
   .doc("The chunk size in bytes during writing out the bytes of 
ChunkedByteBuffer.")
   .bytesConf(ByteUnit.BYTE)
   .checkValue(_ <= ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH,
-"The chunk size during writing out the bytes of" +
-" ChunkedByteBuffer should not larger than Int.MaxValue - 15.")
+"The chunk size during writing out the bytes of ChunkedByteBuffer 
should" +
+  s" not larger than 
${ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH}.")
--- End diff --

ok, thanks


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23162: [MINOR][DOC] Correct some document description errors

2018-11-28 Thread 10110346
Github user 10110346 commented on the issue:

https://github.com/apache/spark/pull/23162
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21957: [SPARK-24994][SQL] When the data type of the fiel...

2018-11-27 Thread 10110346
Github user 10110346 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21957#discussion_r236965962
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala 
---
@@ -269,7 +269,8 @@ case class FileSourceScanExec(
   }
 
   @transient
-  private val pushedDownFilters = 
dataFilters.flatMap(DataSourceStrategy.translateFilter)
+  private val pushedDownFilters = dataFilters.flatMap(DataSourceStrategy.
+translateFilter(_, !relation.fileFormat.isInstanceOf[ParquetSource]))
--- End diff --

Thanks
 Yeah, this is not a good solution, I can't solve this problem better now.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22590: [SPARK-25574][SQL]Add an option `keepQuotes` for parsing...

2018-11-27 Thread 10110346
Github user 10110346 commented on the issue:

https://github.com/apache/spark/pull/22590
  
@HyukjinKwon I think it is not important. but our customers need this 
feature.
Yeah, it is better to find a way to set the arbitrary parse settings options


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23162: [MINOR][DOC] Correct some document description er...

2018-11-27 Thread 10110346
GitHub user 10110346 opened a pull request:

https://github.com/apache/spark/pull/23162

[MINOR][DOC] Correct some document description errors

## What changes were proposed in this pull request?

Correct some document description errors.

## How was this patch tested?
N/A


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/10110346/spark docerror

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23162.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23162


commit e9aba19b526610f3f31fa6a5b56140f6be8dc1c1
Author: liuxian 
Date:   2018-11-28T06:06:51Z

fix




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22163: [SPARK-25166][CORE]Reduce the number of write operations...

2018-11-27 Thread 10110346
Github user 10110346 commented on the issue:

https://github.com/apache/spark/pull/22163
  
cc @kiszk  @maropu 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23154: [SPARK-26195][SQL] Correct exception messages in some cl...

2018-11-27 Thread 10110346
Github user 10110346 commented on the issue:

https://github.com/apache/spark/pull/23154
  
LGTM,thanks


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23154: [SPARK-26195][SQL] Correct exception messages in ...

2018-11-27 Thread 10110346
Github user 10110346 commented on a diff in the pull request:

https://github.com/apache/spark/pull/23154#discussion_r236920634
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedRleValuesReader.java
 ---
@@ -510,42 +510,42 @@ public void readIntegers(int total, 
WritableColumnVector c, int rowId) {
 
   @Override
   public byte readByte() {
-throw new UnsupportedOperationException("only readInts is valid.");
+throw new UnsupportedOperationException("only readByte is valid.");
--- End diff --

These exception messages seem to be correct ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22163: [SPARK-25166][CORE]Reduce the number of write operations...

2018-11-26 Thread 10110346
Github user 10110346 commented on the issue:

https://github.com/apache/spark/pull/22163
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22779: [SPARK-25786][CORE]If the ByteBuffer.hasArray is false ,...

2018-11-17 Thread 10110346
Github user 10110346 commented on the issue:

https://github.com/apache/spark/pull/22779
  
@srowen Thanks, I am sorry,  I am on holiday, l will update it next week ,I 
am reply ing on my phone. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22723: [SPARK-25729][CORE]It is better to replace `minPa...

2018-11-04 Thread 10110346
Github user 10110346 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22723#discussion_r230579427
  
--- Diff: 
core/src/main/scala/org/apache/spark/input/WholeTextFileInputFormat.scala ---
@@ -48,11 +50,11 @@ private[spark] class WholeTextFileInputFormat
* Allow minPartitions set by end-user in order to keep compatibility 
with old Hadoop API,
* which is set through setMaxSplitSize
*/
-  def setMinPartitions(context: JobContext, minPartitions: Int) {
+  def setMinPartitions(sc: SparkContext, context: JobContext, 
minPartitions: Int) {
--- End diff --

Ok,thanks


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22723: [SPARK-25729][CORE]It is better to replace `minPa...

2018-11-04 Thread 10110346
Github user 10110346 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22723#discussion_r230579423
  
--- Diff: 
core/src/main/scala/org/apache/spark/input/WholeTextFileInputFormat.scala ---
@@ -48,11 +50,11 @@ private[spark] class WholeTextFileInputFormat
* Allow minPartitions set by end-user in order to keep compatibility 
with old Hadoop API,
* which is set through setMaxSplitSize
*/
-  def setMinPartitions(context: JobContext, minPartitions: Int) {
+  def setMinPartitions(sc: SparkContext, context: JobContext, 
minPartitions: Int) {
 val files = listStatus(context).asScala
 val totalLen = files.map(file => if (file.isDirectory) 0L else 
file.getLen).sum
-val maxSplitSize = Math.ceil(totalLen * 1.0 /
-  (if (minPartitions == 0) 1 else minPartitions)).toLong
+val minPartNum = Math.max(sc.defaultParallelism, minPartitions)
+val maxSplitSize = Math.ceil(totalLen * 1.0 / minPartNum).toLong
--- End diff --

Thanks, I have change the description.
I think the number  of partitions belong to input format, and it can also 
be used by other RDD.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22723: [SPARK-25729][CORE]It is better to replace `minPa...

2018-11-04 Thread 10110346
Github user 10110346 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22723#discussion_r230579084
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/WholeTextFileRDD.scala 
---
@@ -51,7 +51,7 @@ private[spark] class WholeTextFileRDD(
   case _ =>
 }
 val jobContext = new JobContextImpl(conf, jobId)
-inputFormat.setMinPartitions(jobContext, minPartitions)
+inputFormat.setMinPartitions(sc, jobContext, minPartitions)
--- End diff --

Yeah, thanks


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22855: [SPARK-25839] [Core] Implement use of KryoPool in...

2018-10-29 Thread 10110346
Github user 10110346 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22855#discussion_r228844982
  
--- Diff: 
core/src/main/scala/org/apache/spark/serializer/KryoSerializer.scala ---
@@ -298,30 +312,40 @@ class KryoDeserializationStream(
   }
 }
 
-private[spark] class KryoSerializerInstance(ks: KryoSerializer, useUnsafe: 
Boolean)
+private[spark] class KryoSerializerInstance(
+   ks: KryoSerializer, useUnsafe: Boolean, usePool: Boolean)
   extends SerializerInstance {
   /**
* A re-used [[Kryo]] instance. Methods will borrow this instance by 
calling `borrowKryo()`, do
* their work, then release the instance by calling `releaseKryo()`. 
Logically, this is a caching
* pool of size one. SerializerInstances are not thread-safe, hence 
accesses to this field are
* not synchronized.
*/
-  @Nullable private[this] var cachedKryo: Kryo = borrowKryo()
+  @Nullable private[this] var cachedKryo: Kryo = if (usePool) null else 
borrowKryo()
 
   /**
* Borrows a [[Kryo]] instance. If possible, this tries to re-use a 
cached Kryo instance;
* otherwise, it allocates a new instance.
*/
   private[serializer] def borrowKryo(): Kryo = {
-if (cachedKryo != null) {
-  val kryo = cachedKryo
-  // As a defensive measure, call reset() to clear any Kryo state that 
might have been modified
-  // by the last operation to borrow this instance (see SPARK-7766 for 
discussion of this issue)
+if (usePool) {
+  val kryo = ks.pool.borrow()
   kryo.reset()
-  cachedKryo = null
   kryo
 } else {
-  ks.newKryo()
+  if (cachedKryo != null) {
+val kryo = cachedKryo
+/**
+* As a defensive measure, call reset() to clear any Kryo state 
that might have
--- End diff --

The `*`after the first line must be aligned with the first `*` of the first 
line.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22723: [SPARK-25729][CORE]It is better to replace `minPartition...

2018-10-28 Thread 10110346
Github user 10110346 commented on the issue:

https://github.com/apache/spark/pull/22723
  
retest this please 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22723: [SPARK-25729][CORE]It is better to replace `minPartition...

2018-10-28 Thread 10110346
Github user 10110346 commented on the issue:

https://github.com/apache/spark/pull/22723
  
Thanks, yes, you are right.
After you reminded, I realized there were other places, such as `HadoopRDD`.
But I wonder if it's better to just modify `WholeTextFileInputFormat`, like 
`StreamFileInputFormat`?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22802: [SPARK-25806][SQL]The instance of FileSplit is redundant

2018-10-28 Thread 10110346
Github user 10110346 commented on the issue:

https://github.com/apache/spark/pull/22802
  
@srowen Thanks,  I have checked all and updated it


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22723: [SPARK-25729][CORE]It is better to replace `minPartition...

2018-10-25 Thread 10110346
Github user 10110346 commented on the issue:

https://github.com/apache/spark/pull/22723
  
In fact, `BinaryFileRDD ` uses `max(defaultParallelism, minPartitions)`:
`BinaryFileRDD --->setMinPartitions--->Math.max(sc.defaultParallelism, 
minPartitions)`.

In addition, for this unit test:

https://github.com/apache/spark/blob/b80bf66a8109faa7f58d45b92417a981666866a0/core/src/test/scala/org/apache/spark/FileSuite.scala#L304
if we set `spark.default.parallelism=3`, the result will not be what we 
expected.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22723: [SPARK-25729][CORE]It is better to replace `minPartition...

2018-10-24 Thread 10110346
Github user 10110346 commented on the issue:

https://github.com/apache/spark/pull/22723
  
What you say is reasonable.
But from the perspective of resource utilization, I think it is better to 
replace `minPartitions` with `defaultParallelism`, we can see `BinaryFileRDD` 
and `createNonBucketedReadRDD`, both of them were done like this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22754: [SPARK-25776][CORE]The disk write buffer size mus...

2018-10-24 Thread 10110346
Github user 10110346 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22754#discussion_r228009755
  
--- Diff: 
core/src/main/scala/org/apache/spark/internal/config/package.scala ---
@@ -495,8 +495,8 @@ package object config {
 ConfigBuilder("spark.shuffle.spill.diskWriteBufferSize")
   .doc("The buffer size, in bytes, to use when writing the sorted 
records to an on-disk file.")
   .bytesConf(ByteUnit.BYTE)
-  .checkValue(v => v > 0 && v <= Int.MaxValue,
-s"The buffer size must be greater than 0 and less than 
${Int.MaxValue}.")
+  .checkValue(v => v > 12 && v <= Int.MaxValue,
+s"The buffer size must be greater than 12 and less than 
${Int.MaxValue}.")
--- End diff --

Yea, I think this value is better


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22754: [SPARK-25776][CORE]The disk write buffer size must be gr...

2018-10-23 Thread 10110346
Github user 10110346 commented on the issue:

https://github.com/apache/spark/pull/22754
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22754: [SPARK-25776][CORE]The disk write buffer size must be gr...

2018-10-23 Thread 10110346
Github user 10110346 commented on the issue:

https://github.com/apache/spark/pull/22754
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22754: [SPARK-25776][CORE]The disk write buffer size mus...

2018-10-23 Thread 10110346
Github user 10110346 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22754#discussion_r227392626
  
--- Diff: 
core/src/main/scala/org/apache/spark/internal/config/package.scala ---
@@ -495,8 +495,8 @@ package object config {
 ConfigBuilder("spark.shuffle.spill.diskWriteBufferSize")
   .doc("The buffer size, in bytes, to use when writing the sorted 
records to an on-disk file.")
   .bytesConf(ByteUnit.BYTE)
-  .checkValue(v => v > 0 && v <= Int.MaxValue,
-s"The buffer size must be greater than 0 and less than 
${Int.MaxValue}.")
+  .checkValue(v => v > 12 && v <= Int.MaxValue,
+s"The buffer size must be greater than 12 and less than 
${Int.MaxValue}.")
--- End diff --

So strange, l have handled it, but we can not see the change here 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22802: [SPARK-25806][SQL][MINOR]The instanceof FileSplit...

2018-10-23 Thread 10110346
GitHub user 10110346 opened a pull request:

https://github.com/apache/spark/pull/22802

[SPARK-25806][SQL][MINOR]The instanceof FileSplit is redundant for 
ParquetFileFormat

## What changes were proposed in this pull request?

 The instance of `FileSplit` is redundant for 
`buildReaderWithPartitionValues`
in the `ParquetFileFormat` class.


## How was this patch tested?
Existing unit tests in `ParquetQuerySuite.scala`

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/10110346/spark FileSplitnotneed

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22802.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22802


commit 52140fa64adb47047b2cb10377799a6b9fd3ab73
Author: liuxian 
Date:   2018-10-23T06:43:51Z

fix




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22754: [SPARK-25776][CORE][MINOR]The disk write buffer size mus...

2018-10-22 Thread 10110346
Github user 10110346 commented on the issue:

https://github.com/apache/spark/pull/22754
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22754: [SPARK-25776][CORE][MINOR]The disk write buffer s...

2018-10-22 Thread 10110346
Github user 10110346 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22754#discussion_r227200492
  
--- Diff: 
core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSorterSpillWriter.java
 ---
@@ -62,6 +62,8 @@ public UnsafeSorterSpillWriter(
   int fileBufferSize,
   ShuffleWriteMetrics writeMetrics,
   int numRecordsToWrite) throws IOException {
+// Space used by prefix + len + recordLength is more than 4 + 8 bytes
+assert (diskWriteBufferSize > 12);
--- End diff --

Yes, it can guarantee this.
Here explains why it must be greater than 12.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22723: [SPARK-25729][CORE]It is better to replace `minPartition...

2018-10-22 Thread 10110346
Github user 10110346 commented on the issue:

https://github.com/apache/spark/pull/22723
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22754: [SPARK-25776][CORE][MINOR]The disk write buffer s...

2018-10-22 Thread 10110346
Github user 10110346 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22754#discussion_r227191411
  
--- Diff: 
core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSorterSpillWriter.java
 ---
@@ -62,6 +62,8 @@ public UnsafeSorterSpillWriter(
   int fileBufferSize,
   ShuffleWriteMetrics writeMetrics,
   int numRecordsToWrite) throws IOException {
+// Space used by prefix + len + recordLength is more than 4 + 8 bytes
+assert (diskWriteBufferSize > 12);
--- End diff --

I am not sure too, but I see many places(`BitSetMethods.java, 
HeapMemoryAllocator.java, LongArray.java`) that use it like this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22754: [SPARK-25776][CORE][MINOR]The disk write buffer size mus...

2018-10-21 Thread 10110346
Github user 10110346 commented on the issue:

https://github.com/apache/spark/pull/22754
  
Thank you for your review, I will update it @kiszk 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22754: [SPARK-25776][CORE][MINOR]The disk write buffer size mus...

2018-10-21 Thread 10110346
Github user 10110346 commented on the issue:

https://github.com/apache/spark/pull/22754
  
Thank you for your review, I will update it @kiszk 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22754: [SPARK-25776][CORE][MINOR]The disk write buffer size mus...

2018-10-21 Thread 10110346
Github user 10110346 commented on the issue:

https://github.com/apache/spark/pull/22754
  
Thank you for your review, I will update it @kiszk 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22754: [SPARK-25776][CORE][MINOR]The disk write buffer size mus...

2018-10-21 Thread 10110346
Github user 10110346 commented on the issue:

https://github.com/apache/spark/pull/22754
  
Thank you for your review, I will update it @kiszk 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22754: [SPARK-25776][CORE][MINOR]The disk write buffer size mus...

2018-10-21 Thread 10110346
Github user 10110346 commented on the issue:

https://github.com/apache/spark/pull/22754
  
Thank you for your review, I will update it @kiszk 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22754: [SPARK-25776][CORE][MINOR]The disk write buffer size mus...

2018-10-21 Thread 10110346
Github user 10110346 commented on the issue:

https://github.com/apache/spark/pull/22754
  
Thank you for your review, I will update it @kiszk 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22754: [SPARK-25776][CORE][MINOR]The disk write buffer size mus...

2018-10-21 Thread 10110346
Github user 10110346 commented on the issue:

https://github.com/apache/spark/pull/22754
  
Thank you for your review, I will update it @kiszk 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22754: [SPARK-25776][CORE][MINOR]The disk write buffer size mus...

2018-10-21 Thread 10110346
Github user 10110346 commented on the issue:

https://github.com/apache/spark/pull/22754
  
Thank you for your review, I will update it @kiszk 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22754: [SPARK-25776][CORE][MINOR]The disk write buffer size mus...

2018-10-21 Thread 10110346
Github user 10110346 commented on the issue:

https://github.com/apache/spark/pull/22754
  
Thank you for your review, I will update it @kiszk 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22754: [SPARK-25776][CORE][MINOR]The disk write buffer size mus...

2018-10-21 Thread 10110346
Github user 10110346 commented on the issue:

https://github.com/apache/spark/pull/22754
  
Thank you for your review, I will update it @kiszk 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22754: [SPARK-25776][CORE][MINOR]The disk write buffer size mus...

2018-10-21 Thread 10110346
Github user 10110346 commented on the issue:

https://github.com/apache/spark/pull/22754
  
Thank you for your review, I will update it @kiszk 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22754: [SPARK-25776][CORE][MINOR]The disk write buffer size mus...

2018-10-21 Thread 10110346
Github user 10110346 commented on the issue:

https://github.com/apache/spark/pull/22754
  
Thank you for your review, I will update it @kiszk 



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22754: [SPARK-25776][CORE][MINOR]The disk write buffer size mus...

2018-10-21 Thread 10110346
Github user 10110346 commented on the issue:

https://github.com/apache/spark/pull/22754
  
Thank you for your review, I will update it @kiszk 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22754: [SPARK-25776][CORE][MINOR]The disk write buffer size mus...

2018-10-21 Thread 10110346
Github user 10110346 commented on the issue:

https://github.com/apache/spark/pull/22754
  
Thank you for your review, I will update it @kiszk 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22754: [SPARK-25776][CORE][MINOR]The disk write buffer size mus...

2018-10-21 Thread 10110346
Github user 10110346 commented on the issue:

https://github.com/apache/spark/pull/22754
  
Thank you for your review, I will update it @kiszk 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22754: [SPARK-25776][CORE][MINOR]The disk write buffer size mus...

2018-10-21 Thread 10110346
Github user 10110346 commented on the issue:

https://github.com/apache/spark/pull/22754
  
Thank you for your review, I will update it @kiszk 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22754: [SPARK-25776][CORE][MINOR]The disk write buffer size mus...

2018-10-21 Thread 10110346
Github user 10110346 commented on the issue:

https://github.com/apache/spark/pull/22754
  
Thank you for your review, I will update it @kiszk 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22754: [SPARK-25776][CORE][MINOR]The disk write buffer size mus...

2018-10-21 Thread 10110346
Github user 10110346 commented on the issue:

https://github.com/apache/spark/pull/22754
  
Thank you for your review, I will update it @kiszk 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22754: [SPARK-25776][CORE][MINOR]The disk write buffer size mus...

2018-10-21 Thread 10110346
Github user 10110346 commented on the issue:

https://github.com/apache/spark/pull/22754
  
Thank you for your review, I will update it @kiszk 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22754: [SPARK-25776][CORE][MINOR]The disk write buffer size mus...

2018-10-21 Thread 10110346
Github user 10110346 commented on the issue:

https://github.com/apache/spark/pull/22754
  
Thank you for your review, I will update it @kiszk 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22754: [SPARK-25776][CORE][MINOR]The disk write buffer size mus...

2018-10-21 Thread 10110346
Github user 10110346 commented on the issue:

https://github.com/apache/spark/pull/22754
  
Thank you for your review, I will update it @kiszk 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22754: [SPARK-25776][CORE][MINOR]The disk write buffer size mus...

2018-10-21 Thread 10110346
Github user 10110346 commented on the issue:

https://github.com/apache/spark/pull/22754
  
Thank you for your review, I will update it @kiszk 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22779: [SPARK-25786][CORE]If the ByteBuffer.hasArray is false ,...

2018-10-20 Thread 10110346
Github user 10110346 commented on the issue:

https://github.com/apache/spark/pull/22779
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22779: [SPARK-25786][CORE]If the ByteBuffer.hasArray is false ,...

2018-10-20 Thread 10110346
Github user 10110346 commented on the issue:

https://github.com/apache/spark/pull/22779
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22779: [SPARK-25786][CORE]If the ByteBuffer.hasArray is false ,...

2018-10-20 Thread 10110346
Github user 10110346 commented on the issue:

https://github.com/apache/spark/pull/22779
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22779: [SPARK-25786][CORE]If the ByteBuffer.hasArray is ...

2018-10-19 Thread 10110346
GitHub user 10110346 opened a pull request:

https://github.com/apache/spark/pull/22779

[SPARK-25786][CORE]If the ByteBuffer.hasArray is false , it will throw 
UnsupportedOperationException for Kryo

## What changes were proposed in this pull request?
`deserialize` for kryo,  the type of input parameter is ByteBuffer, if it 
is not backed by an accessible byte array. it will throw 
`UnsupportedOperationException`

Exception Info:
```
java.lang.UnsupportedOperationException was thrown.
java.lang.UnsupportedOperationException
at java.nio.ByteBuffer.array(ByteBuffer.java:994)
at 
org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:362)
```

## How was this patch tested?

Added a unit test


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/10110346/spark InputStreamKryo

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22779.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22779


commit 943e3988dcb70d17e65b5e508f6f35b87fc71d28
Author: liuxian 
Date:   2018-10-19T11:08:10Z

fix




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22725: [SPARK-25753][[CORE][FOLLOW-UP]fix reading small files v...

2018-10-19 Thread 10110346
Github user 10110346 commented on the issue:

https://github.com/apache/spark/pull/22725
  
ok,thanks @dongjoon-hyun 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22774: [SPARK-25780][CORE]Scheduling the tasks which have no hi...

2018-10-19 Thread 10110346
Github user 10110346 commented on the issue:

https://github.com/apache/spark/pull/22774
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22774: [SPARK-25780][CORE]Scheduling the tasks which hav...

2018-10-19 Thread 10110346
GitHub user 10110346 opened a pull request:

https://github.com/apache/spark/pull/22774

[SPARK-25780][CORE]Scheduling the tasks which have no higher level locality 
first

## What changes were proposed in this pull request?
For example:
An application has two executors: (exe1, host1), (exe2,host2), and 3 tasks 
with locality:  {task0, Seq(TaskLocation("host1", "exec1"))}, {task1, 
Seq(TaskLocation("host1", "exec1"), TaskLocation("host2"))},  {task2, 
Seq(TaskLocation("host2")}

If task0 is runing in exe1, when `allowedLocality` is NODE_LOCAL for exe2, 
it is better to schedule task2 fisrt, not task1, because task1 may be scheduled 
to exe1 later.

## How was this patch tested?
Added a UT


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/10110346/spark higher_locality

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22774.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22774


commit 7076bdef5c633739a17e6e9f7ed0c80ed5cb11de
Author: liuxian 
Date:   2018-10-19T06:36:30Z

fix




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22754: [MINOR][CORE]The disk write buffer size must be greater ...

2018-10-18 Thread 10110346
Github user 10110346 commented on the issue:

https://github.com/apache/spark/pull/22754
  
@kiszk  Thanks,I will create a JIRA.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22754: [MINOR][CORE]The disk write buffer size must be greater ...

2018-10-18 Thread 10110346
Github user 10110346 commented on the issue:

https://github.com/apache/spark/pull/22754
  
If we set 12 into this,  `freeSpaceInWriteBuffer ` will  be 0,  and  the 
length of  `copyMemory` will always be 0, so it is not allowed to set 12 into 
this property.
   
https://github.com/apache/spark/blob/dfc4b65088602ee45e0babe22e64e205fab3e82b/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSorterSpillWriter.java#L128


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22754: [CORE][MINOR]The disk write buffer size must be g...

2018-10-17 Thread 10110346
GitHub user 10110346 opened a pull request:

https://github.com/apache/spark/pull/22754

[CORE][MINOR]The disk write buffer size must be greater than 12

## What changes were proposed in this pull request?

 In `UnsafeSorterSpillWriter.java`, when we write a record to a spill file 
wtih ` void write(Object baseObject, long baseOffset,  int recordLength, long 
keyPrefix)`, `recordLength` and `keyPrefix`  will be  written  the disk write 
buffer  first, and these will take 12 bytes, so the disk write buffer size must 
be greater than 12.

## How was this patch tested?
Existing UT in `UnsafeExternalSorterSuite`


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/10110346/spark diskWriteBufferSize

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22754.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22754


commit dfc4b65088602ee45e0babe22e64e205fab3e82b
Author: liuxian 
Date:   2018-10-17T06:03:00Z

fix




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22725: [SPARK-24610][[CORE][FOLLOW-UP]fix reading small files v...

2018-10-16 Thread 10110346
Github user 10110346 commented on the issue:

https://github.com/apache/spark/pull/22725
  
@tgravescs ok, I will do it ,thanks


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22723: [SPARK-25729][CORE]It is better to replace `minPartition...

2018-10-15 Thread 10110346
Github user 10110346 commented on the issue:

https://github.com/apache/spark/pull/22723
  
cc @gatorsmile 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22725: [SPARK-24610][[CORE][FOLLOW-UP]fix reading small files v...

2018-10-15 Thread 10110346
Github user 10110346 commented on the issue:

https://github.com/apache/spark/pull/22725
  
cc @dhruve @tgravescs


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22731: [SPARK-25674][FOLLOW-UP] Update the stats for eac...

2018-10-15 Thread 10110346
Github user 10110346 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22731#discussion_r225362470
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala
 ---
@@ -106,15 +106,16 @@ class FileScanRDD(
 // don't need to run this `if` for every record.
 val preNumRecordsRead = inputMetrics.recordsRead
 if (nextElement.isInstanceOf[ColumnarBatch]) {
+  incTaskInputMetricsBytesRead()
--- End diff --

Considering that the default value of 
`"spark.sql.parquet.columnarReaderBatchSize` is 4096, this change is better .


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22725: [SPARK-24610][[CORE][FOLLOW-UP]fix reading small ...

2018-10-15 Thread 10110346
GitHub user 10110346 opened a pull request:

https://github.com/apache/spark/pull/22725

[SPARK-24610][[CORE][FOLLOW-UP]fix reading small files via BinaryFileRDD

## What changes were proposed in this pull request?

This is a follow up of #21601, `StreamFileInputFormat` and 
`WholeTextFileInputFormat` have the same problem.

`Minimum split size pernode 5123456 cannot be larger than maximum split 
size 4194304
java.io.IOException: Minimum split size pernode 5123456 cannot be larger 
than maximum split size 4194304
at 
org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:
 201)
at 
org.apache.spark.rdd.BinaryFileRDD.getPartitions(BinaryFileRDD.scala:52)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:254)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2138)`

## How was this patch tested?
Added a unit test


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/10110346/spark maxSplitSize_node_rack

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22725.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22725


commit 54ffcdb7a18471a7a24fe36a000ca0cc4e8d0eba
Author: liuxian 
Date:   2018-10-15T07:28:31Z

fix




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22723: [SPARK-25729][CORE]It is better to replace `minPa...

2018-10-15 Thread 10110346
GitHub user 10110346 opened a pull request:

https://github.com/apache/spark/pull/22723

[SPARK-25729][CORE]It is better to replace `minPartitions` with 
`defaultParallelism` , when `minPartitions` is less than `defaultParallelism`

## What changes were proposed in this pull request?

In ‘WholeTextFileRDD’,when `minPartitions` is less than 
`defaultParallelism`,
it is better to replace `minPartitions` with `defaultParallelism` , because 
this can make better use of resources and improve concurrency.


## How was this patch tested?
Added a unit test


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/10110346/spark minPartNum

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22723.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22723


commit b80bf66a8109faa7f58d45b92417a981666866a0
Author: liuxian 
Date:   2018-10-15T06:39:17Z

fix




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22594: [SPARK-25674][SQL] If the records are incremented...

2018-10-10 Thread 10110346
Github user 10110346 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22594#discussion_r224286853
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala
 ---
@@ -70,6 +70,8 @@ class FileScanRDD(
 
   private val ignoreCorruptFiles = 
sparkSession.sessionState.conf.ignoreCorruptFiles
   private val ignoreMissingFiles = 
sparkSession.sessionState.conf.ignoreMissingFiles
+  // only for test
+  private val inputMetricsTest = 
sparkSession.sessionState.conf.contains("spark.inputmetrics.test")
--- End diff --

If this place is controlled by `spark.testing`, other unit tests may fail.
Yeah, I agree with you ,this a simple change, it is  better to drop this.
thanks @srowen 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22590: [SPARK-25574][SQL]Add an option `keepQuotes` for parsing...

2018-10-09 Thread 10110346
Github user 10110346 commented on the issue:

https://github.com/apache/spark/pull/22590
  
Normally, it's better to have no quotes, but in our production environment, 
the user requests quotes to be displayed, so we need this option.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22594: [SPARK-25674][SQL] If the records are incremented...

2018-10-09 Thread 10110346
Github user 10110346 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22594#discussion_r223612070
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala
 ---
@@ -570,4 +572,33 @@ class SQLMetricsSuite extends SparkFunSuite with 
SQLMetricsTestUtils with Shared
   }
 }
   }
+
+  test("InputMetrics---bytesRead") {
--- End diff --

It is too hard,the test needs  involve `ColumnarBatch`,
in addition, we must capture the `bytesRead` in the process of execution, 
not the task end.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22590: [SPARK-25574][SQL]Add an option `keepQuotes` for ...

2018-10-09 Thread 10110346
Github user 10110346 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22590#discussion_r223590113
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
@@ -194,6 +195,22 @@ class CSVSuite extends QueryTest with SharedSQLContext 
with SQLTestUtils with Te
 checkAnswer(rows, expectedRows)
   }
 
+  test("keep escaped quotes") {
+val notKeepQuotes = spark.read
+  .format("csv")
+  .option("keepQuotes", false)
+  .load(testFile(keepQuotesFile))
+var expectedRows = Seq(Row("\"a\"b", "ccc", null, "ddd"), Row("ab", 
"cc", null, "c,ddd"))
--- End diff --

ok,thanks


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22594: [MINOR][SQL] When batch reading, the number of bytes can...

2018-10-04 Thread 10110346
Github user 10110346 commented on the issue:

https://github.com/apache/spark/pull/22594
  
@srowen Yes,I will update,thanks


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22594: [MINOR][SQL] When batch reading, the number of by...

2018-09-30 Thread 10110346
GitHub user 10110346 opened a pull request:

https://github.com/apache/spark/pull/22594

[MINOR][SQL] When batch reading, the number of bytes can not be updated as 
expected.

## What changes were proposed in this pull request?

When batch reading, the number of bytes can not be updated as expected.

## How was this patch tested?
N/A


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/10110346/spark inputMetrics

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22594.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22594


commit e589e1ef83418a485c9d55a72209c0c86cf7b044
Author: liuxian 
Date:   2018-09-30T09:14:20Z

fix




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22590: [SPARK-25574][SQL]Add an option `keepQuotes` for ...

2018-09-29 Thread 10110346
GitHub user 10110346 opened a pull request:

https://github.com/apache/spark/pull/22590

[SPARK-25574][SQL]Add an option `keepQuotes` for parsing csv file

## What changes were proposed in this pull request?

In the PR, I added new option for csv file - `keepQuotes`.
In our project, when we read the CSV file, we hope to keep quotes.

For example:
We have such a record in the CSV file.:
`ab,cc,,"c,ddd"`

We hope it displays like this:
++---++---+
| _c0 | _c1 | _c2  |   _c3 |
+---+---+++
|  ab  |  cc   |  null | `"c,ddd"` |

Not like this:
++---+++
| _c0 |  _c1  | _c2 |   _c3  |
+---++++
|  ab   |  cc   |  null   | c,ddd |
++---++---+



## How was this patch tested?
Added a unit test.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/10110346/spark keepquotes

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22590.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22590


commit 9c46a72517e5235e10ba0325b63817eefe5d71dd
Author: liuxian 
Date:   2018-09-29T07:15:47Z

fix




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22163: [SPARK-25166][CORE]Reduce the number of write operations...

2018-09-20 Thread 10110346
Github user 10110346 commented on the issue:

https://github.com/apache/spark/pull/22163
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22358: [SPARK-25366][SQL]Zstd and brotli CompressionCode...

2018-09-09 Thread 10110346
Github user 10110346 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22358#discussion_r216180370
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -398,10 +398,10 @@ object SQLConf {
   "`parquet.compression` is specified in the table-specific 
options/properties, the " +
   "precedence would be `compression`, `parquet.compression`, " +
   "`spark.sql.parquet.compression.codec`. Acceptable values include: 
none, uncompressed, " +
-  "snappy, gzip, lzo, brotli, lz4, zstd.")
+  "snappy, gzip, lzo, lz4.")
 .stringConf
 .transform(_.toLowerCase(Locale.ROOT))
-.checkValues(Set("none", "uncompressed", "snappy", "gzip", "lzo", 
"lz4", "brotli", "zstd"))
--- End diff --

I agree with you, removing is not a good idea.
Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22358: [SPARK-25366][SQL]Zstd and brotil CompressionCode...

2018-09-07 Thread 10110346
Github user 10110346 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22358#discussion_r215901781
  
--- Diff: docs/sql-programming-guide.md ---
@@ -964,7 +964,7 @@ Configuration of Parquet can be done using the 
`setConf` method on `SparkSession
 Sets the compression codec used when writing Parquet files. If either 
`compression` or
 `parquet.compression` is specified in the table-specific 
options/properties, the precedence would be
 `compression`, `parquet.compression`, 
`spark.sql.parquet.compression.codec`. Acceptable values include:
-none, uncompressed, snappy, gzip, lzo, brotli, lz4, zstd.
--- End diff --

got it,thanks @wangyum 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22358: [SPARK-25366][SQL]Zstd and brotil CompressionCodec are n...

2018-09-07 Thread 10110346
Github user 10110346 commented on the issue:

https://github.com/apache/spark/pull/22358
  
yeah, the error message is output from external 
jar(parquet-common-1.10.0.jar),
I think spark + parquet should avoid the hadoop dependencies 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22358: [SPARK-25366][SQL]Zstd and brotil CompressionCodec are n...

2018-09-07 Thread 10110346
Github user 10110346 commented on the issue:

https://github.com/apache/spark/pull/22358
  
Thanks, if there are the codecs found, we support those compressions, but 
how do I find it? @HyukjinKwon 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22358: [SPARK-25366][SQL]Zstd and brotil CompressionCode...

2018-09-07 Thread 10110346
Github user 10110346 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22358#discussion_r215887803
  
--- Diff: docs/sql-programming-guide.md ---
@@ -964,7 +964,7 @@ Configuration of Parquet can be done using the 
`setConf` method on `SparkSession
 Sets the compression codec used when writing Parquet files. If either 
`compression` or
 `parquet.compression` is specified in the table-specific 
options/properties, the precedence would be
 `compression`, `parquet.compression`, 
`spark.sql.parquet.compression.codec`. Acceptable values include:
-none, uncompressed, snappy, gzip, lzo, brotli, lz4, zstd.
--- End diff --

Installation may not be able to solve it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22358: [SPARK-25366][SQL]Zstd and brotil CompressionCodec are n...

2018-09-07 Thread 10110346
Github user 10110346 commented on the issue:

https://github.com/apache/spark/pull/22358
  
 It is using reflection to acquire hadoop classes for compression which are 
not in the supplied dependencies(hadoop-common-2.6.5.jar, 
hadoop-common-2.7.0.jar, hadoop-common-3.1.0.jar).

`BROTLI("org.apache.hadoop.io.compress.BrotliCodec", 
CompressionCodec.BROTLI, ".br"),  
ZSTD("org.apache.hadoop.io.compress.ZStandardCodec", CompressionCodec.ZSTD, 
".zstd");`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22358: [SPARK-25366][SQL]Zstd and brotil CompressionCode...

2018-09-07 Thread 10110346
GitHub user 10110346 opened a pull request:

https://github.com/apache/spark/pull/22358

[SPARK-25366][SQL]Zstd and brotil CompressionCodec are not supported for 
parquet files

## What changes were proposed in this pull request?
Hadoop2.6  and  hadoop2.7 do not contain zstd and brotil compressioncodec 
,hadoop 3.1 also contains only zstd  compressioncodec .
 So I think we should remove zstd and brotil  for the time being.

**set  `spark.sql.parquet.compression.codec=brotli`:**
Caused by: org.apache.parquet.hadoop.BadConfigurationException: Class 
org.apache.hadoop.io.compress.BrotliCodec was not found
at 
org.apache.parquet.hadoop.CodecFactory.getCodec(CodecFactory.java:235)
at 
org.apache.parquet.hadoop.CodecFactory$HeapBytesCompressor.(CodecFactory.java:142)
at 
org.apache.parquet.hadoop.CodecFactory.createCompressor(CodecFactory.java:206)
at 
org.apache.parquet.hadoop.CodecFactory.getCompressor(CodecFactory.java:189)
at 
org.apache.parquet.hadoop.ParquetRecordWriter.(ParquetRecordWriter.java:153)
at 
org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:411)
at 
org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetOutputWriter.scala:37)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:161)


**set  `spark.sql.parquet.compression.codec=zstd`:**   
Caused by: org.apache.parquet.hadoop.BadConfigurationException: Class 
org.apache.hadoop.io.compress.ZStandardCodec was not found
at 
org.apache.parquet.hadoop.CodecFactory.getCodec(CodecFactory.java:235)
at 
org.apache.parquet.hadoop.CodecFactory$HeapBytesCompressor.(CodecFactory.java:142)
at 
org.apache.parquet.hadoop.CodecFactory.createCompressor(CodecFactory.java:206)
at 
org.apache.parquet.hadoop.CodecFactory.getCompressor(CodecFactory.java:189)
at 
org.apache.parquet.hadoop.ParquetRecordWriter.(ParquetRecordWriter.java:153)
at 
org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:411)
at 
org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetOutputWriter.scala:37)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:161)

## How was this patch tested?
Exist unit test


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/10110346/spark notsupportzstdandbrotil

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22358.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22358


commit 1db036ad725bc7a3c60dbb9aede0f91cf0d798d0
Author: liuxian 
Date:   2018-09-07T07:12:36Z

fix




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22350: [SPARK-25356][SQL]Add Parquet block size option t...

2018-09-06 Thread 10110346
Github user 10110346 closed the pull request at:

https://github.com/apache/spark/pull/22350


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22350: [SPARK-25356][SQL]Add Parquet block size option t...

2018-09-06 Thread 10110346
Github user 10110346 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22350#discussion_r215819785
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
 ---
@@ -123,6 +123,9 @@ class ParquetFileFormat
 // Sets compression scheme
 conf.set(ParquetOutputFormat.COMPRESSION, 
parquetOptions.compressionCodecClassName)
 
+// Sets Parquet block size
+conf.setInt(ParquetOutputFormat.BLOCK_SIZE, 
sparkSession.sessionState.conf.parquetBlockSize)
--- End diff --

 Sounds reasonable.  I close it now, thanks


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22350: [SPARK-25356][SQL]Add Parquet block size option t...

2018-09-06 Thread 10110346
Github user 10110346 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22350#discussion_r215598798
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
 ---
@@ -123,6 +123,9 @@ class ParquetFileFormat
 // Sets compression scheme
 conf.set(ParquetOutputFormat.COMPRESSION, 
parquetOptions.compressionCodecClassName)
 
+// Sets Parquet block size
+conf.setInt(ParquetOutputFormat.BLOCK_SIZE, 
sparkSession.sessionState.conf.parquetBlockSize)
--- End diff --

Yes, we are already able to set this via `parquet.block.size`, 
I think we should add this parameter into  "sql-programming-guide.md"


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22350: [SPARK-25356][SQL]Add Parquet block size option t...

2018-09-06 Thread 10110346
GitHub user 10110346 opened a pull request:

https://github.com/apache/spark/pull/22350

[SPARK-25356][SQL]Add Parquet block size  option to SparkSQL configuration

## What changes were proposed in this pull request?


I think we should configure the Parquet buffer size when using Parquet 
format.
Because for HDFS, `dfs.block.size` is configurable, sometimes we hope the 
block size of parquet to be consistent with it.
And  whether this parameter `spark.sql.files.maxPartitionBytes` is best 
consistent with the Parquet  block size when using Parquet format?
Also we may want to shrink Parquet block size in some tests.

## How was this patch tested?
N/A


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/10110346/spark addblocksize

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22350.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22350


commit 3485b523d54e83ed3388febd06b3ac4914d181ed
Author: liuxian 
Date:   2018-09-06T10:35:43Z

fix




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22306: [SPARK-25300][CORE]Unified the configuration parameter `...

2018-09-03 Thread 10110346
Github user 10110346 commented on the issue:

https://github.com/apache/spark/pull/22306
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22306: [SPARK-25300][CORE]Unified the configuration parameter `...

2018-09-02 Thread 10110346
Github user 10110346 commented on the issue:

https://github.com/apache/spark/pull/22306
  
Thanks,I will apply them to test cases @kiszk 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22306: [SPARK-25300][CORE]Unified the configuration para...

2018-08-31 Thread 10110346
GitHub user 10110346 opened a pull request:

https://github.com/apache/spark/pull/22306

[SPARK-25300][CORE]Unified the configuration parameter 
`spark.shuffle.service.enabled`

## What changes were proposed in this pull request?

The configuration parameter "spark.shuffle.service.enabled"  has defined in 
`package.scala`,  and it  is also used in many place,  so we can replace it 
with `SHUFFLE_SERVICE_ENABLED`.
and unified  this configuration parameter "spark.shuffle.service.port"  
together.

## How was this patch tested?
N/A

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/10110346/spark unifiedserviceenable

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22306.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22306


commit 82525d753b80aef856217b9a161966a7ad499eca
Author: liuxian 
Date:   2018-09-01T03:09:08Z

fix




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22241: [SPARK-25249][CORE][TEST]add a unit test for Open...

2018-08-27 Thread 10110346
Github user 10110346 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22241#discussion_r212902991
  
--- Diff: 
core/src/test/scala/org/apache/spark/util/collection/OpenHashMapSuite.scala ---
@@ -194,4 +194,42 @@ class OpenHashMapSuite extends SparkFunSuite with 
Matchers {
 val numInvalidValues = map.iterator.count(_._2 == 0)
 assertResult(0)(numInvalidValues)
   }
+
+  test("distinguish between the 0/0.0/0L and null") {
+val specializedMap1 = new OpenHashMap[String, Long]
+specializedMap1("a") = null.asInstanceOf[Long]
+specializedMap1("b") = 0L
+assert(specializedMap1.contains("a"))
+assert(!specializedMap1.contains("c"))
+assert(Some(specializedMap1("a")).contains(0L))
+assert(Some(specializedMap1("b")).contains(0L))
+assert(Some(specializedMap1("c")).contains(0L))
--- End diff --

ok , i will add it


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   >