[GitHub] spark issue #23228: [MINOR][DOC] Update the condition description of seriali...
Github user 10110346 commented on the issue: https://github.com/apache/spark/pull/23228 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23251: [SPARK-26300][SS] Remove a redundant `checkForStreaming`...
Github user 10110346 commented on the issue: https://github.com/apache/spark/pull/23251 cc @cloud-fan --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23228: [MINOR][DOC] Update the condition description of seriali...
Github user 10110346 commented on the issue: https://github.com/apache/spark/pull/23228 I have updated, thanks all. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23228: [MINOR][DOC]The condition description of serialized shuf...
Github user 10110346 commented on the issue: https://github.com/apache/spark/pull/23228 cc @JoshRosen @cloud-fan --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23251: [SPARK-26300][SS] The `checkForStreaming` mothod ...
GitHub user 10110346 opened a pull request: https://github.com/apache/spark/pull/23251 [SPARK-26300][SS] The `checkForStreaming` mothod may be called twice in `createQuery` ## What changes were proposed in this pull request? If `checkForContinuous` is called ( `checkForStreaming` is called in `checkForContinuous` ), the `checkForStreaming` mothod will be called twice in `createQuery` , this is not necessary, and the `checkForStreaming` method has a lot of statements, so it's better to remove one of them. ## How was this patch tested? Existing unit tests in `StreamingQueryManagerSuite` You can merge this pull request into a Git repository by running: $ git pull https://github.com/10110346/spark isUnsupportedOperationCheckEnabled Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23251.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23251 commit b1e71ee7a723d63f1cf3c0754f2372eb185439d3 Author: liuxian Date: 2018-12-07T03:08:26Z fix --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23228: [MINOR][DOC]The condition description of serializ...
GitHub user 10110346 opened a pull request: https://github.com/apache/spark/pull/23228 [MINOR][DOC]The condition description of serialized shuffle is not very accurate ## What changes were proposed in this pull request? `1. The shuffle dependency specifies no aggregation or output ordering.` If the shuffle dependency specifies aggregation, but it only aggregates at the reducer side, serialized shuffle can still be used. `3. The shuffle produces fewer than 16777216 output partitions.` If the number of output partitions is 16777216 , we can use serialized shuffle. ## How was this patch tested? N/A You can merge this pull request into a Git repository by running: $ git pull https://github.com/10110346/spark SerializedShuffle_doc Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23228.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23228 commit d5dadbf30d5429c36ec3d5c2845a71c2717fd6f3 Author: liuxian Date: 2018-12-05T08:55:20Z fix --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23216: [SPARK-26264][CORE]It is better to add @transient...
Github user 10110346 closed the pull request at: https://github.com/apache/spark/pull/23216 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23216: [SPARK-26264][CORE]It is better to add @transient to fie...
Github user 10110346 commented on the issue: https://github.com/apache/spark/pull/23216 Ok, I will close this PR, thank you very much --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23216: [SPARK-26264][CORE]It is better to add @transient to fie...
Github user 10110346 commented on the issue: https://github.com/apache/spark/pull/23216 > > > Are you sure it's even a field in the class? it looks like it's only used to define this: > > ``` > @transient private[this] val preferredLocs: Seq[TaskLocation] = { > if (locs == null) Nil else locs.toSet.toSeq > } > ``` > > I'd expect Scala would not generate a field. Indeed the thing it is used to make is transient. Yeah, it would not generate a field, thanks @srowen By the way, is it better to remove `transient` for `ShuffleMapTask`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23216: [SPARK-26264][CORE]It is better to add @transient...
GitHub user 10110346 opened a pull request: https://github.com/apache/spark/pull/23216 [SPARK-26264][CORE]It is better to add @transient to field 'locs' for class `ResultTask`. ## What changes were proposed in this pull request? The field 'locs' is only used in driver side for class `ResultTask`, so it is not needed to serialize when sending the `ResultTask` to executor. Although it's not very big, it's very frequent, so we can add` transient` for it like `ShuffleMapTask`. ## How was this patch tested? Existed unit tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/10110346/spark locs_transient Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23216.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23216 commit b3ede8be1a9073f057cc46fb82eacd7fa3ec36c6 Author: liuxian Date: 2018-12-04T08:55:40Z fix --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23162: [MINOR][DOC] Correct some document description errors
Github user 10110346 commented on the issue: https://github.com/apache/spark/pull/23162 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23162: [MINOR][DOC] Correct some document description er...
Github user 10110346 commented on a diff in the pull request: https://github.com/apache/spark/pull/23162#discussion_r237713245 --- Diff: core/src/main/scala/org/apache/spark/internal/config/package.scala --- @@ -430,8 +430,8 @@ package object config { .doc("The chunk size in bytes during writing out the bytes of ChunkedByteBuffer.") .bytesConf(ByteUnit.BYTE) .checkValue(_ <= ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH, -"The chunk size during writing out the bytes of" + -" ChunkedByteBuffer should not larger than Int.MaxValue - 15.") +"The chunk size during writing out the bytes of ChunkedByteBuffer should" + + s" not larger than ${ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH}.") --- End diff -- ok, thanks --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23162: [MINOR][DOC] Correct some document description errors
Github user 10110346 commented on the issue: https://github.com/apache/spark/pull/23162 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21957: [SPARK-24994][SQL] When the data type of the fiel...
Github user 10110346 commented on a diff in the pull request: https://github.com/apache/spark/pull/21957#discussion_r236965962 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala --- @@ -269,7 +269,8 @@ case class FileSourceScanExec( } @transient - private val pushedDownFilters = dataFilters.flatMap(DataSourceStrategy.translateFilter) + private val pushedDownFilters = dataFilters.flatMap(DataSourceStrategy. +translateFilter(_, !relation.fileFormat.isInstanceOf[ParquetSource])) --- End diff -- Thanks Yeah, this is not a good solution, I can't solve this problem better now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22590: [SPARK-25574][SQL]Add an option `keepQuotes` for parsing...
Github user 10110346 commented on the issue: https://github.com/apache/spark/pull/22590 @HyukjinKwon I think it is not important. but our customers need this feature. Yeah, it is better to find a way to set the arbitrary parse settings options --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23162: [MINOR][DOC] Correct some document description er...
GitHub user 10110346 opened a pull request: https://github.com/apache/spark/pull/23162 [MINOR][DOC] Correct some document description errors ## What changes were proposed in this pull request? Correct some document description errors. ## How was this patch tested? N/A You can merge this pull request into a Git repository by running: $ git pull https://github.com/10110346/spark docerror Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23162.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23162 commit e9aba19b526610f3f31fa6a5b56140f6be8dc1c1 Author: liuxian Date: 2018-11-28T06:06:51Z fix --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22163: [SPARK-25166][CORE]Reduce the number of write operations...
Github user 10110346 commented on the issue: https://github.com/apache/spark/pull/22163 cc @kiszk @maropu --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23154: [SPARK-26195][SQL] Correct exception messages in some cl...
Github user 10110346 commented on the issue: https://github.com/apache/spark/pull/23154 LGTM,thanks --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23154: [SPARK-26195][SQL] Correct exception messages in ...
Github user 10110346 commented on a diff in the pull request: https://github.com/apache/spark/pull/23154#discussion_r236920634 --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedRleValuesReader.java --- @@ -510,42 +510,42 @@ public void readIntegers(int total, WritableColumnVector c, int rowId) { @Override public byte readByte() { -throw new UnsupportedOperationException("only readInts is valid."); +throw new UnsupportedOperationException("only readByte is valid."); --- End diff -- These exception messages seem to be correct ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22163: [SPARK-25166][CORE]Reduce the number of write operations...
Github user 10110346 commented on the issue: https://github.com/apache/spark/pull/22163 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22779: [SPARK-25786][CORE]If the ByteBuffer.hasArray is false ,...
Github user 10110346 commented on the issue: https://github.com/apache/spark/pull/22779 @srowen Thanks, I am sorry, I am on holiday, l will update it next week ,I am reply ing on my phone. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22723: [SPARK-25729][CORE]It is better to replace `minPa...
Github user 10110346 commented on a diff in the pull request: https://github.com/apache/spark/pull/22723#discussion_r230579427 --- Diff: core/src/main/scala/org/apache/spark/input/WholeTextFileInputFormat.scala --- @@ -48,11 +50,11 @@ private[spark] class WholeTextFileInputFormat * Allow minPartitions set by end-user in order to keep compatibility with old Hadoop API, * which is set through setMaxSplitSize */ - def setMinPartitions(context: JobContext, minPartitions: Int) { + def setMinPartitions(sc: SparkContext, context: JobContext, minPartitions: Int) { --- End diff -- Ok,thanks --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22723: [SPARK-25729][CORE]It is better to replace `minPa...
Github user 10110346 commented on a diff in the pull request: https://github.com/apache/spark/pull/22723#discussion_r230579423 --- Diff: core/src/main/scala/org/apache/spark/input/WholeTextFileInputFormat.scala --- @@ -48,11 +50,11 @@ private[spark] class WholeTextFileInputFormat * Allow minPartitions set by end-user in order to keep compatibility with old Hadoop API, * which is set through setMaxSplitSize */ - def setMinPartitions(context: JobContext, minPartitions: Int) { + def setMinPartitions(sc: SparkContext, context: JobContext, minPartitions: Int) { val files = listStatus(context).asScala val totalLen = files.map(file => if (file.isDirectory) 0L else file.getLen).sum -val maxSplitSize = Math.ceil(totalLen * 1.0 / - (if (minPartitions == 0) 1 else minPartitions)).toLong +val minPartNum = Math.max(sc.defaultParallelism, minPartitions) +val maxSplitSize = Math.ceil(totalLen * 1.0 / minPartNum).toLong --- End diff -- Thanks, I have change the description. I think the number of partitions belong to input format, and it can also be used by other RDD. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22723: [SPARK-25729][CORE]It is better to replace `minPa...
Github user 10110346 commented on a diff in the pull request: https://github.com/apache/spark/pull/22723#discussion_r230579084 --- Diff: core/src/main/scala/org/apache/spark/rdd/WholeTextFileRDD.scala --- @@ -51,7 +51,7 @@ private[spark] class WholeTextFileRDD( case _ => } val jobContext = new JobContextImpl(conf, jobId) -inputFormat.setMinPartitions(jobContext, minPartitions) +inputFormat.setMinPartitions(sc, jobContext, minPartitions) --- End diff -- Yeah, thanks --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22855: [SPARK-25839] [Core] Implement use of KryoPool in...
Github user 10110346 commented on a diff in the pull request: https://github.com/apache/spark/pull/22855#discussion_r228844982 --- Diff: core/src/main/scala/org/apache/spark/serializer/KryoSerializer.scala --- @@ -298,30 +312,40 @@ class KryoDeserializationStream( } } -private[spark] class KryoSerializerInstance(ks: KryoSerializer, useUnsafe: Boolean) +private[spark] class KryoSerializerInstance( + ks: KryoSerializer, useUnsafe: Boolean, usePool: Boolean) extends SerializerInstance { /** * A re-used [[Kryo]] instance. Methods will borrow this instance by calling `borrowKryo()`, do * their work, then release the instance by calling `releaseKryo()`. Logically, this is a caching * pool of size one. SerializerInstances are not thread-safe, hence accesses to this field are * not synchronized. */ - @Nullable private[this] var cachedKryo: Kryo = borrowKryo() + @Nullable private[this] var cachedKryo: Kryo = if (usePool) null else borrowKryo() /** * Borrows a [[Kryo]] instance. If possible, this tries to re-use a cached Kryo instance; * otherwise, it allocates a new instance. */ private[serializer] def borrowKryo(): Kryo = { -if (cachedKryo != null) { - val kryo = cachedKryo - // As a defensive measure, call reset() to clear any Kryo state that might have been modified - // by the last operation to borrow this instance (see SPARK-7766 for discussion of this issue) +if (usePool) { + val kryo = ks.pool.borrow() kryo.reset() - cachedKryo = null kryo } else { - ks.newKryo() + if (cachedKryo != null) { +val kryo = cachedKryo +/** +* As a defensive measure, call reset() to clear any Kryo state that might have --- End diff -- The `*`after the first line must be aligned with the first `*` of the first line. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22723: [SPARK-25729][CORE]It is better to replace `minPartition...
Github user 10110346 commented on the issue: https://github.com/apache/spark/pull/22723 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22723: [SPARK-25729][CORE]It is better to replace `minPartition...
Github user 10110346 commented on the issue: https://github.com/apache/spark/pull/22723 Thanks, yes, you are right. After you reminded, I realized there were other places, such as `HadoopRDD`. But I wonder if it's better to just modify `WholeTextFileInputFormat`, like `StreamFileInputFormat`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22802: [SPARK-25806][SQL]The instance of FileSplit is redundant
Github user 10110346 commented on the issue: https://github.com/apache/spark/pull/22802 @srowen Thanks, I have checked all and updated it --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22723: [SPARK-25729][CORE]It is better to replace `minPartition...
Github user 10110346 commented on the issue: https://github.com/apache/spark/pull/22723 In fact, `BinaryFileRDD ` uses `max(defaultParallelism, minPartitions)`: `BinaryFileRDD --->setMinPartitions--->Math.max(sc.defaultParallelism, minPartitions)`. In addition, for this unit test: https://github.com/apache/spark/blob/b80bf66a8109faa7f58d45b92417a981666866a0/core/src/test/scala/org/apache/spark/FileSuite.scala#L304 if we set `spark.default.parallelism=3`, the result will not be what we expected. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22723: [SPARK-25729][CORE]It is better to replace `minPartition...
Github user 10110346 commented on the issue: https://github.com/apache/spark/pull/22723 What you say is reasonable. But from the perspective of resource utilization, I think it is better to replace `minPartitions` with `defaultParallelism`, we can see `BinaryFileRDD` and `createNonBucketedReadRDD`, both of them were done like this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22754: [SPARK-25776][CORE]The disk write buffer size mus...
Github user 10110346 commented on a diff in the pull request: https://github.com/apache/spark/pull/22754#discussion_r228009755 --- Diff: core/src/main/scala/org/apache/spark/internal/config/package.scala --- @@ -495,8 +495,8 @@ package object config { ConfigBuilder("spark.shuffle.spill.diskWriteBufferSize") .doc("The buffer size, in bytes, to use when writing the sorted records to an on-disk file.") .bytesConf(ByteUnit.BYTE) - .checkValue(v => v > 0 && v <= Int.MaxValue, -s"The buffer size must be greater than 0 and less than ${Int.MaxValue}.") + .checkValue(v => v > 12 && v <= Int.MaxValue, +s"The buffer size must be greater than 12 and less than ${Int.MaxValue}.") --- End diff -- Yea, I think this value is better --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22754: [SPARK-25776][CORE]The disk write buffer size must be gr...
Github user 10110346 commented on the issue: https://github.com/apache/spark/pull/22754 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22754: [SPARK-25776][CORE]The disk write buffer size must be gr...
Github user 10110346 commented on the issue: https://github.com/apache/spark/pull/22754 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22754: [SPARK-25776][CORE]The disk write buffer size mus...
Github user 10110346 commented on a diff in the pull request: https://github.com/apache/spark/pull/22754#discussion_r227392626 --- Diff: core/src/main/scala/org/apache/spark/internal/config/package.scala --- @@ -495,8 +495,8 @@ package object config { ConfigBuilder("spark.shuffle.spill.diskWriteBufferSize") .doc("The buffer size, in bytes, to use when writing the sorted records to an on-disk file.") .bytesConf(ByteUnit.BYTE) - .checkValue(v => v > 0 && v <= Int.MaxValue, -s"The buffer size must be greater than 0 and less than ${Int.MaxValue}.") + .checkValue(v => v > 12 && v <= Int.MaxValue, +s"The buffer size must be greater than 12 and less than ${Int.MaxValue}.") --- End diff -- So strange, l have handled it, but we can not see the change here --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22802: [SPARK-25806][SQL][MINOR]The instanceof FileSplit...
GitHub user 10110346 opened a pull request: https://github.com/apache/spark/pull/22802 [SPARK-25806][SQL][MINOR]The instanceof FileSplit is redundant for ParquetFileFormat ## What changes were proposed in this pull request? The instance of `FileSplit` is redundant for `buildReaderWithPartitionValues` in the `ParquetFileFormat` class. ## How was this patch tested? Existing unit tests in `ParquetQuerySuite.scala` You can merge this pull request into a Git repository by running: $ git pull https://github.com/10110346/spark FileSplitnotneed Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22802.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22802 commit 52140fa64adb47047b2cb10377799a6b9fd3ab73 Author: liuxian Date: 2018-10-23T06:43:51Z fix --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22754: [SPARK-25776][CORE][MINOR]The disk write buffer size mus...
Github user 10110346 commented on the issue: https://github.com/apache/spark/pull/22754 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22754: [SPARK-25776][CORE][MINOR]The disk write buffer s...
Github user 10110346 commented on a diff in the pull request: https://github.com/apache/spark/pull/22754#discussion_r227200492 --- Diff: core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSorterSpillWriter.java --- @@ -62,6 +62,8 @@ public UnsafeSorterSpillWriter( int fileBufferSize, ShuffleWriteMetrics writeMetrics, int numRecordsToWrite) throws IOException { +// Space used by prefix + len + recordLength is more than 4 + 8 bytes +assert (diskWriteBufferSize > 12); --- End diff -- Yes, it can guarantee this. Here explains why it must be greater than 12. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22723: [SPARK-25729][CORE]It is better to replace `minPartition...
Github user 10110346 commented on the issue: https://github.com/apache/spark/pull/22723 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22754: [SPARK-25776][CORE][MINOR]The disk write buffer s...
Github user 10110346 commented on a diff in the pull request: https://github.com/apache/spark/pull/22754#discussion_r227191411 --- Diff: core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSorterSpillWriter.java --- @@ -62,6 +62,8 @@ public UnsafeSorterSpillWriter( int fileBufferSize, ShuffleWriteMetrics writeMetrics, int numRecordsToWrite) throws IOException { +// Space used by prefix + len + recordLength is more than 4 + 8 bytes +assert (diskWriteBufferSize > 12); --- End diff -- I am not sure tooï¼ but I see many places(`BitSetMethods.java, HeapMemoryAllocator.java, LongArray.java`) that use it like this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22754: [SPARK-25776][CORE][MINOR]The disk write buffer size mus...
Github user 10110346 commented on the issue: https://github.com/apache/spark/pull/22754 Thank you for your review, I will update it @kiszk --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22754: [SPARK-25776][CORE][MINOR]The disk write buffer size mus...
Github user 10110346 commented on the issue: https://github.com/apache/spark/pull/22754 Thank you for your review, I will update it @kiszk --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22754: [SPARK-25776][CORE][MINOR]The disk write buffer size mus...
Github user 10110346 commented on the issue: https://github.com/apache/spark/pull/22754 Thank you for your review, I will update it @kiszk --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22754: [SPARK-25776][CORE][MINOR]The disk write buffer size mus...
Github user 10110346 commented on the issue: https://github.com/apache/spark/pull/22754 Thank you for your review, I will update it @kiszk --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22754: [SPARK-25776][CORE][MINOR]The disk write buffer size mus...
Github user 10110346 commented on the issue: https://github.com/apache/spark/pull/22754 Thank you for your review, I will update it @kiszk --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22754: [SPARK-25776][CORE][MINOR]The disk write buffer size mus...
Github user 10110346 commented on the issue: https://github.com/apache/spark/pull/22754 Thank you for your review, I will update it @kiszk --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22754: [SPARK-25776][CORE][MINOR]The disk write buffer size mus...
Github user 10110346 commented on the issue: https://github.com/apache/spark/pull/22754 Thank you for your review, I will update it @kiszk --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22754: [SPARK-25776][CORE][MINOR]The disk write buffer size mus...
Github user 10110346 commented on the issue: https://github.com/apache/spark/pull/22754 Thank you for your review, I will update it @kiszk --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22754: [SPARK-25776][CORE][MINOR]The disk write buffer size mus...
Github user 10110346 commented on the issue: https://github.com/apache/spark/pull/22754 Thank you for your review, I will update it @kiszk --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22754: [SPARK-25776][CORE][MINOR]The disk write buffer size mus...
Github user 10110346 commented on the issue: https://github.com/apache/spark/pull/22754 Thank you for your review, I will update it @kiszk --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22754: [SPARK-25776][CORE][MINOR]The disk write buffer size mus...
Github user 10110346 commented on the issue: https://github.com/apache/spark/pull/22754 Thank you for your review, I will update it @kiszk --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22754: [SPARK-25776][CORE][MINOR]The disk write buffer size mus...
Github user 10110346 commented on the issue: https://github.com/apache/spark/pull/22754 Thank you for your review, I will update it @kiszk --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22754: [SPARK-25776][CORE][MINOR]The disk write buffer size mus...
Github user 10110346 commented on the issue: https://github.com/apache/spark/pull/22754 Thank you for your review, I will update it @kiszk --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22754: [SPARK-25776][CORE][MINOR]The disk write buffer size mus...
Github user 10110346 commented on the issue: https://github.com/apache/spark/pull/22754 Thank you for your review, I will update it @kiszk --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22754: [SPARK-25776][CORE][MINOR]The disk write buffer size mus...
Github user 10110346 commented on the issue: https://github.com/apache/spark/pull/22754 Thank you for your review, I will update it @kiszk --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22754: [SPARK-25776][CORE][MINOR]The disk write buffer size mus...
Github user 10110346 commented on the issue: https://github.com/apache/spark/pull/22754 Thank you for your review, I will update it @kiszk --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22754: [SPARK-25776][CORE][MINOR]The disk write buffer size mus...
Github user 10110346 commented on the issue: https://github.com/apache/spark/pull/22754 Thank you for your review, I will update it @kiszk --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22754: [SPARK-25776][CORE][MINOR]The disk write buffer size mus...
Github user 10110346 commented on the issue: https://github.com/apache/spark/pull/22754 Thank you for your review, I will update it @kiszk --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22754: [SPARK-25776][CORE][MINOR]The disk write buffer size mus...
Github user 10110346 commented on the issue: https://github.com/apache/spark/pull/22754 Thank you for your review, I will update it @kiszk --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22754: [SPARK-25776][CORE][MINOR]The disk write buffer size mus...
Github user 10110346 commented on the issue: https://github.com/apache/spark/pull/22754 Thank you for your review, I will update it @kiszk --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22754: [SPARK-25776][CORE][MINOR]The disk write buffer size mus...
Github user 10110346 commented on the issue: https://github.com/apache/spark/pull/22754 Thank you for your review, I will update it @kiszk --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22754: [SPARK-25776][CORE][MINOR]The disk write buffer size mus...
Github user 10110346 commented on the issue: https://github.com/apache/spark/pull/22754 Thank you for your review, I will update it @kiszk --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22779: [SPARK-25786][CORE]If the ByteBuffer.hasArray is false ,...
Github user 10110346 commented on the issue: https://github.com/apache/spark/pull/22779 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22779: [SPARK-25786][CORE]If the ByteBuffer.hasArray is false ,...
Github user 10110346 commented on the issue: https://github.com/apache/spark/pull/22779 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22779: [SPARK-25786][CORE]If the ByteBuffer.hasArray is false ,...
Github user 10110346 commented on the issue: https://github.com/apache/spark/pull/22779 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22779: [SPARK-25786][CORE]If the ByteBuffer.hasArray is ...
GitHub user 10110346 opened a pull request: https://github.com/apache/spark/pull/22779 [SPARK-25786][CORE]If the ByteBuffer.hasArray is false , it will throw UnsupportedOperationException for Kryo ## What changes were proposed in this pull request? `deserialize` for kryo, the type of input parameter is ByteBuffer, if it is not backed by an accessible byte array. it will throw `UnsupportedOperationException` Exception Info: ``` java.lang.UnsupportedOperationException was thrown. java.lang.UnsupportedOperationException at java.nio.ByteBuffer.array(ByteBuffer.java:994) at org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:362) ``` ## How was this patch tested? Added a unit test You can merge this pull request into a Git repository by running: $ git pull https://github.com/10110346/spark InputStreamKryo Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22779.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22779 commit 943e3988dcb70d17e65b5e508f6f35b87fc71d28 Author: liuxian Date: 2018-10-19T11:08:10Z fix --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22725: [SPARK-25753][[CORE][FOLLOW-UP]fix reading small files v...
Github user 10110346 commented on the issue: https://github.com/apache/spark/pull/22725 ok,thanks @dongjoon-hyun --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22774: [SPARK-25780][CORE]Scheduling the tasks which have no hi...
Github user 10110346 commented on the issue: https://github.com/apache/spark/pull/22774 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22774: [SPARK-25780][CORE]Scheduling the tasks which hav...
GitHub user 10110346 opened a pull request: https://github.com/apache/spark/pull/22774 [SPARK-25780][CORE]Scheduling the tasks which have no higher level locality first ## What changes were proposed in this pull request? For example: An application has two executors: (exe1, host1), (exe2,host2), and 3 tasks with locality: {task0, Seq(TaskLocation("host1", "exec1"))}, {task1, Seq(TaskLocation("host1", "exec1"), TaskLocation("host2"))}, {task2, Seq(TaskLocation("host2")} If task0 is runing in exe1, when `allowedLocality` is NODE_LOCAL for exe2, it is better to schedule task2 fisrt, not task1, because task1 may be scheduled to exe1 later. ## How was this patch tested? Added a UT You can merge this pull request into a Git repository by running: $ git pull https://github.com/10110346/spark higher_locality Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22774.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22774 commit 7076bdef5c633739a17e6e9f7ed0c80ed5cb11de Author: liuxian Date: 2018-10-19T06:36:30Z fix --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22754: [MINOR][CORE]The disk write buffer size must be greater ...
Github user 10110346 commented on the issue: https://github.com/apache/spark/pull/22754 @kiszk Thanksï¼I will create a JIRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22754: [MINOR][CORE]The disk write buffer size must be greater ...
Github user 10110346 commented on the issue: https://github.com/apache/spark/pull/22754 If we set 12 into this, `freeSpaceInWriteBuffer ` will be 0, and the length of `copyMemory` will always be 0, so it is not allowed to set 12 into this property. https://github.com/apache/spark/blob/dfc4b65088602ee45e0babe22e64e205fab3e82b/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSorterSpillWriter.java#L128 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22754: [CORE][MINOR]The disk write buffer size must be g...
GitHub user 10110346 opened a pull request: https://github.com/apache/spark/pull/22754 [CORE][MINOR]The disk write buffer size must be greater than 12 ## What changes were proposed in this pull request? In `UnsafeSorterSpillWriter.java`, when we write a record to a spill file wtih ` void write(Object baseObject, long baseOffset, int recordLength, long keyPrefix)`, `recordLength` and `keyPrefix` will be written the disk write buffer first, and these will take 12 bytes, so the disk write buffer size must be greater than 12. ## How was this patch tested? Existing UT in `UnsafeExternalSorterSuite` You can merge this pull request into a Git repository by running: $ git pull https://github.com/10110346/spark diskWriteBufferSize Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22754.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22754 commit dfc4b65088602ee45e0babe22e64e205fab3e82b Author: liuxian Date: 2018-10-17T06:03:00Z fix --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22725: [SPARK-24610][[CORE][FOLLOW-UP]fix reading small files v...
Github user 10110346 commented on the issue: https://github.com/apache/spark/pull/22725 @tgravescs ok, I will do it ,thanks --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22723: [SPARK-25729][CORE]It is better to replace `minPartition...
Github user 10110346 commented on the issue: https://github.com/apache/spark/pull/22723 cc @gatorsmile --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22725: [SPARK-24610][[CORE][FOLLOW-UP]fix reading small files v...
Github user 10110346 commented on the issue: https://github.com/apache/spark/pull/22725 cc @dhruve @tgravescs --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22731: [SPARK-25674][FOLLOW-UP] Update the stats for eac...
Github user 10110346 commented on a diff in the pull request: https://github.com/apache/spark/pull/22731#discussion_r225362470 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala --- @@ -106,15 +106,16 @@ class FileScanRDD( // don't need to run this `if` for every record. val preNumRecordsRead = inputMetrics.recordsRead if (nextElement.isInstanceOf[ColumnarBatch]) { + incTaskInputMetricsBytesRead() --- End diff -- Considering that the default value of `"spark.sql.parquet.columnarReaderBatchSize` is 4096, this change is better . --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22725: [SPARK-24610][[CORE][FOLLOW-UP]fix reading small ...
GitHub user 10110346 opened a pull request: https://github.com/apache/spark/pull/22725 [SPARK-24610][[CORE][FOLLOW-UP]fix reading small files via BinaryFileRDD ## What changes were proposed in this pull request? This is a follow up of #21601, `StreamFileInputFormat` and `WholeTextFileInputFormat` have the same problem. `Minimum split size pernode 5123456 cannot be larger than maximum split size 4194304 java.io.IOException: Minimum split size pernode 5123456 cannot be larger than maximum split size 4194304 at org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java: 201) at org.apache.spark.rdd.BinaryFileRDD.getPartitions(BinaryFileRDD.scala:52) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:254) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:252) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2138)` ## How was this patch tested? Added a unit test You can merge this pull request into a Git repository by running: $ git pull https://github.com/10110346/spark maxSplitSize_node_rack Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22725.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22725 commit 54ffcdb7a18471a7a24fe36a000ca0cc4e8d0eba Author: liuxian Date: 2018-10-15T07:28:31Z fix --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22723: [SPARK-25729][CORE]It is better to replace `minPa...
GitHub user 10110346 opened a pull request: https://github.com/apache/spark/pull/22723 [SPARK-25729][CORE]It is better to replace `minPartitions` with `defaultParallelism` , when `minPartitions` is less than `defaultParallelism` ## What changes were proposed in this pull request? In âWholeTextFileRDDâï¼when `minPartitions` is less than `defaultParallelism`ï¼ it is better to replace `minPartitions` with `defaultParallelism` , because this can make better use of resources and improve concurrency. ## How was this patch tested? Added a unit test You can merge this pull request into a Git repository by running: $ git pull https://github.com/10110346/spark minPartNum Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22723.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22723 commit b80bf66a8109faa7f58d45b92417a981666866a0 Author: liuxian Date: 2018-10-15T06:39:17Z fix --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22594: [SPARK-25674][SQL] If the records are incremented...
Github user 10110346 commented on a diff in the pull request: https://github.com/apache/spark/pull/22594#discussion_r224286853 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala --- @@ -70,6 +70,8 @@ class FileScanRDD( private val ignoreCorruptFiles = sparkSession.sessionState.conf.ignoreCorruptFiles private val ignoreMissingFiles = sparkSession.sessionState.conf.ignoreMissingFiles + // only for test + private val inputMetricsTest = sparkSession.sessionState.conf.contains("spark.inputmetrics.test") --- End diff -- If this place is controlled by `spark.testing`, other unit tests may fail. Yeah, I agree with you ,this a simple change, it is better to drop this. thanks @srowen --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22590: [SPARK-25574][SQL]Add an option `keepQuotes` for parsing...
Github user 10110346 commented on the issue: https://github.com/apache/spark/pull/22590 Normally, it's better to have no quotes, but in our production environment, the user requests quotes to be displayed, so we need this option. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22594: [SPARK-25674][SQL] If the records are incremented...
Github user 10110346 commented on a diff in the pull request: https://github.com/apache/spark/pull/22594#discussion_r223612070 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala --- @@ -570,4 +572,33 @@ class SQLMetricsSuite extends SparkFunSuite with SQLMetricsTestUtils with Shared } } } + + test("InputMetrics---bytesRead") { --- End diff -- It is too hardï¼the test needs involve `ColumnarBatch`, in addition, we must capture the `bytesRead` in the process of execution, not the task end. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22590: [SPARK-25574][SQL]Add an option `keepQuotes` for ...
Github user 10110346 commented on a diff in the pull request: https://github.com/apache/spark/pull/22590#discussion_r223590113 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala --- @@ -194,6 +195,22 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils with Te checkAnswer(rows, expectedRows) } + test("keep escaped quotes") { +val notKeepQuotes = spark.read + .format("csv") + .option("keepQuotes", false) + .load(testFile(keepQuotesFile)) +var expectedRows = Seq(Row("\"a\"b", "ccc", null, "ddd"), Row("ab", "cc", null, "c,ddd")) --- End diff -- okï¼thanks --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22594: [MINOR][SQL] When batch reading, the number of bytes can...
Github user 10110346 commented on the issue: https://github.com/apache/spark/pull/22594 @srowen Yes,I will update,thanks --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22594: [MINOR][SQL] When batch reading, the number of by...
GitHub user 10110346 opened a pull request: https://github.com/apache/spark/pull/22594 [MINOR][SQL] When batch reading, the number of bytes can not be updated as expected. ## What changes were proposed in this pull request? When batch reading, the number of bytes can not be updated as expected. ## How was this patch tested? N/A You can merge this pull request into a Git repository by running: $ git pull https://github.com/10110346/spark inputMetrics Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22594.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22594 commit e589e1ef83418a485c9d55a72209c0c86cf7b044 Author: liuxian Date: 2018-09-30T09:14:20Z fix --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22590: [SPARK-25574][SQL]Add an option `keepQuotes` for ...
GitHub user 10110346 opened a pull request: https://github.com/apache/spark/pull/22590 [SPARK-25574][SQL]Add an option `keepQuotes` for parsing csv file ## What changes were proposed in this pull request? In the PR, I added new option for csv file - `keepQuotes`. In our project, when we read the CSV file, we hope to keep quotes. For example: We have such a record in the CSV file.: `ab,cc,,"c,ddd"` We hope it displays like this: ++---++---+ | _c0 | _c1 | _c2 | _c3 | +---+---+++ | ab | cc | null | `"c,ddd"` | Not like this: ++---+++ | _c0 | _c1 | _c2 | _c3 | +---++++ | ab | cc | null | c,ddd | ++---++---+ ## How was this patch tested? Added a unit test. You can merge this pull request into a Git repository by running: $ git pull https://github.com/10110346/spark keepquotes Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22590.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22590 commit 9c46a72517e5235e10ba0325b63817eefe5d71dd Author: liuxian Date: 2018-09-29T07:15:47Z fix --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22163: [SPARK-25166][CORE]Reduce the number of write operations...
Github user 10110346 commented on the issue: https://github.com/apache/spark/pull/22163 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22358: [SPARK-25366][SQL]Zstd and brotli CompressionCode...
Github user 10110346 commented on a diff in the pull request: https://github.com/apache/spark/pull/22358#discussion_r216180370 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -398,10 +398,10 @@ object SQLConf { "`parquet.compression` is specified in the table-specific options/properties, the " + "precedence would be `compression`, `parquet.compression`, " + "`spark.sql.parquet.compression.codec`. Acceptable values include: none, uncompressed, " + - "snappy, gzip, lzo, brotli, lz4, zstd.") + "snappy, gzip, lzo, lz4.") .stringConf .transform(_.toLowerCase(Locale.ROOT)) -.checkValues(Set("none", "uncompressed", "snappy", "gzip", "lzo", "lz4", "brotli", "zstd")) --- End diff -- I agree with you, removing is not a good idea. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22358: [SPARK-25366][SQL]Zstd and brotil CompressionCode...
Github user 10110346 commented on a diff in the pull request: https://github.com/apache/spark/pull/22358#discussion_r215901781 --- Diff: docs/sql-programming-guide.md --- @@ -964,7 +964,7 @@ Configuration of Parquet can be done using the `setConf` method on `SparkSession Sets the compression codec used when writing Parquet files. If either `compression` or `parquet.compression` is specified in the table-specific options/properties, the precedence would be `compression`, `parquet.compression`, `spark.sql.parquet.compression.codec`. Acceptable values include: -none, uncompressed, snappy, gzip, lzo, brotli, lz4, zstd. --- End diff -- got it,thanks @wangyum --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22358: [SPARK-25366][SQL]Zstd and brotil CompressionCodec are n...
Github user 10110346 commented on the issue: https://github.com/apache/spark/pull/22358 yeahï¼ the error message is output from external jar(parquet-common-1.10.0.jar), I think spark + parquet should avoid the hadoop dependencies --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22358: [SPARK-25366][SQL]Zstd and brotil CompressionCodec are n...
Github user 10110346 commented on the issue: https://github.com/apache/spark/pull/22358 Thanksï¼ if there are the codecs found, we support those compressions, but how do I find it? @HyukjinKwon --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22358: [SPARK-25366][SQL]Zstd and brotil CompressionCode...
Github user 10110346 commented on a diff in the pull request: https://github.com/apache/spark/pull/22358#discussion_r215887803 --- Diff: docs/sql-programming-guide.md --- @@ -964,7 +964,7 @@ Configuration of Parquet can be done using the `setConf` method on `SparkSession Sets the compression codec used when writing Parquet files. If either `compression` or `parquet.compression` is specified in the table-specific options/properties, the precedence would be `compression`, `parquet.compression`, `spark.sql.parquet.compression.codec`. Acceptable values include: -none, uncompressed, snappy, gzip, lzo, brotli, lz4, zstd. --- End diff -- Installation may not be able to solve it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22358: [SPARK-25366][SQL]Zstd and brotil CompressionCodec are n...
Github user 10110346 commented on the issue: https://github.com/apache/spark/pull/22358 It is using reflection to acquire hadoop classes for compression which are not in the supplied dependencies(hadoop-common-2.6.5.jar, hadoop-common-2.7.0.jar, hadoop-common-3.1.0.jar). `BROTLI("org.apache.hadoop.io.compress.BrotliCodec", CompressionCodec.BROTLI, ".br"), ZSTD("org.apache.hadoop.io.compress.ZStandardCodec", CompressionCodec.ZSTD, ".zstd");` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22358: [SPARK-25366][SQL]Zstd and brotil CompressionCode...
GitHub user 10110346 opened a pull request: https://github.com/apache/spark/pull/22358 [SPARK-25366][SQL]Zstd and brotil CompressionCodec are not supported for parquet files ## What changes were proposed in this pull request? Hadoop2.6 and hadoop2.7 do not contain zstd and brotil compressioncodec ,hadoop 3.1 also contains only zstd compressioncodec . So I think we should remove zstd and brotil for the time being. **set `spark.sql.parquet.compression.codec=brotli`:** Caused by: org.apache.parquet.hadoop.BadConfigurationException: Class org.apache.hadoop.io.compress.BrotliCodec was not found at org.apache.parquet.hadoop.CodecFactory.getCodec(CodecFactory.java:235) at org.apache.parquet.hadoop.CodecFactory$HeapBytesCompressor.(CodecFactory.java:142) at org.apache.parquet.hadoop.CodecFactory.createCompressor(CodecFactory.java:206) at org.apache.parquet.hadoop.CodecFactory.getCompressor(CodecFactory.java:189) at org.apache.parquet.hadoop.ParquetRecordWriter.(ParquetRecordWriter.java:153) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:411) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349) at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetOutputWriter.scala:37) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:161) **set `spark.sql.parquet.compression.codec=zstd`:** Caused by: org.apache.parquet.hadoop.BadConfigurationException: Class org.apache.hadoop.io.compress.ZStandardCodec was not found at org.apache.parquet.hadoop.CodecFactory.getCodec(CodecFactory.java:235) at org.apache.parquet.hadoop.CodecFactory$HeapBytesCompressor.(CodecFactory.java:142) at org.apache.parquet.hadoop.CodecFactory.createCompressor(CodecFactory.java:206) at org.apache.parquet.hadoop.CodecFactory.getCompressor(CodecFactory.java:189) at org.apache.parquet.hadoop.ParquetRecordWriter.(ParquetRecordWriter.java:153) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:411) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349) at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetOutputWriter.scala:37) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:161) ## How was this patch tested? Exist unit test You can merge this pull request into a Git repository by running: $ git pull https://github.com/10110346/spark notsupportzstdandbrotil Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22358.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22358 commit 1db036ad725bc7a3c60dbb9aede0f91cf0d798d0 Author: liuxian Date: 2018-09-07T07:12:36Z fix --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22350: [SPARK-25356][SQL]Add Parquet block size option t...
Github user 10110346 closed the pull request at: https://github.com/apache/spark/pull/22350 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22350: [SPARK-25356][SQL]Add Parquet block size option t...
Github user 10110346 commented on a diff in the pull request: https://github.com/apache/spark/pull/22350#discussion_r215819785 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala --- @@ -123,6 +123,9 @@ class ParquetFileFormat // Sets compression scheme conf.set(ParquetOutputFormat.COMPRESSION, parquetOptions.compressionCodecClassName) +// Sets Parquet block size +conf.setInt(ParquetOutputFormat.BLOCK_SIZE, sparkSession.sessionState.conf.parquetBlockSize) --- End diff -- Sounds reasonable. I close it nowï¼ thanks --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22350: [SPARK-25356][SQL]Add Parquet block size option t...
Github user 10110346 commented on a diff in the pull request: https://github.com/apache/spark/pull/22350#discussion_r215598798 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala --- @@ -123,6 +123,9 @@ class ParquetFileFormat // Sets compression scheme conf.set(ParquetOutputFormat.COMPRESSION, parquetOptions.compressionCodecClassName) +// Sets Parquet block size +conf.setInt(ParquetOutputFormat.BLOCK_SIZE, sparkSession.sessionState.conf.parquetBlockSize) --- End diff -- Yes, we are already able to set this via `parquet.block.size`, I think we should add this parameter into "sql-programming-guide.md" --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22350: [SPARK-25356][SQL]Add Parquet block size option t...
GitHub user 10110346 opened a pull request: https://github.com/apache/spark/pull/22350 [SPARK-25356][SQL]Add Parquet block size option to SparkSQL configuration ## What changes were proposed in this pull request? I think we should configure the Parquet buffer size when using Parquet format. Because for HDFS, `dfs.block.size` is configurable, sometimes we hope the block size of parquet to be consistent with it. And whether this parameter `spark.sql.files.maxPartitionBytes` is best consistent with the Parquet block size when using Parquet format? Also we may want to shrink Parquet block size in some tests. ## How was this patch tested? N/A You can merge this pull request into a Git repository by running: $ git pull https://github.com/10110346/spark addblocksize Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22350.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22350 commit 3485b523d54e83ed3388febd06b3ac4914d181ed Author: liuxian Date: 2018-09-06T10:35:43Z fix --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22306: [SPARK-25300][CORE]Unified the configuration parameter `...
Github user 10110346 commented on the issue: https://github.com/apache/spark/pull/22306 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22306: [SPARK-25300][CORE]Unified the configuration parameter `...
Github user 10110346 commented on the issue: https://github.com/apache/spark/pull/22306 Thanks,I will apply them to test cases @kiszk --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22306: [SPARK-25300][CORE]Unified the configuration para...
GitHub user 10110346 opened a pull request: https://github.com/apache/spark/pull/22306 [SPARK-25300][CORE]Unified the configuration parameter `spark.shuffle.service.enabled` ## What changes were proposed in this pull request? The configuration parameter "spark.shuffle.service.enabled" has defined in `package.scala`, and it is also used in many place, so we can replace it with `SHUFFLE_SERVICE_ENABLED`. and unified this configuration parameter "spark.shuffle.service.port" together. ## How was this patch tested? N/A You can merge this pull request into a Git repository by running: $ git pull https://github.com/10110346/spark unifiedserviceenable Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22306.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22306 commit 82525d753b80aef856217b9a161966a7ad499eca Author: liuxian Date: 2018-09-01T03:09:08Z fix --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22241: [SPARK-25249][CORE][TEST]add a unit test for Open...
Github user 10110346 commented on a diff in the pull request: https://github.com/apache/spark/pull/22241#discussion_r212902991 --- Diff: core/src/test/scala/org/apache/spark/util/collection/OpenHashMapSuite.scala --- @@ -194,4 +194,42 @@ class OpenHashMapSuite extends SparkFunSuite with Matchers { val numInvalidValues = map.iterator.count(_._2 == 0) assertResult(0)(numInvalidValues) } + + test("distinguish between the 0/0.0/0L and null") { +val specializedMap1 = new OpenHashMap[String, Long] +specializedMap1("a") = null.asInstanceOf[Long] +specializedMap1("b") = 0L +assert(specializedMap1.contains("a")) +assert(!specializedMap1.contains("c")) +assert(Some(specializedMap1("a")).contains(0L)) +assert(Some(specializedMap1("b")).contains(0L)) +assert(Some(specializedMap1("c")).contains(0L)) --- End diff -- ok , i will add it --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org