[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22339 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22339 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96885/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22339 **[Test build #96885 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96885/testReport)** for PR 22339 at commit [`542872c`](https://github.com/apache/spark/commit/542872cb5459fae1a66ee45aa193986e9a37fb96). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22527: [SPARK-17952][SQL] Nested Java beans support in c...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/22527#discussion_r222183494 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala --- @@ -1098,16 +1098,24 @@ object SQLContext { data: Iterator[_], beanClass: Class[_], attrs: Seq[AttributeReference]): Iterator[InternalRow] = { -val extractors = - JavaTypeInference.getJavaBeanReadableProperties(beanClass).map(_.getReadMethod) -val methodsToConverts = extractors.zip(attrs).map { case (e, attr) => - (e, CatalystTypeConverters.createToCatalystConverter(attr.dataType)) +def createStructConverter(cls: Class[_], fieldTypes: Iterator[DataType]): Any => InternalRow = { --- End diff -- nit: `Seq[DataType]` instead of `Iterator[DataType]`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22527: [SPARK-17952][SQL] Nested Java beans support in c...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/22527#discussion_r222185482 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala --- @@ -1098,16 +1098,24 @@ object SQLContext { data: Iterator[_], beanClass: Class[_], attrs: Seq[AttributeReference]): Iterator[InternalRow] = { -val extractors = - JavaTypeInference.getJavaBeanReadableProperties(beanClass).map(_.getReadMethod) -val methodsToConverts = extractors.zip(attrs).map { case (e, attr) => - (e, CatalystTypeConverters.createToCatalystConverter(attr.dataType)) +def createStructConverter(cls: Class[_], fieldTypes: Iterator[DataType]): Any => InternalRow = { + val methodConverters = + JavaTypeInference.getJavaBeanReadableProperties(cls).iterator.zip(fieldTypes) + .map { case (property, fieldType) => +val method = property.getReadMethod +method -> createConverter(method.getReturnType, fieldType) + }.toArray + value => new GenericInternalRow( --- End diff -- We should check whether the `value` is `null` or not? Also could you add a test for the case? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22491: [SPARK-25483][TEST] Refactor UnsafeArrayDataBenchmark to...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22491 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96883/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22491: [SPARK-25483][TEST] Refactor UnsafeArrayDataBenchmark to...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22491 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22491: [SPARK-25483][TEST] Refactor UnsafeArrayDataBenchmark to...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22491 **[Test build #96883 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96883/testReport)** for PR 22491 at commit [`4b69423`](https://github.com/apache/spark/commit/4b6942337bab89fad4d40f120d53197544a799b8). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22500: [SPARK-25488][TEST] Refactor MiscBenchmark to use main m...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22500 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3650/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22500: [SPARK-25488][TEST] Refactor MiscBenchmark to use main m...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22500 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22500: [SPARK-25488][TEST] Refactor MiscBenchmark to use main m...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22500 **[Test build #96888 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96888/testReport)** for PR 22500 at commit [`c791249`](https://github.com/apache/spark/commit/c791249fdd3f97736d977087434575ca3480ae9d). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22500: [SPARK-25488][TEST] Refactor MiscBenchmark to use main m...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/22500 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22619: [SQL][MINOR] Make use of TypeCoercion.findTightestCommon...
Github user dilipbiswal commented on the issue: https://github.com/apache/spark/pull/22619 cc @HyukjinKwon @MaxGekk --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22619: [SQL][MINOR] Make use of TypeCoercion.findTightestCommon...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22619 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22619: [SQL][MINOR] Make use of TypeCoercion.findTightestCommon...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22619 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96881/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22619: [SQL][MINOR] Make use of TypeCoercion.findTightestCommon...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22619 **[Test build #96881 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96881/testReport)** for PR 22619 at commit [`d4e0bdb`](https://github.com/apache/spark/commit/d4e0bdb5a06f59ff1df3933cb43804e71cde8259). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22339 **[Test build #96887 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96887/testReport)** for PR 22339 at commit [`d91c815`](https://github.com/apache/spark/commit/d91c815774bff070bdb3cb149678ff080bc06b45). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22339 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22339 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3649/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22339 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96886/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22339 **[Test build #96886 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96886/testReport)** for PR 22339 at commit [`dab9bf3`](https://github.com/apache/spark/commit/dab9bf3771994989e5de2857f91d117dc8b74623). * This patch **fails to build**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22339 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22339 **[Test build #96886 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96886/testReport)** for PR 22339 at commit [`dab9bf3`](https://github.com/apache/spark/commit/dab9bf3771994989e5de2857f91d117dc8b74623). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22339 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22339 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3648/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22339 **[Test build #96885 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96885/testReport)** for PR 22339 at commit [`542872c`](https://github.com/apache/spark/commit/542872cb5459fae1a66ee45aa193986e9a37fb96). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22339 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22339 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3647/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22500: [SPARK-25488][TEST] Refactor MiscBenchmark to use main m...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22500 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96882/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22500: [SPARK-25488][TEST] Refactor MiscBenchmark to use main m...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22500 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22500: [SPARK-25488][TEST] Refactor MiscBenchmark to use main m...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22500 **[Test build #96882 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96882/testReport)** for PR 22500 at commit [`c791249`](https://github.com/apache/spark/commit/c791249fdd3f97736d977087434575ca3480ae9d). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22488: [SPARK-25479][TEST] Refactor DatasetBenchmark to use mai...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22488 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22488: [SPARK-25479][TEST] Refactor DatasetBenchmark to use mai...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22488 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96880/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22488: [SPARK-25479][TEST] Refactor DatasetBenchmark to use mai...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22488 **[Test build #96880 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96880/testReport)** for PR 22488 at commit [`27c6493`](https://github.com/apache/spark/commit/27c649337e326aa8081afcd9de1a7097ea2788e2). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning wh...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/22610#discussion_r222173904 --- Diff: python/pyspark/worker.py --- @@ -84,13 +84,36 @@ def wrap_scalar_pandas_udf(f, return_type): arrow_return_type = to_arrow_type(return_type) def verify_result_length(*a): +import pyarrow as pa result = f(*a) if not hasattr(result, "__len__"): raise TypeError("Return type of the user-defined function should be " "Pandas.Series, but is {}".format(type(result))) if len(result) != len(a[0]): raise RuntimeError("Result vector from pandas_udf was not the required length: " "expected %d, got %d" % (len(a[0]), len(result))) + +# Ensure return type of Pandas.Series matches the arrow return type of the user-defined +# function. Otherwise, we may produce incorrect serialized data. +# Note: for timestamp type, we only need to ensure both types are timestamp because the +# serializer will do conversion. +try: +arrow_type_of_result = pa.from_numpy_dtype(result.dtype) +both_are_timestamp = pa.types.is_timestamp(arrow_type_of_result) and \ +pa.types.is_timestamp(arrow_return_type) +if not both_are_timestamp and arrow_return_type != arrow_type_of_result: +print("WARN: Arrow type %s of return Pandas.Series of the user-defined function's " + "dtype %s doesn't match the arrow type %s " + "of defined return type %s" % (arrow_type_of_result, result.dtype, + arrow_return_type, return_type), + file=sys.stderr) +except: +print("WARN: Can't infer arrow type of Pandas.Series's dtype: %s, which might not " + "match the arrow type %s of defined return type %s" % (result.dtype, + arrow_return_type, + return_type), --- End diff -- ok. thanks. :-) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning wh...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22610#discussion_r222173421 --- Diff: python/pyspark/worker.py --- @@ -84,13 +84,36 @@ def wrap_scalar_pandas_udf(f, return_type): arrow_return_type = to_arrow_type(return_type) def verify_result_length(*a): +import pyarrow as pa result = f(*a) if not hasattr(result, "__len__"): raise TypeError("Return type of the user-defined function should be " "Pandas.Series, but is {}".format(type(result))) if len(result) != len(a[0]): raise RuntimeError("Result vector from pandas_udf was not the required length: " "expected %d, got %d" % (len(a[0]), len(result))) + +# Ensure return type of Pandas.Series matches the arrow return type of the user-defined +# function. Otherwise, we may produce incorrect serialized data. +# Note: for timestamp type, we only need to ensure both types are timestamp because the +# serializer will do conversion. +try: +arrow_type_of_result = pa.from_numpy_dtype(result.dtype) +both_are_timestamp = pa.types.is_timestamp(arrow_type_of_result) and \ +pa.types.is_timestamp(arrow_return_type) +if not both_are_timestamp and arrow_return_type != arrow_type_of_result: +print("WARN: Arrow type %s of return Pandas.Series of the user-defined function's " + "dtype %s doesn't match the arrow type %s " + "of defined return type %s" % (arrow_type_of_result, result.dtype, + arrow_return_type, return_type), + file=sys.stderr) +except: +print("WARN: Can't infer arrow type of Pandas.Series's dtype: %s, which might not " + "match the arrow type %s of defined return type %s" % (result.dtype, + arrow_return_type, + return_type), --- End diff -- I would fix the indentation here tho :-) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning when retu...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22610 > he only other idea I have is to provide an option to raise an error if the type needs to be cast. Actually sounds good to me. I think the problem is we are not quite clear about when the type is mismatched in UDFs (see also https://github.com/apache/spark/pull/20163 for a reminder). IIRC, we rather roughly agreed upon documenting it (and allowing exact type match (?)). @viirya and @BryanCutler, how about we document that return types should be matched (we can leave a chart or map referring (`types.to_arrow_type`)? One additional improvement might be .. we describe that type casting behaviour is .. say .. not guaranteed but I am not sure how we can nicely document this. Probably only mentioning the type mapping is fine ..? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22611: [SPARK-25595] Ignore corrupt Avro files if flag I...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22611#discussion_r222169616 --- Diff: external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala --- @@ -342,6 +342,53 @@ class AvroSuite extends QueryTest with SharedSQLContext with SQLTestUtils { } } + private def createDummyCorruptFile(dir: File): Unit = { +FileUtils.forceMkdir(dir) +val corruptFile = new File(dir, "corrupt.avro") +val writer = new BufferedWriter(new FileWriter(corruptFile)) +writer.write("corrupt") +writer.close() + } + + test("Ignore corrupt Avro file if flag IGNORE_CORRUPT_FILES enabled") { +withSQLConf(SQLConf.IGNORE_CORRUPT_FILES.key -> "true") { + withTempPath { dir => +createDummyCorruptFile(dir) +val message = intercept[FileNotFoundException] { + spark.read.format("avro").load(dir.getAbsolutePath).schema +}.getMessage +assert(message.contains("No Avro files found.")) + +val srcFile = new File("src/test/resources/episodes.avro") +val destFile = new File(dir, "episodes.avro") +FileUtils.copyFile(srcFile, destFile) + +val df = spark.read.format("avro").load(srcFile.getAbsolutePath) +val schema = df.schema +val result = df.collect() +// Schema inference picks random readable sample file. +// Here we use a loop to eliminate randomness. +(1 to 5).foreach { _ => + assert(spark.read.format("avro").load(dir.getAbsolutePath).schema == schema) + checkAnswer(spark.read.format("avro").load(dir.getAbsolutePath), result) +} + } +} + } + + test("Throws IOException on reading corrupt Avro file if flag IGNORE_CORRUPT_FILES disabled") { +withSQLConf(SQLConf.IGNORE_CORRUPT_FILES.key -> "false") { + withTempPath { dir => +createDummyCorruptFile(dir) +val message = intercept[org.apache.spark.SparkException] { + spark.read.format("avro").load(dir.getAbsolutePath).schema --- End diff -- `.schema` wouldn't probably be needed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22611: [SPARK-25595] Ignore corrupt Avro files if flag I...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22611#discussion_r222169458 --- Diff: external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala --- @@ -342,6 +342,53 @@ class AvroSuite extends QueryTest with SharedSQLContext with SQLTestUtils { } } + private def createDummyCorruptFile(dir: File): Unit = { +FileUtils.forceMkdir(dir) +val corruptFile = new File(dir, "corrupt.avro") +val writer = new BufferedWriter(new FileWriter(corruptFile)) +writer.write("corrupt") +writer.close() + } + + test("Ignore corrupt Avro file if flag IGNORE_CORRUPT_FILES enabled") { +withSQLConf(SQLConf.IGNORE_CORRUPT_FILES.key -> "true") { + withTempPath { dir => +createDummyCorruptFile(dir) +val message = intercept[FileNotFoundException] { + spark.read.format("avro").load(dir.getAbsolutePath).schema +}.getMessage +assert(message.contains("No Avro files found.")) + +val srcFile = new File("src/test/resources/episodes.avro") +val destFile = new File(dir, "episodes.avro") +FileUtils.copyFile(srcFile, destFile) + +val df = spark.read.format("avro").load(srcFile.getAbsolutePath) +val schema = df.schema +val result = df.collect() +// Schema inference picks random readable sample file. +// Here we use a loop to eliminate randomness. --- End diff -- Actually I don't think it's randomness in this test. In this test, HDFS lists files in an alphabetical order under to the hood although it's not guaranteed. I think the picking order here at least is deterministic. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22616: [SPARK-25586][Core] Remove outer objects from logdebug s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22616 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96876/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22616: [SPARK-25586][Core] Remove outer objects from logdebug s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22616 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22616: [SPARK-25586][Core] Remove outer objects from logdebug s...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22616 **[Test build #96876 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96876/testReport)** for PR 22616 at commit [`2723b47`](https://github.com/apache/spark/commit/2723b47ded05c2eb2ca6cdb3f893965120c04920). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22491: [SPARK-25483][TEST] Refactor UnsafeArrayDataBenchmark to...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22491 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96878/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22491: [SPARK-25483][TEST] Refactor UnsafeArrayDataBenchmark to...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22491 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22488: [SPARK-25479][TEST] Refactor DatasetBenchmark to use mai...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22488 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22488: [SPARK-25479][TEST] Refactor DatasetBenchmark to use mai...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22488 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96879/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22491: [SPARK-25483][TEST] Refactor UnsafeArrayDataBenchmark to...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22491 **[Test build #96878 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96878/testReport)** for PR 22491 at commit [`5c819fb`](https://github.com/apache/spark/commit/5c819fb972c6c5a87375d9f8d6418d1de186449c). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22488: [SPARK-25479][TEST] Refactor DatasetBenchmark to use mai...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22488 **[Test build #96879 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96879/testReport)** for PR 22488 at commit [`d2d0a3e`](https://github.com/apache/spark/commit/d2d0a3e9fbb27b6a531241bbe6641d5950a7a703). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22500: [SPARK-25488][TEST] Refactor MiscBenchmark to use main m...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22500 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22500: [SPARK-25488][TEST] Refactor MiscBenchmark to use main m...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22500 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96877/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22500: [SPARK-25488][TEST] Refactor MiscBenchmark to use main m...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22500 **[Test build #96877 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96877/testReport)** for PR 22500 at commit [`f7a14e3`](https://github.com/apache/spark/commit/f7a14e3e63bb3f78b750bb17fc66938308647b63). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22613: [SPARK-25583][DOC][BRANCH-2.3]Add history-server related...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22613 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96884/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22613: [SPARK-25583][DOC][BRANCH-2.3]Add history-server related...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22613 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22613: [SPARK-25583][DOC][BRANCH-2.3]Add history-server related...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22613 **[Test build #96884 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96884/testReport)** for PR 22613 at commit [`f0947a3`](https://github.com/apache/spark/commit/f0947a31f13035e490385a73b027cd5dbe789072). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22611: [SPARK-25595] Ignore corrupt Avro files if flag I...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22611#discussion_r222167078 --- Diff: external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala --- @@ -342,6 +342,53 @@ class AvroSuite extends QueryTest with SharedSQLContext with SQLTestUtils { } } + private def createDummyCorruptFile(dir: File): Unit = { +FileUtils.forceMkdir(dir) +val corruptFile = new File(dir, "corrupt.avro") +val writer = new BufferedWriter(new FileWriter(corruptFile)) +writer.write("corrupt") +writer.close() --- End diff -- ditto for `tryWithResource` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22611: [SPARK-25595] Ignore corrupt Avro files if flag I...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22611#discussion_r222167036 --- Diff: external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala --- @@ -100,6 +77,50 @@ private[avro] class AvroFileFormat extends FileFormat } } + private def inferAvroSchemaFromFiles( + files: Seq[FileStatus], + conf: Configuration, + ignoreExtension: Boolean): Schema = { +val ignoreCorruptFiles = SQLConf.get.ignoreCorruptFiles +// Schema evolution is not supported yet. Here we only pick first random readable sample file to +// figure out the schema of the whole dataset. +val avroReader = files.iterator.map { f => + val path = f.getPath + if (!ignoreExtension && !path.getName.endsWith(".avro")) { +None + } else { +val in = new FsInput(path, conf) +try { + Some(DataFileReader.openReader(in, new GenericDatumReader[GenericRecord]())) +} catch { + case e: IOException => +if (ignoreCorruptFiles) { + logWarning(s"Skipped the footer in the corrupted file: $path", e) + None +} else { + throw new SparkException(s"Could not read file: $path", e) +} +} finally { + in.close() +} + } +}.collectFirst { + case Some(reader) => reader +} + +avroReader match { + case Some(reader) => +try { --- End diff -- ditto for `tryWithResource` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22611: [SPARK-25595] Ignore corrupt Avro files if flag I...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22611#discussion_r222166950 --- Diff: external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala --- @@ -100,6 +77,50 @@ private[avro] class AvroFileFormat extends FileFormat } } + private def inferAvroSchemaFromFiles( + files: Seq[FileStatus], + conf: Configuration, + ignoreExtension: Boolean): Schema = { +val ignoreCorruptFiles = SQLConf.get.ignoreCorruptFiles +// Schema evolution is not supported yet. Here we only pick first random readable sample file to +// figure out the schema of the whole dataset. +val avroReader = files.iterator.map { f => + val path = f.getPath + if (!ignoreExtension && !path.getName.endsWith(".avro")) { +None + } else { +val in = new FsInput(path, conf) --- End diff -- Not a big deal but we can use `Utils.tryiWithResource` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22611: [SPARK-25595] Ignore corrupt Avro files if flag I...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22611#discussion_r222166498 --- Diff: external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala --- @@ -100,6 +77,50 @@ private[avro] class AvroFileFormat extends FileFormat } } + private def inferAvroSchemaFromFiles( + files: Seq[FileStatus], + conf: Configuration, + ignoreExtension: Boolean): Schema = { +val ignoreCorruptFiles = SQLConf.get.ignoreCorruptFiles --- End diff -- Bout about matching it to `sparkSession.sessionState.conf.ignoreCorruptFiles` like other occurrences? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22613: [SPARK-25583][DOC][BRANCH-2.3]Add history-server related...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22613 **[Test build #96884 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96884/testReport)** for PR 22613 at commit [`f0947a3`](https://github.com/apache/spark/commit/f0947a31f13035e490385a73b027cd5dbe789072). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22613: [SPARK-25583][DOC][BRANCH-2.3]Add history-server related...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22613 ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22616: [SPARK-25586][Core] Remove outer objects from logdebug s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22616 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96875/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22616: [SPARK-25586][Core] Remove outer objects from logdebug s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22616 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22616: [SPARK-25586][Core] Remove outer objects from logdebug s...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22616 **[Test build #96875 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96875/testReport)** for PR 22616 at commit [`cc8926d`](https://github.com/apache/spark/commit/cc8926d6a69786412e7630b03d738da8fda7effe). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `logDebug(s\" + cloning the object of class $` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22615: [SPARK-25016][BUILD][CORE] Remove support for Hadoop 2.6
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/22615 We can simplery this: https://github.com/apache/spark/blob/7ad18ee9f26e75dbe038c6034700f9cd4c0e2baa/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L273-L295 to ```scala sparkConf.get(ROLLED_LOG_INCLUDE_PATTERN).foreach { includePattern => val logAggregationContext = Records.newRecord(classOf[LogAggregationContext]) logAggregationContext.setRolledLogsIncludePattern(includePattern) sparkConf.get(ROLLED_LOG_EXCLUDE_PATTERN).foreach { excludePattern => logAggregationContext.setRolledLogsExcludePattern(excludePattern) } appContext.setLogAggregationContext(logAggregationContext) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22618: [SPARK-25321][ML] Revert SPARK-14681 to avoid API breaki...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22618 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22618: [SPARK-25321][ML] Revert SPARK-14681 to avoid API breaki...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22618 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96874/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22618: [SPARK-25321][ML] Revert SPARK-14681 to avoid API breaki...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22618 **[Test build #96874 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96874/testReport)** for PR 22618 at commit [`90eb1d7`](https://github.com/apache/spark/commit/90eb1d7f5895e442a86506e3e7dae382e138b3b0). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `sealed abstract class Node extends Serializable ` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22491: [SPARK-25483][TEST] Refactor UnsafeArrayDataBenchmark to...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22491 **[Test build #96883 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96883/testReport)** for PR 22491 at commit [`4b69423`](https://github.com/apache/spark/commit/4b6942337bab89fad4d40f120d53197544a799b8). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22488: [SPARK-25479][TEST] Refactor DatasetBenchmark to use mai...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/22488 Congratulation, @jiangxb1987 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22491: [SPARK-25483][TEST] Refactor UnsafeArrayDataBenchmark to...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22491 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22491: [SPARK-25483][TEST] Refactor UnsafeArrayDataBenchmark to...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22491 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3646/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22491: [SPARK-25483][TEST] Refactor UnsafeArrayDataBenchmark to...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/22491 Could you review and merge https://github.com/wangyum/spark/pull/14 ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22347: [SPARK-25353][SQL] executeTake in SparkPlan is mo...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/22347#discussion_r222158101 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala --- @@ -348,30 +349,30 @@ abstract class SparkPlan extends QueryPlan[SparkPlan] with Logging with Serializ // Otherwise, interpolate the number of partitions we need to try, but overestimate // it by 50%. We also cap the estimation in the end. val limitScaleUpFactor = Math.max(sqlContext.conf.limitScaleUpFactor, 2) -if (buf.isEmpty) { +if (scannedRowCount == 0) { numPartsToTry = partsScanned * limitScaleUpFactor } else { - val left = n - buf.size + val left = n - scannedRowCount // As left > 0, numPartsToTry is always >= 1 - numPartsToTry = Math.ceil(1.5 * left * partsScanned / buf.size).toInt + numPartsToTry = Math.ceil(1.5 * left * partsScanned / scannedRowCount).toInt numPartsToTry = Math.min(numPartsToTry, partsScanned * limitScaleUpFactor) } } val p = partsScanned.until(math.min(partsScanned + numPartsToTry, totalParts).toInt) val sc = sqlContext.sparkContext - val res = sc.runJob(childRDD, -(it: Iterator[Array[Byte]]) => if (it.hasNext) it.next() else Array.empty[Byte], p) - - buf ++= res.flatMap(decodeUnsafeRows) + val res = sc.runJob(childRDD, (it: Iterator[(Long, Array[Byte])]) => +if (it.hasNext) it.next() else (0L, Array.empty[Byte]), p) + buf ++= res.map(_._2) + scannedRowCount += res.map(_._1).sum partsScanned += p.size } -if (buf.size > n) { - buf.take(n).toArray +if (scannedRowCount > n) { + buf.toArray.view.flatMap(decodeUnsafeRows).take(n).force } else { - buf.toArray + buf.toArray.view.flatMap(decodeUnsafeRows).force --- End diff -- `buf.flatMap(decodeUnsafeRows).toArray` ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22347: [SPARK-25353][SQL] executeTake in SparkPlan is mo...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/22347#discussion_r222158041 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala --- @@ -348,30 +349,30 @@ abstract class SparkPlan extends QueryPlan[SparkPlan] with Logging with Serializ // Otherwise, interpolate the number of partitions we need to try, but overestimate // it by 50%. We also cap the estimation in the end. val limitScaleUpFactor = Math.max(sqlContext.conf.limitScaleUpFactor, 2) -if (buf.isEmpty) { +if (scannedRowCount == 0) { numPartsToTry = partsScanned * limitScaleUpFactor } else { - val left = n - buf.size + val left = n - scannedRowCount // As left > 0, numPartsToTry is always >= 1 - numPartsToTry = Math.ceil(1.5 * left * partsScanned / buf.size).toInt + numPartsToTry = Math.ceil(1.5 * left * partsScanned / scannedRowCount).toInt numPartsToTry = Math.min(numPartsToTry, partsScanned * limitScaleUpFactor) } } val p = partsScanned.until(math.min(partsScanned + numPartsToTry, totalParts).toInt) val sc = sqlContext.sparkContext - val res = sc.runJob(childRDD, -(it: Iterator[Array[Byte]]) => if (it.hasNext) it.next() else Array.empty[Byte], p) - - buf ++= res.flatMap(decodeUnsafeRows) + val res = sc.runJob(childRDD, (it: Iterator[(Long, Array[Byte])]) => +if (it.hasNext) it.next() else (0L, Array.empty[Byte]), p) + buf ++= res.map(_._2) + scannedRowCount += res.map(_._1).sum partsScanned += p.size } -if (buf.size > n) { - buf.take(n).toArray +if (scannedRowCount > n) { + buf.toArray.view.flatMap(decodeUnsafeRows).take(n).force --- End diff -- Can we simplify like the following? ```scala buf.flatMap(decodeUnsafeRows).take(n).toArray ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22500: [SPARK-25488][TEST] Refactor MiscBenchmark to use main m...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22500 **[Test build #96882 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96882/testReport)** for PR 22500 at commit [`c791249`](https://github.com/apache/spark/commit/c791249fdd3f97736d977087434575ca3480ae9d). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22619: [SQL][MINOR] Make use of TypeCoercion.findTightestCommon...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22619 **[Test build #96881 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96881/testReport)** for PR 22619 at commit [`d4e0bdb`](https://github.com/apache/spark/commit/d4e0bdb5a06f59ff1df3933cb43804e71cde8259). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22500: [SPARK-25488][TEST] Refactor MiscBenchmark to use main m...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22500 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22500: [SPARK-25488][TEST] Refactor MiscBenchmark to use main m...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22500 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3645/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22619: [SQL][MINOR] Make use of TypeCoercion.findTightestCommon...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22619 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3644/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22619: [SQL][MINOR] Make use of TypeCoercion.findTightestCommon...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22619 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22619: [SQL][MINOR] Make use of TypeCoercion.findTightes...
GitHub user dilipbiswal opened a pull request: https://github.com/apache/spark/pull/22619 [SQL][MINOR] Make use of TypeCoercion.findTightestCommonType while inferring CSV schema. ## What changes were proposed in this pull request? Current the CSV's infer schema code inlines `TypeCoercion.findTightestCommonType`. This is a minor refactor to make use of the common type coercion code when applicable. This way we can take advantage of any improvement to the base method. Thanks to @MaxGekk for finding this while reviewing another PR. ## How was this patch tested? This is a minor refactor. Existing tests are used to verify the change. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dilipbiswal/spark csv_minor Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22619.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22619 commit d4e0bdb5a06f59ff1df3933cb43804e71cde8259 Author: Dilip Biswal Date: 2018-10-03T00:47:42Z Make use of TypeCoercion.findTightestCommonType while inferring CSV schema --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22047: [SPARK-19851] Add support for EVERY and ANY (SOME) aggre...
Github user dilipbiswal commented on the issue: https://github.com/apache/spark/pull/22047 @gatorsmile Thanks.. I will check. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22488: [SPARK-25479][TEST] Refactor DatasetBenchmark to use mai...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/22488 Hi, @jiangxb1987 . Could you review (and merge) this PR? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19601: [SPARK-22383][SQL] Generate code to directly get value o...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/19601 Hi, @kiszk . Is this still valid? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning when retu...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/22610 Thanks @BryanCutler! Yes, this should not be a bug but is used as a warning to users that there might be some type conversion they are not noticed at first glance on the Pandas UDFs. For now the conversion is silently done behind the scene and as the case in the JIRA shows it might not be easily noticed that Pandas.Series from UDFs isn't matched with defined UDFs' return types. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22598: [SPARK-25501][SS] Add kafka delegation token support.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22598 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96871/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22598: [SPARK-25501][SS] Add kafka delegation token support.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22598 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22598: [SPARK-25501][SS] Add kafka delegation token support.
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22598 **[Test build #96871 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96871/testReport)** for PR 22598 at commit [`2a79d59`](https://github.com/apache/spark/commit/2a79d598a3789def989454560d9efce5664318f0). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22237 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22237 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96873/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22237 **[Test build #96873 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96873/testReport)** for PR 22237 at commit [`7e7af88`](https://github.com/apache/spark/commit/7e7af88cba38d8f5c1dbb9c1fc8204cada9d0f69). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class YarnShuffleServiceMetrics implements MetricsSource ` * `case class SchemaOfJson(` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22488: [SPARK-25479][TEST] Refactor DatasetBenchmark to use mai...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22488 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3643/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22488: [SPARK-25479][TEST] Refactor DatasetBenchmark to use mai...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22488 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22047: [SPARK-19851] Add support for EVERY and ANY (SOME) aggre...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22047 Let me post something I wrote recently. Could you add test cases to ensure that we do not break the "Ignore NULLs" policy > All the set/aggregate functions ignore NULLs. The typical built-in Set/Aggregate functions are AVG, COUNT, MAX, MIN, SUM, GROUPING. > Note, COUNT(*) is actually equivalent to COUNT(1). Thus, it still includes rows containing null. > Tip, because of the "Ignore NULLs" policy, Sum(a) + Sum(b) is not the same as Sum(a+b). > Note, although the set functions follow the "Ignore NULLs" policy, MIN, MAX, SUM AVG, EVERY, ANY and SOME returns NULL if 1) every value is NULL or 2) SELECT returns no row at all. COUNT never returns NULL. > TODO: When a set function eliminates NULLs, Spark SQL does not follow others to issue a warning message SQLSTATE 01003 "null value eliminated in set function". > TODO: Check whether all the expressions that extend AggregateFunction follow the "Ignore NULLs" policy. If not, we need more investigation to see whether we should correct them. > TODO: When Spark SQL supports EVERY, ANY, and SOME, they follow the same "Ignore NULLs" policy. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22613: [SPARK-25583][DOC][BRANCH-2.3]Add history-server related...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/22613 ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22488: [SPARK-25479][TEST] Refactor DatasetBenchmark to use mai...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22488 **[Test build #96880 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96880/testReport)** for PR 22488 at commit [`27c6493`](https://github.com/apache/spark/commit/27c649337e326aa8081afcd9de1a7097ea2788e2). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22615: [SPARK-25016][BUILD][CORE] Remove support for Hadoop 2.6
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22615 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22615: [SPARK-25016][BUILD][CORE] Remove support for Hadoop 2.6
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22615 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96867/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22615: [SPARK-25016][BUILD][CORE] Remove support for Hadoop 2.6
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22615 **[Test build #96867 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96867/testReport)** for PR 22615 at commit [`3b313bb`](https://github.com/apache/spark/commit/3b313bb83c84429b4d5840055523b1ca48489d19). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22488: [SPARK-25479][TEST] Refactor DatasetBenchmark to use mai...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/22488 @wangyum . Could you review and merge https://github.com/wangyum/spark/pull/13 ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org