[GitHub] spark pull request #16451: [SPARK-18922][SQL][CORE][STREAMING][TESTS] Fix al...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/16451#discussion_r94576403 --- Diff: external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaTestUtils.scala --- @@ -138,10 +139,15 @@ class KafkaTestUtils extends Logging { if (server != null) { server.shutdown() + server.awaitShutdown() server = null } -brokerConf.logDirs.foreach { f => Utils.deleteRecursively(new File(f)) } +// On Windows, `logDirs` is left open even after Kafka server above is completely shut-downed +// in some cases. It leads to test failures on Windows if these are not ignored. +brokerConf.logDirs.map(new File(_)) + .filterNot(FileUtils.deleteQuietly) --- End diff -- Oh, your comment didn't show up when I was writing the comment. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16451: [SPARK-18922][SQL][CORE][STREAMING][TESTS] Fix al...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/16451#discussion_r94576764 --- Diff: external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaTestUtils.scala --- @@ -138,10 +139,15 @@ class KafkaTestUtils extends Logging { if (server != null) { server.shutdown() + server.awaitShutdown() server = null } -brokerConf.logDirs.foreach { f => Utils.deleteRecursively(new File(f)) } +// On Windows, `logDirs` is left open even after Kafka server above is completely shut-downed +// in some cases. It leads to test failures on Windows if these are not ignored. +brokerConf.logDirs.map(new File(_)) + .filterNot(FileUtils.deleteQuietly) --- End diff -- It seems a Windows specific problem. It holds an exclusive lock. If anything does not close the file, then, it can't be deleted or renamed unlike Linux or Mac. I assume it will be able to be deleted if we call `System.gc()` though. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16451: [SPARK-18922][SQL][CORE][STREAMING][TESTS] Fix al...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/16451#discussion_r94577026 --- Diff: external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaTestUtils.scala --- @@ -138,10 +139,15 @@ class KafkaTestUtils extends Logging { if (server != null) { server.shutdown() + server.awaitShutdown() server = null } -brokerConf.logDirs.foreach { f => Utils.deleteRecursively(new File(f)) } +// On Windows, `logDirs` is left open even after Kafka server above is completely shut-downed +// in some cases. It leads to test failures on Windows if these are not ignored. +brokerConf.logDirs.map(new File(_)) + .filterNot(FileUtils.deleteQuietly) --- End diff -- So, I _guess_ it can't be deleted until the object holding the lock is actually garbage-collected. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16451: [SPARK-18922][SQL][CORE][STREAMING][TESTS] Fix al...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/16451#discussion_r94577735 --- Diff: external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaTestUtils.scala --- @@ -138,10 +139,15 @@ class KafkaTestUtils extends Logging { if (server != null) { server.shutdown() + server.awaitShutdown() server = null } -brokerConf.logDirs.foreach { f => Utils.deleteRecursively(new File(f)) } +// On Windows, `logDirs` is left open even after Kafka server above is completely shut-downed +// in some cases. It leads to test failures on Windows if these are not ignored. +brokerConf.logDirs.map(new File(_)) + .filterNot(FileUtils.deleteQuietly) --- End diff -- Yes, it seems the contents always should be deleted first. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16451: [SPARK-18922][SQL][CORE][STREAMING][TESTS] Fix all ident...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/16451 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16451: [SPARK-18922][SQL][CORE][STREAMING][TESTS] Fix all ident...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/16451 I really don't think using the full path makes `HiveSparkSubmitSuite.scala` flaky but if it seems frequently failed, let me use the original code path for OSs except Windows. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16429: [SPARK-19019][PYTHON] Fix hijacked `collections.namedtup...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/16429 Could you maybe check if this patch is applied properly? That error is exactly what this PR fixes and it seems the line number in the errors is not matched to the one in this PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16451: [SPARK-18922][SQL][CORE][STREAMING][TESTS] Fix all ident...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/16451 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16451: [SPARK-18922][SQL][CORE][STREAMING][TESTS] Fix all ident...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/16451 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16451: [SPARK-18922][SQL][CORE][STREAMING][TESTS] Fix all ident...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/16451 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16451: [SPARK-18922][SQL][CORE][STREAMING][TESTS] Fix all ident...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/16451 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16451: [SPARK-18922][SQL][CORE][STREAMING][TESTS] Fix all ident...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/16451 I just checked each except for the one below is passed in Windows via concatenated [full-log](https://gist.github.com/HyukjinKwon/2d199ac9156c380015ad5a71f77866be). It seems `DirectKafkaStreamSuite` is flaky due to the occasional failure of removing created temp directory (the same problem with https://github.com/apache/spark/pull/16451#discussion_r94562756 but that one is constantly failed) It is sometimes passed [![PR-16451](https://ci.appveyor.com/api/projects/status/github/spark-test/spark?branch=A7615F8B-58B0-4D9B-A914-32E7BF7DCB65&svg=true)](https://ci.appveyor.com/project/spark-test/spark/branch/A7615F8B-58B0-4D9B-A914-32E7BF7DCB65) but sometimes failed as below: ``` DirectKafkaStreamSuite: Exception encountered when attempting to run a suite with class name: org.apache.spark.streaming.kafka010.DirectKafkaStreamSuite *** ABORTED *** (59 seconds, 626 milliseconds) java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-426107da-68cf-4d94-b0d6-1f428f1c53f6 ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13252: [SPARK-15473][SQL] CSV data source writes header ...
Github user HyukjinKwon closed the pull request at: https://github.com/apache/spark/pull/13252 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13252: [SPARK-15473][SQL] CSV data source writes header for emp...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/13252 Let me suggest a generalized way latter because it does not look a clean fix. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13988: [SPARK-16101][SQL] Refactoring CSV data source to...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/13988#discussion_r94754860 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala --- @@ -228,3 +150,35 @@ class CSVFileFormat extends TextBasedFileFormat with DataSourceRegister { schema.foreach(field => verifyType(field.dataType)) } } + --- End diff -- These just came from `CSVRelation`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13988: [SPARK-16101][SQL] Refactoring CSV data source to...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/13988#discussion_r94755002 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala --- @@ -39,22 +38,43 @@ private[csv] object CSVInferSchema { * 3. Replace any null types with string type */ def infer( --- End diff -- This is kind of a important change to introduce similar functionalities with JSON. (e,g., creating a dataframe from `RDD[String]` or `Dataset[String]`). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13988: [SPARK-16101][SQL] Refactoring CSV data source to...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/13988#discussion_r94755096 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala --- @@ -214,127 +234,47 @@ private[csv] object CSVInferSchema { case _ => None } -} - -private[csv] object CSVTypeCast { - // A `ValueConverter` is responsible for converting the given value to a desired type. - private type ValueConverter = String => Any /** - * Create converters which cast each given string datum to each specified type in given schema. - * Currently, we do not support complex types (`ArrayType`, `MapType`, `StructType`). - * - * For string types, this is simply the datum. - * For other types, this is converted into the value according to the type. - * For other nullable types, returns null if it is null or equals to the value specified - * in `nullValue` option. - * - * @param schema schema that contains data types to cast the given value into. - * @param options CSV options. + * Generates a header from the given row which is null-safe and duplicate-safe. */ - def makeConverters( - schema: StructType, - options: CSVOptions = CSVOptions()): Array[ValueConverter] = { -schema.map(f => makeConverter(f.name, f.dataType, f.nullable, options)).toArray - } - - /** - * Create a converter which converts the string value to a value according to a desired type. - */ - def makeConverter( - name: String, - dataType: DataType, - nullable: Boolean = true, - options: CSVOptions = CSVOptions()): ValueConverter = dataType match { -case _: ByteType => (d: String) => - nullSafeDatum(d, name, nullable, options)(_.toByte) - -case _: ShortType => (d: String) => - nullSafeDatum(d, name, nullable, options)(_.toShort) - -case _: IntegerType => (d: String) => - nullSafeDatum(d, name, nullable, options)(_.toInt) - -case _: LongType => (d: String) => - nullSafeDatum(d, name, nullable, options)(_.toLong) - -case _: FloatType => (d: String) => - nullSafeDatum(d, name, nullable, options) { -case options.nanValue => Float.NaN -case options.negativeInf => Float.NegativeInfinity -case options.positiveInf => Float.PositiveInfinity -case datum => - Try(datum.toFloat) - .getOrElse(NumberFormat.getInstance(Locale.US).parse(datum).floatValue()) - } - -case _: DoubleType => (d: String) => - nullSafeDatum(d, name, nullable, options) { -case options.nanValue => Double.NaN -case options.negativeInf => Double.NegativeInfinity -case options.positiveInf => Double.PositiveInfinity -case datum => - Try(datum.toDouble) - .getOrElse(NumberFormat.getInstance(Locale.US).parse(datum).doubleValue()) + private def makeSafeHeader( --- End diff -- This just came from `CSVFileFormat`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13988: [SPARK-16101][SQL] Refactoring CSV data source to...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/13988#discussion_r94755470 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala --- @@ -126,6 +127,39 @@ private[csv] class CSVOptions(@transient private val parameters: CaseInsensitive val inputBufferSize = 128 val isCommentSet = this.comment != '\u' + + def asWriterSettings: CsvWriterSettings = { --- End diff -- These just came from `CSVParser`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13988: [SPARK-16101][SQL] Refactoring CSV data source to...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/13988#discussion_r94755659 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala --- @@ -0,0 +1,272 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.csv + +import java.math.BigDecimal +import java.text.NumberFormat +import java.util.Locale + +import scala.util.Try +import scala.util.control.NonFatal + +import com.univocity.parsers.csv.CsvParser + +import org.apache.spark.internal.Logging +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions.GenericInternalRow +import org.apache.spark.sql.catalyst.util.DateTimeUtils +import org.apache.spark.sql.types._ +import org.apache.spark.unsafe.types.UTF8String + +private[csv] class UnivocityParser( +schema: StructType, +requiredSchema: StructType, +options: CSVOptions) extends Logging { + def this(schema: StructType, options: CSVOptions) = this(schema, schema, options) + + val valueConverters = makeConverters(schema, options) + val parser = new CsvParser(options.asParserSettings) + + // A `ValueConverter` is responsible for converting the given value to a desired type. + private type ValueConverter = String => Any + + var numMalformedRecords = 0 + val row = new GenericInternalRow(requiredSchema.length) --- End diff -- Now, we reuse the single row. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13988: [SPARK-16101][SQL] Refactoring CSV data source to...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/13988#discussion_r94755607 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityGenerator.scala --- @@ -0,0 +1,89 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.csv + +import java.io.Writer + +import com.univocity.parsers.csv.CsvWriter + +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.util.DateTimeUtils +import org.apache.spark.sql.types._ + +private[csv] class UnivocityGenerator( +schema: StructType, +writer: Writer, +options: CSVOptions = new CSVOptions(Map.empty[String, String])) { + private val writerSettings = options.asWriterSettings + writerSettings.setHeaders(schema.fieldNames: _*) + private val gen = new CsvWriter(writer, writerSettings) + + // A `ValueConverter` is responsible for converting a value of an `InternalRow` to `String`. --- End diff -- These below just mostly came from `CSVTypeCast` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13988: [SPARK-16101][SQL] Refactoring CSV data source to...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/13988#discussion_r94755764 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala --- @@ -0,0 +1,272 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.csv + +import java.math.BigDecimal +import java.text.NumberFormat +import java.util.Locale + +import scala.util.Try +import scala.util.control.NonFatal + +import com.univocity.parsers.csv.CsvParser + +import org.apache.spark.internal.Logging +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions.GenericInternalRow +import org.apache.spark.sql.catalyst.util.DateTimeUtils +import org.apache.spark.sql.types._ +import org.apache.spark.unsafe.types.UTF8String + +private[csv] class UnivocityParser( +schema: StructType, +requiredSchema: StructType, +options: CSVOptions) extends Logging { + def this(schema: StructType, options: CSVOptions) = this(schema, schema, options) + + val valueConverters = makeConverters(schema, options) + val parser = new CsvParser(options.asParserSettings) + + // A `ValueConverter` is responsible for converting the given value to a desired type. + private type ValueConverter = String => Any + + var numMalformedRecords = 0 + val row = new GenericInternalRow(requiredSchema.length) + val indexArr: Array[Int] = { +val fields = if (options.dropMalformed) { + // If `dropMalformed` is enabled, then it needs to parse all the values + // so that we can decide which row is malformed. + requiredSchema ++ schema.filterNot(requiredSchema.contains(_)) +} else { + requiredSchema +} +fields.filter(schema.contains).map(schema.indexOf).toArray + } + + /** + * Create converters which cast each given string datum to each specified type in given schema. + * Currently, we do not support complex types (`ArrayType`, `MapType`, `StructType`). + * + * For string types, this is simply the datum. + * For other types, this is converted into the value according to the type. + * For other nullable types, returns null if it is null or equals to the value specified + * in `nullValue` option. + * + * @param schema schema that contains data types to cast the given value into. + * @param options CSV options. + */ + private def makeConverters( + schema: StructType, + options: CSVOptions = CSVOptions()): Array[ValueConverter] = { +schema.map(f => makeConverter(f.name, f.dataType, f.nullable, options)).toArray + } + + /** + * Create a converter which converts the string value to a value according to a desired type. + */ + def makeConverter( + name: String, + dataType: DataType, + nullable: Boolean = true, + options: CSVOptions = CSVOptions()): ValueConverter = dataType match { +case _: ByteType => (d: String) => + nullSafeDatum(d, name, nullable, options)(_.toByte) + +case _: ShortType => (d: String) => + nullSafeDatum(d, name, nullable, options)(_.toShort) + +case _: IntegerType => (d: String) => + nullSafeDatum(d, name, nullable, options)(_.toInt) + +case _: LongType => (d: String) => + nullSafeDatum(d, name, nullable, options)(_.toLong) + +case _: FloatType => (d: String) => + nullSafeDatum(d, name, nullable, options) { +case options.nanValue => Float.NaN +case options.negativeInf => Float.NegativeInfinity +case options.positiveInf => Float.PositiveInfinity +case datum => + Try(datum.toFloat) + .getOrElse(NumberFormat.getInstance(Locale.US).parse
[GitHub] spark pull request #13988: [SPARK-16101][SQL] Refactoring CSV data source to...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/13988#discussion_r94756023 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala --- @@ -0,0 +1,272 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.csv + +import java.math.BigDecimal +import java.text.NumberFormat +import java.util.Locale + +import scala.util.Try +import scala.util.control.NonFatal + +import com.univocity.parsers.csv.CsvParser + +import org.apache.spark.internal.Logging +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions.GenericInternalRow +import org.apache.spark.sql.catalyst.util.DateTimeUtils +import org.apache.spark.sql.types._ +import org.apache.spark.unsafe.types.UTF8String + +private[csv] class UnivocityParser( +schema: StructType, +requiredSchema: StructType, +options: CSVOptions) extends Logging { + def this(schema: StructType, options: CSVOptions) = this(schema, schema, options) + + val valueConverters = makeConverters(schema, options) + val parser = new CsvParser(options.asParserSettings) + + // A `ValueConverter` is responsible for converting the given value to a desired type. + private type ValueConverter = String => Any + + var numMalformedRecords = 0 + val row = new GenericInternalRow(requiredSchema.length) + val indexArr: Array[Int] = { +val fields = if (options.dropMalformed) { + // If `dropMalformed` is enabled, then it needs to parse all the values + // so that we can decide which row is malformed. + requiredSchema ++ schema.filterNot(requiredSchema.contains(_)) +} else { + requiredSchema +} +fields.filter(schema.contains).map(schema.indexOf).toArray + } + + /** + * Create converters which cast each given string datum to each specified type in given schema. + * Currently, we do not support complex types (`ArrayType`, `MapType`, `StructType`). + * + * For string types, this is simply the datum. + * For other types, this is converted into the value according to the type. + * For other nullable types, returns null if it is null or equals to the value specified + * in `nullValue` option. + * + * @param schema schema that contains data types to cast the given value into. + * @param options CSV options. + */ + private def makeConverters( + schema: StructType, + options: CSVOptions = CSVOptions()): Array[ValueConverter] = { +schema.map(f => makeConverter(f.name, f.dataType, f.nullable, options)).toArray + } + + /** + * Create a converter which converts the string value to a value according to a desired type. + */ + def makeConverter( + name: String, + dataType: DataType, + nullable: Boolean = true, + options: CSVOptions = CSVOptions()): ValueConverter = dataType match { +case _: ByteType => (d: String) => + nullSafeDatum(d, name, nullable, options)(_.toByte) + +case _: ShortType => (d: String) => + nullSafeDatum(d, name, nullable, options)(_.toShort) + +case _: IntegerType => (d: String) => + nullSafeDatum(d, name, nullable, options)(_.toInt) + +case _: LongType => (d: String) => + nullSafeDatum(d, name, nullable, options)(_.toLong) + +case _: FloatType => (d: String) => + nullSafeDatum(d, name, nullable, options) { +case options.nanValue => Float.NaN +case options.negativeInf => Float.NegativeInfinity +case options.positiveInf => Float.PositiveInfinity +case datum => + Try(datum.toFloat) + .getOrElse(NumberFormat.getInstance(Locale.US).parse
[GitHub] spark pull request #15329: [SPARK-17763][SQL] JacksonParser silently parses ...
Github user HyukjinKwon closed the pull request at: https://github.com/apache/spark/pull/15329 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15329: [SPARK-17763][SQL] JacksonParser silently parses null as...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/15329 After re-thinking, I think this might be blocked by https://github.com/apache/spark/pull/14124 (or at least closely related). Anyhow, it looks dependent of it. Let me close this for now and try to reopen this if I can feel able to proceed this further independently. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13988: [SPARK-16101][SQL] Refactoring CSV data source to...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/13988#discussion_r94773568 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala --- @@ -0,0 +1,272 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.csv + +import java.math.BigDecimal +import java.text.NumberFormat +import java.util.Locale + +import scala.util.Try +import scala.util.control.NonFatal + +import com.univocity.parsers.csv.CsvParser + +import org.apache.spark.internal.Logging +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions.GenericInternalRow +import org.apache.spark.sql.catalyst.util.DateTimeUtils +import org.apache.spark.sql.types._ +import org.apache.spark.unsafe.types.UTF8String + +private[csv] class UnivocityParser( +schema: StructType, +requiredSchema: StructType, +options: CSVOptions) extends Logging { + def this(schema: StructType, options: CSVOptions) = this(schema, schema, options) + + val valueConverters = makeConverters(schema, options) + val parser = new CsvParser(options.asParserSettings) + + // A `ValueConverter` is responsible for converting the given value to a desired type. + private type ValueConverter = String => Any + + var numMalformedRecords = 0 + val row = new GenericInternalRow(requiredSchema.length) --- End diff -- Also, it separates numMalformedRecords when it calls `parse (...)`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13771: [SPARK-13748][PYSPARK][DOC] Add the description for expl...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/13771 Hi @davies, this PR adds a documentation that describes the potentially confusing behaviour as is for the current state. I know this is a super trivial but it is an improvement without any regression. Assuming the JIRA https://issues.apache.org/jira/browse/SPARK-12624, it seems we don't want to allow omitting and it looks it describes the correct behaviour. Could I please ask to take a look please? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16429: [SPARK-19019][PYTHON] Fix hijacked `collections.namedtup...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/16429 ping @joshrosen and @davies --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15751: [SPARK-18246][SQL] Throws an exception before execution ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/15751 Hi @marmbrus, would there be something you are worried of and I should take a careful look for? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15049: [SPARK-17310][SQL] Add an option to disable record-level...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/15049 Hi all, could you please guide me a bit further about what I should do in this PR? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14788: [SPARK-17174][SQL] Add the support for TimestampType for...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/14788 gentle ping... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14451: [SPARK-16848][SQL] Make jdbc() and read.format("jdbc") c...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/14451 Hi @gatorsmile, could I ask what do you think about this PR for now? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16451: [SPARK-18922][SQL][CORE][STREAMING][TESTS] Fix all ident...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/16451 test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16451: [SPARK-18922][SQL][CORE][STREAMING][TESTS] Fix all ident...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/16451 test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16451: [SPARK-18922][SQL][CORE][STREAMING][TESTS] Fix all ident...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/16451 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16451: [SPARK-18922][SQL][CORE][STREAMING][TESTS] Fix all ident...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/16451 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13988: [SPARK-16101][SQL] Refactoring CSV data source to be con...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/13988 Hi @cloud-fan, do you mind if I ask to check whether it looks making sense? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13988: [SPARK-16101][SQL] Refactoring CSV data source to...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/13988#discussion_r94892239 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala --- @@ -0,0 +1,272 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.csv + +import java.math.BigDecimal +import java.text.NumberFormat +import java.util.Locale + +import scala.util.Try +import scala.util.control.NonFatal + +import com.univocity.parsers.csv.CsvParser + +import org.apache.spark.internal.Logging +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions.GenericInternalRow +import org.apache.spark.sql.catalyst.util.DateTimeUtils +import org.apache.spark.sql.types._ +import org.apache.spark.unsafe.types.UTF8String + +private[csv] class UnivocityParser( +schema: StructType, +requiredSchema: StructType, +options: CSVOptions) extends Logging { + def this(schema: StructType, options: CSVOptions) = this(schema, schema, options) + + val valueConverters = makeConverters(schema, options) + val parser = new CsvParser(options.asParserSettings) + + // A `ValueConverter` is responsible for converting the given value to a desired type. + private type ValueConverter = String => Any + + var numMalformedRecords = 0 + val row = new GenericInternalRow(requiredSchema.length) + val indexArr: Array[Int] = { +val fields = if (options.dropMalformed) { + // If `dropMalformed` is enabled, then it needs to parse all the values + // so that we can decide which row is malformed. + requiredSchema ++ schema.filterNot(requiredSchema.contains(_)) +} else { + requiredSchema +} +fields.filter(schema.contains).map(schema.indexOf).toArray + } + + /** + * Create converters which cast each given string datum to each specified type in given schema. + * Currently, we do not support complex types (`ArrayType`, `MapType`, `StructType`). + * + * For string types, this is simply the datum. + * For other types, this is converted into the value according to the type. + * For other nullable types, returns null if it is null or equals to the value specified + * in `nullValue` option. + * + * @param schema schema that contains data types to cast the given value into. + * @param options CSV options. + */ + private def makeConverters( + schema: StructType, + options: CSVOptions = CSVOptions()): Array[ValueConverter] = { +schema.map(f => makeConverter(f.name, f.dataType, f.nullable, options)).toArray + } + + /** + * Create a converter which converts the string value to a value according to a desired type. + */ + def makeConverter( + name: String, + dataType: DataType, + nullable: Boolean = true, + options: CSVOptions = CSVOptions()): ValueConverter = dataType match { +case _: ByteType => (d: String) => + nullSafeDatum(d, name, nullable, options)(_.toByte) + +case _: ShortType => (d: String) => + nullSafeDatum(d, name, nullable, options)(_.toShort) + +case _: IntegerType => (d: String) => + nullSafeDatum(d, name, nullable, options)(_.toInt) + +case _: LongType => (d: String) => + nullSafeDatum(d, name, nullable, options)(_.toLong) + +case _: FloatType => (d: String) => + nullSafeDatum(d, name, nullable, options) { +case options.nanValue => Float.NaN +case options.negativeInf => Float.NegativeInfinity +case options.positiveInf => Float.PositiveInfinity +case datum => + Try(datum.toFloat) + .getOrElse(NumberFormat.getInstance(Locale.US).parse
[GitHub] spark pull request #13988: [SPARK-16101][SQL] Refactoring CSV data source to...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/13988#discussion_r94892718 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala --- @@ -0,0 +1,272 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.csv + +import java.math.BigDecimal +import java.text.NumberFormat +import java.util.Locale + +import scala.util.Try +import scala.util.control.NonFatal + +import com.univocity.parsers.csv.CsvParser + +import org.apache.spark.internal.Logging +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions.GenericInternalRow +import org.apache.spark.sql.catalyst.util.DateTimeUtils +import org.apache.spark.sql.types._ +import org.apache.spark.unsafe.types.UTF8String + +private[csv] class UnivocityParser( +schema: StructType, +requiredSchema: StructType, +options: CSVOptions) extends Logging { + def this(schema: StructType, options: CSVOptions) = this(schema, schema, options) + + val valueConverters = makeConverters(schema, options) --- End diff -- Some changes about converting here came from `CSVTypeCast`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14451: [SPARK-16848][SQL] Make jdbc() and read.format("jdbc") c...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/14451 Ah, thanks. Let me check out. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14451: [SPARK-16848][SQL] Make jdbc() and read.format("jdbc") c...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/14451 @gatorsmile, I just updated the PR description and added another `table(..)` API and the test case. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16451: [SPARK-18922][SQL][CORE][STREAMING][TESTS] Fix al...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/16451#discussion_r94900183 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSparkSubmitSuite.scala --- @@ -339,7 +339,13 @@ class HiveSparkSubmitSuite private def runSparkSubmit(args: Seq[String]): Unit = { val sparkHome = sys.props.getOrElse("spark.test.home", fail("spark.test.home is not set!")) val history = ArrayBuffer.empty[String] -val commands = Seq("./bin/spark-submit") ++ args +val sparkSubmit = if (Utils.isWindows) { + // On Windows, `ProcessBuilder.directory` does not change the current working directory. + new File("..\\..\\bin\\spark-submit.cmd").getAbsolutePath +} else { + "./bin/spark-submit" --- End diff -- I really don't get why/how using the full path (possibly) makes `HiveSparkSubmitSuite` flaky because `SparkSubmitSuite` uses the full path too. I just used the same way used in the original path just simply because 5 out of 18 builds were failed in `HiveSparkSubmitSuite`. I can't think of a workaround for Windows in this case. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16451: [SPARK-18922][SQL][CORE][STREAMING][TESTS] Fix all ident...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/16451 Build started: [TESTS] `org.apache.spark.sql.hive.HiveSparkSubmitSuite` [![PR-16451](https://ci.appveyor.com/api/projects/status/github/spark-test/spark?branch=3E2A36EC-E0E9-4A49-9142-50EBB669F9C0&svg=true)](https://ci.appveyor.com/project/spark-test/spark/branch/3E2A36EC-E0E9-4A49-9142-50EBB669F9C0) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16451: [SPARK-18922][SQL][CORE][STREAMING][TESTS] Fix all ident...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/16451 @srowen, regarding https://github.com/apache/spark/pull/16451#discussion_r94572449, would it be okay to leave as is or should I give a shot for `Utils.deleteRecursively`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14451: [SPARK-16848][SQL] Make jdbc() and read.format("jdbc") c...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/14451 Sure, I will add. It seems `format("jdbc").load()` thorws an exception already which is being tested [here](https://github.com/apache/spark/blob/f923c849e5b8f7e7aeafee59db598a9bf4970f50/sql/core/src/test/scala/org/apache/spark/sql/test/DataFrameReaderWriterSuite.scala#L294-L308). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14451: [SPARK-16848][SQL] Make jdbc() and read.format("jdbc") c...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/14451 Yeap. Then, let me add a test case and fix. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14451: [SPARK-16848][SQL] Check schema validation for us...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/14451#discussion_r94903231 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala --- @@ -536,10 +539,17 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging { */ @scala.annotation.varargs def textFile(paths: String*): Dataset[String] = { +assertNoSpecifiedSchema("textFile") +text(paths : _*).select("value").as[String](sparkSession.implicits.newStringEncoder) + } + + /** + * A convenient function for schema validation in APIs. + */ + private def assertNoSpecifiedSchema(operation: String): Unit = { if (userSpecifiedSchema.nonEmpty) { - throw new AnalysisException("User specified schema not supported with `textFile`") + throw new AnalysisException(s"User specified schema not supported with `$operation`") } -text(paths : _*).select("value").as[String](sparkSession.implicits.newStringEncoder) --- End diff -- This prints weird.. I just added ```scala /** * A convenient function for schema validation in APIs. */ private def assertNoSpecifiedSchema(operation: String): Unit = { if (userSpecifiedSchema.nonEmpty) { throw new AnalysisException(s"User specified schema not supported with `$operation`") } } ``` at the end. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14451: [SPARK-16848][SQL] Check schema validation for us...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/14451#discussion_r94903385 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala --- @@ -536,10 +539,17 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging { */ @scala.annotation.varargs def textFile(paths: String*): Dataset[String] = { +assertNoSpecifiedSchema("textFile") +text(paths : _*).select("value").as[String](sparkSession.implicits.newStringEncoder) + } + + /** + * A convenient function for schema validation in APIs. + */ + private def assertNoSpecifiedSchema(operation: String): Unit = { if (userSpecifiedSchema.nonEmpty) { - throw new AnalysisException("User specified schema not supported with `textFile`") + throw new AnalysisException(s"User specified schema not supported with `$operation`") } -text(paths : _*).select("value").as[String](sparkSession.implicits.newStringEncoder) --- End diff -- This prints weird.. I just added ```scala /** * A convenient function for schema validation in APIs. */ private def assertNoSpecifiedSchema(operation: String): Unit = { if (userSpecifiedSchema.nonEmpty) { throw new AnalysisException(s"User specified schema not supported with `$operation`") } } ``` at the end. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14451: [SPARK-16848][SQL] Check schema validation for us...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/14451#discussion_r94903467 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala --- @@ -536,10 +539,17 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging { */ @scala.annotation.varargs def textFile(paths: String*): Dataset[String] = { +assertNoSpecifiedSchema("textFile") +text(paths : _*).select("value").as[String](sparkSession.implicits.newStringEncoder) + } + + /** + * A convenient function for schema validation in APIs. + */ + private def assertNoSpecifiedSchema(operation: String): Unit = { if (userSpecifiedSchema.nonEmpty) { - throw new AnalysisException("User specified schema not supported with `textFile`") + throw new AnalysisException(s"User specified schema not supported with `$operation`") } -text(paths : _*).select("value").as[String](sparkSession.implicits.newStringEncoder) --- End diff -- This prints weird.. I just added ```scala /** * A convenient function for schema validation in APIs. */ private def assertNoSpecifiedSchema(operation: String): Unit = { if (userSpecifiedSchema.nonEmpty) { throw new AnalysisException(s"User specified schema not supported with `$operation`") } } ``` at the end. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16429: [SPARK-19019][PYTHON] Fix hijacked `collections.namedtup...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/16429 @holdenk, Yup. (actually, it is cloudpipe/cloudpickle@cbd3f34 because there were more minor fixes). It seems there are several customized codes such as injecting `numpy` as below: ```python def inject_numpy(self): numpy = sys.modules.get('numpy') if not numpy or not hasattr(numpy, 'ufunc'): return self.dispatch[numpy.ufunc] = self.__class__.save_ufunc ``` So, I tried to take out the valid changes from there for now. I also agree with using it as a thirdparty library if possible. That would be bigger job than this though because we now have a quite lot of customized codes there. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16429: [SPARK-19019][PYTHON] Fix hijacked `collections.namedtup...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/16429 BTW, this at least makes Spark on working state in Python 3.6.0 with a minimised changes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14451: [SPARK-16848][SQL] Check schema validation for user-spec...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/14451 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13988: [SPARK-16101][SQL] Refactoring CSV data source to...
Github user HyukjinKwon closed the pull request at: https://github.com/apache/spark/pull/13988 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13988: [SPARK-16101][SQL] Refactoring CSV data source to be con...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/13988 Sure! Let me split this into reading and writing ones. Thank you for yout comments. Let me close this for now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16451: [SPARK-18922][SQL][CORE][STREAMING][TESTS] Fix al...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/16451#discussion_r95055056 --- Diff: external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaTestUtils.scala --- @@ -138,10 +139,15 @@ class KafkaTestUtils extends Logging { if (server != null) { server.shutdown() + server.awaitShutdown() server = null } -brokerConf.logDirs.foreach { f => Utils.deleteRecursively(new File(f)) } +// On Windows, `logDirs` is left open even after Kafka server above is completely shut-downed +// in some cases. It leads to test failures on Windows if these are not ignored. +brokerConf.logDirs.map(new File(_)) + .filterNot(FileUtils.deleteQuietly) --- End diff -- > Is the point just that the former method is "quiet" instead of throwing an exception Yes! I will try to clean up including your another comment. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16451: [SPARK-18922][SQL][CORE][STREAMING][TESTS] Fix all ident...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/16451 > Utils.deleteRecursively does seem to delete the contents first, so I don't know why it fails on Windows. Maybe, I misunderstood this comment and also I might be wrong but let me try to explain at my best. On Windows, if we have a folder below and call `Utils.deleteRecursively(".\\tmp")`, ``` tmp └── resource1.txt ``` It will try to remove `resource1.txt` first. If anything opened `resource1.txt` without closing, Windows does not seem allowing to delete `resource1.txt` and also the folder `tmp` unless the object referencing `resource1.txt` closes it or is garbage-collected. For example, - Windows ```scala scala> new java.io.FileInputStream("resource1.txt") res2: java.io.FileInputStream = java.io.FileInputStream@4980d3 scala> new java.io.File("resource1.txt").delete() res3: Boolean = false scala> new java.io.File("resource1.txt").exists() res4: Boolean = true ``` - Mac ```scala scala> new java.io.FileInputStream("resource1.txt") res6: java.io.FileInputStream = java.io.FileInputStream@13805618 scala> new java.io.File("resource1.txt").delete() res7: Boolean = true scala> new java.io.File("resource1.txt").exists() res8: Boolean = false ``` So, if deleting `resource1.txt` throws an exception in `finally` block by `true` from `!file.delete()` and `true` from `file.exists()`. Afterwords, another exception is thrown by the failure of removing `tmp` because the directory is not empty in `finally` block. This suppresses the original exception message from the failure of removing the contents but throws a new exception that says the failure of removing the parent directory. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16451: [SPARK-18922][SQL][CORE][STREAMING][TESTS] Fix all ident...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/16451 Build started: [TESTS] `org.apache.spark.streaming.kafka010.DirectKafkaStreamSuite` [![PR-16451](https://ci.appveyor.com/api/projects/status/github/spark-test/spark?branch=DC5D6D27-6DDA-4D4A-8579-24BC1A5B487B&svg=true)](https://ci.appveyor.com/project/spark-test/spark/branch/DC5D6D27-6DDA-4D4A-8579-24BC1A5B487B) Build started: [TESTS] `org.apache.spark.streaming.kafka.ReliableKafkaStreamSuite` [![PR-16451](https://ci.appveyor.com/api/projects/status/github/spark-test/spark?branch=872A75B1-A478-4007-B02B-882A85DA94BC&svg=true)](https://ci.appveyor.com/project/spark-test/spark/branch/872A75B1-A478-4007-B02B-882A85DA94BC) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16451: [SPARK-18922][SQL][CORE][STREAMING][TESTS] Fix all ident...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/16451 I made a new function just simply because it repeats the same logics. Either way is fine with me. So.. do you mean something like the one as below? ```scala try { Utils.deleteRecursively(snapshotDir) } catch { case e: IOException if Utils.isWindows => logWarning(e.getMessage) } try { Utils.deleteRecursively(logDir) } catch { case e: IOException if Utils.isWindows => logWarning(e.getMessage) } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16451: [SPARK-18922][SQL][CORE][STREAMING][TESTS] Fix all ident...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/16451 Ah, thank you. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16451: [SPARK-18922][SQL][CORE][STREAMING][TESTS] Fix all ident...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/16451 Build started: [TESTS] `org.apache.spark.streaming.kafka010.DirectKafkaStreamSuite` [![PR-16451](https://ci.appveyor.com/api/projects/status/github/spark-test/spark?branch=B6F011A1-2E0F-4A58-AAA5-A9C04DF8A5F8&svg=true)](https://ci.appveyor.com/project/spark-test/spark/branch/B6F011A1-2E0F-4A58-AAA5-A9C04DF8A5F8) Build started: [TESTS] `org.apache.spark.streaming.kafka.ReliableKafkaStreamSuite` [![PR-16451](https://ci.appveyor.com/api/projects/status/github/spark-test/spark?branch=8D9D4C4E-4137-4A39-84E9-68DC944F4B5A&svg=true)](https://ci.appveyor.com/project/spark-test/spark/branch/8D9D4C4E-4137-4A39-84E9-68DC944F4B5A) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16496: [SPARK-16101][SQL] Refactoring CSV write path to ...
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/16496 [SPARK-16101][SQL] Refactoring CSV write path to be consistent with JSON data source ## What changes were proposed in this pull request? This PR refactors CSV write path to be consistent with JSON data source. This PR makes the methods in classes have consistent arguments with JSON ones. - `UnivocityGenerator` and `JacksonGenerator` ``` scala private[csv] class UnivocityGenerator( schema: StructType, writer: Writer, options: CSVOptions = new CSVOptions(Map.empty[String, String])) { ... def write ... def close ... def flush ... ``` ``` scala private[sql] class JacksonGenerator( schema: StructType, writer: Writer, options: JSONOptions = new JSONOptions(Map.empty[String, String])) { ... def write ... def close ... def flush ... ``` - This PR also makes the classes put in together in a consistent manner with JSON. - `CsvFileFormat` ``` scala CsvFileFormat CsvOutputWriter ``` - `JsonFileFormat` ``` scala JsonFileFormat JsonOutputWriter ``` ## How was this patch tested? Existing tests should cover this. You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark SPARK-16101-write Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16496.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16496 commit b8d97f7c391ea1f833d7a4025e3c627fc6385cfb Author: hyukjinkwon Date: 2017-01-07T14:47:25Z Refactoring CSV write path to be consistent with JSON data source --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16496: [SPARK-16101][SQL] Refactoring CSV write path to be cons...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/16496 cc @cloud-fan --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16496: [SPARK-16101][SQL] Refactoring CSV write path to be cons...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/16496 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16451: [SPARK-18922][SQL][CORE][STREAMING][TESTS] Fix all ident...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/16451 Ah, I just found anothet one. Let me check all other errors again. They are now pretty few (30-ish). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16501: [WIP][SPARK-19117][TESTS] Skip the tests using sc...
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/16501 [WIP][SPARK-19117][TESTS] Skip the tests using script transformation on Windows ## What changes were proposed in this pull request? This PR proposes to skip the tests for script transformation failed on Windows due to fixed bash location. ``` SQLQuerySuite: - script *** FAILED *** (553 milliseconds) org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 56.0 failed 1 times, most recent failure: Lost task 0.0 in stage 56.0 (TID 54, localhost, executor driver): java.io.IOException: Cannot run program "/bin/bash": CreateProcess error=2, The system cannot find the file specified - Star Expansion - script transform *** FAILED *** (2 seconds, 375 milliseconds) org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 389.0 failed 1 times, most recent failure: Lost task 0.0 in stage 389.0 (TID 725, localhost, executor driver): java.io.IOException: Cannot run program "/bin/bash": CreateProcess error=2, The system cannot find the file specified - test script transform for stdout *** FAILED *** (2 seconds, 813 milliseconds) org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 391.0 failed 1 times, most recent failure: Lost task 0.0 in stage 391.0 (TID 726, localhost, executor driver): java.io.IOException: Cannot run program "/bin/bash": CreateProcess error=2, The system cannot find the file specified - test script transform for stderr *** FAILED *** (2 seconds, 407 milliseconds) org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 393.0 failed 1 times, most recent failure: Lost task 0.0 in stage 393.0 (TID 727, localhost, executor driver): java.io.IOException: Cannot run program "/bin/bash": CreateProcess error=2, The system cannot find the file specified - test script transform data type *** FAILED *** (171 milliseconds) org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 395.0 failed 1 times, most recent failure: Lost task 0.0 in stage 395.0 (TID 728, localhost, executor driver): java.io.IOException: Cannot run program "/bin/bash": CreateProcess error=2, The system cannot find the file specified ``` ``` HiveQuerySuite: - transform *** FAILED *** (359 milliseconds) Failed to execute query using catalyst: Error: Job aborted due to stage failure: Task 0 in stage 1347.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1347.0 (TID 2395, localhost, executor driver): java.io.IOException: Cannot run program "/bin/bash": CreateProcess error=2, The system cannot find the file specified - schema-less transform *** FAILED *** (344 milliseconds) Failed to execute query using catalyst: Error: Job aborted due to stage failure: Task 0 in stage 1348.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1348.0 (TID 2396, localhost, executor driver): java.io.IOException: Cannot run program "/bin/bash": CreateProcess error=2, The system cannot find the file specified - transform with custom field delimiter *** FAILED *** (296 milliseconds) Failed to execute query using catalyst: Error: Job aborted due to stage failure: Task 0 in stage 1349.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1349.0 (TID 2397, localhost, executor driver): java.io.IOException: Cannot run program "/bin/bash": CreateProcess error=2, The system cannot find the file specified - transform with custom field delimiter2 *** FAILED *** (297 milliseconds) Failed to execute query using catalyst: Error: Job aborted due to stage failure: Task 0 in stage 1350.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1350.0 (TID 2398, localhost, executor driver): java.io.IOException: Cannot run program "/bin/bash": CreateProcess error=2, The system cannot find the file specified - transform with custom field delimiter3 *** FAILED *** (312 milliseconds) Failed to execute query using catalyst: Error: Job aborted due to stage failure: Task 0 in stage 1351.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1351.0 (TID 2399, localhost, executor driver): java.io.IOException: Cannot run program "/bin/bash": CreateProcess error=2, The system cannot find the file specified - transform with SerDe2 *** FAILED *** (437 milliseconds) org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1355.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1355.0 (TID 2403, localhost, executor driver): java.io.IOException: Cannot run program "/bin/bash": Cr
[GitHub] spark issue #16451: [SPARK-18922][SQL][CORE][STREAMING][TESTS] Fix all ident...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/16451 ``` - recovery with file input stream *** FAILED *** (10 seconds, 205 milliseconds) The code passed to eventually never returned normally. Attempted 660 times over 10.0142724 seconds. Last failure message: Unexpected internal error near index 1 \ ^. (CheckpointSuite.scala:680) - SPARK-18220: read Hive orc table with varchar column *** FAILED *** (2 seconds, 563 milliseconds) org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution Error, return code -101 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask. org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z - rolling file appender - size-based rolling (compressed) *** FAILED *** (15 milliseconds) 1000 was not less than 1000 (FileAppenderSuite.scala:128) - recover from node failures with replication *** FAILED *** (34 seconds, 613 milliseconds) org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 6.0 failed 4 times, most recent failure: Lost task 1.3 in stage 6.0 (TID 33, localhost, executor 28): java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_6_piece0 of broadcast_6 ... Caused by: org.apache.spark.SparkException: Failed to get broadcast_6_piece0 of broadcast_6 ... ``` `recovery with file input stream` - seems the same problem with this. `SPARK-18220: read Hive orc table with varchar column` - I could not see the cause at the first look but it seems not related with this problem because apparently the test does not use any path. `rolling file appender - size-based rolling (compressed)` - this test seems possible flaky. It was passed in https://ci.appveyor.com/project/spark-test/spark/build/503-6DEDA384-4A91-45CD-AD26-EE0757D3D2AC/job/etb359vk0fwbrqgo `recover from node failures with replication` - this test seems possibly flaky too. It was passed in https://ci.appveyor.com/project/spark-test/spark/build/503-6DEDA384-4A91-45CD-AD26-EE0757D3D2AC/job/etb359vk0fwbrqgo Other test failures are deal with in https://github.com/apache/spark/pull/16501. Let me add a fix for the first one in `CheckpointSuite` after verifying it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16451: [SPARK-18922][SQL][CORE][STREAMING][TESTS] Fix al...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/16451#discussion_r95078070 --- Diff: streaming/src/test/scala/org/apache/spark/streaming/CheckpointSuite.scala --- @@ -629,7 +629,7 @@ class CheckpointSuite extends TestSuiteBase with DStreamCheckpointTester ssc.graph.getInputStreams().head.asInstanceOf[FileInputDStream[_, _, _]] val filenames = fileInputDStream.batchTimeToSelectedFiles.synchronized { fileInputDStream.batchTimeToSelectedFiles.values.flatten } - filenames.map(_.split(File.separator).last.toInt).toSeq.sorted + filenames.map(_.split("/").last.toInt).toSeq.sorted --- End diff -- This is always "/" because `FileInputDStream.batchTimeToSelectedFiles` is in form of `/a/b/c` via https://github.com/apache/spark/blob/39e2bad6a866d27c3ca594d15e574a1da3ee84cc/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala#L207 ```scala scala> import org.apache.hadoop.fs.Path import org.apache.hadoop.fs.Path scala> new Path("C:\\a\\b\\c").toString res0: String = C:/a/b/c ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16451: [SPARK-18922][SQL][CORE][STREAMING][TESTS] Fix al...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/16451#discussion_r95078082 --- Diff: streaming/src/test/scala/org/apache/spark/streaming/CheckpointSuite.scala --- @@ -755,7 +755,14 @@ class CheckpointSuite extends TestSuiteBase with DStreamCheckpointTester assert(outputBuffer.asScala.flatten.toSet === expectedOutput.toSet) } } finally { - Utils.deleteRecursively(testDir) + try { +// As the driver shuts down in the middle of processing above, `testDir` is not closed +// correctly which causes the test failure on Windows. +Utils.deleteRecursively(testDir) + } catch { +case e: IOException if Utils.isWindows => + logWarning(e.getMessage) + } --- End diff -- Here, it seems it shuts down in the middle of processing for testing. So, it looks ending up with not closing it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16451: [SPARK-18922][SQL][CORE][STREAMING][TESTS] Fix all ident...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/16451 Build started: [TESTS] `org.apache.spark.streaming.CheckpointSuite` [![PR-16451](https://ci.appveyor.com/api/projects/status/github/spark-test/spark?branch=19AF9EA6-D757-4E20-A27E-A6A86D6401EA&svg=true)](https://ci.appveyor.com/project/spark-test/spark/branch/19AF9EA6-D757-4E20-A27E-A6A86D6401EA) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16451: [SPARK-18922][SQL][CORE][STREAMING][TESTS] Fix all ident...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/16451 (FWIW, I am pretty sure of this PR for now) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16501: [WIP][SPARK-19117][TESTS] Skip the tests using script tr...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/16501 Build started: [TESTS] `ALL` [![PR-16501](https://ci.appveyor.com/api/projects/status/github/spark-test/spark?branch=16201E76-7067-408C-A359-61B524963D3B&svg=true)](https://ci.appveyor.com/project/spark-test/spark/branch/16201E76-7067-408C-A359-61B524963D3B) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16502: Branch 2.1
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/16502 Hi @bupt2012, it seems this PR is mistakenly open. Could you please close this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16501: [WIP][SPARK-19117][TESTS] Skip the tests using script tr...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/16501 Build started: [TESTS] `org.apache.spark.sql.hive.execution.SQLQuerySuite` [![PR-16501](https://ci.appveyor.com/api/projects/status/github/spark-test/spark?branch=82B29FFC-BA11-4AA9-ACFC-F529FFAADB12&svg=true)](https://ci.appveyor.com/project/spark-test/spark/branch/82B29FFC-BA11-4AA9-ACFC-F529FFAADB12) Build started: [TESTS] `org.apache.spark.sql.hive.execution.HiveQuerySuite` [![PR-16501](https://ci.appveyor.com/api/projects/status/github/spark-test/spark?branch=1D8572C0-31FF-43ED-AC2E-476C76127BF7&svg=true)](https://ci.appveyor.com/project/spark-test/spark/branch/1D8572C0-31FF-43ED-AC2E-476C76127BF7) Build started: [TESTS] `org.apache.spark.sql.catalyst.LogicalPlanToSQLSuite` [![PR-16501](https://ci.appveyor.com/api/projects/status/github/spark-test/spark?branch=0E18DDFB-25FC-4868-B3CD-F18E29A0618D&svg=true)](https://ci.appveyor.com/project/spark-test/spark/branch/0E18DDFB-25FC-4868-B3CD-F18E29A0618D) Build started: [TESTS] `org.apache.spark.sql.hive.execution.ScriptTransformationSuite` [![PR-16501](https://ci.appveyor.com/api/projects/status/github/spark-test/spark?branch=3935B8AA-FEB7-4232-91D5-B9C8C13E596F&svg=true)](https://ci.appveyor.com/project/spark-test/spark/branch/3935B8AA-FEB7-4232-91D5-B9C8C13E596F) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16429: [SPARK-19019][PYTHON] Fix hijacked `collections.namedtup...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/16429 @joshrosen and @davies, could you review this if you have some time please? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16501: [WIP][SPARK-19117][TESTS] Skip the tests using script tr...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/16501 Build started: [TESTS] `org.apache.spark.sql.hive.execution.HiveQuerySuite` [![PR-16501](https://ci.appveyor.com/api/projects/status/github/spark-test/spark?branch=072A48DF-9BFF-488E-9510-4FE37B211F68&svg=true)](https://ci.appveyor.com/project/spark-test/spark/branch/072A48DF-9BFF-488E-9510-4FE37B211F68) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16515: [MINOR][PYTHON][EXAMPLE] Fix binary classificatio...
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/16515 [MINOR][PYTHON][EXAMPLE] Fix binary classification metrics example to work ## What changes were proposed in this pull request? LibSVM datasource loads `ml.linalg.SparseVector` whereas the examples requires it to be `mllib.linalg.SparseVector`. Scala exmaples, `BinaryClassificationMetricsExample.scala` is fine. ``` File ".../spark/examples/src/main/python/mllib/binary_classification_metrics_example.py", line 39, in .rdd.map(lambda row: LabeledPoint(row[0], row[1])) File ".../spark/python/pyspark/mllib/regression.py", line 54, in __init__ self.features = _convert_to_vector(features) File ".../spark/python/pyspark/mllib/linalg/__init__.py", line 80, in _convert_to_vector raise TypeError("Cannot convert type %s into Vector" % type(l)) TypeError: Cannot convert type into Vector ``` ## How was this patch tested? Manually via ``` ./bin/spark-submit examples/src/main/python/mllib/binary_classification_metrics_example.py ``` You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark minor-example-fix Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16515.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16515 commit 9a4fd40609d6ef71dc3fd3db0f72502eb3f070f0 Author: hyukjinkwon Date: 2017-01-09T10:31:15Z Fix binary classification metrics example to work --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16515: [MINOR][PYTHON][EXAMPLE] Fix binary classificatio...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/16515#discussion_r95132326 --- Diff: examples/src/main/python/mllib/binary_classification_metrics_example.py --- @@ -18,25 +18,20 @@ Binary Classification Metrics Example. """ from __future__ import print_function -from pyspark.sql import SparkSession +from pyspark import SparkContext # $example on$ from pyspark.mllib.classification import LogisticRegressionWithLBFGS from pyspark.mllib.evaluation import BinaryClassificationMetrics -from pyspark.mllib.regression import LabeledPoint +from pyspark.mllib.util import MLUtils # $example off$ if __name__ == "__main__": -spark = SparkSession\ -.builder\ -.appName("BinaryClassificationMetricsExample")\ -.getOrCreate() +sc = SparkContext(appName="BinaryClassificationMetricsExample") --- End diff -- I just used `SparkContext` to be consistent with other examples. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16515: [MINOR][PYTHON][EXAMPLE] Fix binary classification metri...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/16515 @yanboliang Could I please ask to take a look please? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16515: [MINOR][PYTHON][EXAMPLE] Fix binary classification metri...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/16515 Hm.. actually, it seems there are more. Let me open a JIRA and sweep it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16515: [SPARK-19134][PYTHON][EXAMPLE] Fix several Python...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/16515#discussion_r95140703 --- Diff: examples/src/main/python/mllib/binary_classification_metrics_example.py --- @@ -18,25 +18,20 @@ Binary Classification Metrics Example. """ from __future__ import print_function -from pyspark.sql import SparkSession +from pyspark import SparkContext # $example on$ from pyspark.mllib.classification import LogisticRegressionWithLBFGS from pyspark.mllib.evaluation import BinaryClassificationMetrics -from pyspark.mllib.regression import LabeledPoint +from pyspark.mllib.util import MLUtils # $example off$ if __name__ == "__main__": -spark = SparkSession\ -.builder\ -.appName("BinaryClassificationMetricsExample")\ -.getOrCreate() +sc = SparkContext(appName="BinaryClassificationMetricsExample") --- End diff -- Yes, it is up to my understanding. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16515: [SPARK-19134][EXAMPLE] Fix several sql, mllib and status...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/16515 Thank you all! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16553: [SPARK-9435][SQL] Reuse function in Java UDF to c...
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/16553 [SPARK-9435][SQL] Reuse function in Java UDF to correctly support expressions that require equality comparison ## What changes were proposed in this pull request? Currently, running the codes in Java ```java spark.udf().register("inc", new UDF1() { @Override public Long call(Long i) { return i + 1; } }, DataTypes.LongType); spark.range(10).toDF("x").createOrReplaceTempView("tmp"); Row result = spark.sql("SELECT inc(x) FROM tmp GROUP BY inc(x)").head(); Assert.assertEquals(7, result.getLong(0)); ``` fails as below: ``` org.apache.spark.sql.AnalysisException: expression 'tmp.`x`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;; Aggregate [UDF(x#19L)], [UDF(x#19L) AS UDF(x)#23L] +- SubqueryAlias tmp, `tmp` +- Project [id#16L AS x#19L] +- Range (0, 10, step=1, splits=Some(8)) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40) at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:57) ``` The root cause is because we were creating the function every time when it needs to build as below: ```scala scala> def inc(i: Int) = i + 1 inc: (i: Int)Int scala> (inc(_: Int)).hashCode res15: Int = 1231799381 scala> (inc(_: Int)).hashCode res16: Int = 2109839984 scala> (inc(_: Int)) == (inc(_: Int)) res17: Boolean = false ``` This seems leading to the comparison failure between `ScalaUDF` created from Java UDF API, for example, in `Expression.semanticEquals`. In case of Scala one, it seems already fine. Both can be tested easily as below if any reviewer is comfortable with Scala: ```scala val df = Seq((1, 10), (2, 11), (3, 12)).toDF("x", "y") val javaUDF = new UDF1[Int, Int] { override def call(i: Int): Int = i + 1 } // spark.udf.register("inc", javaUDF, IntegerType) // Uncomment this for Java API // spark.udf.register("inc", (i: Int) => i + 1)// Uncomment this for Scala API df.createOrReplaceTempView("tmp") spark.sql("SELECT inc(y) FROM tmp GROUP BY inc(y)").show() ``` ## How was this patch tested? Unit test in `JavaUDFSuite.java`. You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark SPARK-9435 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16553.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16553 commit 30ed14f38b5c38091d07d0e014a49e494aeb73cc Author: hyukjinkwon Date: 2017-01-11T18:02:08Z Reuse function in Java UDF to support correctly expression equality comparison --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16553: [SPARK-9435][SQL] Reuse function in Java UDF to correctl...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/16553 cc @marmbrus, I just saw you in the JIRA. Could you please take a look? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16553: [SPARK-9435][SQL] Reuse function in Java UDF to c...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/16553#discussion_r95694410 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala --- @@ -488,219 +488,241 @@ class UDFRegistration private[sql] (functionRegistry: FunctionRegistry) extends * @since 1.3.0 */ def register(name: String, f: UDF1[_, _], returnType: DataType): Unit = { +val func = f.asInstanceOf[UDF1[Any, Any]].call(_: Any) --- End diff -- Ah, sure. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16429: [SPARK-19019][PYTHON] Fix hijacked `collections.namedtup...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/16429 gentle ping @joshrosen and @davies --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16553: [SPARK-9435][SQL] Reuse function in Java UDF to c...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/16553#discussion_r95705218 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala --- @@ -109,9 +109,10 @@ class UDFRegistration private[sql] (functionRegistry: FunctionRegistry) extends | * @since 1.3.0 | */ |def register(name: String, f: UDF$i[$extTypeArgs, _], returnType: DataType): Unit = { + | val func = f$anyCast.call($anyParams) | functionRegistry.registerFunction( |name, - |(e: Seq[Expression]) => ScalaUDF(f$anyCast.call($anyParams), returnType, e)) + |(e: Seq[Expression]) => ScalaUDF(func, returnType, e)) --- End diff -- I verified this by overwriting the current changes after copying and pasting and checking no diff. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14451: [SPARK-16848][SQL] Check schema validation for user-spec...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/14451 I just simply rebased this. Could this be merged by any change after the build? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16502: Branch 2.1
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/16502 Hi @bupt2012, could you just click the "Close pull request" button below? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14451: [SPARK-16848][SQL] Check schema validation for user-spec...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/14451 Thank you @gatorsmile! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16584: [SPARK-19221][PROJECT INFRA][R] Add winutils bina...
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/16584 [SPARK-19221][PROJECT INFRA][R] Add winutils binaries to the path in AppVeyor tests for Hadoop libraries to call native codes properly ## What changes were proposed in this pull request? It seems Hadoop libraries need winutils binaries for native libraries in the path. It is not a problem in tests for now because we are only testing SparkR on Windows via AppVeyor but it can be a problem if we run Scala tests via AppVeyor as below: ``` - SPARK-18220: read Hive orc table with varchar column *** FAILED *** (3 seconds, 937 milliseconds) org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution Error, return code -101 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask. org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:625) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:609) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:283) ... ``` This PR proposes to add it to the `Path` for AppVeyor tests. ## How was this patch tested? Manually via AppVeyor. **Before** https://ci.appveyor.com/project/spark-test/spark/build/572-windows-complete/job/c4vrysr5uvj2hgu7 **After** https://ci.appveyor.com/project/spark-test/spark/build/549-windows-complete/job/gc8a1pjua2bc4i8m You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark set-path-appveyor Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16584.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16584 commit 7a8163cd006f669371b5188e8b55fc12fe5375d3 Author: hyukjinkwon Date: 2017-01-13T16:10:03Z Add hadoop.dll in AppVeyor tests for Hadoop libraries to call native libraries properly --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16584: [SPARK-19221][PROJECT INFRA][R] Add winutils binaries to...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/16584 Here, SparkR tests to verify it works fine after merged into master. Build started: [SparkR] `ALL` [![PR-16584](https://ci.appveyor.com/api/projects/status/github/spark-test/spark?branch=D02D7C5B-ED94-4326-B72C-DDC1229ECC88&svg=true)](https://ci.appveyor.com/project/spark-test/spark/branch/D02D7C5B-ED94-4326-B72C-DDC1229ECC88) Here, `OrcSourceSuite` that contains the problematic test that seems using the winutils binaries. Build started: [org.apache.spark.sql.hive.orc.OrcSourceSuite] `ALL` [![PR-16584](https://ci.appveyor.com/api/projects/status/github/spark-test/spark?branch=ADC9D5F6-726B-4691-B60F-4B8DAB18010B&svg=true)](https://ci.appveyor.com/project/spark-test/spark/branch/ADC9D5F6-726B-4691-B60F-4B8DAB18010B) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16584: [SPARK-19221][PROJECT INFRA][R] Add winutils binaries to...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/16584 cc @srowen this is related with the last test failure I have identified so far. Let me complete fixing the test failures (found in [a way I have tested](https://ci.appveyor.com/project/spark-test/spark/build/530-windows-test)) on Windows in another PR including newly introduced ones, (too) flaky ones and potentially missed ones by showing a green. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16584: [SPARK-19221][PROJECT INFRA][R] Add winutils binaries to...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/16584 cc @shivaram too. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16577: [SPARK-19214][SQL] Typed aggregate count output field na...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/16577 (It just rings a bell to me. It seems it is okay to break the default names if it seems required - https://github.com/apache/spark/pull/1#issuecomment-246932928) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16553: [SPARK-9435][SQL] Reuse function in Java UDF to correctl...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/16553 @marmbrus, could you take another look when you have some time? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16429: [SPARK-19019][PYTHON] Fix hijacked `collections.namedtup...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/16429 gentle ping --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/16578 Does this take over https://github.com/apache/spark/pull/14957? If so, we might need `Closes #14957` in the PR description for the merge script to close that one or let the author know this takes over that. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16429: [SPARK-19019][PYTHON] Fix hijacked `collections.namedtup...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/16429 The only added comments are as blow: ```python # __kwdefaults__ contains the default values of keyword-only arguments which are # introduced from Python 3. The possible cases for __kwdefaults__ in namedtuple # are as below: # # - Does not exist in Python 2. # - Returns None in <= Python 3.5.x. # - Returns a dictionary containing the default values to the keys from Python 3.6.x #(See https://bugs.python.org/issue25628). ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16429: [SPARK-19019][PYTHON] Fix hijacked `collections.namedtup...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/16429 (I re-ran `./run-tests --python-executables=python3.6` for sure) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16553: [SPARK-9435][SQL] Reuse function in Java UDF to correctl...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/16553 @gatorsmile Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16553: [SPARK-9435][SQL] Reuse function in Java UDF to c...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/16553#discussion_r96137414 --- Diff: sql/core/src/test/java/test/org/apache/spark/sql/JavaUDFSuite.java --- @@ -108,4 +109,24 @@ public void udf3Test() { result = spark.sql("SELECT stringLengthTest('test', 'test2')").head(); Assert.assertEquals(9, result.getInt(0)); } + + @SuppressWarnings("unchecked") + @Test + public void udf4Test() { +spark.udf().register("inc", new UDF1() { + @Override + public Long call(Long i) { +return i + 1; + } +}, DataTypes.LongType); + +spark.range(10).toDF("x").createOrReplaceTempView("tmp"); +List results = spark.sql("SELECT inc(x) FROM tmp GROUP BY inc(x)").collectAsList(); --- End diff -- Sure, makes sense. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org