[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12268#issuecomment-221157062 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12268#issuecomment-221157064 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59175/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12268#issuecomment-221156923 **[Test build #59175 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59175/consoleFull)** for PR 12268 at commit [`66b1757`](https://github.com/apache/spark/commit/66b17570a8d1ad53b5073bbfa439eb01b05413c1). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12268#issuecomment-221156930 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59174/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12268#issuecomment-221156929 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12268#issuecomment-221156828 **[Test build #59174 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59174/consoleFull)** for PR 12268 at commit [`d1f616e`](https://github.com/apache/spark/commit/d1f616e2880e1100f9ffe71981a6039720d0eff4). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12268#issuecomment-221147030 **[Test build #59174 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59174/consoleFull)** for PR 12268 at commit [`d1f616e`](https://github.com/apache/spark/commit/d1f616e2880e1100f9ffe71981a6039720d0eff4). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12268#issuecomment-221147706 **[Test build #59175 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59175/consoleFull)** for PR 12268 at commit [`66b1757`](https://github.com/apache/spark/commit/66b17570a8d1ad53b5073bbfa439eb01b05413c1). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12268#issuecomment-218956629 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/58538/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12268#issuecomment-218956628 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12268#issuecomment-218956508 **[Test build #58538 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58538/consoleFull)** for PR 12268 at commit [`cbb1674`](https://github.com/apache/spark/commit/cbb1674ecb4a82bfdb3fed97cdd14adbdd14ffb6). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12268#issuecomment-218948489 **[Test build #58538 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58538/consoleFull)** for PR 12268 at commit [`cbb1674`](https://github.com/apache/spark/commit/cbb1674ecb4a82bfdb3fed97cdd14adbdd14ffb6). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12268#issuecomment-217603358 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12268#issuecomment-217603319 **[Test build #58046 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58046/consoleFull)** for PR 12268 at commit [`f2234e3`](https://github.com/apache/spark/commit/f2234e3f7bac02c396a8638f69baab740bc83bb1). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class NoSuchPermanentFunctionException(db: String, func: String)` * `class NoSuchFunctionException(db: String, func: String)` * `case class GetExternalRowField(` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12268#issuecomment-217603359 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/58046/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12268#issuecomment-217599175 **[Test build #58046 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58046/consoleFull)** for PR 12268 at commit [`f2234e3`](https://github.com/apache/spark/commit/f2234e3f7bac02c396a8638f69baab740bc83bb1). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12268#issuecomment-217075071 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12268#issuecomment-217075074 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57834/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12268#issuecomment-217074901 **[Test build #57834 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57834/consoleFull)** for PR 12268 at commit [`a0aed27`](https://github.com/apache/spark/commit/a0aed27b7169caee50d0e97bceb6653202ba3f04). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12268#issuecomment-217066498 **[Test build #57834 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57834/consoleFull)** for PR 12268 at commit [`a0aed27`](https://github.com/apache/spark/commit/a0aed27b7169caee50d0e97bceb6653202ba3f04). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12268#issuecomment-216097352 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12268#issuecomment-216097353 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57498/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12268#issuecomment-216097282 **[Test build #57498 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57498/consoleFull)** for PR 12268 at commit [`8e1bdf7`](https://github.com/apache/spark/commit/8e1bdf7176296eb9bd10f1249dd951abd0094191). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12268#issuecomment-216094877 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57496/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12268#issuecomment-216094875 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12268#issuecomment-216094838 **[Test build #57496 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57496/consoleFull)** for PR 12268 at commit [`bd510c2`](https://github.com/apache/spark/commit/bd510c2b309f1da0099205838dd7856737c8ab61). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12268#issuecomment-216091185 **[Test build #57498 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57498/consoleFull)** for PR 12268 at commit [`8e1bdf7`](https://github.com/apache/spark/commit/8e1bdf7176296eb9bd10f1249dd951abd0094191). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12268#issuecomment-216090185 **[Test build #57496 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57496/consoleFull)** for PR 12268 at commit [`bd510c2`](https://github.com/apache/spark/commit/bd510c2b309f1da0099205838dd7856737c8ab61). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/12268#issuecomment-215928432 Since this is almost a complete rewrite, I think we should only consider it early in the release cycle, i.e. for 2.1, not for 2.0 when we are so close. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/12268#issuecomment-215907300 @rxin, @hvanhovell Do you mind if I ask your thoughts on this please? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/12268#discussion_r61366691 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala --- @@ -0,0 +1,189 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.csv + +import scala.util.control.NonFatal + +import com.univocity.parsers.csv.{CsvParser, CsvParserSettings} + +import org.apache.spark.internal.Logging +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions.GenericMutableRow +import org.apache.spark.sql.types.{StructField, StructType} + +/** + * Converts CSV string to a sequence of string + */ +private[csv] object UnivocityParser extends Logging { + /** + * Convert the input iterator to a iterator having [[InternalRow]] + */ + def parseCsv( + iter: Iterator[String], + schema: StructType, + requiredSchema: StructType, + headers: Array[String], + shouldDropHeader: Boolean, + options: CSVOptions): Iterator[InternalRow] = { +if (shouldDropHeader) { + CSVUtils.dropHeaderLine(iter, options) +} +val csv = CSVUtils.filterCommentAndEmpty(iter, options) + +val schemaFields = schema.fields +val requiredFields = requiredSchema.fields +val safeRequiredFields = if (options.dropMalformed) { + // If `dropMalformed` is enabled, then it needs to parse all the values + // so that we can decide which row is malformed. + requiredFields ++ schemaFields.filterNot(requiredFields.contains(_)) +} else { + requiredFields +} +val safeRequiredIndices = new Array[Int](safeRequiredFields.length) +schemaFields.zipWithIndex.filter { + case (field, _) => safeRequiredFields.contains(field) +}.foreach { + case (field, index) => safeRequiredIndices(safeRequiredFields.indexOf(field)) = index +} +val requiredSize = requiredFields.length + +tokenizeData(csv, options, headers).flatMap { tokens => + if (options.dropMalformed && schemaFields.length != tokens.length) { +logWarning(s"Dropping malformed line: ${tokens.mkString(options.delimiter.toString)}") +None + } else if (options.failFast && schemaFields.length != tokens.length) { +throw new RuntimeException(s"Malformed line in FAILFAST mode: " + + s"${tokens.mkString(options.delimiter.toString)}") + } else { +val indexSafeTokens = if (options.permissive && schemaFields.length > tokens.length) { + tokens ++ new Array[String](schemaFields.length - tokens.length) +} else if (options.permissive && schemaFields.length < tokens.length) { + tokens.take(schemaFields.length) +} else { + tokens +} +try { + val row = convertTokens( +indexSafeTokens, +safeRequiredIndices, +schemaFields, +requiredSize, +options) + Some(row) +} catch { + case NonFatal(e) if options.dropMalformed => +logWarning("Parse exception. " + + s"Dropping malformed line: ${tokens.mkString(options.delimiter.toString)}") +None +} + } +} + } + + /** + * Convert the tokens to [[InternalRow]] + */ + private def convertTokens( + tokens: Array[String], + requiredIndices: Array[Int], + schemaFields: Array[StructField], + requiredSize: Int, + options: CSVOptions): InternalRow = { +val row = new GenericMutableRow(requiredSize) --- End diff -- Oh yes! I noticed this too. JSON data source will does as far as I remember. This might have
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/12268#issuecomment-215275474 @hvanhovell If you think it makes sense I will change the title of this PR and JIRA, and will add some more commits to deal with minor things (code style and etc.). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/12268#issuecomment-215274872 @hvanhovell Thank you for a close look! I think I need to change this title of this issue and JIRA because "better performance" might be too broad. The main purpose of this PR was, - Refactoring this to be consistent with JSON data source - Remove unnecessary conversion from `Iterator` to `Reader`. Could I please make some JIRAs and PRs for this in separate PRs or follow-ups if it makes sense? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/12268#discussion_r61359944 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala --- @@ -0,0 +1,189 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.csv + +import scala.util.control.NonFatal + +import com.univocity.parsers.csv.{CsvParser, CsvParserSettings} + +import org.apache.spark.internal.Logging +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions.GenericMutableRow +import org.apache.spark.sql.types.{StructField, StructType} + +/** + * Converts CSV string to a sequence of string + */ +private[csv] object UnivocityParser extends Logging { + /** + * Convert the input iterator to a iterator having [[InternalRow]] + */ + def parseCsv( + iter: Iterator[String], + schema: StructType, + requiredSchema: StructType, + headers: Array[String], + shouldDropHeader: Boolean, + options: CSVOptions): Iterator[InternalRow] = { +if (shouldDropHeader) { + CSVUtils.dropHeaderLine(iter, options) +} +val csv = CSVUtils.filterCommentAndEmpty(iter, options) + +val schemaFields = schema.fields +val requiredFields = requiredSchema.fields +val safeRequiredFields = if (options.dropMalformed) { + // If `dropMalformed` is enabled, then it needs to parse all the values + // so that we can decide which row is malformed. + requiredFields ++ schemaFields.filterNot(requiredFields.contains(_)) +} else { + requiredFields +} +val safeRequiredIndices = new Array[Int](safeRequiredFields.length) +schemaFields.zipWithIndex.filter { + case (field, _) => safeRequiredFields.contains(field) +}.foreach { + case (field, index) => safeRequiredIndices(safeRequiredFields.indexOf(field)) = index +} +val requiredSize = requiredFields.length + +tokenizeData(csv, options, headers).flatMap { tokens => + if (options.dropMalformed && schemaFields.length != tokens.length) { +logWarning(s"Dropping malformed line: ${tokens.mkString(options.delimiter.toString)}") +None + } else if (options.failFast && schemaFields.length != tokens.length) { +throw new RuntimeException(s"Malformed line in FAILFAST mode: " + + s"${tokens.mkString(options.delimiter.toString)}") + } else { +val indexSafeTokens = if (options.permissive && schemaFields.length > tokens.length) { + tokens ++ new Array[String](schemaFields.length - tokens.length) +} else if (options.permissive && schemaFields.length < tokens.length) { + tokens.take(schemaFields.length) --- End diff -- Oh, I haven't tested this yet but I am sure this will work without this logic anyway but I think it is safe to slice this here. The size of `tokens` can be larger than `schemaFields`. I can remove this logic if you feel strongly weird but I feel like it might be okay to just leave. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/12268#discussion_r61359353 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala --- @@ -0,0 +1,189 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.csv + +import scala.util.control.NonFatal + +import com.univocity.parsers.csv.{CsvParser, CsvParserSettings} + +import org.apache.spark.internal.Logging +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions.GenericMutableRow +import org.apache.spark.sql.types.{StructField, StructType} + +/** + * Converts CSV string to a sequence of string + */ +private[csv] object UnivocityParser extends Logging { + /** + * Convert the input iterator to a iterator having [[InternalRow]] + */ + def parseCsv( + iter: Iterator[String], + schema: StructType, + requiredSchema: StructType, + headers: Array[String], + shouldDropHeader: Boolean, + options: CSVOptions): Iterator[InternalRow] = { +if (shouldDropHeader) { + CSVUtils.dropHeaderLine(iter, options) +} +val csv = CSVUtils.filterCommentAndEmpty(iter, options) + +val schemaFields = schema.fields +val requiredFields = requiredSchema.fields +val safeRequiredFields = if (options.dropMalformed) { + // If `dropMalformed` is enabled, then it needs to parse all the values + // so that we can decide which row is malformed. + requiredFields ++ schemaFields.filterNot(requiredFields.contains(_)) +} else { + requiredFields +} +val safeRequiredIndices = new Array[Int](safeRequiredFields.length) +schemaFields.zipWithIndex.filter { + case (field, _) => safeRequiredFields.contains(field) +}.foreach { + case (field, index) => safeRequiredIndices(safeRequiredFields.indexOf(field)) = index +} +val requiredSize = requiredFields.length + +tokenizeData(csv, options, headers).flatMap { tokens => + if (options.dropMalformed && schemaFields.length != tokens.length) { +logWarning(s"Dropping malformed line: ${tokens.mkString(options.delimiter.toString)}") +None + } else if (options.failFast && schemaFields.length != tokens.length) { +throw new RuntimeException(s"Malformed line in FAILFAST mode: " + + s"${tokens.mkString(options.delimiter.toString)}") + } else { +val indexSafeTokens = if (options.permissive && schemaFields.length > tokens.length) { + tokens ++ new Array[String](schemaFields.length - tokens.length) --- End diff -- Thanks for pointing this out. I will think about this further. Maybe I could do this in a separate PR if you think it is sensible. The codes were copied from original. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/12268#discussion_r61359253 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala --- @@ -0,0 +1,189 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.csv + +import scala.util.control.NonFatal + +import com.univocity.parsers.csv.{CsvParser, CsvParserSettings} + +import org.apache.spark.internal.Logging +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions.GenericMutableRow +import org.apache.spark.sql.types.{StructField, StructType} + +/** + * Converts CSV string to a sequence of string + */ +private[csv] object UnivocityParser extends Logging { + /** + * Convert the input iterator to a iterator having [[InternalRow]] + */ + def parseCsv( + iter: Iterator[String], + schema: StructType, + requiredSchema: StructType, + headers: Array[String], + shouldDropHeader: Boolean, + options: CSVOptions): Iterator[InternalRow] = { +if (shouldDropHeader) { + CSVUtils.dropHeaderLine(iter, options) +} +val csv = CSVUtils.filterCommentAndEmpty(iter, options) + +val schemaFields = schema.fields +val requiredFields = requiredSchema.fields +val safeRequiredFields = if (options.dropMalformed) { + // If `dropMalformed` is enabled, then it needs to parse all the values + // so that we can decide which row is malformed. + requiredFields ++ schemaFields.filterNot(requiredFields.contains(_)) +} else { + requiredFields +} +val safeRequiredIndices = new Array[Int](safeRequiredFields.length) +schemaFields.zipWithIndex.filter { + case (field, _) => safeRequiredFields.contains(field) +}.foreach { + case (field, index) => safeRequiredIndices(safeRequiredFields.indexOf(field)) = index +} +val requiredSize = requiredFields.length + +tokenizeData(csv, options, headers).flatMap { tokens => + if (options.dropMalformed && schemaFields.length != tokens.length) { +logWarning(s"Dropping malformed line: ${tokens.mkString(options.delimiter.toString)}") +None + } else if (options.failFast && schemaFields.length != tokens.length) { +throw new RuntimeException(s"Malformed line in FAILFAST mode: " + + s"${tokens.mkString(options.delimiter.toString)}") + } else { +val indexSafeTokens = if (options.permissive && schemaFields.length > tokens.length) { + tokens ++ new Array[String](schemaFields.length - tokens.length) +} else if (options.permissive && schemaFields.length < tokens.length) { + tokens.take(schemaFields.length) +} else { + tokens +} +try { + val row = convertTokens( +indexSafeTokens, +safeRequiredIndices, +schemaFields, +requiredSize, +options) + Some(row) +} catch { + case NonFatal(e) if options.dropMalformed => +logWarning("Parse exception. " + + s"Dropping malformed line: ${tokens.mkString(options.delimiter.toString)}") +None +} + } +} + } + + /** + * Convert the tokens to [[InternalRow]] + */ + private def convertTokens( --- End diff -- I see. Could I do this as well in a separate PR with the purpose of this? Codes were just copied from the original and I just made a function to separate this with the consistent name with JSON data source `convertX()`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/12268#discussion_r61359080 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityGenerator.scala --- @@ -0,0 +1,80 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.csv + +import com.univocity.parsers.csv.{CsvWriter, CsvWriterSettings} + +import org.apache.spark.internal.Logging +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.types.StructType + +/** + * Converts a sequence of string to CSV string + */ +private[csv] object UnivocityGenerator extends Logging { + /** + * Transforms a single InternalRow to CSV using Univocity + * + * @param rowSchema the schema object used for conversion + * @param writer a CsvWriter object + * @param headers headers to write + * @param writeHeader true if it needs to write header + * @param options CSVOptions object containing options + * @param row The row to convert + */ + def apply( + rowSchema: StructType, + writer: CsvWriter, + headers: Array[String], + writeHeader: Boolean, + options: CSVOptions)(row: InternalRow): Unit = { +val tokens = { + row.toSeq(rowSchema).map { field => --- End diff -- Thank you! Could I maybe do this in a separate PR with the purpose of this? This was just copied from original codes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/12268#discussion_r6135 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVRelation.scala --- @@ -17,152 +17,162 @@ package org.apache.spark.sql.execution.datasources.csv -import scala.util.control.NonFatal - -import org.apache.hadoop.fs.Path -import org.apache.hadoop.io.{NullWritable, Text} -import org.apache.hadoop.mapreduce.RecordWriter -import org.apache.hadoop.mapreduce.TaskAttemptContext +import java.io.CharArrayWriter +import java.nio.charset.{Charset, StandardCharsets} + +import com.univocity.parsers.csv.CsvWriter +import org.apache.hadoop.conf.Configuration +import org.apache.hadoop.fs.{FileStatus, Path} +import org.apache.hadoop.io.{LongWritable, NullWritable, Text} +import org.apache.hadoop.mapred.TextInputFormat +import org.apache.hadoop.mapreduce.{Job, RecordWriter, TaskAttemptContext} import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat import org.apache.spark.internal.Logging import org.apache.spark.rdd.RDD import org.apache.spark.sql._ import org.apache.spark.sql.catalyst.InternalRow -import org.apache.spark.sql.catalyst.expressions.GenericMutableRow -import org.apache.spark.sql.execution.datasources.{OutputWriter, OutputWriterFactory, PartitionedFile} +import org.apache.spark.sql.catalyst.expressions.JoinedRow +import org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection +import org.apache.spark.sql.execution.datasources._ +import org.apache.spark.sql.sources._ import org.apache.spark.sql.types._ +import org.apache.spark.util.SerializableConfiguration -object CSVRelation extends Logging { - - def univocityTokenizer( - file: RDD[String], - header: Seq[String], - firstLine: String, - params: CSVOptions): RDD[Array[String]] = { -// If header is set, make sure firstLine is materialized before sending to executors. -file.mapPartitions { iter => - new BulkCsvReader( -if (params.headerFlag) iter.filterNot(_ == firstLine) else iter, -params, -headers = header) -} - } +/** + * Provides access to CSV data from pure SQL statements. + */ +class DefaultSource extends FileFormat with DataSourceRegister { + + override def shortName(): String = "csv" + + override def toString: String = "CSV" + + override def hashCode(): Int = getClass.hashCode() + + override def equals(other: Any): Boolean = other.isInstanceOf[DefaultSource] - def csvParser( - schema: StructType, - requiredColumns: Array[String], - params: CSVOptions): Array[String] => Option[InternalRow] = { -val schemaFields = schema.fields -val requiredFields = StructType(requiredColumns.map(schema(_))).fields -val safeRequiredFields = if (params.dropMalformed) { - // If `dropMalformed` is enabled, then it needs to parse all the values - // so that we can decide which row is malformed. - requiredFields ++ schemaFields.filterNot(requiredFields.contains(_)) + override def inferSchema( + sparkSession: SparkSession, + options: Map[String, String], + files: Seq[FileStatus]): Option[StructType] = { +val csvOptions = new CSVOptions(options) + +// TODO: Move filtering. +val paths = files.filterNot(_.getPath.getName startsWith "_").map(_.getPath.toString) +val rdd = createBaseRdd(sparkSession, csvOptions, paths) +val schema = if (csvOptions.inferSchemaFlag) { + InferSchema.infer(rdd, csvOptions) } else { - requiredFields -} -val safeRequiredIndices = new Array[Int](safeRequiredFields.length) -schemaFields.zipWithIndex.filter { - case (field, _) => safeRequiredFields.contains(field) -}.foreach { - case (field, index) => safeRequiredIndices(safeRequiredFields.indexOf(field)) = index -} -val requiredSize = requiredFields.length -val row = new GenericMutableRow(requiredSize) - -(tokens: Array[String]) => { - if (params.dropMalformed && schemaFields.length != tokens.length) { -logWarning(s"Dropping malformed line: ${tokens.mkString(params.delimiter.toString)}") -None - } else if (params.failFast && schemaFields.length != tokens.length) { -throw new RuntimeException(s"Malformed line in FAILFAST mode: " + - s"${tokens.mkString(params.delimiter.toString)}") + // By default fields are assumed to be StringType + val filteredRdd = rdd.mapPartitions(CSVUtils.filterCommentAndEmpty(_, csvOptions)) +
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/12268#discussion_r61358822 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVRelation.scala --- @@ -17,152 +17,162 @@ package org.apache.spark.sql.execution.datasources.csv -import scala.util.control.NonFatal - -import org.apache.hadoop.fs.Path -import org.apache.hadoop.io.{NullWritable, Text} -import org.apache.hadoop.mapreduce.RecordWriter -import org.apache.hadoop.mapreduce.TaskAttemptContext +import java.io.CharArrayWriter +import java.nio.charset.{Charset, StandardCharsets} + +import com.univocity.parsers.csv.CsvWriter +import org.apache.hadoop.conf.Configuration +import org.apache.hadoop.fs.{FileStatus, Path} +import org.apache.hadoop.io.{LongWritable, NullWritable, Text} +import org.apache.hadoop.mapred.TextInputFormat +import org.apache.hadoop.mapreduce.{Job, RecordWriter, TaskAttemptContext} import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat import org.apache.spark.internal.Logging import org.apache.spark.rdd.RDD import org.apache.spark.sql._ import org.apache.spark.sql.catalyst.InternalRow -import org.apache.spark.sql.catalyst.expressions.GenericMutableRow -import org.apache.spark.sql.execution.datasources.{OutputWriter, OutputWriterFactory, PartitionedFile} +import org.apache.spark.sql.catalyst.expressions.JoinedRow +import org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection +import org.apache.spark.sql.execution.datasources._ +import org.apache.spark.sql.sources._ import org.apache.spark.sql.types._ +import org.apache.spark.util.SerializableConfiguration -object CSVRelation extends Logging { - - def univocityTokenizer( - file: RDD[String], - header: Seq[String], - firstLine: String, - params: CSVOptions): RDD[Array[String]] = { -// If header is set, make sure firstLine is materialized before sending to executors. -file.mapPartitions { iter => - new BulkCsvReader( -if (params.headerFlag) iter.filterNot(_ == firstLine) else iter, -params, -headers = header) -} - } +/** + * Provides access to CSV data from pure SQL statements. + */ +class DefaultSource extends FileFormat with DataSourceRegister { + + override def shortName(): String = "csv" + + override def toString: String = "CSV" + + override def hashCode(): Int = getClass.hashCode() + + override def equals(other: Any): Boolean = other.isInstanceOf[DefaultSource] - def csvParser( - schema: StructType, - requiredColumns: Array[String], - params: CSVOptions): Array[String] => Option[InternalRow] = { -val schemaFields = schema.fields -val requiredFields = StructType(requiredColumns.map(schema(_))).fields -val safeRequiredFields = if (params.dropMalformed) { - // If `dropMalformed` is enabled, then it needs to parse all the values - // so that we can decide which row is malformed. - requiredFields ++ schemaFields.filterNot(requiredFields.contains(_)) + override def inferSchema( + sparkSession: SparkSession, + options: Map[String, String], + files: Seq[FileStatus]): Option[StructType] = { +val csvOptions = new CSVOptions(options) + +// TODO: Move filtering. +val paths = files.filterNot(_.getPath.getName startsWith "_").map(_.getPath.toString) --- End diff -- I see, I cannot guarantee. JSON data source also skip `name.startsWith("_") || name.startsWith(".")` Let me follow this first. Can I maybe do this together with JSON data source after figuring out in a separate PR or a follow-up? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/12268#discussion_r61358639 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala --- @@ -0,0 +1,189 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.csv + +import scala.util.control.NonFatal + +import com.univocity.parsers.csv.{CsvParser, CsvParserSettings} + +import org.apache.spark.internal.Logging +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions.GenericMutableRow +import org.apache.spark.sql.types.{StructField, StructType} + +/** + * Converts CSV string to a sequence of string + */ +private[csv] object UnivocityParser extends Logging { --- End diff -- The name was also taken after from JSON data source, `JacksonParser`. Can I rename them together with JSON data source if this looks problematic in a follow-up or another PR if this is sensible? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/12268#discussion_r61358548 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityGenerator.scala --- @@ -0,0 +1,80 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.csv + +import com.univocity.parsers.csv.{CsvWriter, CsvWriterSettings} + +import org.apache.spark.internal.Logging +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.types.StructType + +/** + * Converts a sequence of string to CSV string + */ +private[csv] object UnivocityGenerator extends Logging { + /** + * Transforms a single InternalRow to CSV using Univocity + * + * @param rowSchema the schema object used for conversion + * @param writer a CsvWriter object + * @param headers headers to write + * @param writeHeader true if it needs to write header + * @param options CSVOptions object containing options + * @param row The row to convert + */ + def apply( --- End diff -- The name was also taken after from JSON data source, `JacksonGenerator`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/12268#discussion_r61358501 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityGenerator.scala --- @@ -0,0 +1,80 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.csv + +import com.univocity.parsers.csv.{CsvWriter, CsvWriterSettings} + +import org.apache.spark.internal.Logging +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.types.StructType + +/** + * Converts a sequence of string to CSV string + */ +private[csv] object UnivocityGenerator extends Logging { --- End diff -- Thanks! The name was also taken after from JSON data source, `JacksonGenerator`. Maybe I can rename them together if this one is merged. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/12268#discussion_r61358445 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/InferSchema.scala --- @@ -30,22 +30,37 @@ import org.apache.spark.sql.catalyst.util.DateTimeUtils import org.apache.spark.sql.types._ import org.apache.spark.unsafe.types.UTF8String -private[csv] object CSVInferSchema { +private[csv] object InferSchema { /** * Similar to the JSON schema inference * 1. Infer type of each row * 2. Merge row types to find common type * 3. Replace any null types with string type */ - def infer( - tokenRdd: RDD[Array[String]], - header: Array[String], - nullValue: String = ""): StructType = { + def infer(csv: RDD[String], options: CSVOptions): StructType = { --- End diff -- Actually, it does call this class method in `DefaultSource.inferSchema`. I intentionally made the same structure with `JSONRelation`. JSON data source also have the class with the same name and same method in order to fix issues easily in the future together . (Actually, the main purpose for refactoring this is inconsistency of structures although they could almost identical structures). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/12268#discussion_r61269972 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala --- @@ -0,0 +1,189 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.csv + +import scala.util.control.NonFatal + +import com.univocity.parsers.csv.{CsvParser, CsvParserSettings} + +import org.apache.spark.internal.Logging +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions.GenericMutableRow +import org.apache.spark.sql.types.{StructField, StructType} + +/** + * Converts CSV string to a sequence of string + */ +private[csv] object UnivocityParser extends Logging { + /** + * Convert the input iterator to a iterator having [[InternalRow]] + */ + def parseCsv( + iter: Iterator[String], + schema: StructType, + requiredSchema: StructType, + headers: Array[String], + shouldDropHeader: Boolean, + options: CSVOptions): Iterator[InternalRow] = { +if (shouldDropHeader) { + CSVUtils.dropHeaderLine(iter, options) +} +val csv = CSVUtils.filterCommentAndEmpty(iter, options) + +val schemaFields = schema.fields +val requiredFields = requiredSchema.fields +val safeRequiredFields = if (options.dropMalformed) { + // If `dropMalformed` is enabled, then it needs to parse all the values + // so that we can decide which row is malformed. + requiredFields ++ schemaFields.filterNot(requiredFields.contains(_)) +} else { + requiredFields +} +val safeRequiredIndices = new Array[Int](safeRequiredFields.length) +schemaFields.zipWithIndex.filter { + case (field, _) => safeRequiredFields.contains(field) +}.foreach { + case (field, index) => safeRequiredIndices(safeRequiredFields.indexOf(field)) = index +} +val requiredSize = requiredFields.length + +tokenizeData(csv, options, headers).flatMap { tokens => + if (options.dropMalformed && schemaFields.length != tokens.length) { +logWarning(s"Dropping malformed line: ${tokens.mkString(options.delimiter.toString)}") +None + } else if (options.failFast && schemaFields.length != tokens.length) { +throw new RuntimeException(s"Malformed line in FAILFAST mode: " + + s"${tokens.mkString(options.delimiter.toString)}") + } else { +val indexSafeTokens = if (options.permissive && schemaFields.length > tokens.length) { + tokens ++ new Array[String](schemaFields.length - tokens.length) +} else if (options.permissive && schemaFields.length < tokens.length) { + tokens.take(schemaFields.length) +} else { + tokens +} +try { + val row = convertTokens( +indexSafeTokens, +safeRequiredIndices, +schemaFields, +requiredSize, +options) + Some(row) +} catch { + case NonFatal(e) if options.dropMalformed => +logWarning("Parse exception. " + + s"Dropping malformed line: ${tokens.mkString(options.delimiter.toString)}") +None +} + } +} + } + + /** + * Convert the tokens to [[InternalRow]] + */ + private def convertTokens( + tokens: Array[String], + requiredIndices: Array[Int], + schemaFields: Array[StructField], + requiredSize: Int, --- End diff -- Nevermind I got it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so,
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user hvanhovell commented on the pull request: https://github.com/apache/spark/pull/12268#issuecomment-215109697 @HyukjinKwon I have taken a pass. The PR looks pretty solid. I do think we can make it a bit more concise in some places and I do think we can make a bit faster as well. Let me know what you think? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/12268#discussion_r61271578 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala --- @@ -0,0 +1,189 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.csv + +import scala.util.control.NonFatal + +import com.univocity.parsers.csv.{CsvParser, CsvParserSettings} + +import org.apache.spark.internal.Logging +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions.GenericMutableRow +import org.apache.spark.sql.types.{StructField, StructType} + +/** + * Converts CSV string to a sequence of string + */ +private[csv] object UnivocityParser extends Logging { + /** + * Convert the input iterator to a iterator having [[InternalRow]] + */ + def parseCsv( + iter: Iterator[String], + schema: StructType, + requiredSchema: StructType, + headers: Array[String], + shouldDropHeader: Boolean, + options: CSVOptions): Iterator[InternalRow] = { +if (shouldDropHeader) { + CSVUtils.dropHeaderLine(iter, options) +} +val csv = CSVUtils.filterCommentAndEmpty(iter, options) + +val schemaFields = schema.fields +val requiredFields = requiredSchema.fields +val safeRequiredFields = if (options.dropMalformed) { + // If `dropMalformed` is enabled, then it needs to parse all the values + // so that we can decide which row is malformed. + requiredFields ++ schemaFields.filterNot(requiredFields.contains(_)) +} else { + requiredFields +} +val safeRequiredIndices = new Array[Int](safeRequiredFields.length) +schemaFields.zipWithIndex.filter { + case (field, _) => safeRequiredFields.contains(field) +}.foreach { + case (field, index) => safeRequiredIndices(safeRequiredFields.indexOf(field)) = index +} +val requiredSize = requiredFields.length + +tokenizeData(csv, options, headers).flatMap { tokens => + if (options.dropMalformed && schemaFields.length != tokens.length) { +logWarning(s"Dropping malformed line: ${tokens.mkString(options.delimiter.toString)}") +None + } else if (options.failFast && schemaFields.length != tokens.length) { +throw new RuntimeException(s"Malformed line in FAILFAST mode: " + + s"${tokens.mkString(options.delimiter.toString)}") + } else { +val indexSafeTokens = if (options.permissive && schemaFields.length > tokens.length) { + tokens ++ new Array[String](schemaFields.length - tokens.length) +} else if (options.permissive && schemaFields.length < tokens.length) { + tokens.take(schemaFields.length) +} else { + tokens +} +try { + val row = convertTokens( +indexSafeTokens, +safeRequiredIndices, +schemaFields, +requiredSize, +options) + Some(row) +} catch { + case NonFatal(e) if options.dropMalformed => +logWarning("Parse exception. " + + s"Dropping malformed line: ${tokens.mkString(options.delimiter.toString)}") +None +} + } +} + } + + /** + * Convert the tokens to [[InternalRow]] + */ + private def convertTokens( + tokens: Array[String], + requiredIndices: Array[Int], + schemaFields: Array[StructField], + requiredSize: Int, + options: CSVOptions): InternalRow = { +val row = new GenericMutableRow(requiredSize) --- End diff -- I am not sure about datasources, but in a lot of places within SparkSQL we just return update a
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/12268#discussion_r61271158 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala --- @@ -0,0 +1,189 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.csv + +import scala.util.control.NonFatal + +import com.univocity.parsers.csv.{CsvParser, CsvParserSettings} + +import org.apache.spark.internal.Logging +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions.GenericMutableRow +import org.apache.spark.sql.types.{StructField, StructType} + +/** + * Converts CSV string to a sequence of string + */ +private[csv] object UnivocityParser extends Logging { + /** + * Convert the input iterator to a iterator having [[InternalRow]] + */ + def parseCsv( + iter: Iterator[String], + schema: StructType, + requiredSchema: StructType, + headers: Array[String], + shouldDropHeader: Boolean, + options: CSVOptions): Iterator[InternalRow] = { +if (shouldDropHeader) { + CSVUtils.dropHeaderLine(iter, options) +} +val csv = CSVUtils.filterCommentAndEmpty(iter, options) + +val schemaFields = schema.fields +val requiredFields = requiredSchema.fields +val safeRequiredFields = if (options.dropMalformed) { + // If `dropMalformed` is enabled, then it needs to parse all the values + // so that we can decide which row is malformed. + requiredFields ++ schemaFields.filterNot(requiredFields.contains(_)) +} else { + requiredFields +} +val safeRequiredIndices = new Array[Int](safeRequiredFields.length) +schemaFields.zipWithIndex.filter { + case (field, _) => safeRequiredFields.contains(field) +}.foreach { + case (field, index) => safeRequiredIndices(safeRequiredFields.indexOf(field)) = index +} +val requiredSize = requiredFields.length + +tokenizeData(csv, options, headers).flatMap { tokens => + if (options.dropMalformed && schemaFields.length != tokens.length) { +logWarning(s"Dropping malformed line: ${tokens.mkString(options.delimiter.toString)}") +None + } else if (options.failFast && schemaFields.length != tokens.length) { +throw new RuntimeException(s"Malformed line in FAILFAST mode: " + + s"${tokens.mkString(options.delimiter.toString)}") + } else { +val indexSafeTokens = if (options.permissive && schemaFields.length > tokens.length) { + tokens ++ new Array[String](schemaFields.length - tokens.length) +} else if (options.permissive && schemaFields.length < tokens.length) { + tokens.take(schemaFields.length) --- End diff -- Why do we want this? `convertTokens` can't read beyond the `schemaFields.length` right? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/12268#discussion_r61267542 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala --- @@ -0,0 +1,189 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.csv + +import scala.util.control.NonFatal + +import com.univocity.parsers.csv.{CsvParser, CsvParserSettings} + +import org.apache.spark.internal.Logging +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions.GenericMutableRow +import org.apache.spark.sql.types.{StructField, StructType} + +/** + * Converts CSV string to a sequence of string + */ +private[csv] object UnivocityParser extends Logging { + /** + * Convert the input iterator to a iterator having [[InternalRow]] + */ + def parseCsv( + iter: Iterator[String], + schema: StructType, + requiredSchema: StructType, + headers: Array[String], + shouldDropHeader: Boolean, + options: CSVOptions): Iterator[InternalRow] = { +if (shouldDropHeader) { + CSVUtils.dropHeaderLine(iter, options) +} +val csv = CSVUtils.filterCommentAndEmpty(iter, options) + +val schemaFields = schema.fields +val requiredFields = requiredSchema.fields +val safeRequiredFields = if (options.dropMalformed) { + // If `dropMalformed` is enabled, then it needs to parse all the values + // so that we can decide which row is malformed. + requiredFields ++ schemaFields.filterNot(requiredFields.contains(_)) +} else { + requiredFields +} +val safeRequiredIndices = new Array[Int](safeRequiredFields.length) +schemaFields.zipWithIndex.filter { + case (field, _) => safeRequiredFields.contains(field) +}.foreach { + case (field, index) => safeRequiredIndices(safeRequiredFields.indexOf(field)) = index +} +val requiredSize = requiredFields.length + +tokenizeData(csv, options, headers).flatMap { tokens => + if (options.dropMalformed && schemaFields.length != tokens.length) { +logWarning(s"Dropping malformed line: ${tokens.mkString(options.delimiter.toString)}") +None + } else if (options.failFast && schemaFields.length != tokens.length) { +throw new RuntimeException(s"Malformed line in FAILFAST mode: " + + s"${tokens.mkString(options.delimiter.toString)}") + } else { +val indexSafeTokens = if (options.permissive && schemaFields.length > tokens.length) { + tokens ++ new Array[String](schemaFields.length - tokens.length) +} else if (options.permissive && schemaFields.length < tokens.length) { + tokens.take(schemaFields.length) +} else { + tokens +} +try { + val row = convertTokens( +indexSafeTokens, +safeRequiredIndices, +schemaFields, +requiredSize, +options) + Some(row) +} catch { + case NonFatal(e) if options.dropMalformed => +logWarning("Parse exception. " + + s"Dropping malformed line: ${tokens.mkString(options.delimiter.toString)}") +None +} + } +} + } + + /** + * Convert the tokens to [[InternalRow]] + */ + private def convertTokens( + tokens: Array[String], + requiredIndices: Array[Int], + schemaFields: Array[StructField], + requiredSize: Int, --- End diff -- Can an entry in `requiredIndices` lie outside of the `requiredSize` range? Why? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/12268#discussion_r61270381 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala --- @@ -0,0 +1,189 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.csv + +import scala.util.control.NonFatal + +import com.univocity.parsers.csv.{CsvParser, CsvParserSettings} + +import org.apache.spark.internal.Logging +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions.GenericMutableRow +import org.apache.spark.sql.types.{StructField, StructType} + +/** + * Converts CSV string to a sequence of string + */ +private[csv] object UnivocityParser extends Logging { + /** + * Convert the input iterator to a iterator having [[InternalRow]] + */ + def parseCsv( + iter: Iterator[String], + schema: StructType, + requiredSchema: StructType, + headers: Array[String], + shouldDropHeader: Boolean, + options: CSVOptions): Iterator[InternalRow] = { +if (shouldDropHeader) { + CSVUtils.dropHeaderLine(iter, options) +} +val csv = CSVUtils.filterCommentAndEmpty(iter, options) + +val schemaFields = schema.fields +val requiredFields = requiredSchema.fields +val safeRequiredFields = if (options.dropMalformed) { + // If `dropMalformed` is enabled, then it needs to parse all the values + // so that we can decide which row is malformed. + requiredFields ++ schemaFields.filterNot(requiredFields.contains(_)) +} else { + requiredFields +} +val safeRequiredIndices = new Array[Int](safeRequiredFields.length) +schemaFields.zipWithIndex.filter { + case (field, _) => safeRequiredFields.contains(field) +}.foreach { + case (field, index) => safeRequiredIndices(safeRequiredFields.indexOf(field)) = index +} +val requiredSize = requiredFields.length + +tokenizeData(csv, options, headers).flatMap { tokens => + if (options.dropMalformed && schemaFields.length != tokens.length) { +logWarning(s"Dropping malformed line: ${tokens.mkString(options.delimiter.toString)}") +None + } else if (options.failFast && schemaFields.length != tokens.length) { +throw new RuntimeException(s"Malformed line in FAILFAST mode: " + + s"${tokens.mkString(options.delimiter.toString)}") + } else { +val indexSafeTokens = if (options.permissive && schemaFields.length > tokens.length) { + tokens ++ new Array[String](schemaFields.length - tokens.length) --- End diff -- Do you think there is way we can do this without appending an array? Using an extra limit in `convertTokens` is probably quicker and causes less GC. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/12268#discussion_r61266412 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala --- @@ -0,0 +1,189 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.csv + +import scala.util.control.NonFatal + +import com.univocity.parsers.csv.{CsvParser, CsvParserSettings} + +import org.apache.spark.internal.Logging +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions.GenericMutableRow +import org.apache.spark.sql.types.{StructField, StructType} + +/** + * Converts CSV string to a sequence of string + */ +private[csv] object UnivocityParser extends Logging { + /** + * Convert the input iterator to a iterator having [[InternalRow]] + */ + def parseCsv( + iter: Iterator[String], + schema: StructType, + requiredSchema: StructType, + headers: Array[String], + shouldDropHeader: Boolean, + options: CSVOptions): Iterator[InternalRow] = { +if (shouldDropHeader) { + CSVUtils.dropHeaderLine(iter, options) +} +val csv = CSVUtils.filterCommentAndEmpty(iter, options) + +val schemaFields = schema.fields +val requiredFields = requiredSchema.fields +val safeRequiredFields = if (options.dropMalformed) { + // If `dropMalformed` is enabled, then it needs to parse all the values + // so that we can decide which row is malformed. + requiredFields ++ schemaFields.filterNot(requiredFields.contains(_)) +} else { + requiredFields +} +val safeRequiredIndices = new Array[Int](safeRequiredFields.length) +schemaFields.zipWithIndex.filter { + case (field, _) => safeRequiredFields.contains(field) +}.foreach { + case (field, index) => safeRequiredIndices(safeRequiredFields.indexOf(field)) = index +} +val requiredSize = requiredFields.length + +tokenizeData(csv, options, headers).flatMap { tokens => + if (options.dropMalformed && schemaFields.length != tokens.length) { +logWarning(s"Dropping malformed line: ${tokens.mkString(options.delimiter.toString)}") +None + } else if (options.failFast && schemaFields.length != tokens.length) { +throw new RuntimeException(s"Malformed line in FAILFAST mode: " + + s"${tokens.mkString(options.delimiter.toString)}") + } else { +val indexSafeTokens = if (options.permissive && schemaFields.length > tokens.length) { + tokens ++ new Array[String](schemaFields.length - tokens.length) +} else if (options.permissive && schemaFields.length < tokens.length) { + tokens.take(schemaFields.length) +} else { + tokens +} +try { + val row = convertTokens( +indexSafeTokens, +safeRequiredIndices, +schemaFields, +requiredSize, +options) + Some(row) +} catch { + case NonFatal(e) if options.dropMalformed => +logWarning("Parse exception. " + + s"Dropping malformed line: ${tokens.mkString(options.delimiter.toString)}") +None +} + } +} + } + + /** + * Convert the tokens to [[InternalRow]] + */ + private def convertTokens( --- End diff -- This might be a wild idea: We might be able to use an encoder here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/12268#discussion_r61265498 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityGenerator.scala --- @@ -0,0 +1,80 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.csv + +import com.univocity.parsers.csv.{CsvWriter, CsvWriterSettings} + +import org.apache.spark.internal.Logging +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.types.StructType + +/** + * Converts a sequence of string to CSV string + */ +private[csv] object UnivocityGenerator extends Logging { --- End diff -- Come to think of it, why not integrate this with the `CsvOutputWriter`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/12268#discussion_r61265122 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityGenerator.scala --- @@ -0,0 +1,80 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.csv + +import com.univocity.parsers.csv.{CsvWriter, CsvWriterSettings} + +import org.apache.spark.internal.Logging +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.types.StructType + +/** + * Converts a sequence of string to CSV string + */ +private[csv] object UnivocityGenerator extends Logging { + /** + * Transforms a single InternalRow to CSV using Univocity + * + * @param rowSchema the schema object used for conversion + * @param writer a CsvWriter object + * @param headers headers to write + * @param writeHeader true if it needs to write header + * @param options CSVOptions object containing options + * @param row The row to convert + */ + def apply( + rowSchema: StructType, + writer: CsvWriter, + headers: Array[String], + writeHeader: Boolean, + options: CSVOptions)(row: InternalRow): Unit = { +val tokens = { + row.toSeq(rowSchema).map { field => --- End diff -- You are calling this alot right? So it might be better not to rely on functional constructs here. Also take a look at the `InternalRow.toSeq` method there might be some room improvement if you just pass in the `DataType`s directly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/12268#discussion_r61260986 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVRelation.scala --- @@ -17,152 +17,162 @@ package org.apache.spark.sql.execution.datasources.csv -import scala.util.control.NonFatal - -import org.apache.hadoop.fs.Path -import org.apache.hadoop.io.{NullWritable, Text} -import org.apache.hadoop.mapreduce.RecordWriter -import org.apache.hadoop.mapreduce.TaskAttemptContext +import java.io.CharArrayWriter +import java.nio.charset.{Charset, StandardCharsets} + +import com.univocity.parsers.csv.CsvWriter +import org.apache.hadoop.conf.Configuration +import org.apache.hadoop.fs.{FileStatus, Path} +import org.apache.hadoop.io.{LongWritable, NullWritable, Text} +import org.apache.hadoop.mapred.TextInputFormat +import org.apache.hadoop.mapreduce.{Job, RecordWriter, TaskAttemptContext} import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat import org.apache.spark.internal.Logging import org.apache.spark.rdd.RDD import org.apache.spark.sql._ import org.apache.spark.sql.catalyst.InternalRow -import org.apache.spark.sql.catalyst.expressions.GenericMutableRow -import org.apache.spark.sql.execution.datasources.{OutputWriter, OutputWriterFactory, PartitionedFile} +import org.apache.spark.sql.catalyst.expressions.JoinedRow +import org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection +import org.apache.spark.sql.execution.datasources._ +import org.apache.spark.sql.sources._ import org.apache.spark.sql.types._ +import org.apache.spark.util.SerializableConfiguration -object CSVRelation extends Logging { - - def univocityTokenizer( - file: RDD[String], - header: Seq[String], - firstLine: String, - params: CSVOptions): RDD[Array[String]] = { -// If header is set, make sure firstLine is materialized before sending to executors. -file.mapPartitions { iter => - new BulkCsvReader( -if (params.headerFlag) iter.filterNot(_ == firstLine) else iter, -params, -headers = header) -} - } +/** + * Provides access to CSV data from pure SQL statements. + */ +class DefaultSource extends FileFormat with DataSourceRegister { + + override def shortName(): String = "csv" + + override def toString: String = "CSV" + + override def hashCode(): Int = getClass.hashCode() + + override def equals(other: Any): Boolean = other.isInstanceOf[DefaultSource] - def csvParser( - schema: StructType, - requiredColumns: Array[String], - params: CSVOptions): Array[String] => Option[InternalRow] = { -val schemaFields = schema.fields -val requiredFields = StructType(requiredColumns.map(schema(_))).fields -val safeRequiredFields = if (params.dropMalformed) { - // If `dropMalformed` is enabled, then it needs to parse all the values - // so that we can decide which row is malformed. - requiredFields ++ schemaFields.filterNot(requiredFields.contains(_)) + override def inferSchema( + sparkSession: SparkSession, + options: Map[String, String], + files: Seq[FileStatus]): Option[StructType] = { +val csvOptions = new CSVOptions(options) + +// TODO: Move filtering. +val paths = files.filterNot(_.getPath.getName startsWith "_").map(_.getPath.toString) +val rdd = createBaseRdd(sparkSession, csvOptions, paths) +val schema = if (csvOptions.inferSchemaFlag) { + InferSchema.infer(rdd, csvOptions) } else { - requiredFields -} -val safeRequiredIndices = new Array[Int](safeRequiredFields.length) -schemaFields.zipWithIndex.filter { - case (field, _) => safeRequiredFields.contains(field) -}.foreach { - case (field, index) => safeRequiredIndices(safeRequiredFields.indexOf(field)) = index -} -val requiredSize = requiredFields.length -val row = new GenericMutableRow(requiredSize) - -(tokens: Array[String]) => { - if (params.dropMalformed && schemaFields.length != tokens.length) { -logWarning(s"Dropping malformed line: ${tokens.mkString(params.delimiter.toString)}") -None - } else if (params.failFast && schemaFields.length != tokens.length) { -throw new RuntimeException(s"Malformed line in FAILFAST mode: " + - s"${tokens.mkString(params.delimiter.toString)}") + // By default fields are assumed to be StringType + val filteredRdd = rdd.mapPartitions(CSVUtils.filterCommentAndEmpty(_, csvOptions)) +
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/12268#discussion_r61260702 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVRelation.scala --- @@ -17,152 +17,162 @@ package org.apache.spark.sql.execution.datasources.csv -import scala.util.control.NonFatal - -import org.apache.hadoop.fs.Path -import org.apache.hadoop.io.{NullWritable, Text} -import org.apache.hadoop.mapreduce.RecordWriter -import org.apache.hadoop.mapreduce.TaskAttemptContext +import java.io.CharArrayWriter +import java.nio.charset.{Charset, StandardCharsets} + +import com.univocity.parsers.csv.CsvWriter +import org.apache.hadoop.conf.Configuration +import org.apache.hadoop.fs.{FileStatus, Path} +import org.apache.hadoop.io.{LongWritable, NullWritable, Text} +import org.apache.hadoop.mapred.TextInputFormat +import org.apache.hadoop.mapreduce.{Job, RecordWriter, TaskAttemptContext} import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat import org.apache.spark.internal.Logging import org.apache.spark.rdd.RDD import org.apache.spark.sql._ import org.apache.spark.sql.catalyst.InternalRow -import org.apache.spark.sql.catalyst.expressions.GenericMutableRow -import org.apache.spark.sql.execution.datasources.{OutputWriter, OutputWriterFactory, PartitionedFile} +import org.apache.spark.sql.catalyst.expressions.JoinedRow +import org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection +import org.apache.spark.sql.execution.datasources._ +import org.apache.spark.sql.sources._ import org.apache.spark.sql.types._ +import org.apache.spark.util.SerializableConfiguration -object CSVRelation extends Logging { - - def univocityTokenizer( - file: RDD[String], - header: Seq[String], - firstLine: String, - params: CSVOptions): RDD[Array[String]] = { -// If header is set, make sure firstLine is materialized before sending to executors. -file.mapPartitions { iter => - new BulkCsvReader( -if (params.headerFlag) iter.filterNot(_ == firstLine) else iter, -params, -headers = header) -} - } +/** + * Provides access to CSV data from pure SQL statements. + */ +class DefaultSource extends FileFormat with DataSourceRegister { + + override def shortName(): String = "csv" + + override def toString: String = "CSV" + + override def hashCode(): Int = getClass.hashCode() + + override def equals(other: Any): Boolean = other.isInstanceOf[DefaultSource] - def csvParser( - schema: StructType, - requiredColumns: Array[String], - params: CSVOptions): Array[String] => Option[InternalRow] = { -val schemaFields = schema.fields -val requiredFields = StructType(requiredColumns.map(schema(_))).fields -val safeRequiredFields = if (params.dropMalformed) { - // If `dropMalformed` is enabled, then it needs to parse all the values - // so that we can decide which row is malformed. - requiredFields ++ schemaFields.filterNot(requiredFields.contains(_)) + override def inferSchema( + sparkSession: SparkSession, + options: Map[String, String], + files: Seq[FileStatus]): Option[StructType] = { +val csvOptions = new CSVOptions(options) + +// TODO: Move filtering. +val paths = files.filterNot(_.getPath.getName startsWith "_").map(_.getPath.toString) --- End diff -- code style: `_.getPath.getName.startsWith("_")`? Is it safe to skip all files with an underscore? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/12268#discussion_r61258605 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala --- @@ -0,0 +1,189 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.csv + +import scala.util.control.NonFatal + +import com.univocity.parsers.csv.{CsvParser, CsvParserSettings} + +import org.apache.spark.internal.Logging +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions.GenericMutableRow +import org.apache.spark.sql.types.{StructField, StructType} + +/** + * Converts CSV string to a sequence of string + */ +private[csv] object UnivocityParser extends Logging { --- End diff -- Again naming. At least add csv to the name. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/12268#discussion_r61258460 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityGenerator.scala --- @@ -0,0 +1,80 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.csv + +import com.univocity.parsers.csv.{CsvWriter, CsvWriterSettings} + +import org.apache.spark.internal.Logging +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.types.StructType + +/** + * Converts a sequence of string to CSV string + */ +private[csv] object UnivocityGenerator extends Logging { + /** + * Transforms a single InternalRow to CSV using Univocity + * + * @param rowSchema the schema object used for conversion + * @param writer a CsvWriter object + * @param headers headers to write + * @param writeHeader true if it needs to write header + * @param options CSVOptions object containing options + * @param row The row to convert + */ + def apply( --- End diff -- Please use a more descriptive name? `writeToCsv`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/12268#discussion_r61258326 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityGenerator.scala --- @@ -0,0 +1,80 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.csv + +import com.univocity.parsers.csv.{CsvWriter, CsvWriterSettings} + +import org.apache.spark.internal.Logging +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.types.StructType + +/** + * Converts a sequence of string to CSV string + */ +private[csv] object UnivocityGenerator extends Logging { --- End diff -- Are we ever going to use a different generator? Why not call it `CsvGenerator`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/12268#discussion_r61258077 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/InferSchema.scala --- @@ -30,22 +30,37 @@ import org.apache.spark.sql.catalyst.util.DateTimeUtils import org.apache.spark.sql.types._ import org.apache.spark.unsafe.types.UTF8String -private[csv] object CSVInferSchema { +private[csv] object InferSchema { /** * Similar to the JSON schema inference * 1. Infer type of each row * 2. Merge row types to find common type * 3. Replace any null types with string type */ - def infer( - tokenRdd: RDD[Array[String]], - header: Array[String], - nullValue: String = ""): StructType = { + def infer(csv: RDD[String], options: CSVOptions): StructType = { --- End diff -- This looks very similar to `DefaultSource.inferSchema` why not move the common functionality into a single method? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12268#issuecomment-214939729 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12268#issuecomment-214939730 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57058/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12268#issuecomment-214939534 **[Test build #57058 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57058/consoleFull)** for PR 12268 at commit [`ee71064`](https://github.com/apache/spark/commit/ee7106416ef17e5168a91bab044c6f6db9dbd53b). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class MultivariateGaussian(` * `class DecisionTreeClassifier @Since(\"1.4.0\") (` * `class GBTClassifier @Since(\"1.4.0\") (` * `class RandomForestClassifier @Since(\"1.4.0\") (` * ` class AFTSurvivalRegressionWrapperWriter(instance: AFTSurvivalRegressionWrapper)` * ` class AFTSurvivalRegressionWrapperReader extends MLReader[AFTSurvivalRegressionWrapper] ` * `class DecisionTreeRegressor @Since(\"1.4.0\") (@Since(\"1.4.0\") override val uid: String)` * `class GBTRegressor @Since(\"1.4.0\") (@Since(\"1.4.0\") override val uid: String)` * `class RandomForestRegressor @Since(\"1.4.0\") (@Since(\"1.4.0\") override val uid: String)` * `case class CartesianProductExec(` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12268#issuecomment-214926444 **[Test build #57058 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57058/consoleFull)** for PR 12268 at commit [`ee71064`](https://github.com/apache/spark/commit/ee7106416ef17e5168a91bab044c6f6db9dbd53b). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/12268#issuecomment-214925695 @rxin No problem. Let me just rebase it if it has conflicts anyway. It is easier to track the changes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/12268#issuecomment-214638048 cc @hvanhovell would you have some time to take a look at this? @HyukjinKwon most of us are very busy trying to get things out for 2.0 so this one will very likely slip. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12268#issuecomment-214625564 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/56965/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12268#issuecomment-214625562 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12268#issuecomment-214625432 **[Test build #56965 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56965/consoleFull)** for PR 12268 at commit [`fe63ba2`](https://github.com/apache/spark/commit/fe63ba22d70c1427657b4967e769270d1956be38). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12268#issuecomment-214624336 Build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12268#issuecomment-214624337 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/56955/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12268#issuecomment-214624209 **[Test build #56955 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56955/consoleFull)** for PR 12268 at commit [`ad21b8e`](https://github.com/apache/spark/commit/ad21b8eea981f61cb35de646f3568b27dd2141a3). * This patch passes all tests. * This patch **does not merge cleanly**. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12268#issuecomment-214615366 **[Test build #56965 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56965/consoleFull)** for PR 12268 at commit [`fe63ba2`](https://github.com/apache/spark/commit/fe63ba22d70c1427657b4967e769270d1956be38). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/12268#issuecomment-214614924 Fixed in https://github.com/apache/spark/commit/f8709218115f6c7aa4fb321865cdef8ceb443bd1 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/12268#issuecomment-214614770 @rxin It looks this is still failing, https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56962 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56963 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12268#issuecomment-214613784 **[Test build #56963 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56963/consoleFull)** for PR 12268 at commit [`f62755e`](https://github.com/apache/spark/commit/f62755e0875ae8f2947abf8a62505dd77b2ed9f5). * This patch **fails to build**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12268#issuecomment-214613795 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/56963/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12268#issuecomment-214613793 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12268#issuecomment-214611901 **[Test build #56963 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56963/consoleFull)** for PR 12268 at commit [`f62755e`](https://github.com/apache/spark/commit/f62755e0875ae8f2947abf8a62505dd77b2ed9f5). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/12268#issuecomment-214609981 This was due to https://github.com/apache/spark/commit/d2614eaadb93a48fba27fe7de64aff942e345f8e --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12268#issuecomment-214609252 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/56961/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12268#issuecomment-214609239 **[Test build #56961 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56961/consoleFull)** for PR 12268 at commit [`d59c7e9`](https://github.com/apache/spark/commit/d59c7e98f306fa9ff5dfe3b4caae14a2de746315). * This patch **fails to build**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `sealed abstract class LDAModel protected[ml] (` * `class LocalLDAModel protected[ml] (` * `class DistributedLDAModel protected[ml] (` * `class ContinuousQueryManager(sparkSession: SparkSession) ` * `class DataFrameReader protected[sql](sparkSession: SparkSession) extends Logging ` * `class Dataset[T] protected[sql](` * `class QueryExecution(val sparkSession: SparkSession, val logical: LogicalPlan) ` * `class FileStreamSinkLog(sparkSession: SparkSession, path: String)` * `class HDFSMetadataLog[T: ClassTag](sparkSession: SparkSession, path: String)` * `class StreamFileCatalog(sparkSession: SparkSession, path: Path) extends FileCatalog with Logging ` * `case class PlanSubqueries(sparkSession: SparkSession) extends Rule[SparkPlan] ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12268#issuecomment-214609249 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12268#issuecomment-214608734 **[Test build #56961 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56961/consoleFull)** for PR 12268 at commit [`d59c7e9`](https://github.com/apache/spark/commit/d59c7e98f306fa9ff5dfe3b4caae14a2de746315). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/12268#issuecomment-214608678 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12268#issuecomment-214608176 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/56959/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12268#issuecomment-214608175 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12268#issuecomment-214608169 **[Test build #56959 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56959/consoleFull)** for PR 12268 at commit [`d59c7e9`](https://github.com/apache/spark/commit/d59c7e98f306fa9ff5dfe3b4caae14a2de746315). * This patch **fails to build**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `sealed abstract class LDAModel protected[ml] (` * `class LocalLDAModel protected[ml] (` * `class DistributedLDAModel protected[ml] (` * `class ContinuousQueryManager(sparkSession: SparkSession) ` * `class DataFrameReader protected[sql](sparkSession: SparkSession) extends Logging ` * `class Dataset[T] protected[sql](` * `class QueryExecution(val sparkSession: SparkSession, val logical: LogicalPlan) ` * `class FileStreamSinkLog(sparkSession: SparkSession, path: String)` * `class HDFSMetadataLog[T: ClassTag](sparkSession: SparkSession, path: String)` * `class StreamFileCatalog(sparkSession: SparkSession, path: Path) extends FileCatalog with Logging ` * `case class PlanSubqueries(sparkSession: SparkSession) extends Rule[SparkPlan] ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12268#issuecomment-214607369 **[Test build #56959 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56959/consoleFull)** for PR 12268 at commit [`d59c7e9`](https://github.com/apache/spark/commit/d59c7e98f306fa9ff5dfe3b4caae14a2de746315). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12268#issuecomment-214605811 **[Test build #56955 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56955/consoleFull)** for PR 12268 at commit [`ad21b8e`](https://github.com/apache/spark/commit/ad21b8eea981f61cb35de646f3568b27dd2141a3). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/12268#issuecomment-214554406 ping @rxin --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12268#issuecomment-213663159 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12268#issuecomment-213663160 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/56768/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12268#issuecomment-213663125 **[Test build #56768 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56768/consoleFull)** for PR 12268 at commit [`92f8f38`](https://github.com/apache/spark/commit/92f8f387cec10cb61e178b312748f86bd75b1b55). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12268#issuecomment-213658568 **[Test build #56768 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56768/consoleFull)** for PR 12268 at commit [`92f8f38`](https://github.com/apache/spark/commit/92f8f387cec10cb61e178b312748f86bd75b1b55). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/12268#issuecomment-213658002 @rxin Could you please review this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12268#issuecomment-212227632 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12268#issuecomment-212227633 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/56309/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12268#issuecomment-212227517 **[Test build #56309 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56309/consoleFull)** for PR 12268 at commit [`d9ea3cb`](https://github.com/apache/spark/commit/d9ea3cb5ccb8db5d8ff9e36fa1e8d4df45ea4fb2). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12268#issuecomment-212206817 **[Test build #56309 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56309/consoleFull)** for PR 12268 at commit [`d9ea3cb`](https://github.com/apache/spark/commit/d9ea3cb5ccb8db5d8ff9e36fa1e8d4df45ea4fb2). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/12268#issuecomment-210971456 will try to take a look in the next few days. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/12268#issuecomment-210971265 Please excuse my pings, @cloud-fan , @rxin , @falaki , @yhuai --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org