[GitHub] spark pull request #16929: [SPARK-19595][SQL] Support json array in from_jso...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/16929#discussion_r103334238 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala --- @@ -480,36 +480,79 @@ case class JsonTuple(children: Seq[Expression]) } /** - * Converts an json input string to a [[StructType]] with the specified schema. + * Converts an json input string to a [[StructType]] or [[ArrayType]] with the specified schema. */ case class JsonToStruct( -schema: StructType, +schema: DataType, options: Map[String, String], child: Expression, timeZoneId: Option[String] = None) extends UnaryExpression with TimeZoneAwareExpression with CodegenFallback with ExpectsInputTypes { override def nullable: Boolean = true - def this(schema: StructType, options: Map[String, String], child: Expression) = + def this(schema: DataType, options: Map[String, String], child: Expression) = this(schema, options, child, None) + override def checkInputDataTypes(): TypeCheckResult = schema match { --- End diff -- Uh.. I though `schema` is not a child but just a parameter. Let me check! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16929: [SPARK-19595][SQL] Support json array in from_jso...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/16929#discussion_r103337914 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala --- @@ -480,36 +480,79 @@ case class JsonTuple(children: Seq[Expression]) } /** - * Converts an json input string to a [[StructType]] with the specified schema. + * Converts an json input string to a [[StructType]] or [[ArrayType]] with the specified schema. */ case class JsonToStruct( -schema: StructType, +schema: DataType, options: Map[String, String], child: Expression, timeZoneId: Option[String] = None) extends UnaryExpression with TimeZoneAwareExpression with CodegenFallback with ExpectsInputTypes { override def nullable: Boolean = true - def this(schema: StructType, options: Map[String, String], child: Expression) = + def this(schema: DataType, options: Map[String, String], child: Expression) = this(schema, options, child, None) + override def checkInputDataTypes(): TypeCheckResult = schema match { +case _: StructType | ArrayType(_: StructType, _) => + super.checkInputDataTypes() +case _ => TypeCheckResult.TypeCheckFailure( + s"Input schema ${schema.simpleString} must be a struct or an array of structs.") + } + + @transient + lazy val rowSchema = schema match { +case st: StructType => st +case ArrayType(st: StructType, _) => st + } + + // This converts parsed rows to the desired output by the given schema. + @transient + lazy val converter = schema match { +case _: StructType => + // These are always produced from json objects by `objectSupport` in `JacksonParser`. + (rows: Seq[InternalRow]) => rows.head + +case ArrayType(_: StructType, _) => + // These are always produced from json arrays by `arraySupport` in `JacksonParser`. + (rows: Seq[InternalRow]) => new GenericArrayData(rows) + } + @transient lazy val parser = new JacksonParser( - schema, - new JSONOptions(options + ("mode" -> ParseModes.FAIL_FAST_MODE), timeZoneId.get)) + rowSchema, + new JSONOptions(options + ("mode" -> ParseModes.FAIL_FAST_MODE), timeZoneId.get), + objectSupport = schema.isInstanceOf[StructType], --- End diff -- > What does the input look like, and what are they specifying? `JscksonParser.parse` produces `Seq[InternalRow]` and it takes `StructType`. What I meant by both `objectSupport` and `arraySupport` is, JSON object and JSON array because we support both as a root JSON object and intended to produce `null` if one of them is disabled. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16929: [SPARK-19595][SQL] Support json array in from_jso...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/16929#discussion_r103338371 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala --- @@ -480,36 +480,79 @@ case class JsonTuple(children: Seq[Expression]) } /** - * Converts an json input string to a [[StructType]] with the specified schema. + * Converts an json input string to a [[StructType]] or [[ArrayType]] with the specified schema. */ case class JsonToStruct( -schema: StructType, +schema: DataType, options: Map[String, String], child: Expression, timeZoneId: Option[String] = None) extends UnaryExpression with TimeZoneAwareExpression with CodegenFallback with ExpectsInputTypes { override def nullable: Boolean = true - def this(schema: StructType, options: Map[String, String], child: Expression) = + def this(schema: DataType, options: Map[String, String], child: Expression) = this(schema, options, child, None) + override def checkInputDataTypes(): TypeCheckResult = schema match { --- End diff -- Uh.. I thought `schema` is not the child of the expression. Let me check again! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16929: [SPARK-19595][SQL] Support json array in from_json
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/16929 Thanks for your detailed look. Let me check again and address the comments! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16929: [SPARK-19595][SQL] Support json array in from_json
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/16929 Then, let me fix this as below: - `from_json` with `StructType` - JSON array -> `null` - JSON object -> `Row(...)` - `from_json` with `ArrayType` - JSON array -> `Array(Row(...), ...)` - JSON object -> `Array(Row(...), ...)` - exposed API - `from_json(..., schema: StructType)` - `from_json(..., schema: DataType)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16929: [SPARK-19595][SQL] Support json array in from_jso...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/16929#discussion_r103386481 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala --- @@ -480,36 +480,79 @@ case class JsonTuple(children: Seq[Expression]) } /** - * Converts an json input string to a [[StructType]] with the specified schema. + * Converts an json input string to a [[StructType]] or [[ArrayType]] with the specified schema. */ case class JsonToStruct( -schema: StructType, +schema: DataType, options: Map[String, String], child: Expression, timeZoneId: Option[String] = None) extends UnaryExpression with TimeZoneAwareExpression with CodegenFallback with ExpectsInputTypes { override def nullable: Boolean = true - def this(schema: StructType, options: Map[String, String], child: Expression) = + def this(schema: DataType, options: Map[String, String], child: Expression) = this(schema, options, child, None) + override def checkInputDataTypes(): TypeCheckResult = schema match { --- End diff -- I tried several combinations with `TypeCollection` but it seems not working. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16929: [SPARK-19595][SQL] Support json array in from_json
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/16929 @brkyvz, @marmbrus - I think it is ready for another look. Could you see if I understood your comments correctly? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13036: [SPARK-15243][ML][SQL][PYSPARK] Param methods should use...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/13036 I am happy to do so. I assume that It seems already almost done except for https://github.com/apache/spark/pull/13036#discussion_r84476560? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17096: [SPARK-15243][ML][SQL][PYSPARK] Add missing suppo...
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/17096 [SPARK-15243][ML][SQL][PYSPARK] Add missing support for unicode in Param methods/functions in dataframe/types ## What changes were proposed in this pull request? This PR proposes to support unicodes in Param methods in ML, other missed functions in DataFrame and other missed one as a column name in types. For example, this causes a `ValueError` in Python 2.x when param is a unicode string: ```python >>> from pyspark.ml.classification import LogisticRegression >>> lr = LogisticRegression() >>> lr.hasParam("threshold") True >>> lr.hasParam(u"threshold") Traceback (most recent call last): ... raise TypeError("hasParam(): paramName must be a string") TypeError: hasParam(): paramName must be a string ``` This PR is based on https://github.com/apache/spark/pull/13036 ## How was this patch tested? Unit tests in `python/pyspark/ml/tests.py` and `python/pyspark/sql/tests.py`. You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark SPARK-15243 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17096.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17096 commit 762997d266c78c16e2e6caaee8daa557ec07820b Author: sethah Date: 2016-05-10T22:44:10Z check for basestring in param methods commit d421a82e1a713b9c7351dfc6a2a6a5fb13eca1e2 Author: hyukjinkwon Date: 2017-02-28T08:46:58Z Fix updated approxQuantile to take basestrings and fix the tests --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17096: [SPARK-15243][ML][SQL][PYSPARK] Add missing support for ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/17096 cc @sethah, @holdenk, @viirya and @k-yokoshi --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16929: [SPARK-19595][SQL] Support json array in from_json
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/16929 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17115: [Doc][Minor] Update R doc
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/17115 ( @actuaryzhang , maybe we could make the PR title more meaningful to summerise the change just for a better shaped PR ) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17096: [SPARK-15243][ML][SQL][PYTHON] Add missing support for u...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/17096 Thank you @viirya. I think it is ready now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16854: [WIP][SPARK-15463][SQL] Add an API to load DataFr...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/16854#discussion_r103894945 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala --- @@ -344,4 +346,36 @@ private[csv] object UnivocityParser { CSVUtils.filterCommentAndEmpty(linesWithoutHeader, options) filteredLines.flatMap(line => parser.parse(line)) } + + /** + * Parses a `Dataset` that contains CSV strings and turns it into an `RDD` of rows. + */ + def tokenizeDataset( + csvDataset: Dataset[String], + maybeFirstLine: Option[String], + options: CSVOptions): RDD[Array[String]] = { +val filtered = CSVUtils.filterCommentAndEmpty(csvDataset, options) +val linesWithoutHeader = maybeFirstLine.map { firstLine => + filtered.rdd.mapPartitions(CSVUtils.filterHeaderLine(_, firstLine, options)) +}.getOrElse(filtered.rdd) + +linesWithoutHeader.mapPartitions { iter => + val parser = new CsvParser(options.asParserSettings) + iter.map(line => parser.parseLine(line)) +} + } + + /** + * Parses a `Dataset` that contains CSV strings and turns it into an `RDD` of rows. + */ + def parseDataset( + csvDataset: Dataset[String], + schema: StructType, + maybeFirstLine: Option[String], + options: CSVOptions): RDD[InternalRow] = { +tokenizeDataset(csvDataset, maybeFirstLine, options).mapPartitions { iter => + val parser = new UnivocityParser(schema, options) + iter.flatMap(line => parser.convert(line)) +} + } --- End diff -- cc @cloud-fan, this is still a wip but I am trying to put the different execution paths into here in CSV parsing. For example, - `spark.read.csv(file)` - data: `parseIterator` (note that this one is read from partitioned file). - schema: `tokenizeDataset ` - `spark.read.csv(file)` with `wholeFile` - data: `parseStream` - schema: `tokenizeStream ` - `spark.read.csv(dataset)` - data: `parseDataset ` - schema: `tokenizeDataset ` However, it seems ending up with a bit weird arguments here.. do you think it is okay? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16854: [WIP][SPARK-15463][SQL] Add an API to load DataFr...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/16854#discussion_r103895193 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala --- @@ -344,4 +346,36 @@ private[csv] object UnivocityParser { CSVUtils.filterCommentAndEmpty(linesWithoutHeader, options) filteredLines.flatMap(line => parser.parse(line)) } + + /** + * Parses a `Dataset` that contains CSV strings and turns it into an `RDD` of rows. + */ + def tokenizeDataset( + csvDataset: Dataset[String], + maybeFirstLine: Option[String], + options: CSVOptions): RDD[Array[String]] = { +val filtered = CSVUtils.filterCommentAndEmpty(csvDataset, options) +val linesWithoutHeader = maybeFirstLine.map { firstLine => + filtered.rdd.mapPartitions(CSVUtils.filterHeaderLine(_, firstLine, options)) +}.getOrElse(filtered.rdd) + +linesWithoutHeader.mapPartitions { iter => + val parser = new CsvParser(options.asParserSettings) + iter.map(line => parser.parseLine(line)) +} + } + + /** + * Parses a `Dataset` that contains CSV strings and turns it into an `RDD` of rows. + */ + def parseDataset( + csvDataset: Dataset[String], + schema: StructType, + maybeFirstLine: Option[String], + options: CSVOptions): RDD[InternalRow] = { +tokenizeDataset(csvDataset, maybeFirstLine, options).mapPartitions { iter => + val parser = new UnivocityParser(schema, options) + iter.flatMap(line => parser.convert(line)) +} + } --- End diff -- If you are not sure or it looks painful to review, let me take out all the changes and put those into `DataFrameReader.csv` for now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17136: [SPARK-19783][SQL] Treat shorter/longer lengths of token...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/17136 Oh, @maropu, I have been looking into R's `read.csv` before adding some comments in the JIRAs. with the data below: ``` a,b,c a,b,c,d,e,d,d ``` ``` > read.csv("test.csv") Error in read.table(file = file, header = header, sep = sep, quote = quote, : ``` with the data below: ```csv a,b,c,d,e,d,d a,b,c ``` ```r > read.csv("test.csv") a b c d e d.1 d.2 1 a b c NA NA NA NA ``` So, IMHO, we might better follow R's `read.csv` for now but of course I guess we should take a look for other libraries. I am actually a bit worried of behaviour change because `PERMISSIVE` has been the default mode and since it was as the thirdparty library (Spark 1.3+). Another concern is, it seems we should produce `columnNameOfCorruptRecord` as we are doing in JSON datasource if we will treat those tokens as malformed ones in `PERMISSIVE` mode. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14788: [SPARK-17174][SQL] Add the support for TimestampT...
Github user HyukjinKwon closed the pull request at: https://github.com/apache/spark/pull/14788 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17142: [SPARK-18699][SQL][FOLLOWUP] Add explanation in C...
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/17142 [SPARK-18699][SQL][FOLLOWUP] Add explanation in CSV parser and minor cleanup ## What changes were proposed in this pull request? This PR suggests adding some comments in `UnivocityParser` logics to explain what happens. Also, it proposes, IMHO, a little bit cleaner (at least easy for me to explain). ## How was this patch tested? Unit tests in `CSVSuite`. You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark SPARK-18699 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17142.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17142 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17142: [SPARK-18699][SQL][FOLLOWUP] Add explanation in C...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17142#discussion_r103986635 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala --- @@ -54,39 +54,77 @@ private[csv] class UnivocityParser( private val dataSchema = StructType(schema.filter(_.name != options.columnNameOfCorruptRecord)) - private val valueConverters = -dataSchema.map(f => makeConverter(f.name, f.dataType, f.nullable, options)).toArray - private val tokenizer = new CsvParser(options.asParserSettings) private var numMalformedRecords = 0 private val row = new GenericInternalRow(requiredSchema.length) - // This gets the raw input that is parsed lately. + // In `PERMISSIVE` parse mode, we should be able to put the raw malformed row into the field + // specified in `columnNameOfCorruptRecord`. The raw input is retrieved by this method. private def getCurrentInput(): String = tokenizer.getContext.currentParsedContent().stripLineEnd - // This parser loads an `indexArr._1`-th position value in input tokens, - // then put the value in `row(indexArr._2)`. - private val indexArr: Array[(Int, Int)] = { -val fields = if (options.dropMalformed) { - // If `dropMalformed` is enabled, then it needs to parse all the values - // so that we can decide which row is malformed. - requiredSchema ++ schema.filterNot(requiredSchema.contains(_)) -} else { - requiredSchema -} -// TODO: Revisit this; we need to clean up code here for readability. -// See an URL below for related discussions: -// https://github.com/apache/spark/pull/16928#discussion_r102636720 -val fieldsWithIndexes = fields.zipWithIndex -corruptFieldIndex.map { case corrFieldIndex => - fieldsWithIndexes.filter { case (_, i) => i != corrFieldIndex } -}.getOrElse { - fieldsWithIndexes -}.map { case (f, i) => - (dataSchema.indexOf(f), i) -}.toArray --- End diff -- cc @maropu and @cloud-fan, could you check if this comment look nicer to you? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17142: [SPARK-18699][SQL][FOLLOWUP] Add explanation in C...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17142#discussion_r104063545 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala --- @@ -54,39 +54,77 @@ private[csv] class UnivocityParser( private val dataSchema = StructType(schema.filter(_.name != options.columnNameOfCorruptRecord)) - private val valueConverters = -dataSchema.map(f => makeConverter(f.name, f.dataType, f.nullable, options)).toArray - private val tokenizer = new CsvParser(options.asParserSettings) private var numMalformedRecords = 0 private val row = new GenericInternalRow(requiredSchema.length) - // This gets the raw input that is parsed lately. + // In `PERMISSIVE` parse mode, we should be able to put the raw malformed row into the field + // specified in `columnNameOfCorruptRecord`. The raw input is retrieved by this method. private def getCurrentInput(): String = tokenizer.getContext.currentParsedContent().stripLineEnd - // This parser loads an `indexArr._1`-th position value in input tokens, - // then put the value in `row(indexArr._2)`. - private val indexArr: Array[(Int, Int)] = { -val fields = if (options.dropMalformed) { - // If `dropMalformed` is enabled, then it needs to parse all the values - // so that we can decide which row is malformed. - requiredSchema ++ schema.filterNot(requiredSchema.contains(_)) -} else { - requiredSchema -} -// TODO: Revisit this; we need to clean up code here for readability. -// See an URL below for related discussions: -// https://github.com/apache/spark/pull/16928#discussion_r102636720 -val fieldsWithIndexes = fields.zipWithIndex -corruptFieldIndex.map { case corrFieldIndex => - fieldsWithIndexes.filter { case (_, i) => i != corrFieldIndex } -}.getOrElse { - fieldsWithIndexes -}.map { case (f, i) => - (dataSchema.indexOf(f), i) -}.toArray --- End diff -- It is just a little bit for codes.. actually :). I hope the comment makes reading this code easier and not look too verbose. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17096: [SPARK-15243][ML][SQL][PYTHON] Add missing support for u...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/17096 @viirya, thank you so much for taking a look and your time. So, basically, the second case it compares str to unicode as below: ```python >>> u"測試" == u"測試".encode("utf-8") False ``` Apparently, it seems we could pass unicode as is? Let me raise another issue for this after testing and looking into this. Actually, the support in `StructType.add` seems not the problem specified in the JIRA. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17096: [SPARK-15243][ML][SQL][PYTHON] Add missing support for u...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/17096 Let me check if each is fine for sure. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17096: [SPARK-15243][ML][SQL][PYTHON] Add missing support for u...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/17096 @holdenk and @viirya, I got rid of the changes in `types.py` and only left that I am pretty sure. There are two kind of changes here that look used in the only local scope. One seems for used `getattr` I guess it is fine as below: ```python >>> getattr("a", u"__str__") >>> getattr("a", "__str__") ``` and other one seems used for setting an parameter to JVM which seems already used in the code base much more. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17142: [SPARK-18699][SQL][FOLLOWUP] Add explanation in CSV pars...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/17142 I definitely will. Thank you so much @cloud-fan and @maropu. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16929: [SPARK-19595][SQL] Support json array in from_json
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/16929 @mambrus and @brkyvz, would there be other things I should double check? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17160: [SPARK-19701][SQL][PYTHON] Throws a correct excep...
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/17160 [SPARK-19701][SQL][PYTHON] Throws a correct exception for 'in' operator against column ## What changes were proposed in this pull request? This PR proposes to remove incorrect implementation that has been not executed so far (at least from Spark 1.5.2) for `in` operator and throw a correct exception rather than saying it is a bool. I tested the codes above in 1.5.2, 1.6.3, 2.1.0 and in the master branch as below: **1.5.2** ```python >>> df = sqlContext.createDataFrame([[1]]) >>> 1 in df._1 Traceback (most recent call last): File "", line 1, in File ".../spark-1.5.2-bin-hadoop2.6/python/pyspark/sql/column.py", line 418, in __nonzero__ raise ValueError("Cannot convert column into bool: please use '&' for 'and', '|' for 'or', " ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions. ``` **1.6.3** ```python >>> 1 in sqlContext.range(1).id Traceback (most recent call last): File "", line 1, in File ".../spark-1.6.3-bin-hadoop2.6/python/pyspark/sql/column.py", line 447, in __nonzero__ raise ValueError("Cannot convert column into bool: please use '&' for 'and', '|' for 'or', " ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions. ``` **2.1.0** ```python >>> 1 in spark.range(1).id Traceback (most recent call last): File "", line 1, in File ".../spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/column.py", line 426, in __nonzero__ raise ValueError("Cannot convert column into bool: please use '&' for 'and', '|' for 'or', " ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions. ``` **Current Master** ```python >>> 1 in spark.range(1).id Traceback (most recent call last): File "", line 1, in File ".../spark/python/pyspark/sql/column.py", line 452, in __nonzero__ raise ValueError("Cannot convert column into bool: please use '&' for 'and', '|' for 'or', " ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions. ``` **After** ```python >>> 1 in spark.range(1).id Traceback (most recent call last): File "", line 1, in File ".../spark/python/pyspark/sql/column.py", line 184, in __contains__ raise ValueError("Cannot apply 'in' operator against a column: please use 'contains' " ValueError: Cannot apply 'in' operator against a column: please use 'contains' in a string column or 'array_contains' function for an array column. ``` In more details, It seems the implementation indented to support this ```python 1 in df.column ``` However, currently, it throws an exception as below: ```python Traceback (most recent call last): File "", line 1, in File ".../spark/python/pyspark/sql/column.py", line 426, in __nonzero__ raise ValueError("Cannot convert column into bool: please use '&' for 'and', '|' for 'or', " ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions. ``` What happens here is as below: ```python class Column(object): def __contains__(self, item): print "I am contains" return Column() def __nonzero__(self): raise Exception("I am nonzero.") >>> 1 in Column() I am contains Traceback (most recent call last): File "", line 1, in File "", line 6, in __nonzero__ Exception: I am nonzero. ``` It seems it calls `
[GitHub] spark issue #17160: [SPARK-19701][SQL][PYTHON] Throws a correct exception fo...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/17160 cc @cloud-fan, @davies and @holdenk. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17160: [SPARK-19701][SQL][PYTHON] Throws a correct excep...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17160#discussion_r104278452 --- Diff: python/pyspark/sql/column.py --- @@ -180,7 +180,9 @@ def __init__(self, jc): __ror__ = _bin_op("or") # container operators -__contains__ = _bin_op("contains") +def __contains__(self, item): +raise ValueError("Cannot apply 'in' operator against a column: please use 'contains' " + "in a string column or 'array_contains' function for an array column.") --- End diff -- What I meant here is use ```python >>> df = spark.range(1) >>> df.select(df.id.contains(0)).show() ``` ``` +---+ |contains(id, 0)| +---+ | true| +---+ ``` or ```python >>> from pyspark.sql.functions import array_contains >>> df = spark.createDataFrame([[[0]]], ["id"]) >>> df.select(array_contains(df.id, 0)).show() ``` ``` +-+ |array_contains(id, 0)| +-+ | true| +-+ ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16929: [SPARK-19595][SQL] Support json array in from_json
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/16929 Thank you so much. Let me clean up. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16854: [WIP][SPARK-15463][SQL] Add an API to load DataFrame fro...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/16854 Oh, no. It does not need to. I just meant to de-duplicate some logics by https://github.com/apache/spark/pull/16854#discussion_r103894945. Let me just remove that part and leave only code changes dedicated for this JIRA. It seems making reviewers confused. Let me clean up soon. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17161: [SPARK-19819][SparkR] Use concrete data in SparkR...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17161#discussion_r104283930 --- Diff: R/pkg/R/DataFrame.R --- @@ -92,8 +92,7 @@ dataFrame <- function(sdf, isCached = FALSE) { #' @examples #'\dontrun{ #' sparkR.session() -#' path <- "path/to/file.json" -#' df <- read.json(path) +#' df <- createDataFrame(mtcars) --- End diff -- Should we define `mtcars` to run this example? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17161: [SPARK-19819][SparkR] Use concrete data in SparkR DataFr...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/17161 This rings a bell to me. I opened a PR https://github.com/apache/spark/pull/16824 to make the examples in Python API doc as self-contained ones but closed as it seems not having much interests from other committers. I would like to see if other R committers support this and then reopen it if they like this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17160: [SPARK-19701][SQL][PYTHON] Throws a correct exception fo...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/17160 **without `__contains__`, `__nonzero__` and `__bool__`** ```python class Column(object): pass >>> 1 in Column() Traceback (most recent call last): File "", line 1, in TypeError: argument of type 'Column' is not iterable >>> bool(Column()) True ``` **without `__nonzero__` and `__bool__`** ```python class Column(object): def __contains__(self, item): print "I am contains" return Column() >>> 1 in Column() I am contains True >>> bool(Column()) True ``` **without `__contains__`** ```python class Column(object): def __nonzero__(self): print "I am nonzero" return Column() >>> 1 in Column() Traceback (most recent call last): File "", line 1, in TypeError: argument of type 'Column' is not iterable >>> bool(Column()) I am nonzero Traceback (most recent call last): File "", line 1, in TypeError: __nonzero__ should return bool or int, returned Column ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17160: [SPARK-19701][SQL][PYTHON] Throws a correct exception fo...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/17160 So.. it seems ```python TypeError: argument of type 'int' is not iterable ``` vs ```python ValueError: Cannot apply 'in' operator against a column: please use 'contains' in a string column or 'array_contains' function for an array column. ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17068: [SPARK-19709][SQL] Read empty file with CSV data ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17068#discussion_r104288502 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala --- @@ -133,8 +133,19 @@ object TextInputCSVDataSource extends CSVDataSource { sparkSession: SparkSession, inputPaths: Seq[FileStatus], parsedOptions: CSVOptions): Option[StructType] = { -val csv: Dataset[String] = createBaseDataset(sparkSession, inputPaths, parsedOptions) -val firstLine: String = CSVUtils.filterCommentAndEmpty(csv, parsedOptions).first() +val csv = createBaseDataset(sparkSession, inputPaths, parsedOptions) +CSVUtils.filterCommentAndEmpty(csv, parsedOptions) + .take(1) + .headOption + .map(firstLine => infer(sparkSession, parsedOptions, csv, firstLine)) + .orElse(Some(StructType(Seq( --- End diff -- Could we maybe just match it to [CSVDataSource.scala#L204-L224](https://github.com/wojtek-szymanski/spark/blob/bdf189087c52e8934cee4ee5563dffea9dde6a99/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala#L204-L224) just for consistency for now? Personally, I thought such chaining makes hard to read the codes sometimes. Maybe, we could consider the code de-duplication about this in another PR. If would be easier if they looks similar at least. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17068: [SPARK-19709][SQL] Read empty file with CSV data ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17068#discussion_r104288543 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala --- @@ -305,13 +305,21 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils { test("test with empty file and known schema") { val result = spark.read .format("csv") - .schema(StructType(List(StructField("column", StringType, false + .schema(StructType(List(StructField("column", StringType, nullable = false .load(testFile(emptyFile)) -assert(result.collect.size === 0) +assert(result.collect().isEmpty) assert(result.schema.fieldNames.size === 1) } + test("test with empty file without schema") { --- End diff -- Let's re-use the test in [CSVSuite.scala#L1083](https://github.com/wojtek-szymanski/spark/blob/bdf189087c52e8934cee4ee5563dffea9dde6a99/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala#L1083). We could.. ```scala test("Empty file produces empty dataframe with empty schema") { Seq(false, true).foreach { wholeFile => val df = spark.read.format("csv") .option("header", true) .option("wholeFile", true) .load(testFile(emptyFile)) assert(df.schema === spark.emptyDataFrame.schema) checkAnswer(df, spark.emptyDataFrame) } } } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17068: [SPARK-19709][SQL] Read empty file with CSV data source
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/17068 Hi @wojtek-szymanski, these are all from me. Let me cc @cloud-fan as my PRs related with this were reviewed by him and I guess I can't trigger the test. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16854: [WIP][SPARK-15463][SQL] Add an API to load DataFr...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/16854#discussion_r104289988 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala --- @@ -399,6 +395,52 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging { } /** + * Loads an `Dataset[String]` storing CSV rows and returns the result as a `DataFrame`. + * + * Unless the schema is specified using `schema` function, this function goes through the + * input once to determine the input schema. + * + * @param csvDataset input Dataset with one CSV row per record + * @since 2.2.0 + */ + def csv(csvDataset: Dataset[String]): DataFrame = { +val parsedOptions: CSVOptions = new CSVOptions( + extraOptions.toMap, + sparkSession.sessionState.conf.sessionLocalTimeZone) +val filteredLines = CSVUtils.filterCommentAndEmpty(csvDataset, parsedOptions) +val maybeFirstLine = filteredLines.take(1).headOption +if (maybeFirstLine.isEmpty) { + return sparkSession.emptyDataFrame --- End diff -- The reason why this one exists unlike json is, CSV needs to head a head always first (even if it does not infer the schema, it needs at least the number of values). In this case, we could return empty one fast. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16854: [WIP][SPARK-15463][SQL] Add an API to load DataFr...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/16854#discussion_r104290005 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala --- @@ -399,6 +395,52 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging { } /** + * Loads an `Dataset[String]` storing CSV rows and returns the result as a `DataFrame`. + * + * Unless the schema is specified using `schema` function, this function goes through the + * input once to determine the input schema. + * + * @param csvDataset input Dataset with one CSV row per record + * @since 2.2.0 + */ + def csv(csvDataset: Dataset[String]): DataFrame = { +val parsedOptions: CSVOptions = new CSVOptions( + extraOptions.toMap, + sparkSession.sessionState.conf.sessionLocalTimeZone) +val filteredLines = CSVUtils.filterCommentAndEmpty(csvDataset, parsedOptions) +val maybeFirstLine = filteredLines.take(1).headOption +if (maybeFirstLine.isEmpty) { + return sparkSession.emptyDataFrame +} + +val firstLine = maybeFirstLine.get +val linesWithoutHeader: RDD[String] = filteredLines.rdd.mapPartitions( + CSVUtils.filterHeaderLine(_, firstLine, parsedOptions)) + +val schema = userSpecifiedSchema.getOrElse { --- End diff -- There is a similar code path in https://github.com/apache/spark/blob/7e5359be5ca038fdb579712b18e7f226d705c276/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala#L132-L150 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16854: [WIP][SPARK-15463][SQL] Add an API to load DataFr...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/16854#discussion_r104289951 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala --- @@ -604,6 +646,22 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging { } } + /** + * A convenient function for schema validation that takes `columnNameOfCorruptRecord` + * as an option. + */ + private def verifyColumnNameOfCorruptRecord( --- End diff -- Maybe, this is too much. I am willing to revert this back. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16854: [WIP][SPARK-15463][SQL] Add an API to load DataFrame fro...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/16854 Let me double check before getting rid of `[WIP]` tomorrow. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16854: [WIP][SPARK-15463][SQL] Add an API to load DataFr...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/16854#discussion_r104290082 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala --- @@ -604,6 +646,22 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging { } } + /** + * A convenient function for schema validation in datasources supporting + * `columnNameOfCorruptRecord` as an option. + */ + private def verifyColumnNameOfCorruptRecord( --- End diff -- Maybe, this is too much. I am willing to revert this back. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16854: [WIP][SPARK-15463][SQL] Add an API to load DataFrame fro...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/16854 Let me double check before getting rid of `[WIP]` tomorrow. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16854: [WIP][SPARK-15463][SQL] Add an API to load DataFr...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/16854#discussion_r104309644 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala --- @@ -59,11 +58,21 @@ abstract class CSVDataSource extends Serializable { sparkSession: SparkSession, inputPaths: Seq[FileStatus], parsedOptions: CSVOptions): Option[StructType] +} + +object CSVDataSource { + def apply(options: CSVOptions): CSVDataSource = { +if (options.wholeFile) { + WholeFileCSVDataSource +} else { + TextInputCSVDataSource +} + } /** * Generates a header from the given row which is null-safe and duplicate-safe. */ - protected def makeSafeHeader( + def makeSafeHeader( --- End diff -- `makeSafeHeader` was moved from `CSVDataSource` class to `CSVDataSource` companion object so that this can be accessed in `DataFrameReader`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17136: [SPARK-19783][SQL] Treat shorter/longer lengths o...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17136#discussion_r104309730 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala --- @@ -1085,4 +1062,23 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils { checkAnswer(df, spark.emptyDataFrame) } } + + test("SPARK-X regard as malformed records if the length is not equal to expected one") { --- End diff -- Maybe `SPARK-19783` :). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17136: [SPARK-19783][SQL] Treat shorter/longer lengths o...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17136#discussion_r104309710 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala --- @@ -232,43 +243,27 @@ private[csv] class UnivocityParser( throw new RuntimeException(s"Malformed line in FAILFAST mode: " + s"${tokens.mkString(options.delimiter.toString)}") } else { --- End diff -- Could we make this ``` else { if { ... } else { ... } } ``` to ``` else if { ... } else { ... } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17136: [SPARK-19783][SQL] Treat shorter/longer lengths o...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17136#discussion_r104310045 --- Diff: R/pkg/inst/tests/testthat/test_sparkSQL.R --- @@ -246,8 +246,8 @@ test_that("read/write csv as DataFrame", { mockLinesCsv <- c("year,make,model,comment,blank", "\"2012\",\"Tesla\",\"S\",\"No comment\",", "1997,Ford,E350,\"Go get one now they are going fast\",", - "2015,Chevy,Volt", - "NA,Dummy,Placeholder") + "2015,Chevy,Volt,,", --- End diff -- We might need a way for it if we clean up and define the behaviour about parse mode.. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17136: [SPARK-19783][SQL] Treat shorter/longer lengths o...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17136#discussion_r104310061 --- Diff: R/pkg/inst/tests/testthat/test_sparkSQL.R --- @@ -246,8 +246,8 @@ test_that("read/write csv as DataFrame", { mockLinesCsv <- c("year,make,model,comment,blank", "\"2012\",\"Tesla\",\"S\",\"No comment\",", "1997,Ford,E350,\"Go get one now they are going fast\",", - "2015,Chevy,Volt", - "NA,Dummy,Placeholder") + "2015,Chevy,Volt,,", --- End diff -- We might need a way for it after we clean up and define the behaviour about parse mode.. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16929: [SPARK-19595][SQL] Support json array in from_json
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/16929 I just updated the PR description to prevent confusion. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16929: [SPARK-19595][SQL] Support json array in from_json
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/16929 Cc @brkyvz and @marmbrus could this be merged by any chance? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16929: [SPARK-19595][SQL] Support json array in from_json
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/16929 Thank you @brkyvz. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16854: [SPARK-15463][SQL] Add an API to load DataFrame f...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/16854#discussion_r104330267 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala --- @@ -399,6 +395,52 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging { } /** + * Loads an `Dataset[String]` storing CSV rows and returns the result as a `DataFrame`. + * + * Unless the schema is specified using `schema` function, this function goes through the + * input once to determine the input schema. + * + * @param csvDataset input Dataset with one CSV row per record + * @since 2.2.0 + */ + def csv(csvDataset: Dataset[String]): DataFrame = { +val parsedOptions: CSVOptions = new CSVOptions( + extraOptions.toMap, + sparkSession.sessionState.conf.sessionLocalTimeZone) +val filteredLines = CSVUtils.filterCommentAndEmpty(csvDataset, parsedOptions) +val maybeFirstLine = filteredLines.take(1).headOption +if (maybeFirstLine.isEmpty) { + return sparkSession.emptyDataFrame +} + +val firstLine = maybeFirstLine.get +val linesWithoutHeader: RDD[String] = filteredLines.rdd.mapPartitions( + CSVUtils.filterHeaderLine(_, firstLine, parsedOptions)) + +val schema = userSpecifiedSchema.getOrElse { --- End diff -- Yes, I think we could with explicitly `wholeFile` set to false. Let me give a shot and show you if it looks okay. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16854: [SPARK-15463][SQL] Add an API to load DataFrame f...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/16854#discussion_r104330396 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala --- @@ -399,6 +395,52 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging { } /** + * Loads an `Dataset[String]` storing CSV rows and returns the result as a `DataFrame`. + * + * Unless the schema is specified using `schema` function, this function goes through the + * input once to determine the input schema. + * + * @param csvDataset input Dataset with one CSV row per record + * @since 2.2.0 + */ + def csv(csvDataset: Dataset[String]): DataFrame = { +val parsedOptions: CSVOptions = new CSVOptions( + extraOptions.toMap, + sparkSession.sessionState.conf.sessionLocalTimeZone) +val filteredLines = CSVUtils.filterCommentAndEmpty(csvDataset, parsedOptions) +val maybeFirstLine = filteredLines.take(1).headOption +if (maybeFirstLine.isEmpty) { + return sparkSession.emptyDataFrame +} + +val firstLine = maybeFirstLine.get +val linesWithoutHeader: RDD[String] = filteredLines.rdd.mapPartitions( + CSVUtils.filterHeaderLine(_, firstLine, parsedOptions)) + +val schema = userSpecifiedSchema.getOrElse { --- End diff -- Oh, sorry, I overlooked. It seems `TextInputCSVDataSource.infer` takes input paths whereas we want `Dataset` here. Let me try to take a look and see if we could reuse it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17160: [SPARK-19701][SQL][PYTHON] Throws a correct exception fo...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/17160 It does not need for `__contains__` but need for `bool` because I guess we would not want to return bool as other operators return `Column`. For example, ```python >>> not spark.range(1).id Traceback (most recent call last): File "", line 1, in File ".../spark/python/pyspark/sql/column.py", line 452, in __nonzero__ raise ValueError("Cannot convert column into bool: please use '&' for 'and', '|' for 'or', " ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions. ``` If we remove this, then ``` >>> not spark.range(1).id False ``` I think we want `Column` as below: ```python >>> 1 < spark.range(1).id Column<(id > 1)> ``` but for `bool`, it seems not. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17160: [SPARK-19701][SQL][PYTHON] Throws a correct exception fo...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/17160 @cloud-fan Thank you sincerely so much for looking into this deeper and asking the details. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17161: [SPARK-19819][SparkR] Use concrete data in SparkR DataFr...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/17161 @felixcheung, please let me leave my humble opinion. In general, I like this change. Maybe, it should not necessarily be always runnable (at least for now) but I believe it is still an improvement to make them runnable, for example, it allows users to directly copy and paste the example and easily test. To me, this is more like the case that we (at least I) prefer the JIRA having a self-contained reproducer. Personally, I just copy and paste the example and check the input/output when I first learn and try new APIs as a user. > Firstly, I see this as slightly different from Python, in that in R it is common to have built-in datasets and possibly users are used to having them and having examples using them. In case of Python examples, up to my knowledge, it is also common to document self-runnable examples (e.g., [pandas](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.pivot.html#pandas.DataFrame.pivot) and [numpy](https://docs.scipy.org/doc/numpy/reference/arrays.ndarray.html)). Moreover, they are tested in doctests so there is no overhead to check if they are runnable everytime. Maybe, let's talk about this in the PR or JIRA (I am sorry for dragging Python one into this). > I have done a pass on the changes in this PR and I'm happy with changing from non-existing json file to mtcars. I'm slightly concerned with the few cases of artificial 3 rows data (like here) - more on that below on small dataset. Maybe, we could take out the examples that you are concerned of for now.. For the four concerns you described, I do agree in general but if it has a downside more than what it improves, I would disagree but to me it might sound the improvement might be better than the downsides about the change itself. >how much work and how much change is it to change all examples (this is only 1 .R out of 20-something files we have, in a total of 300+ methods which is on the high side for R packages) I guess the first one is about reviewing cost/backporting/conflict concern. I am willing to help verify the examples by manually running even if they are so many in terms of reviewing cost if the change is big. > how much churn will it be to keep them up-to-date when we are having changes to API (eg. sparkR.session()); especially since in order to have examples self-contained we tend to add additional calls to manipulate data and thereby increasing the number of references of API calls; which could get stale easily like the example with insertInto For the second concern, if it makes the maintenance header, it'd be a important point but I think we are being conservative on the API changes from Spark 2.x. So, I guess we would less likely need to sweep these. I think it is okay even if it would be change in the near future if the changes are not quite big. > perhaps more importantly, how practical or useful it would be to use built-in datasets or native R data.frame (mtcars, cars, Titanic, iris, or make up some; that are super small) on a scalable data platform like Spark? perhaps it is better to demonstrate, in examples, how to work with external data sources, multiple file formats etc.? > and lastly, we still have about a dozen methods that are without example that are being flagged by CRAN checks (but not enough to fail it yet) For the third and fourth concerns, I think we could improve this by adding more examples, explaining and describing the details. For example, we could leave some comments about this such as .. "this is an example but in pratice blabla..". Maybe, we could do this in another PR maybe. Otherwise, we could leave the examples as they are with some proper comments. For other thoughts, I think it would be great if they are ran in the tests at lesat. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17068: [SPARK-19709][SQL] Read empty file with CSV data ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17068#discussion_r104350586 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala --- @@ -190,28 +194,28 @@ object WholeFileCSVDataSource extends CSVDataSource { sparkSession: SparkSession, inputPaths: Seq[FileStatus], parsedOptions: CSVOptions): Option[StructType] = { -val csv: RDD[PortableDataStream] = createBaseRdd(sparkSession, inputPaths, parsedOptions) -val maybeFirstRow: Option[Array[String]] = csv.flatMap { lines => +val csv = createBaseRdd(sparkSession, inputPaths, parsedOptions) +csv.flatMap { lines => UnivocityParser.tokenizeStream( CodecStreams.createInputStreamWithCloseResource(lines.getConfiguration, lines.getPath()), -false, +shouldDropHeader = false, new CsvParser(parsedOptions.asParserSettings)) -}.take(1).headOption - -if (maybeFirstRow.isDefined) { - val firstRow = maybeFirstRow.get - val caseSensitive = sparkSession.sessionState.conf.caseSensitiveAnalysis - val header = makeSafeHeader(firstRow, caseSensitive, parsedOptions) - val tokenRDD = csv.flatMap { lines => -UnivocityParser.tokenizeStream( - CodecStreams.createInputStreamWithCloseResource(lines.getConfiguration, lines.getPath()), - parsedOptions.headerFlag, - new CsvParser(parsedOptions.asParserSettings)) - } - Some(CSVInferSchema.infer(tokenRDD, header, parsedOptions)) -} else { - // If the first row could not be read, just return the empty schema. - Some(StructType(Nil)) +}.take(1).headOption match { --- End diff -- IMHO, `Option.isDefine` with `Option.get`, `Option.map` with `Option.getOrElse` and `Option` with `match case Some... case None` all might be fine. But, how about minimising the change by matching the above one to `Option.isDefine` with `Option.get`? Then, it would not require the changes here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17068: [SPARK-19709][SQL] Read empty file with CSV data ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17068#discussion_r104356921 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala --- @@ -190,28 +194,28 @@ object WholeFileCSVDataSource extends CSVDataSource { sparkSession: SparkSession, inputPaths: Seq[FileStatus], parsedOptions: CSVOptions): Option[StructType] = { -val csv: RDD[PortableDataStream] = createBaseRdd(sparkSession, inputPaths, parsedOptions) -val maybeFirstRow: Option[Array[String]] = csv.flatMap { lines => +val csv = createBaseRdd(sparkSession, inputPaths, parsedOptions) +csv.flatMap { lines => UnivocityParser.tokenizeStream( CodecStreams.createInputStreamWithCloseResource(lines.getConfiguration, lines.getPath()), -false, +shouldDropHeader = false, new CsvParser(parsedOptions.asParserSettings)) -}.take(1).headOption - -if (maybeFirstRow.isDefined) { - val firstRow = maybeFirstRow.get - val caseSensitive = sparkSession.sessionState.conf.caseSensitiveAnalysis - val header = makeSafeHeader(firstRow, caseSensitive, parsedOptions) - val tokenRDD = csv.flatMap { lines => -UnivocityParser.tokenizeStream( - CodecStreams.createInputStreamWithCloseResource(lines.getConfiguration, lines.getPath()), - parsedOptions.headerFlag, - new CsvParser(parsedOptions.asParserSettings)) - } - Some(CSVInferSchema.infer(tokenRDD, header, parsedOptions)) -} else { - // If the first row could not be read, just return the empty schema. - Some(StructType(Nil)) +}.take(1).headOption match { --- End diff -- All three patterns I mentioned are being used across the code base. There is no style guide for this both in https://github.com/databricks/scala-style-guide and http://spark.apache.org/contributing.html In this case, matching new one to other similar ones is a better choice to reduce changed lines, rather than doing the opposite. Personal taste might be secondary. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17068: [SPARK-19709][SQL] Read empty file with CSV data ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17068#discussion_r104357269 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala --- @@ -190,28 +194,28 @@ object WholeFileCSVDataSource extends CSVDataSource { sparkSession: SparkSession, inputPaths: Seq[FileStatus], parsedOptions: CSVOptions): Option[StructType] = { -val csv: RDD[PortableDataStream] = createBaseRdd(sparkSession, inputPaths, parsedOptions) -val maybeFirstRow: Option[Array[String]] = csv.flatMap { lines => +val csv = createBaseRdd(sparkSession, inputPaths, parsedOptions) +csv.flatMap { lines => UnivocityParser.tokenizeStream( CodecStreams.createInputStreamWithCloseResource(lines.getConfiguration, lines.getPath()), -false, +shouldDropHeader = false, new CsvParser(parsedOptions.asParserSettings)) -}.take(1).headOption - -if (maybeFirstRow.isDefined) { - val firstRow = maybeFirstRow.get - val caseSensitive = sparkSession.sessionState.conf.caseSensitiveAnalysis - val header = makeSafeHeader(firstRow, caseSensitive, parsedOptions) - val tokenRDD = csv.flatMap { lines => -UnivocityParser.tokenizeStream( - CodecStreams.createInputStreamWithCloseResource(lines.getConfiguration, lines.getPath()), - parsedOptions.headerFlag, - new CsvParser(parsedOptions.asParserSettings)) - } - Some(CSVInferSchema.infer(tokenRDD, header, parsedOptions)) -} else { - // If the first row could not be read, just return the empty schema. - Some(StructType(Nil)) +}.take(1).headOption match { --- End diff -- https://github.com/apache/spark/pull/17068#discussion_r104356866 did not show up when I write my comment. I am fine as is. I am not supposed to decide this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15666: [SPARK-11421] [Core][Python][R] Added ability for...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/15666#discussion_r104358056 --- Diff: core/src/main/scala/org/apache/spark/TestUtils.scala --- @@ -168,6 +168,27 @@ private[spark] object TestUtils { createCompiledClass(className, destDir, sourceFile, classpathUrls) } + /** Create a dummy compile jar for a given package, classname. Jar will be placed in destDir */ + def createDummyJar(destDir: String, packageName: String, className: String): String = { +val srcDir = new File(destDir, packageName) +srcDir.mkdirs() +val excSource = new JavaSourceFromString(new File(srcDir, className).getAbsolutePath, + s"""package $packageName; + | + |public class $className implements java.io.Serializable { + | public static String helloWorld(String arg) { return "Hello " + arg; } + | public static int addStuff(int arg1, int arg2) { return arg1 + arg2; } + |} +""". +stripMargin) +val excFile = createCompiledClass(className, srcDir, excSource, Seq.empty) --- End diff -- Can we use URI form here? it seems this one is problematic. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15666: [SPARK-11421] [Core][Python][R] Added ability for...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/15666#discussion_r104357841 --- Diff: R/pkg/R/context.R --- @@ -319,6 +319,34 @@ spark.addFile <- function(path, recursive = FALSE) { invisible(callJMethod(sc, "addFile", suppressWarnings(normalizePath(path)), recursive)) } + +#' Adds a JAR dependency for all tasks to be executed on this SparkContext in the future. +#' +#' The \code{path} passed can be either a local file, a file in HDFS (or other Hadoop-supported +#' filesystems), an HTTP, HTTPS or FTP URI, or local:/path for a file on every worker node. +#' If \code{addToCurrentClassLoader} is true, add the jar to the current threads' classloader. In +#' general adding to the current threads' class loader will impact all other application threads +#' unless they have explicitly changed their class loader. +#' +#' @rdname spark.addJar +#' @param path The path of the jar to be added +#' @param addToCurrentClassLoader Whether to add the jar to the current driver classloader. +#' Default is FALSE. +#' @export +#' @examples +#'\dontrun{ +#' spark.addJar("/path/to/something.jar", TRUE) +#'} +#' @note spark.addJar since 2.2.0 +spark.addJar <- function(path, addToCurrentClassLoader = FALSE) { + sc <- getSparkContext() + normalizedPath <- suppressWarnings(normalizePath(path)) + scala_sc <- callJMethod(sc, "sc") + invisible(callJMethod(scala_sc, "addJar", normalizedPath, addToCurrentClassLoader)) --- End diff -- This seems failed as below: ``` 1. Error: add jar should work and allow usage of the jar on the driver node (@test_context.R#174) java.lang.IllegalArgumentException: Illegal character in path at index 12: string:///C:\Users\appveyor\AppData\Local\Temp\1\RtmpCqEUJL\testjar\sparkrTests\DummyClassForAddJarTest.java at java.net.URI.create(URI.java:852) at org.apache.spark.TestUtils$.org$apache$spark$TestUtils$$createURI(TestUtils.scala:119) at org.apache.spark.TestUtils$JavaSourceFromString.(TestUtils.scala:123) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15666: [SPARK-11421] [Core][Python][R] Added ability for...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/15666#discussion_r104358500 --- Diff: core/src/main/scala/org/apache/spark/TestUtils.scala --- @@ -168,6 +168,27 @@ private[spark] object TestUtils { createCompiledClass(className, destDir, sourceFile, classpathUrls) } + /** Create a dummy compile jar for a given package, classname. Jar will be placed in destDir */ + def createDummyJar(destDir: String, packageName: String, className: String): String = { +val srcDir = new File(destDir, packageName) +srcDir.mkdirs() +val excSource = new JavaSourceFromString(new File(srcDir, className).getAbsolutePath, --- End diff -- Can we use URI form here? it seems this one is problematic. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15666: [SPARK-11421] [Core][Python][R] Added ability for...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/15666#discussion_r104360211 --- Diff: R/pkg/inst/tests/testthat/test_context.R --- @@ -167,6 +167,18 @@ test_that("spark.lapply should perform simple transforms", { sparkR.session.stop() }) +test_that("add jar should work and allow usage of the jar on the driver node", { --- End diff -- This seems failed as below: ``` 1. Error: add jar should work and allow usage of the jar on the driver node (@test_context.R#174) java.lang.IllegalArgumentException: Illegal character in path at index 12: string:///C:\Users\appveyor\AppData\Local\Temp\1\RtmpCqEUJL\testjar\sparkrTests\DummyClassForAddJarTest.java at java.net.URI.create(URI.java:852) at org.apache.spark.TestUtils$.org$apache$spark$TestUtils$$createURI(TestUtils.scala:119) at org.apache.spark.TestUtils$JavaSourceFromString.(TestUtils.scala:123) ``` (Sorry, I left the comments in a wrong line) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17178: [SPARK-19828][R] Support array type in from_json ...
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/17178 [SPARK-19828][R] Support array type in from_json in R ## What changes were proposed in this pull request? Since we could not directly define the array type in R, this PR proposes to support array types in R as string types that are used in `structField` as below: ```R jsonArr <- "[{\"name\":\"Bob\"}, {\"name\":\"Alice\"}]" df <- as.DataFrame(list(list("people" = jsonArr))) collect(select(df, alias(from_json(df$people, "array>"), "arrcol"))) ``` prints ```R arrcol 1 Bob, Alice ``` ## How was this patch tested? Unit tests in `test_sparkSQL.R`. You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark SPARK-19828 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17178.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17178 commit 10feb9d2e9e4d027f874e6f508e7e7421c95cd6d Author: hyukjinkwon Date: 2017-03-06T11:15:45Z Support array in from_json in R --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17178: [SPARK-19828][R] Support array type in from_json ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17178#discussion_r104398168 --- Diff: R/pkg/inst/tests/testthat/test_sparkSQL.R --- @@ -1342,28 +1342,52 @@ test_that("column functions", { df <- read.json(mapTypeJsonPath) j <- collect(select(df, alias(to_json(df$info), "json"))) expect_equal(j[order(j$json), ][1], "{\"age\":16,\"height\":176.5}") - df <- as.DataFrame(j) - schema <- structType(structField("age", "integer"), - structField("height", "double")) - s <- collect(select(df, alias(from_json(df$json, schema), "structcol"))) - expect_equal(ncol(s), 1) - expect_equal(nrow(s), 3) - expect_is(s[[1]][[1]], "struct") - expect_true(any(apply(s, 1, function(x) { x[[1]]$age == 16 } ))) - - # passing option - df <- as.DataFrame(list(list("col" = "{\"date\":\"21/10/2014\"}"))) - schema2 <- structType(structField("date", "date")) - expect_error(tryCatch(collect(select(df, from_json(df$col, schema2))), -error = function(e) { stop(e) }), - paste0(".*(java.lang.NumberFormatException: For input string:).*")) - s <- collect(select(df, from_json(df$col, schema2, dateFormat = "dd/MM/"))) - expect_is(s[[1]][[1]]$date, "Date") - expect_equal(as.character(s[[1]][[1]]$date), "2014-10-21") - - # check for unparseable - df <- as.DataFrame(list(list("a" = ""))) - expect_equal(collect(select(df, from_json(df$a, schema)))[[1]][[1]], NA) + + schemas <- list(structType(structField("age", "integer"), structField("height", "double")), + "struct") + for (schema in schemas) { --- End diff -- I re-used the existing tests codes with the loop below: ```R schemas <- list(structType(structField("age", "integer"), structField("height", "double")), "struct") for (schema in schemas) { ... } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17178: [SPARK-19828][R] Support array type in from_json ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17178#discussion_r104398323 --- Diff: R/pkg/R/functions.R --- @@ -2430,33 +2430,43 @@ setMethod("date_format", signature(y = "Column", x = "character"), column(jc) }) +setClassUnion("characterOrstructType", c("character", "structType")) --- End diff -- Can I use a union class here in this file? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17178: [SPARK-19828][R] Support array type in from_json in R
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/17178 @felixcheung, this is a bit different with what we talked but I opened this because I thought you might like this more. This takes the type string given to `structField` now. If you are worried of this, let me change it back to the optional parameter one. Could you check if this makes sense to you? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15666: [SPARK-11421] [Core][Python][R] Added ability for...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/15666#discussion_r104408350 --- Diff: core/src/main/scala/org/apache/spark/TestUtils.scala --- @@ -168,6 +168,27 @@ private[spark] object TestUtils { createCompiledClass(className, destDir, sourceFile, classpathUrls) } + /** Create a dummy compile jar for a given package, classname. Jar will be placed in destDir */ + def createDummyJar(destDir: String, packageName: String, className: String): String = { +val srcDir = new File(destDir, packageName) +srcDir.mkdirs() +val excSource = new JavaSourceFromString(new File(srcDir, className).getAbsolutePath, --- End diff -- It's fine to push commit and use AppVeyor IMHO. There is a guide for this - https://github.com/apache/spark/blob/master/R/WINDOWS.md if you are willing to manually test but it is kind of a grunt work. In my case, I use my personal AppVeyor account. Otherwise, I can send a PR for your branch if you are happy with waiting till this weekend :). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15666: [SPARK-11421] [Core][Python][R] Added ability for addJar...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/15666 Ah.. it only runs the test when there is a change in R.. give me a moment. Let me trigger the test by my account --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15666: [SPARK-11421] [Core][Python][R] Added ability for addJar...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/15666 Build started: [SparkR] `ALL` [![PR-15666](https://ci.appveyor.com/api/projects/status/github/spark-test/spark?branch=2978E854-F59E-4033-9A82-C28E6980E95E&svg=true)](https://ci.appveyor.com/project/spark-test/spark/branch/2978E854-F59E-4033-9A82-C28E6980E95E) PS: It is a funny workaround but you could close and reopen. Then, I believe it triggers the AppVeyor test. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15666: [SPARK-11421] [Core][Python][R] Added ability for...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/15666#discussion_r104429730 --- Diff: R/pkg/inst/tests/testthat/test_context.R --- @@ -167,6 +167,18 @@ test_that("spark.lapply should perform simple transforms", { sparkR.session.stop() }) +test_that("add jar should work and allow usage of the jar on the driver node", { + sparkR.sparkContext() + + destDir <- paste0(tempdir(), "/", "testjar") + jarName <- callJStatic("org.apache.spark.TestUtils", "createDummyJar", + destDir, "sparkrTests", "DummyClassForAddJarTest") + + spark.addJar(jarName, addToCurrentClassLoader = TRUE) + testClass <- newJObject("sparkrTests.DummyClassForAddJarTest") --- End diff -- Hm.. it seems this line this time. ``` 1. Error: add jar should work and allow usage of the jar on the driver node (@test_context.R#178) 1: newJObject("sparkrTests.DummyClassForAddJarTest") at C:/projects/spark/R/lib/SparkR/tests/testthat/test_context.R:178 2: invokeJava(isStatic = TRUE, className, methodName = "", ...) 3: handleErrors(returnStatus, conn) 4: stop(readString(conn)) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15666: [SPARK-11421] [Core][Python][R] Added ability for...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/15666#discussion_r104431249 --- Diff: R/pkg/inst/tests/testthat/test_context.R --- @@ -167,6 +167,18 @@ test_that("spark.lapply should perform simple transforms", { sparkR.session.stop() }) +test_that("add jar should work and allow usage of the jar on the driver node", { + sparkR.sparkContext() + + destDir <- paste0(tempdir(), "/", "testjar") + jarName <- callJStatic("org.apache.spark.TestUtils", "createDummyJar", + destDir, "sparkrTests", "DummyClassForAddJarTest") + + spark.addJar(jarName, addToCurrentClassLoader = TRUE) + testClass <- newJObject("sparkrTests.DummyClassForAddJarTest") --- End diff -- Maybe, helpful log: ``` ERROR RBackendHandler: on sparkrTests.DummyClassForAddJarTest failed java.lang.ClassNotFoundException: sparkrTests.DummyClassForAddJarTest at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:348) at org.apache.spark.util.Utils$.classForName(Utils.scala:230) at org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:143) at org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:108) at org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:40) at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336) at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:287) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336) at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:293) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:267) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336) at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1294) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:911) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:643) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131) at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144) at java.lang.Thread.run(Thread.java:745) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wi
[GitHub] spark pull request #15666: [SPARK-11421] [Core][Python][R] Added ability for...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/15666#discussion_r104432962 --- Diff: core/src/main/scala/org/apache/spark/TestUtils.scala --- @@ -168,6 +168,27 @@ private[spark] object TestUtils { createCompiledClass(className, destDir, sourceFile, classpathUrls) } + /** Create a dummy compile jar for a given package, classname. Jar will be placed in destDir */ + def createDummyJar(destDir: String, packageName: String, className: String): String = { +val srcDir = new File(destDir, packageName) +srcDir.mkdirs() +val excSource = new JavaSourceFromString(new File(srcDir, className).toURI.getPath, + s"""package $packageName; + | + |public class $className implements java.io.Serializable { --- End diff -- (This indentation seems inconsistent BTW) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15666: [SPARK-11421] [Core][Python][R] Added ability for...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/15666#discussion_r104433213 --- Diff: core/src/test/scala/org/apache/spark/SparkContextSuite.scala --- @@ -308,6 +308,36 @@ class SparkContextSuite extends SparkFunSuite with LocalSparkContext with Eventu sc.listJars().head should include (tmpJar.getName) } + for ( +schedulingMode <- Seq("local_mode", "non_local_mode") + ) { --- End diff -- Just personal taste.. maybe ```scala Seq("local_mode", "non_local_mode").foreach { schedulingMode => ... } ``` ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15666: [SPARK-11421] [Core][Python][R] Added ability for addJar...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/15666 Feel free to push any commit. Let me trigger the build with my account like a bot. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15666: [SPARK-11421] [Core][Python][R] Added ability for...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/15666#discussion_r104432976 --- Diff: core/src/main/scala/org/apache/spark/TestUtils.scala --- @@ -168,6 +168,27 @@ private[spark] object TestUtils { createCompiledClass(className, destDir, sourceFile, classpathUrls) } + /** Create a dummy compile jar for a given package, classname. Jar will be placed in destDir */ + def createDummyJar(destDir: String, packageName: String, className: String): String = { +val srcDir = new File(destDir, packageName) +srcDir.mkdirs() +val excSource = new JavaSourceFromString(new File(srcDir, className).toURI.getPath, + s"""package $packageName; + | + |public class $className implements java.io.Serializable { + | public static String helloWorld(String arg) { return "Hello " + arg; } + | public static int addStuff(int arg1, int arg2) { return arg1 + arg2; } + |} +""". +stripMargin) --- End diff -- (We could make this lined.) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15666: [SPARK-11421] [Core][Python][R] Added ability for...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/15666#discussion_r104433603 --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala --- @@ -1842,6 +1864,14 @@ class SparkContext(config: SparkConf) extends Logging { logInfo(s"Added JAR $path at $key with timestamp $timestamp") postEnvironmentUpdate() } +if (addToCurrentClassLoader) { + val currentCL = Utils.getContextOrSparkClassLoader + currentCL match { +case cl: MutableURLClassLoader => + cl.addURL(uri.toURL) --- End diff -- (We could make this inlined too..) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16854: [SPARK-15463][SQL] Add an API to load DataFrame from Dat...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/16854 @cloud-fan, I think this is ready for another look. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17177: [SPARK-19834][SQL] csv encoding/decoding using es...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17177#discussion_r104436338 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala --- @@ -465,6 +465,47 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils { } } + test("save csv with quote escaping enabled, avoiding double backslash") { +withTempDir { dir => + val csvDir = new File(dir, "csv").getCanonicalPath + + val df1 = spark.sqlContext.createDataFrame(List( +(1, "aa\\\"bb,cc"), // aa\"bb,cc +(2, "aa\"bb,cc"), // aa\\\"bb,cc +(3, "aa\\\"\\\"bb,cc"), // aa\"\"bb,cc +(4, "aa\"\"bb,cc") // aa\\\"\\\"bb,cc + )) + + df1.coalesce(1).write + .format("csv") + .option("quote", "\"") + .option("escape", "\\") + .save(csvDir) + + val df2 = spark.read.csv(csvDir).orderBy($"_c0") + + val df1Str = df1.collect().map(_.getString(1)).mkString(" ") + + val df2Str = df2.select("_c1").collect().map(_.getString(0)).mkString(" ") + + val text = spark.read --- End diff -- We should double-spaced line indentation. (please refer https://github.com/databricks/scala-style-guide) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17177: [SPARK-19834][SQL] csv encoding/decoding using es...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17177#discussion_r104401595 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala --- @@ -465,6 +465,47 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils { } } + test("save csv with quote escaping enabled, avoiding double backslash") { +withTempDir { dir => + val csvDir = new File(dir, "csv").getCanonicalPath + + val df1 = spark.sqlContext.createDataFrame(List( +(1, "aa\\\"bb,cc"), // aa\"bb,cc +(2, "aa\"bb,cc"), // aa\\\"bb,cc +(3, "aa\\\"\\\"bb,cc"), // aa\"\"bb,cc +(4, "aa\"\"bb,cc") // aa\\\"\\\"bb,cc + )) + + df1.coalesce(1).write + .format("csv") + .option("quote", "\"") + .option("escape", "\\") + .save(csvDir) + + val df2 = spark.read.csv(csvDir).orderBy($"_c0") + + val df1Str = df1.collect().map(_.getString(1)).mkString(" ") + + val df2Str = df2.select("_c1").collect().map(_.getString(0)).mkString(" ") + + val text = spark.read + .format("text") + .option("quote", "\"") + .option("escape", "\\") + .load(csvDir) + .collect() + .map(_.getString(0)) + .sortBy(_(0)).map(_.drop(3).init).mkString(" ") + + val textExpected = +"aa\"bb,cc aa\\\"bb,cc aa\"\"\"bb,cc aa\\\"\"\\\"\"bb,cc" --- End diff -- I think this would be more readable if this is wrapped by triple quotes (`"""..."""`) in this case. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17177: [SPARK-19834][SQL] csv encoding/decoding using es...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17177#discussion_r104401044 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala --- @@ -465,6 +465,47 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils { } } + test("save csv with quote escaping enabled, avoiding double backslash") { --- End diff -- Could we maybe narrow down what it tests and reduce the test codes? It seems little bit confusing what it tests. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17177: [SPARK-19834][SQL] csv encoding/decoding using es...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17177#discussion_r104456057 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala --- @@ -167,6 +168,7 @@ private[csv] class CSVOptions( format.setDelimiter(delimiter) format.setQuote(quote) format.setQuoteEscape(escape) +format.setCharToEscapeQuoteEscaping(quote) --- End diff -- It seems in https://github.com/uniVocity/univocity-parsers/blob/master/README.md#escaping-quote-escape-characters it sets the same `escape` for both `setQuoteEscape` and `setCharToEscapeQuoteEscaping`and I sounds correct to me. However, it seems different here. Do you mind if I ask the reason? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17178: [SPARK-19828][R] Support array type in from_json ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17178#discussion_r104553042 --- Diff: R/pkg/inst/tests/testthat/test_sparkSQL.R --- @@ -1342,28 +1342,52 @@ test_that("column functions", { df <- read.json(mapTypeJsonPath) j <- collect(select(df, alias(to_json(df$info), "json"))) expect_equal(j[order(j$json), ][1], "{\"age\":16,\"height\":176.5}") - df <- as.DataFrame(j) - schema <- structType(structField("age", "integer"), - structField("height", "double")) - s <- collect(select(df, alias(from_json(df$json, schema), "structcol"))) - expect_equal(ncol(s), 1) - expect_equal(nrow(s), 3) - expect_is(s[[1]][[1]], "struct") - expect_true(any(apply(s, 1, function(x) { x[[1]]$age == 16 } ))) - - # passing option - df <- as.DataFrame(list(list("col" = "{\"date\":\"21/10/2014\"}"))) - schema2 <- structType(structField("date", "date")) - expect_error(tryCatch(collect(select(df, from_json(df$col, schema2))), -error = function(e) { stop(e) }), - paste0(".*(java.lang.NumberFormatException: For input string:).*")) - s <- collect(select(df, from_json(df$col, schema2, dateFormat = "dd/MM/"))) - expect_is(s[[1]][[1]]$date, "Date") - expect_equal(as.character(s[[1]][[1]]$date), "2014-10-21") - - # check for unparseable - df <- as.DataFrame(list(list("a" = ""))) - expect_equal(collect(select(df, from_json(df$a, schema)))[[1]][[1]], NA) + + schemas <- list(structType(structField("age", "integer"), structField("height", "double")), + "struct") + for (schema in schemas) { +df <- as.DataFrame(j) +s <- collect(select(df, alias(from_json(df$json, schema), "structcol"))) +expect_equal(ncol(s), 1) +expect_equal(nrow(s), 3) +expect_is(s[[1]][[1]], "struct") +expect_true(any(apply(s, 1, function(x) { x[[1]]$age == 16 } ))) + +# passing option +df <- as.DataFrame(list(list("col" = "{\"date\":\"21/10/2014\"}"))) +schema2 <- structType(structField("date", "date")) +expect_error(tryCatch(collect(select(df, from_json(df$col, schema2))), +error = function(e) { stop(e) }), +paste0(".*(java.lang.NumberFormatException: For input string:).*")) +s <- collect(select(df, from_json(df$col, schema2, dateFormat = "dd/MM/"))) +expect_is(s[[1]][[1]]$date, "Date") +expect_equal(as.character(s[[1]][[1]]$date), "2014-10-21") + +# check for unparseable +df <- as.DataFrame(list(list("a" = ""))) +expect_equal(collect(select(df, from_json(df$a, schema)))[[1]][[1]], NA) + } + + # check if array type in string is correctly supported. + jsonArr <- "[{\"name\":\"Bob\"}, {\"name\":\"Alice\"}]" + df <- as.DataFrame(list(list("people" = jsonArr))) + arr <- collect(select(df, alias(from_json(df$people, "array>"), "arrcol"))) --- End diff -- (Just in case, `from_json` takes both `DataType` and json string from `DataType.json`. In that sense, I thought it'd be nicer if it takes what `structField` takes in R) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17068: [SPARK-19709][SQL] Read empty file with CSV data ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17068#discussion_r104581628 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala --- @@ -190,28 +194,28 @@ object WholeFileCSVDataSource extends CSVDataSource { sparkSession: SparkSession, inputPaths: Seq[FileStatus], parsedOptions: CSVOptions): Option[StructType] = { -val csv: RDD[PortableDataStream] = createBaseRdd(sparkSession, inputPaths, parsedOptions) -val maybeFirstRow: Option[Array[String]] = csv.flatMap { lines => +val csv = createBaseRdd(sparkSession, inputPaths, parsedOptions) +csv.flatMap { lines => UnivocityParser.tokenizeStream( CodecStreams.createInputStreamWithCloseResource(lines.getConfiguration, lines.getPath()), -false, +shouldDropHeader = false, new CsvParser(parsedOptions.asParserSettings)) -}.take(1).headOption - -if (maybeFirstRow.isDefined) { - val firstRow = maybeFirstRow.get - val caseSensitive = sparkSession.sessionState.conf.caseSensitiveAnalysis - val header = makeSafeHeader(firstRow, caseSensitive, parsedOptions) - val tokenRDD = csv.flatMap { lines => -UnivocityParser.tokenizeStream( - CodecStreams.createInputStreamWithCloseResource(lines.getConfiguration, lines.getPath()), - parsedOptions.headerFlag, - new CsvParser(parsedOptions.asParserSettings)) - } - Some(CSVInferSchema.infer(tokenRDD, header, parsedOptions)) -} else { - // If the first row could not be read, just return the empty schema. - Some(StructType(Nil)) +}.take(1).headOption match { --- End diff -- Thank you both for bearing with me. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17178: [SPARK-19828][R] Support array type in from_json ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17178#discussion_r104604621 --- Diff: R/pkg/inst/tests/testthat/test_sparkSQL.R --- @@ -1342,28 +1342,52 @@ test_that("column functions", { df <- read.json(mapTypeJsonPath) j <- collect(select(df, alias(to_json(df$info), "json"))) expect_equal(j[order(j$json), ][1], "{\"age\":16,\"height\":176.5}") - df <- as.DataFrame(j) - schema <- structType(structField("age", "integer"), - structField("height", "double")) - s <- collect(select(df, alias(from_json(df$json, schema), "structcol"))) - expect_equal(ncol(s), 1) - expect_equal(nrow(s), 3) - expect_is(s[[1]][[1]], "struct") - expect_true(any(apply(s, 1, function(x) { x[[1]]$age == 16 } ))) - - # passing option - df <- as.DataFrame(list(list("col" = "{\"date\":\"21/10/2014\"}"))) - schema2 <- structType(structField("date", "date")) - expect_error(tryCatch(collect(select(df, from_json(df$col, schema2))), -error = function(e) { stop(e) }), - paste0(".*(java.lang.NumberFormatException: For input string:).*")) - s <- collect(select(df, from_json(df$col, schema2, dateFormat = "dd/MM/"))) - expect_is(s[[1]][[1]]$date, "Date") - expect_equal(as.character(s[[1]][[1]]$date), "2014-10-21") - - # check for unparseable - df <- as.DataFrame(list(list("a" = ""))) - expect_equal(collect(select(df, from_json(df$a, schema)))[[1]][[1]], NA) + + schemas <- list(structType(structField("age", "integer"), structField("height", "double")), + "struct") --- End diff -- Yes.. I persuaded myself that it is a valid string for `structField`. If you prefer optional parameter one, I could try. Otherwise, let me close this for now If you are not sure of both ways :). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17096: [SPARK-15243][ML][SQL][PYTHON] Add missing support for u...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/17096 Thank you @viirya for your sign-off. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16854: [SPARK-15463][SQL] Add an API to load DataFrame f...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/16854#discussion_r104651799 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala --- @@ -134,23 +133,33 @@ object TextInputCSVDataSource extends CSVDataSource { inputPaths: Seq[FileStatus], parsedOptions: CSVOptions): Option[StructType] = { val csv = createBaseDataset(sparkSession, inputPaths, parsedOptions) -CSVUtils.filterCommentAndEmpty(csv, parsedOptions).take(1).headOption match { - case Some(firstLine) => -val firstRow = new CsvParser(parsedOptions.asParserSettings).parseLine(firstLine) -val caseSensitive = sparkSession.sessionState.conf.caseSensitiveAnalysis -val header = makeSafeHeader(firstRow, caseSensitive, parsedOptions) -val tokenRDD = csv.rdd.mapPartitions { iter => - val filteredLines = CSVUtils.filterCommentAndEmpty(iter, parsedOptions) - val linesWithoutHeader = -CSVUtils.filterHeaderLine(filteredLines, firstLine, parsedOptions) - val parser = new CsvParser(parsedOptions.asParserSettings) - linesWithoutHeader.map(parser.parseLine) -} -Some(CSVInferSchema.infer(tokenRDD, header, parsedOptions)) - case None => -// If the first line could not be read, just return the empty schema. -Some(StructType(Nil)) -} +val maybeFirstLine = CSVUtils.filterCommentAndEmpty(csv, parsedOptions).take(1).headOption +Some(inferFromDataset(sparkSession, csv, maybeFirstLine, parsedOptions)) + } + + /** + * Infers the schema from `Dataset` that stores CSV string records. + */ + def inferFromDataset( --- End diff -- There is almost no code modification here. Just moved from above. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17178: [SPARK-19828][R] Support array type in from_json ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17178#discussion_r104658406 --- Diff: R/pkg/inst/tests/testthat/test_sparkSQL.R --- @@ -1342,28 +1342,52 @@ test_that("column functions", { df <- read.json(mapTypeJsonPath) j <- collect(select(df, alias(to_json(df$info), "json"))) expect_equal(j[order(j$json), ][1], "{\"age\":16,\"height\":176.5}") - df <- as.DataFrame(j) - schema <- structType(structField("age", "integer"), - structField("height", "double")) - s <- collect(select(df, alias(from_json(df$json, schema), "structcol"))) - expect_equal(ncol(s), 1) - expect_equal(nrow(s), 3) - expect_is(s[[1]][[1]], "struct") - expect_true(any(apply(s, 1, function(x) { x[[1]]$age == 16 } ))) - - # passing option - df <- as.DataFrame(list(list("col" = "{\"date\":\"21/10/2014\"}"))) - schema2 <- structType(structField("date", "date")) - expect_error(tryCatch(collect(select(df, from_json(df$col, schema2))), -error = function(e) { stop(e) }), - paste0(".*(java.lang.NumberFormatException: For input string:).*")) - s <- collect(select(df, from_json(df$col, schema2, dateFormat = "dd/MM/"))) - expect_is(s[[1]][[1]]$date, "Date") - expect_equal(as.character(s[[1]][[1]]$date), "2014-10-21") - - # check for unparseable - df <- as.DataFrame(list(list("a" = ""))) - expect_equal(collect(select(df, from_json(df$a, schema)))[[1]][[1]], NA) + + schemas <- list(structType(structField("age", "integer"), structField("height", "double")), + "struct") --- End diff -- Let me give a shot at my best. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17192: [SPARK-19849][SQL] Support ArrayType in to_json t...
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/17192 [SPARK-19849][SQL] Support ArrayType in to_json to produce JSON array ## What changes were proposed in this pull request? This PR proposes to support an array of struct type in `to_json` as below: ```scala import org.apache.spark.sql.functions._ val df = Seq(Tuple1(Tuple1(1) :: Nil)).toDF("a") df.select(to_json($"a").as("json")).show() ``` ``` +--+ | json| +--+ |[{"_1":1}]| +--+ ``` Currently, it throws an exception as below (a newline manually inserted for readability): ``` org.apache.spark.sql.AnalysisException: cannot resolve 'structtojson(`array`)' due to data type mismatch: structtojson requires that the expression is a struct expression.;; ``` This allows the roundtrip with `from_json` as below: ```scala import org.apache.spark.sql.functions._ import org.apache.spark.sql.types._ val schema = ArrayType(StructType(StructField("a", IntegerType) :: Nil)) val df = Seq("""[{"a":1}, {"a":2}]""").toDF("json").select(from_json($"json", schema).as("array")) df.show() // Read back. df.select(to_json($"array").as("json")).show() ``` ``` +--+ | array| +--+ |[[1], [2]]| +--+ +-+ | json| +-+ |[{"a":1},{"a":2}]| +-+ ``` Also, this PR proposes to rename from `StructToJson` to `StructOrArrayToJson ` and `JsonToStruct` to `JsonToStructOrArray`. ## How was this patch tested? Unit tests in `JsonFunctionsSuite` and `JsonExpressionsSuite` for Scala, doctest for Python and test in `test_sparkSQL.R` for R. You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark SPARK-19849 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17192.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17192 commit 92922650ca6bf44ef4f4daf02653f66125e881d2 Author: hyukjinkwon Date: 2017-03-07T14:43:37Z Support ArrayType in to_json to produce JSON array --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17192: [SPARK-19849][SQL] Support ArrayType in to_json to produ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/17192 cc @brkyvz and @marmbrus, could you please take a look and see if it makes sense? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17096: [SPARK-15243][ML][SQL][PYTHON] Add missing suppor...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17096#discussion_r104856634 --- Diff: python/pyspark/ml/tests.py --- @@ -320,6 +320,13 @@ def test_hasparam(self): testParams = TestParams() self.assertTrue(all([testParams.hasParam(p.name) for p in testParams.params])) self.assertFalse(testParams.hasParam("notAParameter")) +self.assertTrue(testParams.hasParam(u"maxIter")) + +def test_resolveparam(self): +testParams = TestParams() +self.assertEqual(testParams._resolveParam(testParams.maxIter), testParams.maxIter) +self.assertEqual(testParams._resolveParam("maxIter"), testParams.maxIter) +self.assertEqual(testParams._resolveParam(u"maxIter"), testParams.maxIter) --- End diff -- Yes, I think that makes sense, though. It seems that requires more look and tests. I believe the current state resolves the specific JIRA. Maybe, could we merge this as is if you are think either way is fine? I feel It has been dragged by related changes and hope it could be merged if it is okay to you. If it dose not sound good to you, then, let me try to take a look. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17178: [SPARK-19828][R] Support array type in from_json ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17178#discussion_r104876261 --- Diff: R/pkg/inst/tests/testthat/test_sparkSQL.R --- @@ -1342,28 +1342,52 @@ test_that("column functions", { df <- read.json(mapTypeJsonPath) j <- collect(select(df, alias(to_json(df$info), "json"))) expect_equal(j[order(j$json), ][1], "{\"age\":16,\"height\":176.5}") - df <- as.DataFrame(j) - schema <- structType(structField("age", "integer"), - structField("height", "double")) - s <- collect(select(df, alias(from_json(df$json, schema), "structcol"))) - expect_equal(ncol(s), 1) - expect_equal(nrow(s), 3) - expect_is(s[[1]][[1]], "struct") - expect_true(any(apply(s, 1, function(x) { x[[1]]$age == 16 } ))) - - # passing option - df <- as.DataFrame(list(list("col" = "{\"date\":\"21/10/2014\"}"))) - schema2 <- structType(structField("date", "date")) - expect_error(tryCatch(collect(select(df, from_json(df$col, schema2))), -error = function(e) { stop(e) }), - paste0(".*(java.lang.NumberFormatException: For input string:).*")) - s <- collect(select(df, from_json(df$col, schema2, dateFormat = "dd/MM/"))) - expect_is(s[[1]][[1]]$date, "Date") - expect_equal(as.character(s[[1]][[1]]$date), "2014-10-21") - - # check for unparseable - df <- as.DataFrame(list(list("a" = ""))) - expect_equal(collect(select(df, from_json(df$a, schema)))[[1]][[1]], NA) + + schemas <- list(structType(structField("age", "integer"), structField("height", "double")), + "struct") --- End diff -- Hm, @felixcheung, I think this resembles catalog string, maybe we could reuse `CatalystSqlParser.parseDataType` to make this more formal and to do not duplicate the efforts for defining a format or documentation. This is a big change but if this is what we want in the future, I would like to argue that we should keep this way. For JSON string schema, there is an overloaded version of `from_json` that takes that schema string. If we are going to expose it, it can be easily done. However, I think you meant it is a bigger change because we need to provide a way to produce this JSON string from types. Up to my knowledge, we can only manually specify the schema via this calalog string. Is this true? If so, I don't have a good idea for now to support this and I would rather close this if you so as well. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16854: [SPARK-15463][SQL] Add an API to load DataFrame f...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/16854#discussion_r104879395 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala --- @@ -1088,4 +1104,20 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils { checkAnswer(df, spark.emptyDataFrame) } } + + test("Empty dataframe produces empty dataframe") { +// Empty dataframe with schema. +val emptyDF = spark.createDataFrame( --- End diff -- Sure, let me fix it up. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17192: [SPARK-19849][SQL] Support ArrayType in to_json t...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17192#discussion_r104886708 --- Diff: sql/core/src/test/resources/sql-tests/results/json-functions.sql.out --- @@ -32,32 +34,40 @@ Usage: to_json(expr[, options]) - Returns a json string with a given struct valu -- !query 2 select to_json(named_struct('a', 1, 'b', 2)) -- !query 2 schema -struct +struct -- !query 2 output {"a":1,"b":2} -- !query 3 select to_json(named_struct('time', to_timestamp('2015-08-26', '-MM-dd')), map('timestampFormat', 'dd/MM/')) -- !query 3 schema -struct +struct -- !query 3 output {"time":"26/08/2015"} -- !query 4 -select to_json(named_struct('a', 1, 'b', 2), named_struct('mode', 'PERMISSIVE')) +select to_json(array(named_struct('a', 1, 'b', 2))) -- !query 4 schema -struct<> +struct -- !query 4 output -org.apache.spark.sql.AnalysisException -Must use a map() function for options;; line 1 pos 7 +[{"a":1,"b":2}] -- !query 5 -select to_json() +select to_json(named_struct('a', 1, 'b', 2), named_struct('mode', 'PERMISSIVE')) --- End diff -- This happened to look a bit weird but I had to add this in the middle of the sql file - https://github.com/apache/spark/pull/17192/files#diff-3b8a538abd658a260aa32c4aa593bed7R6 to represent this is not the case of the error. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15751: [SPARK-18246][SQL] Throws an exception before execution ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/15751 Do we want this change? If there is an approval, I will rebase. It seems easily making a conflict. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17178: [SPARK-19828][R] Support array type in from_json ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17178#discussion_r105049248 --- Diff: R/pkg/inst/tests/testthat/test_sparkSQL.R --- @@ -1342,28 +1342,52 @@ test_that("column functions", { df <- read.json(mapTypeJsonPath) j <- collect(select(df, alias(to_json(df$info), "json"))) expect_equal(j[order(j$json), ][1], "{\"age\":16,\"height\":176.5}") - df <- as.DataFrame(j) - schema <- structType(structField("age", "integer"), - structField("height", "double")) - s <- collect(select(df, alias(from_json(df$json, schema), "structcol"))) - expect_equal(ncol(s), 1) - expect_equal(nrow(s), 3) - expect_is(s[[1]][[1]], "struct") - expect_true(any(apply(s, 1, function(x) { x[[1]]$age == 16 } ))) - - # passing option - df <- as.DataFrame(list(list("col" = "{\"date\":\"21/10/2014\"}"))) - schema2 <- structType(structField("date", "date")) - expect_error(tryCatch(collect(select(df, from_json(df$col, schema2))), -error = function(e) { stop(e) }), - paste0(".*(java.lang.NumberFormatException: For input string:).*")) - s <- collect(select(df, from_json(df$col, schema2, dateFormat = "dd/MM/"))) - expect_is(s[[1]][[1]]$date, "Date") - expect_equal(as.character(s[[1]][[1]]$date), "2014-10-21") - - # check for unparseable - df <- as.DataFrame(list(list("a" = ""))) - expect_equal(collect(select(df, from_json(df$a, schema)))[[1]][[1]], NA) + + schemas <- list(structType(structField("age", "integer"), structField("height", "double")), + "struct") --- End diff -- Doh, I meant there is a public `from_json` that takes JSON string schema - https://github.com/apache/spark/blob/369a148e591bb16ec7da54867610b207602cd698/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L3061-L3062 Yup, let me give a shot to provide an option for it first to show if it looks okay to you. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17192: [SPARK-19849][SQL] Support ArrayType in to_json t...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17192#discussion_r105052696 --- Diff: R/pkg/inst/tests/testthat/test_sparkSQL.R --- @@ -1339,6 +1339,11 @@ test_that("column functions", { expect_equal(collect(select(df, bround(df$x, 0)))[[1]][2], 4) # Test to_json(), from_json() + arr <- list(listToStruct(list("name" = "bob"))) + df <- as.DataFrame(list(listToStruct(list("people" = arr + j <- collect(select(df, alias(to_json(df$people), "json"))) + expect_equal(j[order(j$json), ][1], "[{\"name\":\"bob\"}]") + --- End diff -- @felixcheung, it is a little bit of R codes here but could you check the test and documentation? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17192: [SPARK-19849][SQL] Support ArrayType in to_json t...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17192#discussion_r105052901 --- Diff: sql/core/src/test/resources/sql-tests/results/json-functions.sql.out --- @@ -32,32 +34,40 @@ Usage: to_json(expr[, options]) - Returns a json string with a given struct valu -- !query 2 select to_json(named_struct('a', 1, 'b', 2)) -- !query 2 schema -struct +struct -- !query 2 output {"a":1,"b":2} -- !query 3 select to_json(named_struct('time', to_timestamp('2015-08-26', '-MM-dd')), map('timestampFormat', 'dd/MM/')) -- !query 3 schema -struct +struct -- !query 3 output {"time":"26/08/2015"} -- !query 4 -select to_json(named_struct('a', 1, 'b', 2), named_struct('mode', 'PERMISSIVE')) +select to_json(array(named_struct('a', 1, 'b', 2))) --- End diff -- Cc @maropu, thia adds an test in the file you latelly wrote. Could you check if this follows your indention? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17192: [SPARK-19849][SQL] Support ArrayType in to_json t...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17192#discussion_r105054224 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala --- @@ -220,4 +242,5 @@ class JsonFunctionsSuite extends QueryTest with SharedSQLContext { assert(errMsg2.getMessage.startsWith( "A type of keys and values in map() must be string, but got")) } + --- End diff -- I found a newline at the end of tests prevent a conflict in some cases (_if I am not mistaken_). I am happy to revert this change back if anyone is sure that it is uesless or feel like it is an unrelated change. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17192: [SPARK-19849][SQL] Support ArrayType in to_json t...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17192#discussion_r105059270 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala --- @@ -422,7 +422,7 @@ object FunctionRegistry { expression[BitwiseXor]("^"), // json -expression[StructToJson]("to_json"), +expression[StructOrArrayToJson]("to_json"), --- End diff -- I think that one is not particually better. `StructsToJson` then does not refer it can be an array or a single struct if the reason is only ambiguosity. Either way is fine to me. If any committer picks up one, let me follow. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org