[GitHub] spark pull request #16929: [SPARK-19595][SQL] Support json array in from_jso...

2017-02-27 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/16929#discussion_r103334238
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
 ---
@@ -480,36 +480,79 @@ case class JsonTuple(children: Seq[Expression])
 }
 
 /**
- * Converts an json input string to a [[StructType]] with the specified 
schema.
+ * Converts an json input string to a [[StructType]] or [[ArrayType]] with 
the specified schema.
  */
 case class JsonToStruct(
-schema: StructType,
+schema: DataType,
 options: Map[String, String],
 child: Expression,
 timeZoneId: Option[String] = None)
   extends UnaryExpression with TimeZoneAwareExpression with 
CodegenFallback with ExpectsInputTypes {
   override def nullable: Boolean = true
 
-  def this(schema: StructType, options: Map[String, String], child: 
Expression) =
+  def this(schema: DataType, options: Map[String, String], child: 
Expression) =
 this(schema, options, child, None)
 
+  override def checkInputDataTypes(): TypeCheckResult = schema match {
--- End diff --

Uh.. I though `schema` is not a child but just a parameter. Let me check!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16929: [SPARK-19595][SQL] Support json array in from_jso...

2017-02-27 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/16929#discussion_r103337914
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
 ---
@@ -480,36 +480,79 @@ case class JsonTuple(children: Seq[Expression])
 }
 
 /**
- * Converts an json input string to a [[StructType]] with the specified 
schema.
+ * Converts an json input string to a [[StructType]] or [[ArrayType]] with 
the specified schema.
  */
 case class JsonToStruct(
-schema: StructType,
+schema: DataType,
 options: Map[String, String],
 child: Expression,
 timeZoneId: Option[String] = None)
   extends UnaryExpression with TimeZoneAwareExpression with 
CodegenFallback with ExpectsInputTypes {
   override def nullable: Boolean = true
 
-  def this(schema: StructType, options: Map[String, String], child: 
Expression) =
+  def this(schema: DataType, options: Map[String, String], child: 
Expression) =
 this(schema, options, child, None)
 
+  override def checkInputDataTypes(): TypeCheckResult = schema match {
+case _: StructType | ArrayType(_: StructType, _) =>
+  super.checkInputDataTypes()
+case _ => TypeCheckResult.TypeCheckFailure(
+  s"Input schema ${schema.simpleString} must be a struct or an array 
of structs.")
+  }
+
+  @transient
+  lazy val rowSchema = schema match {
+case st: StructType => st
+case ArrayType(st: StructType, _) => st
+  }
+
+  // This converts parsed rows to the desired output by the given schema.
+  @transient
+  lazy val converter = schema match {
+case _: StructType =>
+  // These are always produced from json objects by `objectSupport` in 
`JacksonParser`.
+  (rows: Seq[InternalRow]) => rows.head
+
+case ArrayType(_: StructType, _) =>
+  // These are always produced from json arrays by `arraySupport` in 
`JacksonParser`.
+  (rows: Seq[InternalRow]) => new GenericArrayData(rows)
+  }
+
   @transient
   lazy val parser =
 new JacksonParser(
-  schema,
-  new JSONOptions(options + ("mode" -> ParseModes.FAIL_FAST_MODE), 
timeZoneId.get))
+  rowSchema,
+  new JSONOptions(options + ("mode" -> ParseModes.FAIL_FAST_MODE), 
timeZoneId.get),
+  objectSupport = schema.isInstanceOf[StructType],
--- End diff --

> What does the input look like, and what are they specifying?

`JscksonParser.parse` produces `Seq[InternalRow]` and it takes 
`StructType`. What I meant by both `objectSupport` and `arraySupport` is, JSON 
object and JSON array because we support both as a root JSON object and 
intended to produce `null` if one of them is disabled. 




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16929: [SPARK-19595][SQL] Support json array in from_jso...

2017-02-27 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/16929#discussion_r103338371
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
 ---
@@ -480,36 +480,79 @@ case class JsonTuple(children: Seq[Expression])
 }
 
 /**
- * Converts an json input string to a [[StructType]] with the specified 
schema.
+ * Converts an json input string to a [[StructType]] or [[ArrayType]] with 
the specified schema.
  */
 case class JsonToStruct(
-schema: StructType,
+schema: DataType,
 options: Map[String, String],
 child: Expression,
 timeZoneId: Option[String] = None)
   extends UnaryExpression with TimeZoneAwareExpression with 
CodegenFallback with ExpectsInputTypes {
   override def nullable: Boolean = true
 
-  def this(schema: StructType, options: Map[String, String], child: 
Expression) =
+  def this(schema: DataType, options: Map[String, String], child: 
Expression) =
 this(schema, options, child, None)
 
+  override def checkInputDataTypes(): TypeCheckResult = schema match {
--- End diff --

Uh.. I thought `schema` is not the child of the expression. Let me check 
again!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16929: [SPARK-19595][SQL] Support json array in from_json

2017-02-27 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/16929
  
Thanks for your detailed look. Let me check again and address the comments!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16929: [SPARK-19595][SQL] Support json array in from_json

2017-02-27 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/16929
  
Then, let me fix this as below:

- `from_json` with `StructType`
  - JSON array -> `null` 
  - JSON object -> `Row(...)` 

- `from_json` with `ArrayType`
  - JSON array -> `Array(Row(...), ...)`
  - JSON object -> `Array(Row(...), ...)`

- exposed API 
  - `from_json(..., schema: StructType)`
  - `from_json(..., schema: DataType)` 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16929: [SPARK-19595][SQL] Support json array in from_jso...

2017-02-27 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/16929#discussion_r103386481
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
 ---
@@ -480,36 +480,79 @@ case class JsonTuple(children: Seq[Expression])
 }
 
 /**
- * Converts an json input string to a [[StructType]] with the specified 
schema.
+ * Converts an json input string to a [[StructType]] or [[ArrayType]] with 
the specified schema.
  */
 case class JsonToStruct(
-schema: StructType,
+schema: DataType,
 options: Map[String, String],
 child: Expression,
 timeZoneId: Option[String] = None)
   extends UnaryExpression with TimeZoneAwareExpression with 
CodegenFallback with ExpectsInputTypes {
   override def nullable: Boolean = true
 
-  def this(schema: StructType, options: Map[String, String], child: 
Expression) =
+  def this(schema: DataType, options: Map[String, String], child: 
Expression) =
 this(schema, options, child, None)
 
+  override def checkInputDataTypes(): TypeCheckResult = schema match {
--- End diff --

I tried several combinations with `TypeCollection` but it seems not working.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16929: [SPARK-19595][SQL] Support json array in from_json

2017-02-27 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/16929
  
@brkyvz, @marmbrus - I think it is ready for another look. Could you see if 
I understood your comments correctly?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13036: [SPARK-15243][ML][SQL][PYSPARK] Param methods should use...

2017-02-27 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/13036
  
I am happy to do so. I assume that It seems already almost done except for 
https://github.com/apache/spark/pull/13036#discussion_r84476560?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17096: [SPARK-15243][ML][SQL][PYSPARK] Add missing suppo...

2017-02-28 Thread HyukjinKwon
GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/17096

[SPARK-15243][ML][SQL][PYSPARK] Add missing support for unicode in Param 
methods/functions in dataframe/types

## What changes were proposed in this pull request?

This PR proposes to support unicodes in Param methods in ML, other missed 
functions in DataFrame and other missed one as a column name in types.

For example, this causes a `ValueError` in Python 2.x when param is a 
unicode string:

```python
>>> from pyspark.ml.classification import LogisticRegression
>>> lr = LogisticRegression()
>>> lr.hasParam("threshold")
True
>>> lr.hasParam(u"threshold")
Traceback (most recent call last):
 ...
raise TypeError("hasParam(): paramName must be a string")
TypeError: hasParam(): paramName must be a string
```

This PR is based on https://github.com/apache/spark/pull/13036

## How was this patch tested?

Unit tests in `python/pyspark/ml/tests.py` and 
`python/pyspark/sql/tests.py`.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark SPARK-15243

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17096.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17096


commit 762997d266c78c16e2e6caaee8daa557ec07820b
Author: sethah 
Date:   2016-05-10T22:44:10Z

check for basestring in param methods

commit d421a82e1a713b9c7351dfc6a2a6a5fb13eca1e2
Author: hyukjinkwon 
Date:   2017-02-28T08:46:58Z

Fix updated approxQuantile to take basestrings and fix the tests




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17096: [SPARK-15243][ML][SQL][PYSPARK] Add missing support for ...

2017-02-28 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/17096
  
cc @sethah, @holdenk, @viirya and @k-yokoshi


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16929: [SPARK-19595][SQL] Support json array in from_json

2017-02-28 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/16929
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17115: [Doc][Minor] Update R doc

2017-03-01 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/17115
  
( @actuaryzhang , maybe we could make the PR title more meaningful to 
summerise the change just for a better shaped PR )


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17096: [SPARK-15243][ML][SQL][PYTHON] Add missing support for u...

2017-03-01 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/17096
  
Thank you @viirya. I think it is ready now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16854: [WIP][SPARK-15463][SQL] Add an API to load DataFr...

2017-03-02 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/16854#discussion_r103894945
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala
 ---
@@ -344,4 +346,36 @@ private[csv] object UnivocityParser {
   CSVUtils.filterCommentAndEmpty(linesWithoutHeader, options)
 filteredLines.flatMap(line => parser.parse(line))
   }
+
+  /**
+   * Parses a `Dataset` that contains CSV strings and turns it into an 
`RDD` of rows.
+   */
+  def tokenizeDataset(
+  csvDataset: Dataset[String],
+  maybeFirstLine: Option[String],
+  options: CSVOptions): RDD[Array[String]] = {
+val filtered = CSVUtils.filterCommentAndEmpty(csvDataset, options)
+val linesWithoutHeader = maybeFirstLine.map { firstLine =>
+  filtered.rdd.mapPartitions(CSVUtils.filterHeaderLine(_, firstLine, 
options))
+}.getOrElse(filtered.rdd)
+
+linesWithoutHeader.mapPartitions { iter =>
+  val parser = new CsvParser(options.asParserSettings)
+  iter.map(line => parser.parseLine(line))
+}
+  }
+
+  /**
+   * Parses a `Dataset` that contains CSV strings and turns it into an 
`RDD` of rows.
+   */
+  def parseDataset(
+  csvDataset: Dataset[String],
+  schema: StructType,
+  maybeFirstLine: Option[String],
+  options: CSVOptions): RDD[InternalRow] = {
+tokenizeDataset(csvDataset, maybeFirstLine, options).mapPartitions { 
iter =>
+  val parser = new UnivocityParser(schema, options)
+  iter.flatMap(line => parser.convert(line))
+}
+  }
--- End diff --

cc @cloud-fan, this is still a wip but I am trying to put the different 
execution paths into here in CSV parsing.

For example,

  - `spark.read.csv(file)`
- data: `parseIterator` (note that this one is read from partitioned 
file).
- schema: `tokenizeDataset `


  - `spark.read.csv(file)` with `wholeFile`
- data: `parseStream`
- schema: `tokenizeStream `

  - `spark.read.csv(dataset)`
- data: `parseDataset `
- schema: `tokenizeDataset `


However, it seems ending up with a bit weird arguments here.. do you think 
it is okay? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16854: [WIP][SPARK-15463][SQL] Add an API to load DataFr...

2017-03-02 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/16854#discussion_r103895193
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala
 ---
@@ -344,4 +346,36 @@ private[csv] object UnivocityParser {
   CSVUtils.filterCommentAndEmpty(linesWithoutHeader, options)
 filteredLines.flatMap(line => parser.parse(line))
   }
+
+  /**
+   * Parses a `Dataset` that contains CSV strings and turns it into an 
`RDD` of rows.
+   */
+  def tokenizeDataset(
+  csvDataset: Dataset[String],
+  maybeFirstLine: Option[String],
+  options: CSVOptions): RDD[Array[String]] = {
+val filtered = CSVUtils.filterCommentAndEmpty(csvDataset, options)
+val linesWithoutHeader = maybeFirstLine.map { firstLine =>
+  filtered.rdd.mapPartitions(CSVUtils.filterHeaderLine(_, firstLine, 
options))
+}.getOrElse(filtered.rdd)
+
+linesWithoutHeader.mapPartitions { iter =>
+  val parser = new CsvParser(options.asParserSettings)
+  iter.map(line => parser.parseLine(line))
+}
+  }
+
+  /**
+   * Parses a `Dataset` that contains CSV strings and turns it into an 
`RDD` of rows.
+   */
+  def parseDataset(
+  csvDataset: Dataset[String],
+  schema: StructType,
+  maybeFirstLine: Option[String],
+  options: CSVOptions): RDD[InternalRow] = {
+tokenizeDataset(csvDataset, maybeFirstLine, options).mapPartitions { 
iter =>
+  val parser = new UnivocityParser(schema, options)
+  iter.flatMap(line => parser.convert(line))
+}
+  }
--- End diff --

If you are not sure or it looks painful to review, let me take out all the 
changes and put those into `DataFrameReader.csv` for now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17136: [SPARK-19783][SQL] Treat shorter/longer lengths of token...

2017-03-02 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/17136
  
Oh, @maropu, I have been looking into R's `read.csv` before adding some 
comments in the JIRAs.

with the data below:

```
a,b,c
a,b,c,d,e,d,d
```

```
> read.csv("test.csv")
Error in read.table(file = file, header = header, sep = sep, quote = quote, 
 :
```

with the data below:

```csv
a,b,c,d,e,d,d
a,b,c
```

```r
> read.csv("test.csv")
  a b c  d  e d.1 d.2
1 a b c NA NA  NA  NA
```

So, IMHO, we might better follow R's `read.csv` for now but of course I 
guess we should take a look for other libraries.

I am actually a bit worried of behaviour change because `PERMISSIVE` has 
been the default mode and  since it was as the thirdparty library (Spark 1.3+).

Another concern is, it seems we should produce `columnNameOfCorruptRecord` 
as we are doing in JSON datasource if we will treat those tokens as malformed 
ones in `PERMISSIVE` mode.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14788: [SPARK-17174][SQL] Add the support for TimestampT...

2017-03-02 Thread HyukjinKwon
Github user HyukjinKwon closed the pull request at:

https://github.com/apache/spark/pull/14788


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17142: [SPARK-18699][SQL][FOLLOWUP] Add explanation in C...

2017-03-02 Thread HyukjinKwon
GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/17142

[SPARK-18699][SQL][FOLLOWUP] Add explanation in CSV parser and minor cleanup

## What changes were proposed in this pull request?

This PR suggests adding some comments in `UnivocityParser` logics to 
explain what happens. Also, it proposes, IMHO, a little bit cleaner (at least 
easy for me to explain).

## How was this patch tested?

Unit tests in `CSVSuite`.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark SPARK-18699

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17142.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17142






---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17142: [SPARK-18699][SQL][FOLLOWUP] Add explanation in C...

2017-03-02 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17142#discussion_r103986635
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala
 ---
@@ -54,39 +54,77 @@ private[csv] class UnivocityParser(
 
   private val dataSchema = StructType(schema.filter(_.name != 
options.columnNameOfCorruptRecord))
 
-  private val valueConverters =
-dataSchema.map(f => makeConverter(f.name, f.dataType, f.nullable, 
options)).toArray
-
   private val tokenizer = new CsvParser(options.asParserSettings)
 
   private var numMalformedRecords = 0
 
   private val row = new GenericInternalRow(requiredSchema.length)
 
-  // This gets the raw input that is parsed lately.
+  // In `PERMISSIVE` parse mode, we should be able to put the raw 
malformed row into the field
+  // specified in `columnNameOfCorruptRecord`. The raw input is retrieved 
by this method.
   private def getCurrentInput(): String = 
tokenizer.getContext.currentParsedContent().stripLineEnd
 
-  // This parser loads an `indexArr._1`-th position value in input tokens,
-  // then put the value in `row(indexArr._2)`.
-  private val indexArr: Array[(Int, Int)] = {
-val fields = if (options.dropMalformed) {
-  // If `dropMalformed` is enabled, then it needs to parse all the 
values
-  // so that we can decide which row is malformed.
-  requiredSchema ++ schema.filterNot(requiredSchema.contains(_))
-} else {
-  requiredSchema
-}
-// TODO: Revisit this; we need to clean up code here for readability.
-// See an URL below for related discussions:
-// https://github.com/apache/spark/pull/16928#discussion_r102636720
-val fieldsWithIndexes = fields.zipWithIndex
-corruptFieldIndex.map { case corrFieldIndex =>
-  fieldsWithIndexes.filter { case (_, i) => i != corrFieldIndex }
-}.getOrElse {
-  fieldsWithIndexes
-}.map { case (f, i) =>
-  (dataSchema.indexOf(f), i)
-}.toArray
--- End diff --

cc @maropu and @cloud-fan, could you check if this comment look nicer to 
you?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17142: [SPARK-18699][SQL][FOLLOWUP] Add explanation in C...

2017-03-02 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17142#discussion_r104063545
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala
 ---
@@ -54,39 +54,77 @@ private[csv] class UnivocityParser(
 
   private val dataSchema = StructType(schema.filter(_.name != 
options.columnNameOfCorruptRecord))
 
-  private val valueConverters =
-dataSchema.map(f => makeConverter(f.name, f.dataType, f.nullable, 
options)).toArray
-
   private val tokenizer = new CsvParser(options.asParserSettings)
 
   private var numMalformedRecords = 0
 
   private val row = new GenericInternalRow(requiredSchema.length)
 
-  // This gets the raw input that is parsed lately.
+  // In `PERMISSIVE` parse mode, we should be able to put the raw 
malformed row into the field
+  // specified in `columnNameOfCorruptRecord`. The raw input is retrieved 
by this method.
   private def getCurrentInput(): String = 
tokenizer.getContext.currentParsedContent().stripLineEnd
 
-  // This parser loads an `indexArr._1`-th position value in input tokens,
-  // then put the value in `row(indexArr._2)`.
-  private val indexArr: Array[(Int, Int)] = {
-val fields = if (options.dropMalformed) {
-  // If `dropMalformed` is enabled, then it needs to parse all the 
values
-  // so that we can decide which row is malformed.
-  requiredSchema ++ schema.filterNot(requiredSchema.contains(_))
-} else {
-  requiredSchema
-}
-// TODO: Revisit this; we need to clean up code here for readability.
-// See an URL below for related discussions:
-// https://github.com/apache/spark/pull/16928#discussion_r102636720
-val fieldsWithIndexes = fields.zipWithIndex
-corruptFieldIndex.map { case corrFieldIndex =>
-  fieldsWithIndexes.filter { case (_, i) => i != corrFieldIndex }
-}.getOrElse {
-  fieldsWithIndexes
-}.map { case (f, i) =>
-  (dataSchema.indexOf(f), i)
-}.toArray
--- End diff --

It is just a little bit for codes.. actually :). I hope the comment makes 
reading this code easier and not look too verbose.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17096: [SPARK-15243][ML][SQL][PYTHON] Add missing support for u...

2017-03-02 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/17096
  
@viirya, thank you so much for taking a look and your time. 

So, basically, the second case it compares str to unicode as below:

```python
>>> u"測試" == u"測試".encode("utf-8")
False
```

Apparently, it seems we could pass unicode as is? Let me raise another 
issue for this after testing and looking into this. Actually, the support in 
`StructType.add` seems not the problem specified in the JIRA. 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17096: [SPARK-15243][ML][SQL][PYTHON] Add missing support for u...

2017-03-02 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/17096
  
Let me check if each is fine for sure.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17096: [SPARK-15243][ML][SQL][PYTHON] Add missing support for u...

2017-03-02 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/17096
  
@holdenk and @viirya, I got rid of the changes in `types.py` and only left 
that I am pretty sure.

There are two kind of changes here that look used in the only local scope.

One seems for used `getattr` I guess it is fine as below:

```python
>>> getattr("a", u"__str__")

>>> getattr("a", "__str__")

```

and other one seems used for setting an parameter to JVM which seems 
already used in the code base much more.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17142: [SPARK-18699][SQL][FOLLOWUP] Add explanation in CSV pars...

2017-03-03 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/17142
  
I definitely will. Thank you so much @cloud-fan and @maropu.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16929: [SPARK-19595][SQL] Support json array in from_json

2017-03-03 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/16929
  
@mambrus and @brkyvz, would there be other things I should double check?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17160: [SPARK-19701][SQL][PYTHON] Throws a correct excep...

2017-03-03 Thread HyukjinKwon
GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/17160

[SPARK-19701][SQL][PYTHON] Throws a correct exception for 'in' operator 
against column

## What changes were proposed in this pull request?

This PR proposes to remove incorrect implementation that has been not 
executed so far (at least from Spark 1.5.2) for `in` operator and throw a 
correct exception rather than saying it is a bool. I tested the codes above in 
1.5.2, 1.6.3, 2.1.0 and in the master branch as below:

**1.5.2**

```python
>>> df = sqlContext.createDataFrame([[1]])
>>> 1 in df._1
Traceback (most recent call last):
  File "", line 1, in 
  File ".../spark-1.5.2-bin-hadoop2.6/python/pyspark/sql/column.py", line 
418, in __nonzero__
raise ValueError("Cannot convert column into bool: please use '&' for 
'and', '|' for 'or', "
ValueError: Cannot convert column into bool: please use '&' for 'and', '|' 
for 'or', '~' for 'not' when building DataFrame boolean expressions.
```

**1.6.3**

```python
>>> 1 in sqlContext.range(1).id
Traceback (most recent call last):
  File "", line 1, in 
  File ".../spark-1.6.3-bin-hadoop2.6/python/pyspark/sql/column.py", line 
447, in __nonzero__
raise ValueError("Cannot convert column into bool: please use '&' for 
'and', '|' for 'or', "
ValueError: Cannot convert column into bool: please use '&' for 'and', '|' 
for 'or', '~' for 'not' when building DataFrame boolean expressions.
```

**2.1.0**

```python
>>> 1 in spark.range(1).id
Traceback (most recent call last):
  File "", line 1, in 
  File ".../spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/column.py", line 
426, in __nonzero__
raise ValueError("Cannot convert column into bool: please use '&' for 
'and', '|' for 'or', "
ValueError: Cannot convert column into bool: please use '&' for 'and', '|' 
for 'or', '~' for 'not' when building DataFrame boolean expressions.
```

**Current Master**

```python
>>> 1 in spark.range(1).id
Traceback (most recent call last):
  File "", line 1, in 
  File ".../spark/python/pyspark/sql/column.py", line 452, in __nonzero__
raise ValueError("Cannot convert column into bool: please use '&' for 
'and', '|' for 'or', "
ValueError: Cannot convert column into bool: please use '&' for 'and', '|' 
for 'or', '~' for 'not' when building DataFrame boolean expressions.
```

**After**

```python
>>> 1 in spark.range(1).id
Traceback (most recent call last):
  File "", line 1, in 
  File ".../spark/python/pyspark/sql/column.py", line 184, in __contains__
raise ValueError("Cannot apply 'in' operator against a column: please 
use 'contains' "
ValueError: Cannot apply 'in' operator against a column: please use 
'contains' in a string column or 'array_contains' function for an array column.
```

In more details,

It seems the implementation indented to support this

```python
1 in df.column
```

However, currently, it throws an exception as below:

```python
Traceback (most recent call last):
  File "", line 1, in 
  File ".../spark/python/pyspark/sql/column.py", line 426, in __nonzero__
raise ValueError("Cannot convert column into bool: please use '&' for 
'and', '|' for 'or', "
ValueError: Cannot convert column into bool: please use '&' for 'and', '|' 
for 'or', '~' for 'not' when building DataFrame boolean expressions.
```

What happens here is as below:

```python
class Column(object):
def __contains__(self, item):
print "I am contains"
return Column()
def __nonzero__(self):
raise Exception("I am nonzero.")

>>> 1 in Column()
I am contains
Traceback (most recent call last):
  File "", line 1, in 
  File "", line 6, in __nonzero__
Exception: I am nonzero.
```

It seems it calls `

[GitHub] spark issue #17160: [SPARK-19701][SQL][PYTHON] Throws a correct exception fo...

2017-03-03 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/17160
  
cc @cloud-fan, @davies and @holdenk.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17160: [SPARK-19701][SQL][PYTHON] Throws a correct excep...

2017-03-03 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17160#discussion_r104278452
  
--- Diff: python/pyspark/sql/column.py ---
@@ -180,7 +180,9 @@ def __init__(self, jc):
 __ror__ = _bin_op("or")
 
 # container operators
-__contains__ = _bin_op("contains")
+def __contains__(self, item):
+raise ValueError("Cannot apply 'in' operator against a column: 
please use 'contains' "
+ "in a string column or 'array_contains' function 
for an array column.")
--- End diff --

What I meant here is use

```python
>>> df = spark.range(1)
>>> df.select(df.id.contains(0)).show()
```
```
+---+
|contains(id, 0)|
+---+
|   true|
+---+
```

or

```python
>>> from pyspark.sql.functions import array_contains
>>> df = spark.createDataFrame([[[0]]], ["id"])
>>> df.select(array_contains(df.id, 0)).show()
```
```
+-+
|array_contains(id, 0)|
+-+
| true|
+-+
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16929: [SPARK-19595][SQL] Support json array in from_json

2017-03-03 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/16929
  
Thank you so much. Let me clean up.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16854: [WIP][SPARK-15463][SQL] Add an API to load DataFrame fro...

2017-03-03 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/16854
  
Oh, no. It does not need to. 

I just meant to de-duplicate some logics by 
https://github.com/apache/spark/pull/16854#discussion_r103894945. Let me just 
remove that part and leave only code changes dedicated for this JIRA. It seems 
making reviewers confused. Let me clean up soon.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17161: [SPARK-19819][SparkR] Use concrete data in SparkR...

2017-03-04 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17161#discussion_r104283930
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -92,8 +92,7 @@ dataFrame <- function(sdf, isCached = FALSE) {
 #' @examples
 #'\dontrun{
 #' sparkR.session()
-#' path <- "path/to/file.json"
-#' df <- read.json(path)
+#' df <- createDataFrame(mtcars)
--- End diff --

Should we define `mtcars` to run this example?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17161: [SPARK-19819][SparkR] Use concrete data in SparkR DataFr...

2017-03-04 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/17161
  
This rings a bell to me. I opened a PR  
https://github.com/apache/spark/pull/16824 to make the examples in Python API 
doc as self-contained ones but closed as it seems not having much interests 
from other committers. I would like to see if other R committers support this 
and then reopen it if they like this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17160: [SPARK-19701][SQL][PYTHON] Throws a correct exception fo...

2017-03-04 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/17160
  
**without `__contains__`, `__nonzero__` and `__bool__`**

```python
class Column(object): pass

>>> 1 in Column()
Traceback (most recent call last):
  File "", line 1, in 
TypeError: argument of type 'Column' is not iterable
>>> bool(Column())
True
```

**without `__nonzero__` and `__bool__`**

```python
class Column(object):
def __contains__(self, item):
print "I am contains"
return Column()

>>> 1 in Column()
I am contains
True
>>> bool(Column())
True
```


**without `__contains__`**

```python
class Column(object):
def __nonzero__(self):
print "I am nonzero"
return Column()

>>> 1 in Column()
Traceback (most recent call last):
  File "", line 1, in 
TypeError: argument of type 'Column' is not iterable
>>> bool(Column())
I am nonzero
Traceback (most recent call last):
  File "", line 1, in 
TypeError: __nonzero__ should return bool or int, returned Column
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17160: [SPARK-19701][SQL][PYTHON] Throws a correct exception fo...

2017-03-04 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/17160
  
So.. it seems 

```python
TypeError: argument of type 'int' is not iterable
```

vs

```python
ValueError: Cannot apply 'in' operator against a column: please use 
'contains' in a string column or 'array_contains' function for an array column.
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17068: [SPARK-19709][SQL] Read empty file with CSV data ...

2017-03-04 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17068#discussion_r104288502
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala
 ---
@@ -133,8 +133,19 @@ object TextInputCSVDataSource extends CSVDataSource {
   sparkSession: SparkSession,
   inputPaths: Seq[FileStatus],
   parsedOptions: CSVOptions): Option[StructType] = {
-val csv: Dataset[String] = createBaseDataset(sparkSession, inputPaths, 
parsedOptions)
-val firstLine: String = CSVUtils.filterCommentAndEmpty(csv, 
parsedOptions).first()
+val csv = createBaseDataset(sparkSession, inputPaths, parsedOptions)
+CSVUtils.filterCommentAndEmpty(csv, parsedOptions)
+  .take(1)
+  .headOption
+  .map(firstLine => infer(sparkSession, parsedOptions, csv, firstLine))
+  .orElse(Some(StructType(Seq(
--- End diff --

Could we maybe just match it to 
[CSVDataSource.scala#L204-L224](https://github.com/wojtek-szymanski/spark/blob/bdf189087c52e8934cee4ee5563dffea9dde6a99/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala#L204-L224)
 just for consistency for now? 

Personally, I thought such chaining makes hard to read the codes sometimes. 
Maybe, we could consider the code de-duplication about this in another PR. If 
would be easier if they looks similar at least.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17068: [SPARK-19709][SQL] Read empty file with CSV data ...

2017-03-04 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17068#discussion_r104288543
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
@@ -305,13 +305,21 @@ class CSVSuite extends QueryTest with 
SharedSQLContext with SQLTestUtils {
   test("test with empty file and known schema") {
 val result = spark.read
   .format("csv")
-  .schema(StructType(List(StructField("column", StringType, false
+  .schema(StructType(List(StructField("column", StringType, nullable = 
false
   .load(testFile(emptyFile))
 
-assert(result.collect.size === 0)
+assert(result.collect().isEmpty)
 assert(result.schema.fieldNames.size === 1)
   }
 
+  test("test with empty file without schema") {
--- End diff --

Let's re-use the test in 
[CSVSuite.scala#L1083](https://github.com/wojtek-szymanski/spark/blob/bdf189087c52e8934cee4ee5563dffea9dde6a99/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala#L1083).

We could..

```scala
  test("Empty file produces empty dataframe with empty schema") {
Seq(false, true).foreach { wholeFile =>
  val df = spark.read.format("csv")
.option("header", true)
.option("wholeFile", true)
.load(testFile(emptyFile))

assert(df.schema === spark.emptyDataFrame.schema)
checkAnswer(df, spark.emptyDataFrame)
  }
}
  }
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17068: [SPARK-19709][SQL] Read empty file with CSV data source

2017-03-04 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/17068
  
Hi @wojtek-szymanski, these are all from me. Let me cc @cloud-fan as my PRs 
related with this were reviewed by him and I guess I can't trigger the test.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16854: [WIP][SPARK-15463][SQL] Add an API to load DataFr...

2017-03-04 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/16854#discussion_r104289988
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala ---
@@ -399,6 +395,52 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
   }
 
   /**
+   * Loads an `Dataset[String]` storing CSV rows and returns the result as 
a `DataFrame`.
+   *
+   * Unless the schema is specified using `schema` function, this function 
goes through the
+   * input once to determine the input schema.
+   *
+   * @param csvDataset input Dataset with one CSV row per record
+   * @since 2.2.0
+   */
+  def csv(csvDataset: Dataset[String]): DataFrame = {
+val parsedOptions: CSVOptions = new CSVOptions(
+  extraOptions.toMap,
+  sparkSession.sessionState.conf.sessionLocalTimeZone)
+val filteredLines = CSVUtils.filterCommentAndEmpty(csvDataset, 
parsedOptions)
+val maybeFirstLine = filteredLines.take(1).headOption
+if (maybeFirstLine.isEmpty) {
+  return sparkSession.emptyDataFrame
--- End diff --

The reason why this one exists unlike json is, CSV needs to head a head 
always first (even if it does not infer the schema, it needs at least the 
number of values). In this case, we could return empty one fast.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16854: [WIP][SPARK-15463][SQL] Add an API to load DataFr...

2017-03-04 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/16854#discussion_r104290005
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala ---
@@ -399,6 +395,52 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
   }
 
   /**
+   * Loads an `Dataset[String]` storing CSV rows and returns the result as 
a `DataFrame`.
+   *
+   * Unless the schema is specified using `schema` function, this function 
goes through the
+   * input once to determine the input schema.
+   *
+   * @param csvDataset input Dataset with one CSV row per record
+   * @since 2.2.0
+   */
+  def csv(csvDataset: Dataset[String]): DataFrame = {
+val parsedOptions: CSVOptions = new CSVOptions(
+  extraOptions.toMap,
+  sparkSession.sessionState.conf.sessionLocalTimeZone)
+val filteredLines = CSVUtils.filterCommentAndEmpty(csvDataset, 
parsedOptions)
+val maybeFirstLine = filteredLines.take(1).headOption
+if (maybeFirstLine.isEmpty) {
+  return sparkSession.emptyDataFrame
+}
+
+val firstLine = maybeFirstLine.get
+val linesWithoutHeader: RDD[String] = filteredLines.rdd.mapPartitions(
+  CSVUtils.filterHeaderLine(_, firstLine, parsedOptions))
+
+val schema = userSpecifiedSchema.getOrElse {
--- End diff --

There is a similar code path in 
https://github.com/apache/spark/blob/7e5359be5ca038fdb579712b18e7f226d705c276/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala#L132-L150


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16854: [WIP][SPARK-15463][SQL] Add an API to load DataFr...

2017-03-04 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/16854#discussion_r104289951
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala ---
@@ -604,6 +646,22 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
 }
   }
 
+  /**
+   * A convenient function for schema validation that takes 
`columnNameOfCorruptRecord`
+   * as an option.
+   */
+  private def verifyColumnNameOfCorruptRecord(
--- End diff --

Maybe, this is too much. I am willing to revert this back.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16854: [WIP][SPARK-15463][SQL] Add an API to load DataFrame fro...

2017-03-04 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/16854
  
Let me double check before getting rid of `[WIP]` tomorrow.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16854: [WIP][SPARK-15463][SQL] Add an API to load DataFr...

2017-03-04 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/16854#discussion_r104290082
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala ---
@@ -604,6 +646,22 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
 }
   }
 
+  /**
+   * A convenient function for schema validation in datasources supporting
+   * `columnNameOfCorruptRecord` as an option.
+   */
+  private def verifyColumnNameOfCorruptRecord(
--- End diff --

Maybe, this is too much. I am willing to revert this back.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16854: [WIP][SPARK-15463][SQL] Add an API to load DataFrame fro...

2017-03-04 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/16854
  
Let me double check before getting rid of `[WIP]` tomorrow.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16854: [WIP][SPARK-15463][SQL] Add an API to load DataFr...

2017-03-05 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/16854#discussion_r104309644
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala
 ---
@@ -59,11 +58,21 @@ abstract class CSVDataSource extends Serializable {
   sparkSession: SparkSession,
   inputPaths: Seq[FileStatus],
   parsedOptions: CSVOptions): Option[StructType]
+}
+
+object CSVDataSource {
+  def apply(options: CSVOptions): CSVDataSource = {
+if (options.wholeFile) {
+  WholeFileCSVDataSource
+} else {
+  TextInputCSVDataSource
+}
+  }
 
   /**
* Generates a header from the given row which is null-safe and 
duplicate-safe.
*/
-  protected def makeSafeHeader(
+  def makeSafeHeader(
--- End diff --

`makeSafeHeader` was moved from `CSVDataSource` class to `CSVDataSource` 
companion object so that this can be accessed in `DataFrameReader`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17136: [SPARK-19783][SQL] Treat shorter/longer lengths o...

2017-03-05 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17136#discussion_r104309730
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
@@ -1085,4 +1062,23 @@ class CSVSuite extends QueryTest with 
SharedSQLContext with SQLTestUtils {
   checkAnswer(df, spark.emptyDataFrame)
 }
   }
+
+  test("SPARK-X regard as malformed records if the length is not equal 
to expected one") {
--- End diff --

Maybe `SPARK-19783` :).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17136: [SPARK-19783][SQL] Treat shorter/longer lengths o...

2017-03-05 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17136#discussion_r104309710
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala
 ---
@@ -232,43 +243,27 @@ private[csv] class UnivocityParser(
   throw new RuntimeException(s"Malformed line in FAILFAST mode: " +
 s"${tokens.mkString(options.delimiter.toString)}")
 } else {
--- End diff --

Could we make this 


```
else {
  if {
...
  } else {
...
  }
}
```

to

```
else if {
  ...
} else {
  ...
}
```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17136: [SPARK-19783][SQL] Treat shorter/longer lengths o...

2017-03-05 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17136#discussion_r104310045
  
--- Diff: R/pkg/inst/tests/testthat/test_sparkSQL.R ---
@@ -246,8 +246,8 @@ test_that("read/write csv as DataFrame", {
   mockLinesCsv <- c("year,make,model,comment,blank",
"\"2012\",\"Tesla\",\"S\",\"No comment\",",
"1997,Ford,E350,\"Go get one now they are going 
fast\",",
-   "2015,Chevy,Volt",
-   "NA,Dummy,Placeholder")
+   "2015,Chevy,Volt,,",
--- End diff --

We might need a way for it if we clean up and define the behaviour about 
parse mode..


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17136: [SPARK-19783][SQL] Treat shorter/longer lengths o...

2017-03-05 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17136#discussion_r104310061
  
--- Diff: R/pkg/inst/tests/testthat/test_sparkSQL.R ---
@@ -246,8 +246,8 @@ test_that("read/write csv as DataFrame", {
   mockLinesCsv <- c("year,make,model,comment,blank",
"\"2012\",\"Tesla\",\"S\",\"No comment\",",
"1997,Ford,E350,\"Go get one now they are going 
fast\",",
-   "2015,Chevy,Volt",
-   "NA,Dummy,Placeholder")
+   "2015,Chevy,Volt,,",
--- End diff --

We might need a way for it after we clean up and define the behaviour about 
parse mode..



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16929: [SPARK-19595][SQL] Support json array in from_json

2017-03-05 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/16929
  
I just updated the PR description to prevent confusion.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16929: [SPARK-19595][SQL] Support json array in from_json

2017-03-05 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/16929
  
Cc @brkyvz and @marmbrus could this be merged by any chance?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16929: [SPARK-19595][SQL] Support json array in from_json

2017-03-05 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/16929
  
Thank you @brkyvz.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16854: [SPARK-15463][SQL] Add an API to load DataFrame f...

2017-03-05 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/16854#discussion_r104330267
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala ---
@@ -399,6 +395,52 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
   }
 
   /**
+   * Loads an `Dataset[String]` storing CSV rows and returns the result as 
a `DataFrame`.
+   *
+   * Unless the schema is specified using `schema` function, this function 
goes through the
+   * input once to determine the input schema.
+   *
+   * @param csvDataset input Dataset with one CSV row per record
+   * @since 2.2.0
+   */
+  def csv(csvDataset: Dataset[String]): DataFrame = {
+val parsedOptions: CSVOptions = new CSVOptions(
+  extraOptions.toMap,
+  sparkSession.sessionState.conf.sessionLocalTimeZone)
+val filteredLines = CSVUtils.filterCommentAndEmpty(csvDataset, 
parsedOptions)
+val maybeFirstLine = filteredLines.take(1).headOption
+if (maybeFirstLine.isEmpty) {
+  return sparkSession.emptyDataFrame
+}
+
+val firstLine = maybeFirstLine.get
+val linesWithoutHeader: RDD[String] = filteredLines.rdd.mapPartitions(
+  CSVUtils.filterHeaderLine(_, firstLine, parsedOptions))
+
+val schema = userSpecifiedSchema.getOrElse {
--- End diff --

Yes, I think we could with explicitly `wholeFile` set to false. Let me give 
a shot and show you if it looks okay.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16854: [SPARK-15463][SQL] Add an API to load DataFrame f...

2017-03-05 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/16854#discussion_r104330396
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala ---
@@ -399,6 +395,52 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
   }
 
   /**
+   * Loads an `Dataset[String]` storing CSV rows and returns the result as 
a `DataFrame`.
+   *
+   * Unless the schema is specified using `schema` function, this function 
goes through the
+   * input once to determine the input schema.
+   *
+   * @param csvDataset input Dataset with one CSV row per record
+   * @since 2.2.0
+   */
+  def csv(csvDataset: Dataset[String]): DataFrame = {
+val parsedOptions: CSVOptions = new CSVOptions(
+  extraOptions.toMap,
+  sparkSession.sessionState.conf.sessionLocalTimeZone)
+val filteredLines = CSVUtils.filterCommentAndEmpty(csvDataset, 
parsedOptions)
+val maybeFirstLine = filteredLines.take(1).headOption
+if (maybeFirstLine.isEmpty) {
+  return sparkSession.emptyDataFrame
+}
+
+val firstLine = maybeFirstLine.get
+val linesWithoutHeader: RDD[String] = filteredLines.rdd.mapPartitions(
+  CSVUtils.filterHeaderLine(_, firstLine, parsedOptions))
+
+val schema = userSpecifiedSchema.getOrElse {
--- End diff --

Oh, sorry, I overlooked. It seems `TextInputCSVDataSource.infer` takes 
input paths whereas we want `Dataset` here. Let me try to take a look and see 
if we could reuse it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17160: [SPARK-19701][SQL][PYTHON] Throws a correct exception fo...

2017-03-05 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/17160
  
It does not need for `__contains__` but need for `bool` because I guess we 
would not want to return bool as other operators return `Column`.

For example,

```python
>>> not spark.range(1).id
Traceback (most recent call last):
  File "", line 1, in 
  File ".../spark/python/pyspark/sql/column.py", line 452, in __nonzero__
raise ValueError("Cannot convert column into bool: please use '&' for 
'and', '|' for 'or', "
ValueError: Cannot convert column into bool: please use '&' for 'and', '|' 
for 'or', '~' for 'not' when building DataFrame boolean expressions.
```

If we remove this, then

```
>>> not spark.range(1).id
False
```

I think we want `Column` as below:

```python
>>> 1 < spark.range(1).id
Column<(id > 1)>
```

but for `bool`, it seems not.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17160: [SPARK-19701][SQL][PYTHON] Throws a correct exception fo...

2017-03-05 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/17160
  
@cloud-fan Thank you sincerely so much for looking into this deeper and 
asking the details.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17161: [SPARK-19819][SparkR] Use concrete data in SparkR DataFr...

2017-03-05 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/17161
  
@felixcheung, please let me leave my humble opinion.

In general, I like this change. Maybe, it should not necessarily be always 
runnable (at least for now) but I believe it is still an improvement to make 
them runnable, for example, it allows users to directly copy and paste the 
example and easily test.
To me, this is more like the case that we (at least I) prefer the JIRA 
having a self-contained reproducer. Personally, I just copy and paste the 
example and check the input/output when I first learn and try new APIs as a 
user.

> Firstly, I see this as slightly different from Python, in that in R it is 
common to have built-in datasets and possibly users are used to having them and 
having examples using them.

In case of Python examples, up to my knowledge, it is also common to 
document self-runnable examples (e.g., 
[pandas](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.pivot.html#pandas.DataFrame.pivot)
 and [numpy](https://docs.scipy.org/doc/numpy/reference/arrays.ndarray.html)). 
Moreover, they are tested in doctests so there is no overhead to check if they 
are runnable everytime. Maybe, let's talk about this in the PR or JIRA (I am 
sorry for dragging Python one into this).

> I have done a pass on the changes in this PR and I'm happy with changing 
from non-existing json file to mtcars. I'm slightly concerned with the few 
cases of artificial 3 rows data (like here) - more on that below on small 
dataset.

Maybe, we could take out the examples that you are concerned of for now..


For the four concerns you described, I do agree in general but if it has a 
downside more than what it improves, I would disagree but to me it might sound 
the improvement might be better than the downsides about the change itself. 

>how much work and how much change is it to change all examples (this is 
only 1 .R out of 20-something files we have, in a total of 300+ methods which 
is on the high side for R packages)

I guess the first one is about reviewing cost/backporting/conflict concern. 
I am willing to help verify the examples by manually running even if they are 
so many in terms of reviewing cost if the change is big.

> how much churn will it be to keep them up-to-date when we are having 
changes to API (eg. sparkR.session()); especially since in order to have 
examples self-contained we tend to add additional calls to manipulate data and 
thereby increasing the number of references of API calls; which could get stale 
easily like the example with insertInto

For the second concern, if it makes the maintenance header, it'd be a 
important point but I think we are being conservative on the API changes from 
Spark 2.x. So, I guess we would less likely need to sweep these. I think it is 
okay even if it would be change in the near future if the changes are not quite 
big.

> perhaps more importantly, how practical or useful it would be to use 
built-in datasets or native R data.frame (mtcars, cars, Titanic, iris, or make 
up some; that are super small) on a scalable data platform like Spark? perhaps 
it is better to demonstrate, in examples, how to work with external data 
sources, multiple file formats etc.?

> and lastly, we still have about a dozen methods that are without example 
that are being flagged by CRAN checks (but not enough to fail it yet)

For the third and fourth concerns, I think we could improve this by adding 
more examples, explaining and describing the details. For example, we could 
leave some comments about this such as .. "this is an example but in pratice 
blabla..". Maybe, we could do this in another PR maybe. Otherwise, we could 
leave the examples as they are with some proper comments.

For other thoughts, I think it would be great if they are ran in the tests 
at lesat.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17068: [SPARK-19709][SQL] Read empty file with CSV data ...

2017-03-05 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17068#discussion_r104350586
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala
 ---
@@ -190,28 +194,28 @@ object WholeFileCSVDataSource extends CSVDataSource {
   sparkSession: SparkSession,
   inputPaths: Seq[FileStatus],
   parsedOptions: CSVOptions): Option[StructType] = {
-val csv: RDD[PortableDataStream] = createBaseRdd(sparkSession, 
inputPaths, parsedOptions)
-val maybeFirstRow: Option[Array[String]] = csv.flatMap { lines =>
+val csv = createBaseRdd(sparkSession, inputPaths, parsedOptions)
+csv.flatMap { lines =>
   UnivocityParser.tokenizeStream(
 
CodecStreams.createInputStreamWithCloseResource(lines.getConfiguration, 
lines.getPath()),
-false,
+shouldDropHeader = false,
 new CsvParser(parsedOptions.asParserSettings))
-}.take(1).headOption
-
-if (maybeFirstRow.isDefined) {
-  val firstRow = maybeFirstRow.get
-  val caseSensitive = 
sparkSession.sessionState.conf.caseSensitiveAnalysis
-  val header = makeSafeHeader(firstRow, caseSensitive, parsedOptions)
-  val tokenRDD = csv.flatMap { lines =>
-UnivocityParser.tokenizeStream(
-  
CodecStreams.createInputStreamWithCloseResource(lines.getConfiguration, 
lines.getPath()),
-  parsedOptions.headerFlag,
-  new CsvParser(parsedOptions.asParserSettings))
-  }
-  Some(CSVInferSchema.infer(tokenRDD, header, parsedOptions))
-} else {
-  // If the first row could not be read, just return the empty schema.
-  Some(StructType(Nil))
+}.take(1).headOption match {
--- End diff --

IMHO, `Option.isDefine` with `Option.get`, `Option.map` with 
`Option.getOrElse` and `Option` with  `match case Some... case None` all might 
be fine. But, how about minimising the change by matching the above one to 
`Option.isDefine` with `Option.get`? Then, it would not require the changes 
here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17068: [SPARK-19709][SQL] Read empty file with CSV data ...

2017-03-05 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17068#discussion_r104356921
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala
 ---
@@ -190,28 +194,28 @@ object WholeFileCSVDataSource extends CSVDataSource {
   sparkSession: SparkSession,
   inputPaths: Seq[FileStatus],
   parsedOptions: CSVOptions): Option[StructType] = {
-val csv: RDD[PortableDataStream] = createBaseRdd(sparkSession, 
inputPaths, parsedOptions)
-val maybeFirstRow: Option[Array[String]] = csv.flatMap { lines =>
+val csv = createBaseRdd(sparkSession, inputPaths, parsedOptions)
+csv.flatMap { lines =>
   UnivocityParser.tokenizeStream(
 
CodecStreams.createInputStreamWithCloseResource(lines.getConfiguration, 
lines.getPath()),
-false,
+shouldDropHeader = false,
 new CsvParser(parsedOptions.asParserSettings))
-}.take(1).headOption
-
-if (maybeFirstRow.isDefined) {
-  val firstRow = maybeFirstRow.get
-  val caseSensitive = 
sparkSession.sessionState.conf.caseSensitiveAnalysis
-  val header = makeSafeHeader(firstRow, caseSensitive, parsedOptions)
-  val tokenRDD = csv.flatMap { lines =>
-UnivocityParser.tokenizeStream(
-  
CodecStreams.createInputStreamWithCloseResource(lines.getConfiguration, 
lines.getPath()),
-  parsedOptions.headerFlag,
-  new CsvParser(parsedOptions.asParserSettings))
-  }
-  Some(CSVInferSchema.infer(tokenRDD, header, parsedOptions))
-} else {
-  // If the first row could not be read, just return the empty schema.
-  Some(StructType(Nil))
+}.take(1).headOption match {
--- End diff --

All three patterns I mentioned are being used across the code base. There 
is no style guide for this both in 
https://github.com/databricks/scala-style-guide and 
http://spark.apache.org/contributing.html

In this case, matching new one to other similar ones is a better choice to 
reduce changed lines, rather than doing the opposite. Personal taste might be 
secondary.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17068: [SPARK-19709][SQL] Read empty file with CSV data ...

2017-03-05 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17068#discussion_r104357269
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala
 ---
@@ -190,28 +194,28 @@ object WholeFileCSVDataSource extends CSVDataSource {
   sparkSession: SparkSession,
   inputPaths: Seq[FileStatus],
   parsedOptions: CSVOptions): Option[StructType] = {
-val csv: RDD[PortableDataStream] = createBaseRdd(sparkSession, 
inputPaths, parsedOptions)
-val maybeFirstRow: Option[Array[String]] = csv.flatMap { lines =>
+val csv = createBaseRdd(sparkSession, inputPaths, parsedOptions)
+csv.flatMap { lines =>
   UnivocityParser.tokenizeStream(
 
CodecStreams.createInputStreamWithCloseResource(lines.getConfiguration, 
lines.getPath()),
-false,
+shouldDropHeader = false,
 new CsvParser(parsedOptions.asParserSettings))
-}.take(1).headOption
-
-if (maybeFirstRow.isDefined) {
-  val firstRow = maybeFirstRow.get
-  val caseSensitive = 
sparkSession.sessionState.conf.caseSensitiveAnalysis
-  val header = makeSafeHeader(firstRow, caseSensitive, parsedOptions)
-  val tokenRDD = csv.flatMap { lines =>
-UnivocityParser.tokenizeStream(
-  
CodecStreams.createInputStreamWithCloseResource(lines.getConfiguration, 
lines.getPath()),
-  parsedOptions.headerFlag,
-  new CsvParser(parsedOptions.asParserSettings))
-  }
-  Some(CSVInferSchema.infer(tokenRDD, header, parsedOptions))
-} else {
-  // If the first row could not be read, just return the empty schema.
-  Some(StructType(Nil))
+}.take(1).headOption match {
--- End diff --

https://github.com/apache/spark/pull/17068#discussion_r104356866 did not 
show up when I write my comment. I am fine as is. I am not supposed to decide 
this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15666: [SPARK-11421] [Core][Python][R] Added ability for...

2017-03-05 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/15666#discussion_r104358056
  
--- Diff: core/src/main/scala/org/apache/spark/TestUtils.scala ---
@@ -168,6 +168,27 @@ private[spark] object TestUtils {
 createCompiledClass(className, destDir, sourceFile, classpathUrls)
   }
 
+  /** Create a dummy compile jar for a given package, classname.  Jar will 
be placed in destDir */
+  def createDummyJar(destDir: String, packageName: String, className: 
String): String = {
+val srcDir = new File(destDir, packageName)
+srcDir.mkdirs()
+val excSource = new JavaSourceFromString(new File(srcDir, 
className).getAbsolutePath,
+  s"""package $packageName;
+ |
+  |public class $className implements java.io.Serializable {
+ |  public static String helloWorld(String arg) { return "Hello " 
+ arg; }
+ |  public static int addStuff(int arg1, int arg2) { return arg1 + 
arg2; }
+ |}
+""".
+stripMargin)
+val excFile = createCompiledClass(className, srcDir, excSource, 
Seq.empty)
--- End diff --

Can we use URI form here? it seems this one is problematic.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15666: [SPARK-11421] [Core][Python][R] Added ability for...

2017-03-05 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/15666#discussion_r104357841
  
--- Diff: R/pkg/R/context.R ---
@@ -319,6 +319,34 @@ spark.addFile <- function(path, recursive = FALSE) {
   invisible(callJMethod(sc, "addFile", 
suppressWarnings(normalizePath(path)), recursive))
 }
 
+
+#' Adds a JAR dependency for all tasks to be executed on this SparkContext 
in the future.
+#'
+#' The \code{path} passed can be either a local file, a file in HDFS (or 
other Hadoop-supported
+#' filesystems), an HTTP, HTTPS or FTP URI, or local:/path for a file on 
every worker node.
+#' If \code{addToCurrentClassLoader} is true, add the jar to the current 
threads' classloader. In
+#' general adding to the current threads' class loader will impact all 
other application threads
+#' unless they have explicitly changed their class loader.
+#'
+#' @rdname spark.addJar
+#' @param path The path of the jar to be added
+#' @param addToCurrentClassLoader Whether to add the jar to the current 
driver classloader.
+#' Default is FALSE.
+#' @export
+#' @examples
+#'\dontrun{
+#' spark.addJar("/path/to/something.jar", TRUE)
+#'}
+#' @note spark.addJar since 2.2.0
+spark.addJar <- function(path, addToCurrentClassLoader = FALSE) {
+  sc <- getSparkContext()
+  normalizedPath <- suppressWarnings(normalizePath(path))
+  scala_sc <- callJMethod(sc, "sc")
+  invisible(callJMethod(scala_sc, "addJar", normalizedPath, 
addToCurrentClassLoader))
--- End diff --

This seems failed as below:

```
1. Error: add jar should work and allow usage of the jar on the driver node 
(@test_context.R#174) 
java.lang.IllegalArgumentException: Illegal character in path at index 12: 
string:///C:\Users\appveyor\AppData\Local\Temp\1\RtmpCqEUJL\testjar\sparkrTests\DummyClassForAddJarTest.java
at java.net.URI.create(URI.java:852)
at 
org.apache.spark.TestUtils$.org$apache$spark$TestUtils$$createURI(TestUtils.scala:119)
at 
org.apache.spark.TestUtils$JavaSourceFromString.(TestUtils.scala:123)
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15666: [SPARK-11421] [Core][Python][R] Added ability for...

2017-03-05 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/15666#discussion_r104358500
  
--- Diff: core/src/main/scala/org/apache/spark/TestUtils.scala ---
@@ -168,6 +168,27 @@ private[spark] object TestUtils {
 createCompiledClass(className, destDir, sourceFile, classpathUrls)
   }
 
+  /** Create a dummy compile jar for a given package, classname.  Jar will 
be placed in destDir */
+  def createDummyJar(destDir: String, packageName: String, className: 
String): String = {
+val srcDir = new File(destDir, packageName)
+srcDir.mkdirs()
+val excSource = new JavaSourceFromString(new File(srcDir, 
className).getAbsolutePath,
--- End diff --

Can we use URI form here? it seems this one is problematic.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15666: [SPARK-11421] [Core][Python][R] Added ability for...

2017-03-06 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/15666#discussion_r104360211
  
--- Diff: R/pkg/inst/tests/testthat/test_context.R ---
@@ -167,6 +167,18 @@ test_that("spark.lapply should perform simple 
transforms", {
   sparkR.session.stop()
 })
 
+test_that("add jar should work and allow usage of the jar on the driver 
node", {
--- End diff --

This seems failed as below:

```
1. Error: add jar should work and allow usage of the jar on the driver node 
(@test_context.R#174) 
java.lang.IllegalArgumentException: Illegal character in path at index 12: 
string:///C:\Users\appveyor\AppData\Local\Temp\1\RtmpCqEUJL\testjar\sparkrTests\DummyClassForAddJarTest.java
at java.net.URI.create(URI.java:852)
at 
org.apache.spark.TestUtils$.org$apache$spark$TestUtils$$createURI(TestUtils.scala:119)
at 
org.apache.spark.TestUtils$JavaSourceFromString.(TestUtils.scala:123)
```

(Sorry, I left the comments in a wrong line)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17178: [SPARK-19828][R] Support array type in from_json ...

2017-03-06 Thread HyukjinKwon
GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/17178

[SPARK-19828][R] Support array type in from_json in R

## What changes were proposed in this pull request?

Since we could not directly define the array type in R, this PR proposes to 
support array types in R as string types that are used in `structField` as 
below:

```R
jsonArr <- "[{\"name\":\"Bob\"}, {\"name\":\"Alice\"}]"
df <- as.DataFrame(list(list("people" = jsonArr)))
collect(select(df, alias(from_json(df$people, 
"array>"), "arrcol")))
```

prints

```R
  arrcol
1 Bob, Alice
```

## How was this patch tested?

Unit tests in `test_sparkSQL.R`.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark SPARK-19828

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17178.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17178


commit 10feb9d2e9e4d027f874e6f508e7e7421c95cd6d
Author: hyukjinkwon 
Date:   2017-03-06T11:15:45Z

Support array in from_json in R




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17178: [SPARK-19828][R] Support array type in from_json ...

2017-03-06 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17178#discussion_r104398168
  
--- Diff: R/pkg/inst/tests/testthat/test_sparkSQL.R ---
@@ -1342,28 +1342,52 @@ test_that("column functions", {
   df <- read.json(mapTypeJsonPath)
   j <- collect(select(df, alias(to_json(df$info), "json")))
   expect_equal(j[order(j$json), ][1], "{\"age\":16,\"height\":176.5}")
-  df <- as.DataFrame(j)
-  schema <- structType(structField("age", "integer"),
-   structField("height", "double"))
-  s <- collect(select(df, alias(from_json(df$json, schema), "structcol")))
-  expect_equal(ncol(s), 1)
-  expect_equal(nrow(s), 3)
-  expect_is(s[[1]][[1]], "struct")
-  expect_true(any(apply(s, 1, function(x) { x[[1]]$age == 16 } )))
-
-  # passing option
-  df <- as.DataFrame(list(list("col" = "{\"date\":\"21/10/2014\"}")))
-  schema2 <- structType(structField("date", "date"))
-  expect_error(tryCatch(collect(select(df, from_json(df$col, schema2))),
-error = function(e) { stop(e) }),
-   paste0(".*(java.lang.NumberFormatException: For input 
string:).*"))
-  s <- collect(select(df, from_json(df$col, schema2, dateFormat = 
"dd/MM/")))
-  expect_is(s[[1]][[1]]$date, "Date")
-  expect_equal(as.character(s[[1]][[1]]$date), "2014-10-21")
-
-  # check for unparseable
-  df <- as.DataFrame(list(list("a" = "")))
-  expect_equal(collect(select(df, from_json(df$a, schema)))[[1]][[1]], NA)
+
+  schemas <- list(structType(structField("age", "integer"), 
structField("height", "double")),
+  "struct")
+  for (schema in schemas) {
--- End diff --

I re-used the existing tests codes with the loop below:

```R
schemas <- list(structType(structField("age", "integer"), 
structField("height", "double")),
"struct")
for (schema in schemas) {
  ...
}
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17178: [SPARK-19828][R] Support array type in from_json ...

2017-03-06 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17178#discussion_r104398323
  
--- Diff: R/pkg/R/functions.R ---
@@ -2430,33 +2430,43 @@ setMethod("date_format", signature(y = "Column", x 
= "character"),
 column(jc)
   })
 
+setClassUnion("characterOrstructType", c("character", "structType"))
--- End diff --

Can I use a union class here in this file?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17178: [SPARK-19828][R] Support array type in from_json in R

2017-03-06 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/17178
  
@felixcheung, this is a bit different with what we talked but I opened this 
because I thought you might like this more. This takes the type string given to 
`structField` now. If you are worried of this, let me change it back to the 
optional parameter one.

Could you check if this makes sense to you?






---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15666: [SPARK-11421] [Core][Python][R] Added ability for...

2017-03-06 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/15666#discussion_r104408350
  
--- Diff: core/src/main/scala/org/apache/spark/TestUtils.scala ---
@@ -168,6 +168,27 @@ private[spark] object TestUtils {
 createCompiledClass(className, destDir, sourceFile, classpathUrls)
   }
 
+  /** Create a dummy compile jar for a given package, classname.  Jar will 
be placed in destDir */
+  def createDummyJar(destDir: String, packageName: String, className: 
String): String = {
+val srcDir = new File(destDir, packageName)
+srcDir.mkdirs()
+val excSource = new JavaSourceFromString(new File(srcDir, 
className).getAbsolutePath,
--- End diff --

It's fine to push commit and use AppVeyor IMHO. There is a guide for this - 
https://github.com/apache/spark/blob/master/R/WINDOWS.md if you are willing to 
manually test but it is kind of a grunt work. In my case, I use my personal 
AppVeyor account.

Otherwise, I can send a PR for your branch if you are happy with waiting 
till this weekend :).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15666: [SPARK-11421] [Core][Python][R] Added ability for addJar...

2017-03-06 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/15666
  
Ah.. it only runs the test when there is a change in R.. give me a moment. 
Let me trigger the test by my account


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15666: [SPARK-11421] [Core][Python][R] Added ability for addJar...

2017-03-06 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/15666
  
Build started: [SparkR] `ALL` 
[![PR-15666](https://ci.appveyor.com/api/projects/status/github/spark-test/spark?branch=2978E854-F59E-4033-9A82-C28E6980E95E&svg=true)](https://ci.appveyor.com/project/spark-test/spark/branch/2978E854-F59E-4033-9A82-C28E6980E95E)

PS: It is a funny workaround but you could close and reopen. Then, I 
believe it triggers the AppVeyor test.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15666: [SPARK-11421] [Core][Python][R] Added ability for...

2017-03-06 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/15666#discussion_r104429730
  
--- Diff: R/pkg/inst/tests/testthat/test_context.R ---
@@ -167,6 +167,18 @@ test_that("spark.lapply should perform simple 
transforms", {
   sparkR.session.stop()
 })
 
+test_that("add jar should work and allow usage of the jar on the driver 
node", {
+  sparkR.sparkContext()
+
+  destDir <- paste0(tempdir(), "/", "testjar")
+  jarName <- callJStatic("org.apache.spark.TestUtils", "createDummyJar",
+  destDir, "sparkrTests", "DummyClassForAddJarTest")
+
+  spark.addJar(jarName, addToCurrentClassLoader = TRUE)
+  testClass <- newJObject("sparkrTests.DummyClassForAddJarTest")
--- End diff --

Hm.. it seems this line this time.

```
1. Error: add jar should work and allow usage of the jar on the driver node 
(@test_context.R#178) 
 
1: newJObject("sparkrTests.DummyClassForAddJarTest") at 
C:/projects/spark/R/lib/SparkR/tests/testthat/test_context.R:178
2: invokeJava(isStatic = TRUE, className, methodName = "", ...)
3: handleErrors(returnStatus, conn)
4: stop(readString(conn))
```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15666: [SPARK-11421] [Core][Python][R] Added ability for...

2017-03-06 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/15666#discussion_r104431249
  
--- Diff: R/pkg/inst/tests/testthat/test_context.R ---
@@ -167,6 +167,18 @@ test_that("spark.lapply should perform simple 
transforms", {
   sparkR.session.stop()
 })
 
+test_that("add jar should work and allow usage of the jar on the driver 
node", {
+  sparkR.sparkContext()
+
+  destDir <- paste0(tempdir(), "/", "testjar")
+  jarName <- callJStatic("org.apache.spark.TestUtils", "createDummyJar",
+  destDir, "sparkrTests", "DummyClassForAddJarTest")
+
+  spark.addJar(jarName, addToCurrentClassLoader = TRUE)
+  testClass <- newJObject("sparkrTests.DummyClassForAddJarTest")
--- End diff --

Maybe, helpful log:

```
ERROR RBackendHandler:  on sparkrTests.DummyClassForAddJarTest failed
java.lang.ClassNotFoundException: sparkrTests.DummyClassForAddJarTest
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:230)
at 
org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:143)
at 
org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:108)
at 
org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:40)
at 
io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
at 
io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:287)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
at 
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
at 
io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:293)
at 
io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:267)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
at 
io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1294)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:911)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:643)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
at 
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
at java.lang.Thread.run(Thread.java:745)
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wi

[GitHub] spark pull request #15666: [SPARK-11421] [Core][Python][R] Added ability for...

2017-03-06 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/15666#discussion_r104432962
  
--- Diff: core/src/main/scala/org/apache/spark/TestUtils.scala ---
@@ -168,6 +168,27 @@ private[spark] object TestUtils {
 createCompiledClass(className, destDir, sourceFile, classpathUrls)
   }
 
+  /** Create a dummy compile jar for a given package, classname.  Jar will 
be placed in destDir */
+  def createDummyJar(destDir: String, packageName: String, className: 
String): String = {
+val srcDir = new File(destDir, packageName)
+srcDir.mkdirs()
+val excSource = new JavaSourceFromString(new File(srcDir, 
className).toURI.getPath,
+  s"""package $packageName;
+ |
+  |public class $className implements java.io.Serializable {
--- End diff --

(This indentation seems inconsistent BTW)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15666: [SPARK-11421] [Core][Python][R] Added ability for...

2017-03-06 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/15666#discussion_r104433213
  
--- Diff: core/src/test/scala/org/apache/spark/SparkContextSuite.scala ---
@@ -308,6 +308,36 @@ class SparkContextSuite extends SparkFunSuite with 
LocalSparkContext with Eventu
 sc.listJars().head should include (tmpJar.getName)
   }
 
+  for (
+schedulingMode <- Seq("local_mode", "non_local_mode")
+  ) {
--- End diff --

Just personal taste.. maybe 

```scala
Seq("local_mode", "non_local_mode").foreach { schedulingMode =>
  ...
}
```
?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15666: [SPARK-11421] [Core][Python][R] Added ability for addJar...

2017-03-06 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/15666
  
Feel free to push any commit. Let me trigger the build with my account like 
a bot.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15666: [SPARK-11421] [Core][Python][R] Added ability for...

2017-03-06 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/15666#discussion_r104432976
  
--- Diff: core/src/main/scala/org/apache/spark/TestUtils.scala ---
@@ -168,6 +168,27 @@ private[spark] object TestUtils {
 createCompiledClass(className, destDir, sourceFile, classpathUrls)
   }
 
+  /** Create a dummy compile jar for a given package, classname.  Jar will 
be placed in destDir */
+  def createDummyJar(destDir: String, packageName: String, className: 
String): String = {
+val srcDir = new File(destDir, packageName)
+srcDir.mkdirs()
+val excSource = new JavaSourceFromString(new File(srcDir, 
className).toURI.getPath,
+  s"""package $packageName;
+ |
+  |public class $className implements java.io.Serializable {
+ |  public static String helloWorld(String arg) { return "Hello " 
+ arg; }
+ |  public static int addStuff(int arg1, int arg2) { return arg1 + 
arg2; }
+ |}
+""".
+stripMargin)
--- End diff --

(We could make this lined.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15666: [SPARK-11421] [Core][Python][R] Added ability for...

2017-03-06 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/15666#discussion_r104433603
  
--- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala ---
@@ -1842,6 +1864,14 @@ class SparkContext(config: SparkConf) extends 
Logging {
   logInfo(s"Added JAR $path at $key with timestamp $timestamp")
   postEnvironmentUpdate()
 }
+if (addToCurrentClassLoader) {
+  val currentCL = Utils.getContextOrSparkClassLoader
+  currentCL match {
+case cl: MutableURLClassLoader =>
+  cl.addURL(uri.toURL)
--- End diff --

(We could make this inlined too..)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16854: [SPARK-15463][SQL] Add an API to load DataFrame from Dat...

2017-03-06 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/16854
  
@cloud-fan, I think this is ready for another look.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17177: [SPARK-19834][SQL] csv encoding/decoding using es...

2017-03-06 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17177#discussion_r104436338
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
@@ -465,6 +465,47 @@ class CSVSuite extends QueryTest with SharedSQLContext 
with SQLTestUtils {
 }
   }
 
+  test("save csv with quote escaping enabled, avoiding double backslash") {
+withTempDir { dir =>
+  val csvDir = new File(dir, "csv").getCanonicalPath
+
+  val df1 = spark.sqlContext.createDataFrame(List(
+(1, "aa\\\"bb,cc"), // aa\"bb,cc
+(2, "aa\"bb,cc"),   // aa\\\"bb,cc
+(3, "aa\\\"\\\"bb,cc"), // aa\"\"bb,cc
+(4, "aa\"\"bb,cc")  // aa\\\"\\\"bb,cc
+  ))
+
+  df1.coalesce(1).write
+  .format("csv")
+  .option("quote", "\"")
+  .option("escape", "\\")
+  .save(csvDir)
+
+  val df2 = spark.read.csv(csvDir).orderBy($"_c0")
+
+  val df1Str = df1.collect().map(_.getString(1)).mkString(" ")
+
+  val df2Str = 
df2.select("_c1").collect().map(_.getString(0)).mkString(" ")
+
+  val text = spark.read
--- End diff --

We should double-spaced line indentation. (please refer 
https://github.com/databricks/scala-style-guide)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17177: [SPARK-19834][SQL] csv encoding/decoding using es...

2017-03-06 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17177#discussion_r104401595
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
@@ -465,6 +465,47 @@ class CSVSuite extends QueryTest with SharedSQLContext 
with SQLTestUtils {
 }
   }
 
+  test("save csv with quote escaping enabled, avoiding double backslash") {
+withTempDir { dir =>
+  val csvDir = new File(dir, "csv").getCanonicalPath
+
+  val df1 = spark.sqlContext.createDataFrame(List(
+(1, "aa\\\"bb,cc"), // aa\"bb,cc
+(2, "aa\"bb,cc"),   // aa\\\"bb,cc
+(3, "aa\\\"\\\"bb,cc"), // aa\"\"bb,cc
+(4, "aa\"\"bb,cc")  // aa\\\"\\\"bb,cc
+  ))
+
+  df1.coalesce(1).write
+  .format("csv")
+  .option("quote", "\"")
+  .option("escape", "\\")
+  .save(csvDir)
+
+  val df2 = spark.read.csv(csvDir).orderBy($"_c0")
+
+  val df1Str = df1.collect().map(_.getString(1)).mkString(" ")
+
+  val df2Str = 
df2.select("_c1").collect().map(_.getString(0)).mkString(" ")
+
+  val text = spark.read
+  .format("text")
+  .option("quote", "\"")
+  .option("escape", "\\")
+  .load(csvDir)
+  .collect()
+  .map(_.getString(0))
+  .sortBy(_(0)).map(_.drop(3).init).mkString(" ")
+
+  val textExpected =
+"aa\"bb,cc aa\\\"bb,cc aa\"\"\"bb,cc 
aa\\\"\"\\\"\"bb,cc"
--- End diff --

I think this would be more readable if this is wrapped by triple quotes 
(`"""..."""`) in this case.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17177: [SPARK-19834][SQL] csv encoding/decoding using es...

2017-03-06 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17177#discussion_r104401044
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
@@ -465,6 +465,47 @@ class CSVSuite extends QueryTest with SharedSQLContext 
with SQLTestUtils {
 }
   }
 
+  test("save csv with quote escaping enabled, avoiding double backslash") {
--- End diff --

Could we maybe narrow down what it tests and reduce the test codes?  It 
seems little bit confusing what it tests.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17177: [SPARK-19834][SQL] csv encoding/decoding using es...

2017-03-06 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17177#discussion_r104456057
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala
 ---
@@ -167,6 +168,7 @@ private[csv] class CSVOptions(
 format.setDelimiter(delimiter)
 format.setQuote(quote)
 format.setQuoteEscape(escape)
+format.setCharToEscapeQuoteEscaping(quote)
--- End diff --

It seems in 
https://github.com/uniVocity/univocity-parsers/blob/master/README.md#escaping-quote-escape-characters
 it sets the same `escape` for both `setQuoteEscape` and 
`setCharToEscapeQuoteEscaping`and I sounds correct to me. However, it seems 
different here. Do you mind if I ask the reason?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17178: [SPARK-19828][R] Support array type in from_json ...

2017-03-06 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17178#discussion_r104553042
  
--- Diff: R/pkg/inst/tests/testthat/test_sparkSQL.R ---
@@ -1342,28 +1342,52 @@ test_that("column functions", {
   df <- read.json(mapTypeJsonPath)
   j <- collect(select(df, alias(to_json(df$info), "json")))
   expect_equal(j[order(j$json), ][1], "{\"age\":16,\"height\":176.5}")
-  df <- as.DataFrame(j)
-  schema <- structType(structField("age", "integer"),
-   structField("height", "double"))
-  s <- collect(select(df, alias(from_json(df$json, schema), "structcol")))
-  expect_equal(ncol(s), 1)
-  expect_equal(nrow(s), 3)
-  expect_is(s[[1]][[1]], "struct")
-  expect_true(any(apply(s, 1, function(x) { x[[1]]$age == 16 } )))
-
-  # passing option
-  df <- as.DataFrame(list(list("col" = "{\"date\":\"21/10/2014\"}")))
-  schema2 <- structType(structField("date", "date"))
-  expect_error(tryCatch(collect(select(df, from_json(df$col, schema2))),
-error = function(e) { stop(e) }),
-   paste0(".*(java.lang.NumberFormatException: For input 
string:).*"))
-  s <- collect(select(df, from_json(df$col, schema2, dateFormat = 
"dd/MM/")))
-  expect_is(s[[1]][[1]]$date, "Date")
-  expect_equal(as.character(s[[1]][[1]]$date), "2014-10-21")
-
-  # check for unparseable
-  df <- as.DataFrame(list(list("a" = "")))
-  expect_equal(collect(select(df, from_json(df$a, schema)))[[1]][[1]], NA)
+
+  schemas <- list(structType(structField("age", "integer"), 
structField("height", "double")),
+  "struct")
+  for (schema in schemas) {
+df <- as.DataFrame(j)
+s <- collect(select(df, alias(from_json(df$json, schema), 
"structcol")))
+expect_equal(ncol(s), 1)
+expect_equal(nrow(s), 3)
+expect_is(s[[1]][[1]], "struct")
+expect_true(any(apply(s, 1, function(x) { x[[1]]$age == 16 } )))
+
+# passing option
+df <- as.DataFrame(list(list("col" = "{\"date\":\"21/10/2014\"}")))
+schema2 <- structType(structField("date", "date"))
+expect_error(tryCatch(collect(select(df, from_json(df$col, schema2))),
+error = function(e) { stop(e) }),
+paste0(".*(java.lang.NumberFormatException: For input string:).*"))
+s <- collect(select(df, from_json(df$col, schema2, dateFormat = 
"dd/MM/")))
+expect_is(s[[1]][[1]]$date, "Date")
+expect_equal(as.character(s[[1]][[1]]$date), "2014-10-21")
+
+# check for unparseable
+df <- as.DataFrame(list(list("a" = "")))
+expect_equal(collect(select(df, from_json(df$a, schema)))[[1]][[1]], 
NA)
+  }
+
+  # check if array type in string is correctly supported.
+  jsonArr <- "[{\"name\":\"Bob\"}, {\"name\":\"Alice\"}]"
+  df <- as.DataFrame(list(list("people" = jsonArr)))
+  arr <- collect(select(df, alias(from_json(df$people, 
"array>"), "arrcol")))
--- End diff --

(Just in case, `from_json` takes both `DataType` and json string from 
`DataType.json`. In that sense, I thought it'd be nicer if it takes what 
`structField` takes in R)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17068: [SPARK-19709][SQL] Read empty file with CSV data ...

2017-03-06 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17068#discussion_r104581628
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala
 ---
@@ -190,28 +194,28 @@ object WholeFileCSVDataSource extends CSVDataSource {
   sparkSession: SparkSession,
   inputPaths: Seq[FileStatus],
   parsedOptions: CSVOptions): Option[StructType] = {
-val csv: RDD[PortableDataStream] = createBaseRdd(sparkSession, 
inputPaths, parsedOptions)
-val maybeFirstRow: Option[Array[String]] = csv.flatMap { lines =>
+val csv = createBaseRdd(sparkSession, inputPaths, parsedOptions)
+csv.flatMap { lines =>
   UnivocityParser.tokenizeStream(
 
CodecStreams.createInputStreamWithCloseResource(lines.getConfiguration, 
lines.getPath()),
-false,
+shouldDropHeader = false,
 new CsvParser(parsedOptions.asParserSettings))
-}.take(1).headOption
-
-if (maybeFirstRow.isDefined) {
-  val firstRow = maybeFirstRow.get
-  val caseSensitive = 
sparkSession.sessionState.conf.caseSensitiveAnalysis
-  val header = makeSafeHeader(firstRow, caseSensitive, parsedOptions)
-  val tokenRDD = csv.flatMap { lines =>
-UnivocityParser.tokenizeStream(
-  
CodecStreams.createInputStreamWithCloseResource(lines.getConfiguration, 
lines.getPath()),
-  parsedOptions.headerFlag,
-  new CsvParser(parsedOptions.asParserSettings))
-  }
-  Some(CSVInferSchema.infer(tokenRDD, header, parsedOptions))
-} else {
-  // If the first row could not be read, just return the empty schema.
-  Some(StructType(Nil))
+}.take(1).headOption match {
--- End diff --

Thank you both for bearing with me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17178: [SPARK-19828][R] Support array type in from_json ...

2017-03-06 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17178#discussion_r104604621
  
--- Diff: R/pkg/inst/tests/testthat/test_sparkSQL.R ---
@@ -1342,28 +1342,52 @@ test_that("column functions", {
   df <- read.json(mapTypeJsonPath)
   j <- collect(select(df, alias(to_json(df$info), "json")))
   expect_equal(j[order(j$json), ][1], "{\"age\":16,\"height\":176.5}")
-  df <- as.DataFrame(j)
-  schema <- structType(structField("age", "integer"),
-   structField("height", "double"))
-  s <- collect(select(df, alias(from_json(df$json, schema), "structcol")))
-  expect_equal(ncol(s), 1)
-  expect_equal(nrow(s), 3)
-  expect_is(s[[1]][[1]], "struct")
-  expect_true(any(apply(s, 1, function(x) { x[[1]]$age == 16 } )))
-
-  # passing option
-  df <- as.DataFrame(list(list("col" = "{\"date\":\"21/10/2014\"}")))
-  schema2 <- structType(structField("date", "date"))
-  expect_error(tryCatch(collect(select(df, from_json(df$col, schema2))),
-error = function(e) { stop(e) }),
-   paste0(".*(java.lang.NumberFormatException: For input 
string:).*"))
-  s <- collect(select(df, from_json(df$col, schema2, dateFormat = 
"dd/MM/")))
-  expect_is(s[[1]][[1]]$date, "Date")
-  expect_equal(as.character(s[[1]][[1]]$date), "2014-10-21")
-
-  # check for unparseable
-  df <- as.DataFrame(list(list("a" = "")))
-  expect_equal(collect(select(df, from_json(df$a, schema)))[[1]][[1]], NA)
+
+  schemas <- list(structType(structField("age", "integer"), 
structField("height", "double")),
+  "struct")
--- End diff --

Yes.. I persuaded myself that it is a valid string for `structField`. If 
you prefer optional parameter one, I could try. Otherwise, let me close this 
for now If you are not sure of both ways :).



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17096: [SPARK-15243][ML][SQL][PYTHON] Add missing support for u...

2017-03-06 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/17096
  
Thank you @viirya for your sign-off.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16854: [SPARK-15463][SQL] Add an API to load DataFrame f...

2017-03-07 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/16854#discussion_r104651799
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala
 ---
@@ -134,23 +133,33 @@ object TextInputCSVDataSource extends CSVDataSource {
   inputPaths: Seq[FileStatus],
   parsedOptions: CSVOptions): Option[StructType] = {
 val csv = createBaseDataset(sparkSession, inputPaths, parsedOptions)
-CSVUtils.filterCommentAndEmpty(csv, parsedOptions).take(1).headOption 
match {
-  case Some(firstLine) =>
-val firstRow = new 
CsvParser(parsedOptions.asParserSettings).parseLine(firstLine)
-val caseSensitive = 
sparkSession.sessionState.conf.caseSensitiveAnalysis
-val header = makeSafeHeader(firstRow, caseSensitive, parsedOptions)
-val tokenRDD = csv.rdd.mapPartitions { iter =>
-  val filteredLines = CSVUtils.filterCommentAndEmpty(iter, 
parsedOptions)
-  val linesWithoutHeader =
-CSVUtils.filterHeaderLine(filteredLines, firstLine, 
parsedOptions)
-  val parser = new CsvParser(parsedOptions.asParserSettings)
-  linesWithoutHeader.map(parser.parseLine)
-}
-Some(CSVInferSchema.infer(tokenRDD, header, parsedOptions))
-  case None =>
-// If the first line could not be read, just return the empty 
schema.
-Some(StructType(Nil))
-}
+val maybeFirstLine = CSVUtils.filterCommentAndEmpty(csv, 
parsedOptions).take(1).headOption
+Some(inferFromDataset(sparkSession, csv, maybeFirstLine, 
parsedOptions))
+  }
+
+  /**
+   * Infers the schema from `Dataset` that stores CSV string records.
+   */
+  def inferFromDataset(
--- End diff --

There is almost no code modification here. Just moved from above.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17178: [SPARK-19828][R] Support array type in from_json ...

2017-03-07 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17178#discussion_r104658406
  
--- Diff: R/pkg/inst/tests/testthat/test_sparkSQL.R ---
@@ -1342,28 +1342,52 @@ test_that("column functions", {
   df <- read.json(mapTypeJsonPath)
   j <- collect(select(df, alias(to_json(df$info), "json")))
   expect_equal(j[order(j$json), ][1], "{\"age\":16,\"height\":176.5}")
-  df <- as.DataFrame(j)
-  schema <- structType(structField("age", "integer"),
-   structField("height", "double"))
-  s <- collect(select(df, alias(from_json(df$json, schema), "structcol")))
-  expect_equal(ncol(s), 1)
-  expect_equal(nrow(s), 3)
-  expect_is(s[[1]][[1]], "struct")
-  expect_true(any(apply(s, 1, function(x) { x[[1]]$age == 16 } )))
-
-  # passing option
-  df <- as.DataFrame(list(list("col" = "{\"date\":\"21/10/2014\"}")))
-  schema2 <- structType(structField("date", "date"))
-  expect_error(tryCatch(collect(select(df, from_json(df$col, schema2))),
-error = function(e) { stop(e) }),
-   paste0(".*(java.lang.NumberFormatException: For input 
string:).*"))
-  s <- collect(select(df, from_json(df$col, schema2, dateFormat = 
"dd/MM/")))
-  expect_is(s[[1]][[1]]$date, "Date")
-  expect_equal(as.character(s[[1]][[1]]$date), "2014-10-21")
-
-  # check for unparseable
-  df <- as.DataFrame(list(list("a" = "")))
-  expect_equal(collect(select(df, from_json(df$a, schema)))[[1]][[1]], NA)
+
+  schemas <- list(structType(structField("age", "integer"), 
structField("height", "double")),
+  "struct")
--- End diff --

Let me give a shot at my best.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17192: [SPARK-19849][SQL] Support ArrayType in to_json t...

2017-03-07 Thread HyukjinKwon
GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/17192

[SPARK-19849][SQL] Support ArrayType in to_json to produce JSON array

## What changes were proposed in this pull request?

This PR proposes to support an array of struct type in `to_json` as below:

```scala
import org.apache.spark.sql.functions._

val df = Seq(Tuple1(Tuple1(1) :: Nil)).toDF("a")
df.select(to_json($"a").as("json")).show()
```

```
+--+
|  json|
+--+
|[{"_1":1}]|
+--+
```

Currently, it throws an exception as below (a newline manually inserted for 
readability):

```
org.apache.spark.sql.AnalysisException: cannot resolve 
'structtojson(`array`)' due to data type 
mismatch: structtojson requires that the expression is a struct 
expression.;;
```

This allows the roundtrip with `from_json` as below:

```scala
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._

val schema = ArrayType(StructType(StructField("a", IntegerType) :: Nil))
val df = Seq("""[{"a":1}, 
{"a":2}]""").toDF("json").select(from_json($"json", schema).as("array"))
df.show()

// Read back.
df.select(to_json($"array").as("json")).show()
```

```
+--+
| array|
+--+
|[[1], [2]]|
+--+

+-+
| json|
+-+
|[{"a":1},{"a":2}]|
+-+
```


Also, this PR proposes to rename from `StructToJson` to 
`StructOrArrayToJson ` and `JsonToStruct` to `JsonToStructOrArray`.

## How was this patch tested?

Unit tests in `JsonFunctionsSuite` and `JsonExpressionsSuite` for Scala, 
doctest for Python and test in `test_sparkSQL.R` for R.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark SPARK-19849

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17192.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17192


commit 92922650ca6bf44ef4f4daf02653f66125e881d2
Author: hyukjinkwon 
Date:   2017-03-07T14:43:37Z

Support ArrayType in to_json to produce JSON array




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17192: [SPARK-19849][SQL] Support ArrayType in to_json to produ...

2017-03-07 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/17192
  
cc @brkyvz and @marmbrus, could you please take a look and see if it makes 
sense?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17096: [SPARK-15243][ML][SQL][PYTHON] Add missing suppor...

2017-03-07 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17096#discussion_r104856634
  
--- Diff: python/pyspark/ml/tests.py ---
@@ -320,6 +320,13 @@ def test_hasparam(self):
 testParams = TestParams()
 self.assertTrue(all([testParams.hasParam(p.name) for p in 
testParams.params]))
 self.assertFalse(testParams.hasParam("notAParameter"))
+self.assertTrue(testParams.hasParam(u"maxIter"))
+
+def test_resolveparam(self):
+testParams = TestParams()
+self.assertEqual(testParams._resolveParam(testParams.maxIter), 
testParams.maxIter)
+self.assertEqual(testParams._resolveParam("maxIter"), 
testParams.maxIter)
+self.assertEqual(testParams._resolveParam(u"maxIter"), 
testParams.maxIter)
--- End diff --

Yes, I think that makes sense, though. It seems that requires more look and 
tests. I believe the current state resolves the specific JIRA. Maybe, could we 
merge this as is if you are think either way is fine? I feel It has been 
dragged by related changes and hope it could be merged if it is okay to you. If 
it dose not sound good to you, then, let me try to take a look.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17178: [SPARK-19828][R] Support array type in from_json ...

2017-03-08 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17178#discussion_r104876261
  
--- Diff: R/pkg/inst/tests/testthat/test_sparkSQL.R ---
@@ -1342,28 +1342,52 @@ test_that("column functions", {
   df <- read.json(mapTypeJsonPath)
   j <- collect(select(df, alias(to_json(df$info), "json")))
   expect_equal(j[order(j$json), ][1], "{\"age\":16,\"height\":176.5}")
-  df <- as.DataFrame(j)
-  schema <- structType(structField("age", "integer"),
-   structField("height", "double"))
-  s <- collect(select(df, alias(from_json(df$json, schema), "structcol")))
-  expect_equal(ncol(s), 1)
-  expect_equal(nrow(s), 3)
-  expect_is(s[[1]][[1]], "struct")
-  expect_true(any(apply(s, 1, function(x) { x[[1]]$age == 16 } )))
-
-  # passing option
-  df <- as.DataFrame(list(list("col" = "{\"date\":\"21/10/2014\"}")))
-  schema2 <- structType(structField("date", "date"))
-  expect_error(tryCatch(collect(select(df, from_json(df$col, schema2))),
-error = function(e) { stop(e) }),
-   paste0(".*(java.lang.NumberFormatException: For input 
string:).*"))
-  s <- collect(select(df, from_json(df$col, schema2, dateFormat = 
"dd/MM/")))
-  expect_is(s[[1]][[1]]$date, "Date")
-  expect_equal(as.character(s[[1]][[1]]$date), "2014-10-21")
-
-  # check for unparseable
-  df <- as.DataFrame(list(list("a" = "")))
-  expect_equal(collect(select(df, from_json(df$a, schema)))[[1]][[1]], NA)
+
+  schemas <- list(structType(structField("age", "integer"), 
structField("height", "double")),
+  "struct")
--- End diff --

Hm, @felixcheung, I think this resembles catalog string, maybe we could 
reuse `CatalystSqlParser.parseDataType` to make this more formal and to do not 
duplicate the efforts for defining a format or documentation. This is a big 
change but if this is what we want in the future, I would like to argue that we 
should keep this way.

For JSON string schema, there is an overloaded version of `from_json` that 
takes that schema string. If we are going to expose it, it can be easily done.

However, I think you meant it is a bigger change because we need to provide 
a way to produce this JSON string from types. Up to my knowledge, we can only 
manually specify the schema via this calalog string. Is this true? If so, I 
don't have a good idea for now to support this and I would rather close this if 
you so as well.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16854: [SPARK-15463][SQL] Add an API to load DataFrame f...

2017-03-08 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/16854#discussion_r104879395
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
@@ -1088,4 +1104,20 @@ class CSVSuite extends QueryTest with 
SharedSQLContext with SQLTestUtils {
   checkAnswer(df, spark.emptyDataFrame)
 }
   }
+
+  test("Empty dataframe produces empty dataframe") {
+// Empty dataframe with schema.
+val emptyDF = spark.createDataFrame(
--- End diff --

Sure, let me fix it up.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17192: [SPARK-19849][SQL] Support ArrayType in to_json t...

2017-03-08 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17192#discussion_r104886708
  
--- Diff: 
sql/core/src/test/resources/sql-tests/results/json-functions.sql.out ---
@@ -32,32 +34,40 @@ Usage: to_json(expr[, options]) - Returns a json string 
with a given struct valu
 -- !query 2
 select to_json(named_struct('a', 1, 'b', 2))
 -- !query 2 schema
-struct
+struct
 -- !query 2 output
 {"a":1,"b":2}
 
 
 -- !query 3
 select to_json(named_struct('time', to_timestamp('2015-08-26', 
'-MM-dd')), map('timestampFormat', 'dd/MM/'))
 -- !query 3 schema
-struct
+struct
 -- !query 3 output
 {"time":"26/08/2015"}
 
 
 -- !query 4
-select to_json(named_struct('a', 1, 'b', 2), named_struct('mode', 
'PERMISSIVE'))
+select to_json(array(named_struct('a', 1, 'b', 2)))
 -- !query 4 schema
-struct<>
+struct
 -- !query 4 output
-org.apache.spark.sql.AnalysisException
-Must use a map() function for options;; line 1 pos 7
+[{"a":1,"b":2}]
 
 
 -- !query 5
-select to_json()
+select to_json(named_struct('a', 1, 'b', 2), named_struct('mode', 
'PERMISSIVE'))
--- End diff --

This happened to look a bit weird but I had to add this in the middle of 
the sql file - 
https://github.com/apache/spark/pull/17192/files#diff-3b8a538abd658a260aa32c4aa593bed7R6
 to represent this is not the case of the error.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15751: [SPARK-18246][SQL] Throws an exception before execution ...

2017-03-08 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/15751
  
Do we want this change? If there is an approval, I will rebase. It seems 
easily making a conflict.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17178: [SPARK-19828][R] Support array type in from_json ...

2017-03-08 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17178#discussion_r105049248
  
--- Diff: R/pkg/inst/tests/testthat/test_sparkSQL.R ---
@@ -1342,28 +1342,52 @@ test_that("column functions", {
   df <- read.json(mapTypeJsonPath)
   j <- collect(select(df, alias(to_json(df$info), "json")))
   expect_equal(j[order(j$json), ][1], "{\"age\":16,\"height\":176.5}")
-  df <- as.DataFrame(j)
-  schema <- structType(structField("age", "integer"),
-   structField("height", "double"))
-  s <- collect(select(df, alias(from_json(df$json, schema), "structcol")))
-  expect_equal(ncol(s), 1)
-  expect_equal(nrow(s), 3)
-  expect_is(s[[1]][[1]], "struct")
-  expect_true(any(apply(s, 1, function(x) { x[[1]]$age == 16 } )))
-
-  # passing option
-  df <- as.DataFrame(list(list("col" = "{\"date\":\"21/10/2014\"}")))
-  schema2 <- structType(structField("date", "date"))
-  expect_error(tryCatch(collect(select(df, from_json(df$col, schema2))),
-error = function(e) { stop(e) }),
-   paste0(".*(java.lang.NumberFormatException: For input 
string:).*"))
-  s <- collect(select(df, from_json(df$col, schema2, dateFormat = 
"dd/MM/")))
-  expect_is(s[[1]][[1]]$date, "Date")
-  expect_equal(as.character(s[[1]][[1]]$date), "2014-10-21")
-
-  # check for unparseable
-  df <- as.DataFrame(list(list("a" = "")))
-  expect_equal(collect(select(df, from_json(df$a, schema)))[[1]][[1]], NA)
+
+  schemas <- list(structType(structField("age", "integer"), 
structField("height", "double")),
+  "struct")
--- End diff --

Doh, I meant there is a public `from_json` that takes JSON string schema - 
https://github.com/apache/spark/blob/369a148e591bb16ec7da54867610b207602cd698/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L3061-L3062

Yup, let me give a shot to provide an option for it first to show if it 
looks okay to you.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17192: [SPARK-19849][SQL] Support ArrayType in to_json t...

2017-03-08 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17192#discussion_r105052696
  
--- Diff: R/pkg/inst/tests/testthat/test_sparkSQL.R ---
@@ -1339,6 +1339,11 @@ test_that("column functions", {
   expect_equal(collect(select(df, bround(df$x, 0)))[[1]][2], 4)
 
   # Test to_json(), from_json()
+  arr <- list(listToStruct(list("name" = "bob")))
+  df <- as.DataFrame(list(listToStruct(list("people" = arr
+  j <- collect(select(df, alias(to_json(df$people), "json")))
+  expect_equal(j[order(j$json), ][1], "[{\"name\":\"bob\"}]")
+
--- End diff --

@felixcheung, it is a little bit of R codes here but could you check the 
test and documentation? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17192: [SPARK-19849][SQL] Support ArrayType in to_json t...

2017-03-08 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17192#discussion_r105052901
  
--- Diff: 
sql/core/src/test/resources/sql-tests/results/json-functions.sql.out ---
@@ -32,32 +34,40 @@ Usage: to_json(expr[, options]) - Returns a json string 
with a given struct valu
 -- !query 2
 select to_json(named_struct('a', 1, 'b', 2))
 -- !query 2 schema
-struct
+struct
 -- !query 2 output
 {"a":1,"b":2}
 
 
 -- !query 3
 select to_json(named_struct('time', to_timestamp('2015-08-26', 
'-MM-dd')), map('timestampFormat', 'dd/MM/'))
 -- !query 3 schema
-struct
+struct
 -- !query 3 output
 {"time":"26/08/2015"}
 
 
 -- !query 4
-select to_json(named_struct('a', 1, 'b', 2), named_struct('mode', 
'PERMISSIVE'))
+select to_json(array(named_struct('a', 1, 'b', 2)))
--- End diff --

Cc @maropu, thia adds an test in the file you latelly wrote. Could you 
check if this follows your indention?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17192: [SPARK-19849][SQL] Support ArrayType in to_json t...

2017-03-08 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17192#discussion_r105054224
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala ---
@@ -220,4 +242,5 @@ class JsonFunctionsSuite extends QueryTest with 
SharedSQLContext {
 assert(errMsg2.getMessage.startsWith(
   "A type of keys and values in map() must be string, but got"))
   }
+
--- End diff --

I found a newline at the end of tests prevent a conflict in some cases (_if 
I am not mistaken_). I am happy to revert this change back if anyone is sure 
that it is uesless or feel like it is an unrelated change.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17192: [SPARK-19849][SQL] Support ArrayType in to_json t...

2017-03-08 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17192#discussion_r105059270
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
 ---
@@ -422,7 +422,7 @@ object FunctionRegistry {
 expression[BitwiseXor]("^"),
 
 // json
-expression[StructToJson]("to_json"),
+expression[StructOrArrayToJson]("to_json"),
--- End diff --

I think that one is not particually better. `StructsToJson` then does not 
refer it can be an array or a single struct if the reason is only ambiguosity. 
Either way is fine to me. If any committer picks up one, let me follow.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



<    5   6   7   8   9   10   11   12   13   14   >