[GitHub] spark issue #22383: [SPARK-25362][JavaAPI] Replace Spark Optional class with...

2018-10-18 Thread mmolimar
Github user mmolimar commented on the issue:

https://github.com/apache/spark/pull/22383
  
I agree @srowen.
What do you think about reusing the current implementation we already have, 
for example, in the guava lib instead of having that class in Spark?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22383: [SPARK-25362][JavaAPI] Replace Spark Optional cla...

2018-10-12 Thread mmolimar
Github user mmolimar commented on a diff in the pull request:

https://github.com/apache/spark/pull/22383#discussion_r224948273
  
--- Diff: project/MimaExcludes.scala ---
@@ -36,6 +36,8 @@ object MimaExcludes {
 
   // Exclude rules for 3.0.x
   lazy val v30excludes = v24excludes ++ Seq(
+// [SPARK-25362][JavaAPI] Replace Spark Optional class with Java 
Optional
+
ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.api.java.Optional")
--- End diff --

No worries. Done ;-)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22383: [SPARK-25362][JavaAPI] Replace Spark Optional class with...

2018-10-11 Thread mmolimar
Github user mmolimar commented on the issue:

https://github.com/apache/spark/pull/22383
  
No problem. Done ;-)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22383: [SPARK-25395][JavaAPI] Replace Spark Optional class with...

2018-10-11 Thread mmolimar
Github user mmolimar commented on the issue:

https://github.com/apache/spark/pull/22383
  
Updated @srowen 
The PR title already contains SPARK-25395, is that what you're expecting or 
another PR?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22383: [SPARK-25395][JavaAPI] Replace Spark Optional class with...

2018-09-11 Thread mmolimar
Github user mmolimar commented on the issue:

https://github.com/apache/spark/pull/22383
  
Done @srowen 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22383: [SPARK-25395][JavaAPI] Removing Optional Spark Ja...

2018-09-10 Thread mmolimar
GitHub user mmolimar opened a pull request:

https://github.com/apache/spark/pull/22383

[SPARK-25395][JavaAPI] Removing Optional Spark Java API

## What changes were proposed in this pull request?

Previous Spark versions didn't require Java 8 and an ``Optional`` Spark 
Java API had to be  implemented to support optional values.

Since Spark 2.4 uses Java 8, the ``Optional`` Spark Java API should be 
removed so that Spark uses the original Java API.

## How was this patch tested?

``OptionalSuite`` class was removed to test Spark Java API ``Optional`` 
class (this class as well).
Notice that the ``get`` method in the Spark Java API ``Optional`` class 
throws a ``NullPointerException`` when the value is not set whereas the native 
Java API ``java.util.Optional`` throws a ``NoSuchElementException``.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mmolimar/spark SPARK-25395

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22383.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22383


commit 4daf0ece245a2dc640be217cf4ad481ea430f996
Author: Mario Molina 
Date:   2018-09-10T14:48:13Z

Removing Optional Spark Java API




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22234: [SPARK-25241][SQL] Configurable empty values when...

2018-09-10 Thread mmolimar
Github user mmolimar commented on a diff in the pull request:

https://github.com/apache/spark/pull/22234#discussion_r216337792
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala
 ---
@@ -91,9 +91,10 @@ abstract class CSVDataSource extends Serializable {
   }
 
   row.zipWithIndex.map { case (value, index) =>
-if (value == null || value.isEmpty || value == options.nullValue) {
-  // When there are empty strings or the values set in 
`nullValue`, put the
-  // index as the suffix.
+if (value == null || value.isEmpty || value == options.nullValue ||
+  value == options.emptyValueInRead) {
--- End diff --

Do I revert these both changes @HyukjinKwon then?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18447: [SPARK-21232][SQL][SparkR][PYSPARK] New built-in ...

2018-09-10 Thread mmolimar
Github user mmolimar closed the pull request at:

https://github.com/apache/spark/pull/18447


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22234: [SPARK-25241][SQL] Configurable empty values when...

2018-08-26 Thread mmolimar
Github user mmolimar commented on a diff in the pull request:

https://github.com/apache/spark/pull/22234#discussion_r212851409
  
--- Diff: python/pyspark/sql/readwriter.py ---
@@ -457,9 +459,9 @@ def csv(self, path, schema=None, sep=None, 
encoding=None, quote=None, escape=Non
 schema=schema, sep=sep, encoding=encoding, quote=quote, 
escape=escape, comment=comment,
 header=header, inferSchema=inferSchema, 
ignoreLeadingWhiteSpace=ignoreLeadingWhiteSpace,
 ignoreTrailingWhiteSpace=ignoreTrailingWhiteSpace, 
nullValue=nullValue,
-nanValue=nanValue, positiveInf=positiveInf, 
negativeInf=negativeInf,
-dateFormat=dateFormat, timestampFormat=timestampFormat, 
maxColumns=maxColumns,
-maxCharsPerColumn=maxCharsPerColumn,
+emptyValue=emptyValue, nanValue=nanValue, 
positiveInf=positiveInf,
--- End diff --

Done!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22234: [SPARK-25241][SQL] Configurable empty values when...

2018-08-26 Thread mmolimar
Github user mmolimar commented on a diff in the pull request:

https://github.com/apache/spark/pull/22234#discussion_r212850822
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala
 ---
@@ -117,6 +117,9 @@ class CSVOptions(
 
   val nullValue = parameters.getOrElse("nullValue", "")
 
+  val emptyValueInRead = parameters.getOrElse("emptyValue", "")
--- End diff --

I though that as well. Just for the shake of providing backwards 
compatibility as we already have in `ignoreLeadingWhiteSpaceInRead` and 
`ignoreLeadingWhiteSpaceFlagInWrite` I implemented that in that way.
What do you say?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22234: [SPARK-25241][SQL] Configurable empty values when readin...

2018-08-26 Thread mmolimar
Github user mmolimar commented on the issue:

https://github.com/apache/spark/pull/22234
  
@MaxGekk I added what you suggested as well.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22234: [SPARK-25241][SQL] Configurable empty values when...

2018-08-26 Thread mmolimar
Github user mmolimar commented on a diff in the pull request:

https://github.com/apache/spark/pull/22234#discussion_r212842706
  
--- Diff: python/pyspark/sql/readwriter.py ---
@@ -345,11 +345,11 @@ def text(self, paths, wholetext=False, lineSep=None):
 @since(2.0)
 def csv(self, path, schema=None, sep=None, encoding=None, quote=None, 
escape=None,
 comment=None, header=None, inferSchema=None, 
ignoreLeadingWhiteSpace=None,
-ignoreTrailingWhiteSpace=None, nullValue=None, nanValue=None, 
positiveInf=None,
-negativeInf=None, dateFormat=None, timestampFormat=None, 
maxColumns=None,
-maxCharsPerColumn=None, maxMalformedLogPerPartition=None, 
mode=None,
-columnNameOfCorruptRecord=None, multiLine=None, 
charToEscapeQuoteEscaping=None,
-samplingRatio=None, enforceSchema=None):
+ignoreTrailingWhiteSpace=None, nullValue=None, 
emptyValue=None, nanValue=None,
--- End diff --

Done!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22234: [SPARK-25241][SQL] Configurable empty values when...

2018-08-25 Thread mmolimar
GitHub user mmolimar opened a pull request:

https://github.com/apache/spark/pull/22234

[SPARK-25241][SQL] Configurable empty values when reading/writing CSV files

## What changes were proposed in this pull request?
There is an option in the CSV parser to set values when we have empty 
values in the CSV files or in our dataframes.
Currently, this option cannot be configured and always sets a default value 
(empty string for reading and `""` for writing).
This PR is about enabling a new CSV option in the reader/writer to set 
custom empty values when reading/writing CSV files.

## How was this patch tested?
The changes were tested by CSVSuite adding two unit tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mmolimar/spark SPARK-25241

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22234.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22234


commit 8b5180021d246ab2fdf0824c01b9f180136837ce
Author: Mario Molina 
Date:   2018-08-25T17:42:03Z

Configurable empty values when reading/writing CSV files




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18447: [SPARK-21232][SQL][SparkR][PYSPARK] New built-in SQL fun...

2018-07-20 Thread mmolimar
Github user mmolimar commented on the issue:

https://github.com/apache/spark/pull/18447
  
Hi @HyukjinKwon 
For me it's fine:
"In some SQL db you have to query explicitly the table schema, ie: select 
data_type from all_tab_columns where table_name = 'my_table'or something like 
that.
In case of the ARQ engine from Apache Jena you can call this function in 
SPARQL (see 
[W3C-SPARQL](https://www.w3.org/TR/rdf-sparql-query/#func-datatype)).
I find it useful in order to avoid to query the schema."


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18447: [SPARK-21232][SQL][SparkR][PYSPARK] New built-in SQL fun...

2018-06-04 Thread mmolimar
Github user mmolimar commented on the issue:

https://github.com/apache/spark/pull/18447
  
so @felixcheung ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18447: [SPARK-21232][SQL][SparkR][PYSPARK] New built-in SQL fun...

2018-05-08 Thread mmolimar
Github user mmolimar commented on the issue:

https://github.com/apache/spark/pull/18447
  
@felixcheung I think it should be fine now.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18447: [SPARK-21232][SQL][SparkR][PYSPARK] New built-in SQL fun...

2018-05-04 Thread mmolimar
Github user mmolimar commented on the issue:

https://github.com/apache/spark/pull/18447
  
@felixcheung Everything done!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18447: [SPARK-21232][SQL][SparkR][PYSPARK] New built-in SQL fun...

2017-10-28 Thread mmolimar
Github user mmolimar commented on the issue:

https://github.com/apache/spark/pull/18447
  
In some SQL db you have to query explicitly the table schema, ie: ``select 
data_type from all_tab_columns where table_name = 'my_table'``or something like 
that.
In case of the ARQ engine from Apache Jena you can call this function in 
SPARQL (see [W3C-SPARQL](https://www.w3.org/TR/rdf-sparql-query/#func-datatype).
I find it useful in order to avoid to query the schema.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18447: [SPARK-21232][SQL][SparkR][PYSPARK] New built-in ...

2017-07-28 Thread mmolimar
Github user mmolimar commented on a diff in the pull request:

https://github.com/apache/spark/pull/18447#discussion_r130025210
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala ---
@@ -209,6 +209,18 @@ class DataFrameFunctionsSuite extends QueryTest with 
SharedSQLContext {
   Row(2743272264L, 2180413220L))
   }
 
+  test("misc data_type function") {
+val df = Seq(("a", false)).toDF("a", "b")
+
+checkAnswer(
+  df.select(data_type($"a"), data_type($"b")),
--- End diff --

@HyukjinKwon so?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18447: [SPARK-21232][SQL][SparkR][PYSPARK] New built-in ...

2017-07-11 Thread mmolimar
Github user mmolimar commented on a diff in the pull request:

https://github.com/apache/spark/pull/18447#discussion_r126710311
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala ---
@@ -209,6 +209,18 @@ class DataFrameFunctionsSuite extends QueryTest with 
SharedSQLContext {
   Row(2743272264L, 2180413220L))
   }
 
+  test("misc data_type function") {
+val df = Seq(("a", false)).toDF("a", "b")
+
+checkAnswer(
+  df.select(data_type($"a"), data_type($"b")),
--- End diff --

I can't think of a SQL db which queries the datatype without using the 
table schema. However, for example in MongoDB, you can get something like that 
using the $type operator or 'typeof'.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18447: [SPARK-21232][SQL][SparkR][PYSPARK] New built-in ...

2017-06-28 Thread mmolimar
Github user mmolimar commented on a diff in the pull request:

https://github.com/apache/spark/pull/18447#discussion_r124545289
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala ---
@@ -209,6 +209,18 @@ class DataFrameFunctionsSuite extends QueryTest with 
SharedSQLContext {
   Row(2743272264L, 2180413220L))
   }
 
+  test("misc data_type function") {
+val df = Seq(("a", false)).toDF("a", "b")
+
+checkAnswer(
+  df.select(data_type($"a"), data_type($"b")),
--- End diff --

The idea would be to know the type based on the value itself, not by the 
schema (i.e. the value could be null):
```scala
val df = spark.sparkContext.parallelize(
  StringData(null) ::
  StringData("a") :: Nil).toDF()
df.select(data_type(col("s"))) //you get null and string in this case
df.schema.map(_.dataType.simpleString) // you just get string
```
On the other hand, it'd be nice to have this SQL function as we do in some 
databases.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18447: [SPARK-21232][SQL][SparkR][PYSPARK] New built-in ...

2017-06-28 Thread mmolimar
GitHub user mmolimar opened a pull request:

https://github.com/apache/spark/pull/18447

[SPARK-21232][SQL][SparkR][PYSPARK] New built-in SQL function - Data_Type

## What changes were proposed in this pull request?

New built-in function to get the data type of columns in SQL.

## How was this patch tested?

Unit tests included.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mmolimar/spark SPARK-21232

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18447.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18447


commit ef2b2189994f4d790560dbf8bfddf0008a520ccf
Author: Mario Molina <mmoli...@gmail.com>
Date:   2017-06-28T06:22:29Z

New data_type SQL function

commit f4bea7061d09246c8fcaa97865043a70228913e3
Author: Mario Molina <mmoli...@gmail.com>
Date:   2017-06-28T06:22:57Z

Tests for data_type function

commit 6fad8e9f518567b503345a21fcea8a4ddf1e5d9b
Author: Mario Molina <mmoli...@gmail.com>
Date:   2017-06-28T06:25:18Z

Python support for data_type function

commit 959cccf4357abef2bd90957c5402c2a2d67c6262
Author: Mario Molina <mmoli...@gmail.com>
Date:   2017-06-28T06:25:31Z

R support for data_type function




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org