[GitHub] spark issue #22383: [SPARK-25362][JavaAPI] Replace Spark Optional class with...
Github user mmolimar commented on the issue: https://github.com/apache/spark/pull/22383 I agree @srowen. What do you think about reusing the current implementation we already have, for example, in the guava lib instead of having that class in Spark? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22383: [SPARK-25362][JavaAPI] Replace Spark Optional cla...
Github user mmolimar commented on a diff in the pull request: https://github.com/apache/spark/pull/22383#discussion_r224948273 --- Diff: project/MimaExcludes.scala --- @@ -36,6 +36,8 @@ object MimaExcludes { // Exclude rules for 3.0.x lazy val v30excludes = v24excludes ++ Seq( +// [SPARK-25362][JavaAPI] Replace Spark Optional class with Java Optional + ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.api.java.Optional") --- End diff -- No worries. Done ;-) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22383: [SPARK-25362][JavaAPI] Replace Spark Optional class with...
Github user mmolimar commented on the issue: https://github.com/apache/spark/pull/22383 No problem. Done ;-) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22383: [SPARK-25395][JavaAPI] Replace Spark Optional class with...
Github user mmolimar commented on the issue: https://github.com/apache/spark/pull/22383 Updated @srowen The PR title already contains SPARK-25395, is that what you're expecting or another PR? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22383: [SPARK-25395][JavaAPI] Replace Spark Optional class with...
Github user mmolimar commented on the issue: https://github.com/apache/spark/pull/22383 Done @srowen --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22383: [SPARK-25395][JavaAPI] Removing Optional Spark Ja...
GitHub user mmolimar opened a pull request: https://github.com/apache/spark/pull/22383 [SPARK-25395][JavaAPI] Removing Optional Spark Java API ## What changes were proposed in this pull request? Previous Spark versions didn't require Java 8 and an ``Optional`` Spark Java API had to be implemented to support optional values. Since Spark 2.4 uses Java 8, the ``Optional`` Spark Java API should be removed so that Spark uses the original Java API. ## How was this patch tested? ``OptionalSuite`` class was removed to test Spark Java API ``Optional`` class (this class as well). Notice that the ``get`` method in the Spark Java API ``Optional`` class throws a ``NullPointerException`` when the value is not set whereas the native Java API ``java.util.Optional`` throws a ``NoSuchElementException``. You can merge this pull request into a Git repository by running: $ git pull https://github.com/mmolimar/spark SPARK-25395 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22383.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22383 commit 4daf0ece245a2dc640be217cf4ad481ea430f996 Author: Mario Molina Date: 2018-09-10T14:48:13Z Removing Optional Spark Java API --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22234: [SPARK-25241][SQL] Configurable empty values when...
Github user mmolimar commented on a diff in the pull request: https://github.com/apache/spark/pull/22234#discussion_r216337792 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala --- @@ -91,9 +91,10 @@ abstract class CSVDataSource extends Serializable { } row.zipWithIndex.map { case (value, index) => -if (value == null || value.isEmpty || value == options.nullValue) { - // When there are empty strings or the values set in `nullValue`, put the - // index as the suffix. +if (value == null || value.isEmpty || value == options.nullValue || + value == options.emptyValueInRead) { --- End diff -- Do I revert these both changes @HyukjinKwon then? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18447: [SPARK-21232][SQL][SparkR][PYSPARK] New built-in ...
Github user mmolimar closed the pull request at: https://github.com/apache/spark/pull/18447 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22234: [SPARK-25241][SQL] Configurable empty values when...
Github user mmolimar commented on a diff in the pull request: https://github.com/apache/spark/pull/22234#discussion_r212851409 --- Diff: python/pyspark/sql/readwriter.py --- @@ -457,9 +459,9 @@ def csv(self, path, schema=None, sep=None, encoding=None, quote=None, escape=Non schema=schema, sep=sep, encoding=encoding, quote=quote, escape=escape, comment=comment, header=header, inferSchema=inferSchema, ignoreLeadingWhiteSpace=ignoreLeadingWhiteSpace, ignoreTrailingWhiteSpace=ignoreTrailingWhiteSpace, nullValue=nullValue, -nanValue=nanValue, positiveInf=positiveInf, negativeInf=negativeInf, -dateFormat=dateFormat, timestampFormat=timestampFormat, maxColumns=maxColumns, -maxCharsPerColumn=maxCharsPerColumn, +emptyValue=emptyValue, nanValue=nanValue, positiveInf=positiveInf, --- End diff -- Done! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22234: [SPARK-25241][SQL] Configurable empty values when...
Github user mmolimar commented on a diff in the pull request: https://github.com/apache/spark/pull/22234#discussion_r212850822 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala --- @@ -117,6 +117,9 @@ class CSVOptions( val nullValue = parameters.getOrElse("nullValue", "") + val emptyValueInRead = parameters.getOrElse("emptyValue", "") --- End diff -- I though that as well. Just for the shake of providing backwards compatibility as we already have in `ignoreLeadingWhiteSpaceInRead` and `ignoreLeadingWhiteSpaceFlagInWrite` I implemented that in that way. What do you say? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22234: [SPARK-25241][SQL] Configurable empty values when readin...
Github user mmolimar commented on the issue: https://github.com/apache/spark/pull/22234 @MaxGekk I added what you suggested as well. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22234: [SPARK-25241][SQL] Configurable empty values when...
Github user mmolimar commented on a diff in the pull request: https://github.com/apache/spark/pull/22234#discussion_r212842706 --- Diff: python/pyspark/sql/readwriter.py --- @@ -345,11 +345,11 @@ def text(self, paths, wholetext=False, lineSep=None): @since(2.0) def csv(self, path, schema=None, sep=None, encoding=None, quote=None, escape=None, comment=None, header=None, inferSchema=None, ignoreLeadingWhiteSpace=None, -ignoreTrailingWhiteSpace=None, nullValue=None, nanValue=None, positiveInf=None, -negativeInf=None, dateFormat=None, timestampFormat=None, maxColumns=None, -maxCharsPerColumn=None, maxMalformedLogPerPartition=None, mode=None, -columnNameOfCorruptRecord=None, multiLine=None, charToEscapeQuoteEscaping=None, -samplingRatio=None, enforceSchema=None): +ignoreTrailingWhiteSpace=None, nullValue=None, emptyValue=None, nanValue=None, --- End diff -- Done! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22234: [SPARK-25241][SQL] Configurable empty values when...
GitHub user mmolimar opened a pull request: https://github.com/apache/spark/pull/22234 [SPARK-25241][SQL] Configurable empty values when reading/writing CSV files ## What changes were proposed in this pull request? There is an option in the CSV parser to set values when we have empty values in the CSV files or in our dataframes. Currently, this option cannot be configured and always sets a default value (empty string for reading and `""` for writing). This PR is about enabling a new CSV option in the reader/writer to set custom empty values when reading/writing CSV files. ## How was this patch tested? The changes were tested by CSVSuite adding two unit tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/mmolimar/spark SPARK-25241 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22234.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22234 commit 8b5180021d246ab2fdf0824c01b9f180136837ce Author: Mario Molina Date: 2018-08-25T17:42:03Z Configurable empty values when reading/writing CSV files --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18447: [SPARK-21232][SQL][SparkR][PYSPARK] New built-in SQL fun...
Github user mmolimar commented on the issue: https://github.com/apache/spark/pull/18447 Hi @HyukjinKwon For me it's fine: "In some SQL db you have to query explicitly the table schema, ie: select data_type from all_tab_columns where table_name = 'my_table'or something like that. In case of the ARQ engine from Apache Jena you can call this function in SPARQL (see [W3C-SPARQL](https://www.w3.org/TR/rdf-sparql-query/#func-datatype)). I find it useful in order to avoid to query the schema." --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18447: [SPARK-21232][SQL][SparkR][PYSPARK] New built-in SQL fun...
Github user mmolimar commented on the issue: https://github.com/apache/spark/pull/18447 so @felixcheung ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18447: [SPARK-21232][SQL][SparkR][PYSPARK] New built-in SQL fun...
Github user mmolimar commented on the issue: https://github.com/apache/spark/pull/18447 @felixcheung I think it should be fine now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18447: [SPARK-21232][SQL][SparkR][PYSPARK] New built-in SQL fun...
Github user mmolimar commented on the issue: https://github.com/apache/spark/pull/18447 @felixcheung Everything done! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18447: [SPARK-21232][SQL][SparkR][PYSPARK] New built-in SQL fun...
Github user mmolimar commented on the issue: https://github.com/apache/spark/pull/18447 In some SQL db you have to query explicitly the table schema, ie: ``select data_type from all_tab_columns where table_name = 'my_table'``or something like that. In case of the ARQ engine from Apache Jena you can call this function in SPARQL (see [W3C-SPARQL](https://www.w3.org/TR/rdf-sparql-query/#func-datatype). I find it useful in order to avoid to query the schema. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18447: [SPARK-21232][SQL][SparkR][PYSPARK] New built-in ...
Github user mmolimar commented on a diff in the pull request: https://github.com/apache/spark/pull/18447#discussion_r130025210 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala --- @@ -209,6 +209,18 @@ class DataFrameFunctionsSuite extends QueryTest with SharedSQLContext { Row(2743272264L, 2180413220L)) } + test("misc data_type function") { +val df = Seq(("a", false)).toDF("a", "b") + +checkAnswer( + df.select(data_type($"a"), data_type($"b")), --- End diff -- @HyukjinKwon so? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18447: [SPARK-21232][SQL][SparkR][PYSPARK] New built-in ...
Github user mmolimar commented on a diff in the pull request: https://github.com/apache/spark/pull/18447#discussion_r126710311 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala --- @@ -209,6 +209,18 @@ class DataFrameFunctionsSuite extends QueryTest with SharedSQLContext { Row(2743272264L, 2180413220L)) } + test("misc data_type function") { +val df = Seq(("a", false)).toDF("a", "b") + +checkAnswer( + df.select(data_type($"a"), data_type($"b")), --- End diff -- I can't think of a SQL db which queries the datatype without using the table schema. However, for example in MongoDB, you can get something like that using the $type operator or 'typeof'. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18447: [SPARK-21232][SQL][SparkR][PYSPARK] New built-in ...
Github user mmolimar commented on a diff in the pull request: https://github.com/apache/spark/pull/18447#discussion_r124545289 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala --- @@ -209,6 +209,18 @@ class DataFrameFunctionsSuite extends QueryTest with SharedSQLContext { Row(2743272264L, 2180413220L)) } + test("misc data_type function") { +val df = Seq(("a", false)).toDF("a", "b") + +checkAnswer( + df.select(data_type($"a"), data_type($"b")), --- End diff -- The idea would be to know the type based on the value itself, not by the schema (i.e. the value could be null): ```scala val df = spark.sparkContext.parallelize( StringData(null) :: StringData("a") :: Nil).toDF() df.select(data_type(col("s"))) //you get null and string in this case df.schema.map(_.dataType.simpleString) // you just get string ``` On the other hand, it'd be nice to have this SQL function as we do in some databases. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18447: [SPARK-21232][SQL][SparkR][PYSPARK] New built-in ...
GitHub user mmolimar opened a pull request: https://github.com/apache/spark/pull/18447 [SPARK-21232][SQL][SparkR][PYSPARK] New built-in SQL function - Data_Type ## What changes were proposed in this pull request? New built-in function to get the data type of columns in SQL. ## How was this patch tested? Unit tests included. You can merge this pull request into a Git repository by running: $ git pull https://github.com/mmolimar/spark SPARK-21232 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18447.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18447 commit ef2b2189994f4d790560dbf8bfddf0008a520ccf Author: Mario Molina <mmoli...@gmail.com> Date: 2017-06-28T06:22:29Z New data_type SQL function commit f4bea7061d09246c8fcaa97865043a70228913e3 Author: Mario Molina <mmoli...@gmail.com> Date: 2017-06-28T06:22:57Z Tests for data_type function commit 6fad8e9f518567b503345a21fcea8a4ddf1e5d9b Author: Mario Molina <mmoli...@gmail.com> Date: 2017-06-28T06:25:18Z Python support for data_type function commit 959cccf4357abef2bd90957c5402c2a2d67c6262 Author: Mario Molina <mmoli...@gmail.com> Date: 2017-06-28T06:25:31Z R support for data_type function --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org