[GitHub] spark pull request #20535: [SPARK-23341][SQL] define some standard options f...

2018-04-23 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20535#discussion_r183580907
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala ---
@@ -193,10 +196,13 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
   if (ds.isInstanceOf[ReadSupport] || 
ds.isInstanceOf[ReadSupportWithSchema]) {
 val sessionOptions = DataSourceV2Utils.extractSessionConfigs(
   ds = ds, conf = sparkSession.sessionState.conf)
+val pathsOption = {
+  val objectMapper = new ObjectMapper()
+  DataSourceOptions.PATHS_KEY -> 
objectMapper.writeValueAsString(paths.toArray)
+}
 Dataset.ofRows(sparkSession, DataSourceV2Relation.create(
-  ds, extraOptions.toMap ++ sessionOptions,
+  ds, extraOptions.toMap ++ sessionOptions + pathsOption,
--- End diff --

Basically we may have duplicated entries in session configs and 
`DataFrameReader/Writer` options, not only path. The rule is, 
`DataFrameReader/Writer` options should overwrite session configs.

cc @jiangxb1987 can you submit a PR to explicitly document it in 
`SessionConfigSupport`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20535: [SPARK-23341][SQL] define some standard options f...

2018-04-23 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20535#discussion_r183470867
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala ---
@@ -193,10 +196,13 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
   if (ds.isInstanceOf[ReadSupport] || 
ds.isInstanceOf[ReadSupportWithSchema]) {
 val sessionOptions = DataSourceV2Utils.extractSessionConfigs(
   ds = ds, conf = sparkSession.sessionState.conf)
+val pathsOption = {
+  val objectMapper = new ObjectMapper()
+  DataSourceOptions.PATHS_KEY -> 
objectMapper.writeValueAsString(paths.toArray)
+}
 Dataset.ofRows(sparkSession, DataSourceV2Relation.create(
-  ds, extraOptions.toMap ++ sessionOptions,
+  ds, extraOptions.toMap ++ sessionOptions + pathsOption,
--- End diff --

issue an exception when extraOptions("path") is not empty?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20535: [SPARK-23341][SQL] define some standard options f...

2018-04-17 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/20535


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20535: [SPARK-23341][SQL] define some standard options f...

2018-04-11 Thread gengliangwang
Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/20535#discussion_r180714763
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/DataSourceOptions.java 
---
@@ -17,16 +17,61 @@
 
 package org.apache.spark.sql.sources.v2;
 
+import java.io.IOException;
 import java.util.HashMap;
 import java.util.Locale;
 import java.util.Map;
 import java.util.Optional;
+import java.util.stream.Stream;
+
+import com.fasterxml.jackson.databind.ObjectMapper;
 
 import org.apache.spark.annotation.InterfaceStability;
 
 /**
  * An immutable string-to-string map in which keys are case-insensitive. 
This is used to represent
  * data source options.
+ *
+ * Each data source implementation can define its own options and teach 
its users how to set them.
+ * Spark doesn't have any restrictions about what options a data source 
should or should not have.
+ * Instead Spark defines some standard options that data sources can 
optionally adopt. It's possible
+ * that some options are very common and many data sources use them. 
However different data
+ * sources may define the common options(key and meaning) differently, 
which is quite confusing to
+ * end users.
+ *
+ * The standard options defined by Spark:
+ * 
+ *   
+ * Option key
+ * Option value
+ *   
+ *   
+ * path
+ * A path string of the data files/directories, like
+ * path1, /absolute/file2, 
path3/*. The path can
+ * either be relative or absolute, points to either file or directory, 
and can contain
+ * wildcards. This option is commonly used by file-based data 
sources.
+ *   
+ *   
+ * paths
+ * A JSON array style paths string of the data files/directories, 
like
+ * ["path1", "/absolute/file2"]. The format of each path 
is same as the
+ * path option, plus it should follow JSON string literal 
format, e.g. quotes
+ * should be escaped, pa\"th means pa"th.
--- End diff --

pa\"th?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20535: [SPARK-23341][SQL] define some standard options f...

2018-04-09 Thread gengliangwang
Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/20535#discussion_r180171421
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/DataSourceOptions.java 
---
@@ -97,4 +142,66 @@ public double getDouble(String key, double 
defaultValue) {
 return keyLowerCasedMap.containsKey(lcaseKey) ?
   Double.parseDouble(keyLowerCasedMap.get(lcaseKey)) : defaultValue;
   }
+
+  /**
+   * The option key for singular path.
+   */
+  public static final String PATH_KEY = "path";
+
+  /**
+   * The option key for multiple paths.
+   */
+  public static final String PATHS_KEY = "paths";
+
+  /**
+   * The option key for table name.
+   */
+  public static final String TABLE_KEY = "table";
+
+  /**
+   * The option key for database name.
+   */
+  public static final String DATABASE_KEY = "database";
+
+  /**
+   * Returns the value of the singular path option.
+   */
+  public Optional path() {
+return get(PATH_KEY);
+  }
+
+  /**
+   * Returns all the paths specified by both the singular path option and 
the multiple
+   * paths option.
+   */
+  public String[] paths() {
+String[] singularPath = path().map(s -> new String[]{s}).orElseGet(() 
-> new String[0]);
+Optional pathsStr = get(PATHS_KEY);
+System.out.println(pathsStr);
--- End diff --

remove println :)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20535: [SPARK-23341][SQL] define some standard options f...

2018-02-21 Thread rdblue
Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/20535#discussion_r169708283
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala ---
@@ -190,10 +190,15 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
 val cls = DataSource.lookupDataSource(source, 
sparkSession.sessionState.conf)
 if (classOf[DataSourceV2].isAssignableFrom(cls)) {
   val ds = cls.newInstance()
-  val options = new DataSourceOptions((extraOptions ++
-DataSourceV2Utils.extractSessionConfigs(
-  ds = ds.asInstanceOf[DataSourceV2],
-  conf = sparkSession.sessionState.conf)).asJava)
+  val sessionOptions = DataSourceV2Utils.extractSessionConfigs(
+ds = ds.asInstanceOf[DataSourceV2],
+conf = sparkSession.sessionState.conf)
+  val pathOption = if (paths.isEmpty) {
+Map.empty
+  } else {
+Map(DataSourceOptions.KEY_PATH -> paths.mkString(","))
--- End diff --

`KEY_PATH` sounds like the path for a key. It would be more clear if the 
name was `PATH_KEY`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20535: [SPARK-23341][SQL] define some standard options f...

2018-02-21 Thread rdblue
Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/20535#discussion_r169707938
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/DataSourceOptions.java 
---
@@ -97,4 +130,20 @@ public double getDouble(String key, double 
defaultValue) {
 return keyLowerCasedMap.containsKey(lcaseKey) ?
   Double.parseDouble(keyLowerCasedMap.get(lcaseKey)) : defaultValue;
   }
+
+  public static final String KEY_PATH = "path";
+  public static final String KEY_TABLE = "table";
+  public static final String KEY_DATABASE = "database";
+
+  public Optional getPath() {
--- End diff --

I think it is more friendly when using this in scala to drop the `get` and 
use just `path` or `database`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20535: [SPARK-23341][SQL] define some standard options f...

2018-02-21 Thread rdblue
Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/20535#discussion_r169707644
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/DataSourceOptions.java 
---
@@ -27,6 +27,39 @@
 /**
  * An immutable string-to-string map in which keys are case-insensitive. 
This is used to represent
  * data source options.
+ *
+ * Each data source implementation can define its own options and teach 
its users how to set them.
+ * Spark doesn't have any restrictions about what options a data source 
should or should not have.
+ * Instead Spark defines some standard options that data sources can 
optionally adopt. It's possible
+ * that some options are very common and many data sources use them. 
However different data
+ * sources may define the common options(key and meaning) differently, 
which is quite confusing to
+ * end users.
+ *
+ * The standard options defined by Spark:
+ * 
+ *   
+ * Option key
+ * Option value
+ *   
+ *   
+ * path
+ * A comma separated paths string of the data files/directories, 
like
+ * path1,/absolute/file2,path3/*. Each path can either be 
relative or absolute,
+ * points to either file or directory, and can contain wildcards. This 
option is commonly used
+ * by file-based data sources.
+ *   
+ *   
+ * table
+ * A table name string representing the table name directly 
without any interpretation.
--- End diff --

I think this is clear with the examples.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20535: [SPARK-23341][SQL] define some standard options f...

2018-02-07 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20535#discussion_r166826237
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala ---
@@ -171,7 +171,8 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
* @since 1.4.0
*/
   def load(path: String): DataFrame = {
-option("path", path).load(Seq.empty: _*) // force invocation of 
`load(...varargs...)`
+// force invocation of `load(...varargs...)`
+option(DataSourceOptions.KEY_PATH, path).load(Seq.empty: _*)
--- End diff --

makes sense, let me change it back.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20535: [SPARK-23341][SQL] define some standard options f...

2018-02-07 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20535#discussion_r166826138
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/DataSourceOptions.java 
---
@@ -27,6 +27,39 @@
 /**
  * An immutable string-to-string map in which keys are case-insensitive. 
This is used to represent
  * data source options.
+ *
+ * Each data source implementation can define its own options and teach 
its users how to set them.
+ * Spark doesn't have any restrictions about what options a data source 
should or should not have.
+ * Instead Spark defines some standard options that data sources can 
optionally adopt. It's possible
+ * that some options are very common and many data sources use them. 
However different data
+ * sources may define the common options(key and meaning) differently, 
which is quite confusing to
+ * end users.
+ *
+ * The standard options defined by Spark:
+ * 
+ *   
+ * Option key
+ * Option value
+ *   
+ *   
+ * path
+ * A comma separated paths string of the data files/directories, 
like
+ * path1,/absolute/file2,path3/*. Each path can either be 
relative or absolute,
+ * points to either file or directory, and can contain wildcards. This 
option is commonly used
+ * by file-based data sources.
+ *   
+ *   
+ * table
+ * A table name string representing the table name directly 
without any interpretation.
--- End diff --

It means it's a pure string, there is not parsing rule for it like SQL 
identifier. I put some examples below and hopefully they can explain it well.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20535: [SPARK-23341][SQL] define some standard options f...

2018-02-07 Thread rdblue
Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/20535#discussion_r166709990
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala ---
@@ -171,7 +171,8 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
* @since 1.4.0
*/
   def load(path: String): DataFrame = {
-option("path", path).load(Seq.empty: _*) // force invocation of 
`load(...varargs...)`
+// force invocation of `load(...varargs...)`
+option(DataSourceOptions.KEY_PATH, path).load(Seq.empty: _*)
--- End diff --

It seems odd to me to change this string. While there's no behavior change, 
the constant is for a key in v2's DataSourceOptions, not for the 
DataFrameReader API. We could change it to "PATH" and it would be perfectly 
fine for v2, but would change the behavior here. Such a change is incredibly 
unlikely, which is why I say it is just "odd".


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20535: [SPARK-23341][SQL] define some standard options f...

2018-02-07 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/20535#discussion_r166701501
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/DataSourceOptions.java 
---
@@ -27,6 +27,39 @@
 /**
  * An immutable string-to-string map in which keys are case-insensitive. 
This is used to represent
  * data source options.
+ *
+ * Each data source implementation can define its own options and teach 
its users how to set them.
+ * Spark doesn't have any restrictions about what options a data source 
should or should not have.
+ * Instead Spark defines some standard options that data sources can 
optionally adopt. It's possible
+ * that some options are very common and many data sources use them. 
However different data
+ * sources may define the common options(key and meaning) differently, 
which is quite confusing to
+ * end users.
+ *
+ * The standard options defined by Spark:
+ * 
+ *   
+ * Option key
+ * Option value
+ *   
+ *   
+ * path
+ * A comma separated paths string of the data files/directories, 
like
+ * path1,/absolute/file2,path3/*. Each path can either be 
relative or absolute,
+ * points to either file or directory, and can contain wildcards. This 
option is commonly used
+ * by file-based data sources.
+ *   
+ *   
+ * table
+ * A table name string representing the table name directly 
without any interpretation.
--- End diff --

what do you mean by "without any interpretation"?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20535: [SPARK-23341][SQL] define some standard options f...

2018-02-07 Thread cloud-fan
GitHub user cloud-fan opened a pull request:

https://github.com/apache/spark/pull/20535

[SPARK-23341][SQL] define some standard options for data source v2

## What changes were proposed in this pull request?

Each data source implementation can define its own options and teach its 
users how to set them. Spark doesn't have any restrictions about what options a 
data source should or should not have. It's possible that some options are very 
common and many data sources use them. However different data sources may 
define the common options(key and meaning) differently, which is quite 
confusing to end users.

This PR defines some standard options that data sources can optionally 
adopt: path, table and database.

## How was this patch tested?

a new test case.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/cloud-fan/spark options

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20535.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20535


commit c9009d85d3f147eb652f141bbcb0424767e90477
Author: Wenchen Fan 
Date:   2018-02-07T14:56:53Z

define some standard options for data source v2




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org