[GitHub] spark pull request #20535: [SPARK-23341][SQL] define some standard options f...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20535#discussion_r183580907 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala --- @@ -193,10 +196,13 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging { if (ds.isInstanceOf[ReadSupport] || ds.isInstanceOf[ReadSupportWithSchema]) { val sessionOptions = DataSourceV2Utils.extractSessionConfigs( ds = ds, conf = sparkSession.sessionState.conf) +val pathsOption = { + val objectMapper = new ObjectMapper() + DataSourceOptions.PATHS_KEY -> objectMapper.writeValueAsString(paths.toArray) +} Dataset.ofRows(sparkSession, DataSourceV2Relation.create( - ds, extraOptions.toMap ++ sessionOptions, + ds, extraOptions.toMap ++ sessionOptions + pathsOption, --- End diff -- Basically we may have duplicated entries in session configs and `DataFrameReader/Writer` options, not only path. The rule is, `DataFrameReader/Writer` options should overwrite session configs. cc @jiangxb1987 can you submit a PR to explicitly document it in `SessionConfigSupport`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20535: [SPARK-23341][SQL] define some standard options f...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20535#discussion_r183470867 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala --- @@ -193,10 +196,13 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging { if (ds.isInstanceOf[ReadSupport] || ds.isInstanceOf[ReadSupportWithSchema]) { val sessionOptions = DataSourceV2Utils.extractSessionConfigs( ds = ds, conf = sparkSession.sessionState.conf) +val pathsOption = { + val objectMapper = new ObjectMapper() + DataSourceOptions.PATHS_KEY -> objectMapper.writeValueAsString(paths.toArray) +} Dataset.ofRows(sparkSession, DataSourceV2Relation.create( - ds, extraOptions.toMap ++ sessionOptions, + ds, extraOptions.toMap ++ sessionOptions + pathsOption, --- End diff -- issue an exception when extraOptions("path") is not empty? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20535: [SPARK-23341][SQL] define some standard options f...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20535 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20535: [SPARK-23341][SQL] define some standard options f...
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/20535#discussion_r180714763 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/DataSourceOptions.java --- @@ -17,16 +17,61 @@ package org.apache.spark.sql.sources.v2; +import java.io.IOException; import java.util.HashMap; import java.util.Locale; import java.util.Map; import java.util.Optional; +import java.util.stream.Stream; + +import com.fasterxml.jackson.databind.ObjectMapper; import org.apache.spark.annotation.InterfaceStability; /** * An immutable string-to-string map in which keys are case-insensitive. This is used to represent * data source options. + * + * Each data source implementation can define its own options and teach its users how to set them. + * Spark doesn't have any restrictions about what options a data source should or should not have. + * Instead Spark defines some standard options that data sources can optionally adopt. It's possible + * that some options are very common and many data sources use them. However different data + * sources may define the common options(key and meaning) differently, which is quite confusing to + * end users. + * + * The standard options defined by Spark: + * + * + * Option key + * Option value + * + * + * path + * A path string of the data files/directories, like + * path1, /absolute/file2, path3/*. The path can + * either be relative or absolute, points to either file or directory, and can contain + * wildcards. This option is commonly used by file-based data sources. + * + * + * paths + * A JSON array style paths string of the data files/directories, like + * ["path1", "/absolute/file2"]. The format of each path is same as the + * path option, plus it should follow JSON string literal format, e.g. quotes + * should be escaped, pa\"th means pa"th. --- End diff -- pa\"th? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20535: [SPARK-23341][SQL] define some standard options f...
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/20535#discussion_r180171421 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/DataSourceOptions.java --- @@ -97,4 +142,66 @@ public double getDouble(String key, double defaultValue) { return keyLowerCasedMap.containsKey(lcaseKey) ? Double.parseDouble(keyLowerCasedMap.get(lcaseKey)) : defaultValue; } + + /** + * The option key for singular path. + */ + public static final String PATH_KEY = "path"; + + /** + * The option key for multiple paths. + */ + public static final String PATHS_KEY = "paths"; + + /** + * The option key for table name. + */ + public static final String TABLE_KEY = "table"; + + /** + * The option key for database name. + */ + public static final String DATABASE_KEY = "database"; + + /** + * Returns the value of the singular path option. + */ + public Optional path() { +return get(PATH_KEY); + } + + /** + * Returns all the paths specified by both the singular path option and the multiple + * paths option. + */ + public String[] paths() { +String[] singularPath = path().map(s -> new String[]{s}).orElseGet(() -> new String[0]); +Optional pathsStr = get(PATHS_KEY); +System.out.println(pathsStr); --- End diff -- remove println :) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20535: [SPARK-23341][SQL] define some standard options f...
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/20535#discussion_r169708283 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala --- @@ -190,10 +190,15 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging { val cls = DataSource.lookupDataSource(source, sparkSession.sessionState.conf) if (classOf[DataSourceV2].isAssignableFrom(cls)) { val ds = cls.newInstance() - val options = new DataSourceOptions((extraOptions ++ -DataSourceV2Utils.extractSessionConfigs( - ds = ds.asInstanceOf[DataSourceV2], - conf = sparkSession.sessionState.conf)).asJava) + val sessionOptions = DataSourceV2Utils.extractSessionConfigs( +ds = ds.asInstanceOf[DataSourceV2], +conf = sparkSession.sessionState.conf) + val pathOption = if (paths.isEmpty) { +Map.empty + } else { +Map(DataSourceOptions.KEY_PATH -> paths.mkString(",")) --- End diff -- `KEY_PATH` sounds like the path for a key. It would be more clear if the name was `PATH_KEY`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20535: [SPARK-23341][SQL] define some standard options f...
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/20535#discussion_r169707938 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/DataSourceOptions.java --- @@ -97,4 +130,20 @@ public double getDouble(String key, double defaultValue) { return keyLowerCasedMap.containsKey(lcaseKey) ? Double.parseDouble(keyLowerCasedMap.get(lcaseKey)) : defaultValue; } + + public static final String KEY_PATH = "path"; + public static final String KEY_TABLE = "table"; + public static final String KEY_DATABASE = "database"; + + public Optional getPath() { --- End diff -- I think it is more friendly when using this in scala to drop the `get` and use just `path` or `database`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20535: [SPARK-23341][SQL] define some standard options f...
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/20535#discussion_r169707644 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/DataSourceOptions.java --- @@ -27,6 +27,39 @@ /** * An immutable string-to-string map in which keys are case-insensitive. This is used to represent * data source options. + * + * Each data source implementation can define its own options and teach its users how to set them. + * Spark doesn't have any restrictions about what options a data source should or should not have. + * Instead Spark defines some standard options that data sources can optionally adopt. It's possible + * that some options are very common and many data sources use them. However different data + * sources may define the common options(key and meaning) differently, which is quite confusing to + * end users. + * + * The standard options defined by Spark: + * + * + * Option key + * Option value + * + * + * path + * A comma separated paths string of the data files/directories, like + * path1,/absolute/file2,path3/*. Each path can either be relative or absolute, + * points to either file or directory, and can contain wildcards. This option is commonly used + * by file-based data sources. + * + * + * table + * A table name string representing the table name directly without any interpretation. --- End diff -- I think this is clear with the examples. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20535: [SPARK-23341][SQL] define some standard options f...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20535#discussion_r166826237 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala --- @@ -171,7 +171,8 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging { * @since 1.4.0 */ def load(path: String): DataFrame = { -option("path", path).load(Seq.empty: _*) // force invocation of `load(...varargs...)` +// force invocation of `load(...varargs...)` +option(DataSourceOptions.KEY_PATH, path).load(Seq.empty: _*) --- End diff -- makes sense, let me change it back. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20535: [SPARK-23341][SQL] define some standard options f...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20535#discussion_r166826138 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/DataSourceOptions.java --- @@ -27,6 +27,39 @@ /** * An immutable string-to-string map in which keys are case-insensitive. This is used to represent * data source options. + * + * Each data source implementation can define its own options and teach its users how to set them. + * Spark doesn't have any restrictions about what options a data source should or should not have. + * Instead Spark defines some standard options that data sources can optionally adopt. It's possible + * that some options are very common and many data sources use them. However different data + * sources may define the common options(key and meaning) differently, which is quite confusing to + * end users. + * + * The standard options defined by Spark: + * + * + * Option key + * Option value + * + * + * path + * A comma separated paths string of the data files/directories, like + * path1,/absolute/file2,path3/*. Each path can either be relative or absolute, + * points to either file or directory, and can contain wildcards. This option is commonly used + * by file-based data sources. + * + * + * table + * A table name string representing the table name directly without any interpretation. --- End diff -- It means it's a pure string, there is not parsing rule for it like SQL identifier. I put some examples below and hopefully they can explain it well. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20535: [SPARK-23341][SQL] define some standard options f...
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/20535#discussion_r166709990 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala --- @@ -171,7 +171,8 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging { * @since 1.4.0 */ def load(path: String): DataFrame = { -option("path", path).load(Seq.empty: _*) // force invocation of `load(...varargs...)` +// force invocation of `load(...varargs...)` +option(DataSourceOptions.KEY_PATH, path).load(Seq.empty: _*) --- End diff -- It seems odd to me to change this string. While there's no behavior change, the constant is for a key in v2's DataSourceOptions, not for the DataFrameReader API. We could change it to "PATH" and it would be perfectly fine for v2, but would change the behavior here. Such a change is incredibly unlikely, which is why I say it is just "odd". --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20535: [SPARK-23341][SQL] define some standard options f...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/20535#discussion_r166701501 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/DataSourceOptions.java --- @@ -27,6 +27,39 @@ /** * An immutable string-to-string map in which keys are case-insensitive. This is used to represent * data source options. + * + * Each data source implementation can define its own options and teach its users how to set them. + * Spark doesn't have any restrictions about what options a data source should or should not have. + * Instead Spark defines some standard options that data sources can optionally adopt. It's possible + * that some options are very common and many data sources use them. However different data + * sources may define the common options(key and meaning) differently, which is quite confusing to + * end users. + * + * The standard options defined by Spark: + * + * + * Option key + * Option value + * + * + * path + * A comma separated paths string of the data files/directories, like + * path1,/absolute/file2,path3/*. Each path can either be relative or absolute, + * points to either file or directory, and can contain wildcards. This option is commonly used + * by file-based data sources. + * + * + * table + * A table name string representing the table name directly without any interpretation. --- End diff -- what do you mean by "without any interpretation"? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20535: [SPARK-23341][SQL] define some standard options f...
GitHub user cloud-fan opened a pull request: https://github.com/apache/spark/pull/20535 [SPARK-23341][SQL] define some standard options for data source v2 ## What changes were proposed in this pull request? Each data source implementation can define its own options and teach its users how to set them. Spark doesn't have any restrictions about what options a data source should or should not have. It's possible that some options are very common and many data sources use them. However different data sources may define the common options(key and meaning) differently, which is quite confusing to end users. This PR defines some standard options that data sources can optionally adopt: path, table and database. ## How was this patch tested? a new test case. You can merge this pull request into a Git repository by running: $ git pull https://github.com/cloud-fan/spark options Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20535.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20535 commit c9009d85d3f147eb652f141bbcb0424767e90477 Author: Wenchen Fan Date: 2018-02-07T14:56:53Z define some standard options for data source v2 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org