[jira] [Assigned] (SPARK-42690) Implement CSV/JSON parsing funcions
[ https://issues.apache.org/jira/browse/SPARK-42690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-42690: - Assignee: Yang Jie > Implement CSV/JSON parsing funcions > --- > > Key: SPARK-42690 > URL: https://issues.apache.org/jira/browse/SPARK-42690 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Assignee: Yang Jie >Priority: Major > > Implement the following two methods in DataFrameReader: > > > {code:java} > /** > * Loads a `Dataset[String]` storing JSON objects ( href="http://jsonlines.org/";>JSON Lines > * text format or newline-delimited JSON) and returns the result as a > `DataFrame`. > * > * Unless the schema is specified using `schema` function, this function goes > through the > * input once to determine the input schema. > * > * @param jsonDataset input Dataset with one JSON object per record > * @since 3.4.0 > */ > def json(jsonDataset: Dataset[String]): DataFrame > /** > * Loads an `Dataset[String]` storing CSV rows and returns the result as a > `DataFrame`. > * > * If the schema is not specified using `schema` function and `inferSchema` > option is enabled, > * this function goes through the input once to determine the input schema. > * > * If the schema is not specified using `schema` function and `inferSchema` > option is disabled, > * it determines the columns as string types and it reads only the first line > to determine the > * names and the number of fields. > * > * If the enforceSchema is set to `false`, only the CSV header in the first > line is checked > * to conform specified or inferred schema. > * > * @note if `header` option is set to `true` when calling this API, all lines > same with > * the header will be removed if exists. > * > * @param csvDataset input Dataset with one CSV row per record > * @since 3.4.0 > */ > def csv(csvDataset: Dataset[String]): DataFrame > {code} > > For this we need a new message. We cannot use project because we don't know > the schema upfront. > > {code:java} > message Parse { > // (Required) Input relation to Parse. The input is expected to have single > text column. > Relation input = 1; > // (Required) The expected format of the text. > ParseFormat format = 2; > enum ParseFormat { > PARSE_FORMAT_UNSPECIFIED = 0; > PARSE_FORMAT_CSV = 1; > PARSE_FORMAT_JSON = 2; > } > } > {code} > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42690) Implement CSV/JSON parsing funcions
[ https://issues.apache.org/jira/browse/SPARK-42690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42690: Assignee: Apache Spark > Implement CSV/JSON parsing funcions > --- > > Key: SPARK-42690 > URL: https://issues.apache.org/jira/browse/SPARK-42690 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Assignee: Apache Spark >Priority: Major > > Implement the following two methods in DataFrameReader: > > > {code:java} > /** > * Loads a `Dataset[String]` storing JSON objects ( href="http://jsonlines.org/";>JSON Lines > * text format or newline-delimited JSON) and returns the result as a > `DataFrame`. > * > * Unless the schema is specified using `schema` function, this function goes > through the > * input once to determine the input schema. > * > * @param jsonDataset input Dataset with one JSON object per record > * @since 3.4.0 > */ > def json(jsonDataset: Dataset[String]): DataFrame > /** > * Loads an `Dataset[String]` storing CSV rows and returns the result as a > `DataFrame`. > * > * If the schema is not specified using `schema` function and `inferSchema` > option is enabled, > * this function goes through the input once to determine the input schema. > * > * If the schema is not specified using `schema` function and `inferSchema` > option is disabled, > * it determines the columns as string types and it reads only the first line > to determine the > * names and the number of fields. > * > * If the enforceSchema is set to `false`, only the CSV header in the first > line is checked > * to conform specified or inferred schema. > * > * @note if `header` option is set to `true` when calling this API, all lines > same with > * the header will be removed if exists. > * > * @param csvDataset input Dataset with one CSV row per record > * @since 3.4.0 > */ > def csv(csvDataset: Dataset[String]): DataFrame > {code} > > For this we need a new message. We cannot use project because we don't know > the schema upfront. > > {code:java} > message Parse { > // (Required) Input relation to Parse. The input is expected to have single > text column. > Relation input = 1; > // (Required) The expected format of the text. > ParseFormat format = 2; > enum ParseFormat { > PARSE_FORMAT_UNSPECIFIED = 0; > PARSE_FORMAT_CSV = 1; > PARSE_FORMAT_JSON = 2; > } > } > {code} > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42690) Implement CSV/JSON parsing funcions
[ https://issues.apache.org/jira/browse/SPARK-42690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42690: Assignee: (was: Apache Spark) > Implement CSV/JSON parsing funcions > --- > > Key: SPARK-42690 > URL: https://issues.apache.org/jira/browse/SPARK-42690 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Priority: Major > > Implement the following two methods in DataFrameReader: > > > {code:java} > /** > * Loads a `Dataset[String]` storing JSON objects ( href="http://jsonlines.org/";>JSON Lines > * text format or newline-delimited JSON) and returns the result as a > `DataFrame`. > * > * Unless the schema is specified using `schema` function, this function goes > through the > * input once to determine the input schema. > * > * @param jsonDataset input Dataset with one JSON object per record > * @since 3.4.0 > */ > def json(jsonDataset: Dataset[String]): DataFrame > /** > * Loads an `Dataset[String]` storing CSV rows and returns the result as a > `DataFrame`. > * > * If the schema is not specified using `schema` function and `inferSchema` > option is enabled, > * this function goes through the input once to determine the input schema. > * > * If the schema is not specified using `schema` function and `inferSchema` > option is disabled, > * it determines the columns as string types and it reads only the first line > to determine the > * names and the number of fields. > * > * If the enforceSchema is set to `false`, only the CSV header in the first > line is checked > * to conform specified or inferred schema. > * > * @note if `header` option is set to `true` when calling this API, all lines > same with > * the header will be removed if exists. > * > * @param csvDataset input Dataset with one CSV row per record > * @since 3.4.0 > */ > def csv(csvDataset: Dataset[String]): DataFrame > {code} > > For this we need a new message. We cannot use project because we don't know > the schema upfront. > > {code:java} > message Parse { > // (Required) Input relation to Parse. The input is expected to have single > text column. > Relation input = 1; > // (Required) The expected format of the text. > ParseFormat format = 2; > enum ParseFormat { > PARSE_FORMAT_UNSPECIFIED = 0; > PARSE_FORMAT_CSV = 1; > PARSE_FORMAT_JSON = 2; > } > } > {code} > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org