[jira] [Commented] (SPARK-17101) Provide consistent format identifiers for TextFileFormat and ParquetFileFormat

Shuai Lin (JIRA) Wed, 11 Jan 2017 07:40:41 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-17101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15818648#comment-15818648
 ]


Shuai Lin commented on SPARK-17101:
-----------------------------------

Seems this issue has already been resolved by 
https://github.com/apache/spark/pull/14680 ? cc [~rxin]

> Provide consistent format identifiers for TextFileFormat and ParquetFileFormat
> ------------------------------------------------------------------------------
>
>                 Key: SPARK-17101
>                 URL: https://issues.apache.org/jira/browse/SPARK-17101
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.1.0
>            Reporter: Jacek Laskowski
>            Priority: Trivial
>
> Define the format identifier that is used in {{Optimized Logical Plan}} in 
> {{explain}} for {{text}} file format.
> {code}
> scala> spark.read.text("people.csv").cache.explain(extended = true)
> ...
> == Optimized Logical Plan ==
> InMemoryRelation [value#24], true, 10000, StorageLevel(disk, memory, 
> deserialized, 1 replicas)
>    +- *FileScan text [value#24] Batched: false, Format: 
> org.apache.spark.sql.execution.datasources.text.TextFileFormat@262e2c8c, 
> InputPaths: file:/Users/jacek/dev/oss/spark/people.csv, PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct<value:string>
> == Physical Plan ==
> InMemoryTableScan [value#24]
>    +- InMemoryRelation [value#24], true, 10000, StorageLevel(disk, memory, 
> deserialized, 1 replicas)
>          +- *FileScan text [value#24] Batched: false, Format: 
> org.apache.spark.sql.execution.datasources.text.TextFileFormat@262e2c8c, 
> InputPaths: file:/Users/jacek/dev/oss/spark/people.csv, PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct<value:string>
> {code}
> When you {{explain}} csv format you can see {{Format: CSV}}.
> {code}
> scala> spark.read.csv("people.csv").cache.explain(extended = true)
> == Parsed Logical Plan ==
> Relation[_c0#39,_c1#40,_c2#41,_c3#42] csv
> == Analyzed Logical Plan ==
> _c0: string, _c1: string, _c2: string, _c3: string
> Relation[_c0#39,_c1#40,_c2#41,_c3#42] csv
> == Optimized Logical Plan ==
> InMemoryRelation [_c0#39, _c1#40, _c2#41, _c3#42], true, 10000, 
> StorageLevel(disk, memory, deserialized, 1 replicas)
>    +- *FileScan csv [_c0#39,_c1#40,_c2#41,_c3#42] Batched: false, Format: 
> CSV, InputPaths: file:/Users/jacek/dev/oss/spark/people.csv, 
> PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct<_c0:string,_c1:string,_c2:string,_c3:string>
> == Physical Plan ==
> InMemoryTableScan [_c0#39, _c1#40, _c2#41, _c3#42]
>    +- InMemoryRelation [_c0#39, _c1#40, _c2#41, _c3#42], true, 10000, 
> StorageLevel(disk, memory, deserialized, 1 replicas)
>          +- *FileScan csv [_c0#39,_c1#40,_c2#41,_c3#42] Batched: false, 
> Format: CSV, InputPaths: file:/Users/jacek/dev/oss/spark/people.csv, 
> PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct<_c0:string,_c1:string,_c2:string,_c3:string>
> {code}
> The custom format is defined for JSON, too.
> {code}
> scala> spark.read.json("people.csv").cache.explain(extended = true)
> == Parsed Logical Plan ==
> Relation[_corrupt_record#93] json
> == Analyzed Logical Plan ==
> _corrupt_record: string
> Relation[_corrupt_record#93] json
> == Optimized Logical Plan ==
> InMemoryRelation [_corrupt_record#93], true, 10000, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
>    +- *FileScan json [_corrupt_record#93] Batched: false, Format: JSON, 
> InputPaths: file:/Users/jacek/dev/oss/spark/people.csv, PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct<_corrupt_record:string>
> == Physical Plan ==
> InMemoryTableScan [_corrupt_record#93]
>    +- InMemoryRelation [_corrupt_record#93], true, 10000, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
>          +- *FileScan json [_corrupt_record#93] Batched: false, Format: JSON, 
> InputPaths: file:/Users/jacek/dev/oss/spark/people.csv, PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct<_corrupt_record:string>
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17101) Provide consistent format identifiers for TextFileFormat and ParquetFileFormat

Reply via email to