[jira] [Commented] (SPARK-17101) Provide consistent format identifiers for TextFileFormat and ParquetFileFormat

2017-01-18 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15828253#comment-15828253
 ] 

Hyukjin Kwon commented on SPARK-17101:
--

I see. Thank you for correcting me. I will keep it in mind and try to be 
careful.

> Provide consistent format identifiers for TextFileFormat and ParquetFileFormat
> --
>
> Key: SPARK-17101
> URL: https://issues.apache.org/jira/browse/SPARK-17101
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Jacek Laskowski
>Priority: Trivial
>
> Define the format identifier that is used in {{Optimized Logical Plan}} in 
> {{explain}} for {{text}} file format.
> {code}
> scala> spark.read.text("people.csv").cache.explain(extended = true)
> ...
> == Optimized Logical Plan ==
> InMemoryRelation [value#24], true, 1, StorageLevel(disk, memory, 
> deserialized, 1 replicas)
>+- *FileScan text [value#24] Batched: false, Format: 
> org.apache.spark.sql.execution.datasources.text.TextFileFormat@262e2c8c, 
> InputPaths: file:/Users/jacek/dev/oss/spark/people.csv, PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct
> == Physical Plan ==
> InMemoryTableScan [value#24]
>+- InMemoryRelation [value#24], true, 1, StorageLevel(disk, memory, 
> deserialized, 1 replicas)
>  +- *FileScan text [value#24] Batched: false, Format: 
> org.apache.spark.sql.execution.datasources.text.TextFileFormat@262e2c8c, 
> InputPaths: file:/Users/jacek/dev/oss/spark/people.csv, PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct
> {code}
> When you {{explain}} csv format you can see {{Format: CSV}}.
> {code}
> scala> spark.read.csv("people.csv").cache.explain(extended = true)
> == Parsed Logical Plan ==
> Relation[_c0#39,_c1#40,_c2#41,_c3#42] csv
> == Analyzed Logical Plan ==
> _c0: string, _c1: string, _c2: string, _c3: string
> Relation[_c0#39,_c1#40,_c2#41,_c3#42] csv
> == Optimized Logical Plan ==
> InMemoryRelation [_c0#39, _c1#40, _c2#41, _c3#42], true, 1, 
> StorageLevel(disk, memory, deserialized, 1 replicas)
>+- *FileScan csv [_c0#39,_c1#40,_c2#41,_c3#42] Batched: false, Format: 
> CSV, InputPaths: file:/Users/jacek/dev/oss/spark/people.csv, 
> PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct<_c0:string,_c1:string,_c2:string,_c3:string>
> == Physical Plan ==
> InMemoryTableScan [_c0#39, _c1#40, _c2#41, _c3#42]
>+- InMemoryRelation [_c0#39, _c1#40, _c2#41, _c3#42], true, 1, 
> StorageLevel(disk, memory, deserialized, 1 replicas)
>  +- *FileScan csv [_c0#39,_c1#40,_c2#41,_c3#42] Batched: false, 
> Format: CSV, InputPaths: file:/Users/jacek/dev/oss/spark/people.csv, 
> PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct<_c0:string,_c1:string,_c2:string,_c3:string>
> {code}
> The custom format is defined for JSON, too.
> {code}
> scala> spark.read.json("people.csv").cache.explain(extended = true)
> == Parsed Logical Plan ==
> Relation[_corrupt_record#93] json
> == Analyzed Logical Plan ==
> _corrupt_record: string
> Relation[_corrupt_record#93] json
> == Optimized Logical Plan ==
> InMemoryRelation [_corrupt_record#93], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
>+- *FileScan json [_corrupt_record#93] Batched: false, Format: JSON, 
> InputPaths: file:/Users/jacek/dev/oss/spark/people.csv, PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct<_corrupt_record:string>
> == Physical Plan ==
> InMemoryTableScan [_corrupt_record#93]
>+- InMemoryRelation [_corrupt_record#93], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
>  +- *FileScan json [_corrupt_record#93] Batched: false, Format: JSON, 
> InputPaths: file:/Users/jacek/dev/oss/spark/people.csv, PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct<_corrupt_record:string>
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17101) Provide consistent format identifiers for TextFileFormat and ParquetFileFormat

2017-01-18 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15828241#comment-15828241
 ] 

Sean Owen commented on SPARK-17101:
---

Looks like the correct resolution [~hyukjin.kwon] but it looks like the 
duplicates relationship needs to be reversed from its earlier state. This one 
duplicates the other, but the old relationship between the JIRAs shows the 
reverse. While you're cleaning up if you're able to fix up things like that it 
would be even more useful to future readers.

> Provide consistent format identifiers for TextFileFormat and ParquetFileFormat
> --
>
> Key: SPARK-17101
> URL: https://issues.apache.org/jira/browse/SPARK-17101
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Jacek Laskowski
>Priority: Trivial
>
> Define the format identifier that is used in {{Optimized Logical Plan}} in 
> {{explain}} for {{text}} file format.
> {code}
> scala> spark.read.text("people.csv").cache.explain(extended = true)
> ...
> == Optimized Logical Plan ==
> InMemoryRelation [value#24], true, 1, StorageLevel(disk, memory, 
> deserialized, 1 replicas)
>+- *FileScan text [value#24] Batched: false, Format: 
> org.apache.spark.sql.execution.datasources.text.TextFileFormat@262e2c8c, 
> InputPaths: file:/Users/jacek/dev/oss/spark/people.csv, PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct
> == Physical Plan ==
> InMemoryTableScan [value#24]
>+- InMemoryRelation [value#24], true, 1, StorageLevel(disk, memory, 
> deserialized, 1 replicas)
>  +- *FileScan text [value#24] Batched: false, Format: 
> org.apache.spark.sql.execution.datasources.text.TextFileFormat@262e2c8c, 
> InputPaths: file:/Users/jacek/dev/oss/spark/people.csv, PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct
> {code}
> When you {{explain}} csv format you can see {{Format: CSV}}.
> {code}
> scala> spark.read.csv("people.csv").cache.explain(extended = true)
> == Parsed Logical Plan ==
> Relation[_c0#39,_c1#40,_c2#41,_c3#42] csv
> == Analyzed Logical Plan ==
> _c0: string, _c1: string, _c2: string, _c3: string
> Relation[_c0#39,_c1#40,_c2#41,_c3#42] csv
> == Optimized Logical Plan ==
> InMemoryRelation [_c0#39, _c1#40, _c2#41, _c3#42], true, 1, 
> StorageLevel(disk, memory, deserialized, 1 replicas)
>+- *FileScan csv [_c0#39,_c1#40,_c2#41,_c3#42] Batched: false, Format: 
> CSV, InputPaths: file:/Users/jacek/dev/oss/spark/people.csv, 
> PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct<_c0:string,_c1:string,_c2:string,_c3:string>
> == Physical Plan ==
> InMemoryTableScan [_c0#39, _c1#40, _c2#41, _c3#42]
>+- InMemoryRelation [_c0#39, _c1#40, _c2#41, _c3#42], true, 1, 
> StorageLevel(disk, memory, deserialized, 1 replicas)
>  +- *FileScan csv [_c0#39,_c1#40,_c2#41,_c3#42] Batched: false, 
> Format: CSV, InputPaths: file:/Users/jacek/dev/oss/spark/people.csv, 
> PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct<_c0:string,_c1:string,_c2:string,_c3:string>
> {code}
> The custom format is defined for JSON, too.
> {code}
> scala> spark.read.json("people.csv").cache.explain(extended = true)
> == Parsed Logical Plan ==
> Relation[_corrupt_record#93] json
> == Analyzed Logical Plan ==
> _corrupt_record: string
> Relation[_corrupt_record#93] json
> == Optimized Logical Plan ==
> InMemoryRelation [_corrupt_record#93], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
>+- *FileScan json [_corrupt_record#93] Batched: false, Format: JSON, 
> InputPaths: file:/Users/jacek/dev/oss/spark/people.csv, PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct<_corrupt_record:string>
> == Physical Plan ==
> InMemoryTableScan [_corrupt_record#93]
>+- InMemoryRelation [_corrupt_record#93], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
>  +- *FileScan json [_corrupt_record#93] Batched: false, Format: JSON, 
> InputPaths: file:/Users/jacek/dev/oss/spark/people.csv, PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct<_corrupt_record:string>
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17101) Provide consistent format identifiers for TextFileFormat and ParquetFileFormat

2017-01-11 Thread Shuai Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15818648#comment-15818648
 ] 

Shuai Lin commented on SPARK-17101:
---

Seems this issue has already been resolved by 
https://github.com/apache/spark/pull/14680 ? cc [~rxin]

> Provide consistent format identifiers for TextFileFormat and ParquetFileFormat
> --
>
> Key: SPARK-17101
> URL: https://issues.apache.org/jira/browse/SPARK-17101
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Jacek Laskowski
>Priority: Trivial
>
> Define the format identifier that is used in {{Optimized Logical Plan}} in 
> {{explain}} for {{text}} file format.
> {code}
> scala> spark.read.text("people.csv").cache.explain(extended = true)
> ...
> == Optimized Logical Plan ==
> InMemoryRelation [value#24], true, 1, StorageLevel(disk, memory, 
> deserialized, 1 replicas)
>+- *FileScan text [value#24] Batched: false, Format: 
> org.apache.spark.sql.execution.datasources.text.TextFileFormat@262e2c8c, 
> InputPaths: file:/Users/jacek/dev/oss/spark/people.csv, PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct
> == Physical Plan ==
> InMemoryTableScan [value#24]
>+- InMemoryRelation [value#24], true, 1, StorageLevel(disk, memory, 
> deserialized, 1 replicas)
>  +- *FileScan text [value#24] Batched: false, Format: 
> org.apache.spark.sql.execution.datasources.text.TextFileFormat@262e2c8c, 
> InputPaths: file:/Users/jacek/dev/oss/spark/people.csv, PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct
> {code}
> When you {{explain}} csv format you can see {{Format: CSV}}.
> {code}
> scala> spark.read.csv("people.csv").cache.explain(extended = true)
> == Parsed Logical Plan ==
> Relation[_c0#39,_c1#40,_c2#41,_c3#42] csv
> == Analyzed Logical Plan ==
> _c0: string, _c1: string, _c2: string, _c3: string
> Relation[_c0#39,_c1#40,_c2#41,_c3#42] csv
> == Optimized Logical Plan ==
> InMemoryRelation [_c0#39, _c1#40, _c2#41, _c3#42], true, 1, 
> StorageLevel(disk, memory, deserialized, 1 replicas)
>+- *FileScan csv [_c0#39,_c1#40,_c2#41,_c3#42] Batched: false, Format: 
> CSV, InputPaths: file:/Users/jacek/dev/oss/spark/people.csv, 
> PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct<_c0:string,_c1:string,_c2:string,_c3:string>
> == Physical Plan ==
> InMemoryTableScan [_c0#39, _c1#40, _c2#41, _c3#42]
>+- InMemoryRelation [_c0#39, _c1#40, _c2#41, _c3#42], true, 1, 
> StorageLevel(disk, memory, deserialized, 1 replicas)
>  +- *FileScan csv [_c0#39,_c1#40,_c2#41,_c3#42] Batched: false, 
> Format: CSV, InputPaths: file:/Users/jacek/dev/oss/spark/people.csv, 
> PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct<_c0:string,_c1:string,_c2:string,_c3:string>
> {code}
> The custom format is defined for JSON, too.
> {code}
> scala> spark.read.json("people.csv").cache.explain(extended = true)
> == Parsed Logical Plan ==
> Relation[_corrupt_record#93] json
> == Analyzed Logical Plan ==
> _corrupt_record: string
> Relation[_corrupt_record#93] json
> == Optimized Logical Plan ==
> InMemoryRelation [_corrupt_record#93], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
>+- *FileScan json [_corrupt_record#93] Batched: false, Format: JSON, 
> InputPaths: file:/Users/jacek/dev/oss/spark/people.csv, PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct<_corrupt_record:string>
> == Physical Plan ==
> InMemoryTableScan [_corrupt_record#93]
>+- InMemoryRelation [_corrupt_record#93], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
>  +- *FileScan json [_corrupt_record#93] Batched: false, Format: JSON, 
> InputPaths: file:/Users/jacek/dev/oss/spark/people.csv, PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct<_corrupt_record:string>
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org