[jira] [Commented] (SPARK-17101) Provide consistent format identifiers for TextFileFormat and ParquetFileFormat
[ https://issues.apache.org/jira/browse/SPARK-17101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15828253#comment-15828253 ] Hyukjin Kwon commented on SPARK-17101: -- I see. Thank you for correcting me. I will keep it in mind and try to be careful. > Provide consistent format identifiers for TextFileFormat and ParquetFileFormat > -- > > Key: SPARK-17101 > URL: https://issues.apache.org/jira/browse/SPARK-17101 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Jacek Laskowski >Priority: Trivial > > Define the format identifier that is used in {{Optimized Logical Plan}} in > {{explain}} for {{text}} file format. > {code} > scala> spark.read.text("people.csv").cache.explain(extended = true) > ... > == Optimized Logical Plan == > InMemoryRelation [value#24], true, 1, StorageLevel(disk, memory, > deserialized, 1 replicas) >+- *FileScan text [value#24] Batched: false, Format: > org.apache.spark.sql.execution.datasources.text.TextFileFormat@262e2c8c, > InputPaths: file:/Users/jacek/dev/oss/spark/people.csv, PartitionFilters: [], > PushedFilters: [], ReadSchema: struct > == Physical Plan == > InMemoryTableScan [value#24] >+- InMemoryRelation [value#24], true, 1, StorageLevel(disk, memory, > deserialized, 1 replicas) > +- *FileScan text [value#24] Batched: false, Format: > org.apache.spark.sql.execution.datasources.text.TextFileFormat@262e2c8c, > InputPaths: file:/Users/jacek/dev/oss/spark/people.csv, PartitionFilters: [], > PushedFilters: [], ReadSchema: struct > {code} > When you {{explain}} csv format you can see {{Format: CSV}}. > {code} > scala> spark.read.csv("people.csv").cache.explain(extended = true) > == Parsed Logical Plan == > Relation[_c0#39,_c1#40,_c2#41,_c3#42] csv > == Analyzed Logical Plan == > _c0: string, _c1: string, _c2: string, _c3: string > Relation[_c0#39,_c1#40,_c2#41,_c3#42] csv > == Optimized Logical Plan == > InMemoryRelation [_c0#39, _c1#40, _c2#41, _c3#42], true, 1, > StorageLevel(disk, memory, deserialized, 1 replicas) >+- *FileScan csv [_c0#39,_c1#40,_c2#41,_c3#42] Batched: false, Format: > CSV, InputPaths: file:/Users/jacek/dev/oss/spark/people.csv, > PartitionFilters: [], PushedFilters: [], ReadSchema: > struct<_c0:string,_c1:string,_c2:string,_c3:string> > == Physical Plan == > InMemoryTableScan [_c0#39, _c1#40, _c2#41, _c3#42] >+- InMemoryRelation [_c0#39, _c1#40, _c2#41, _c3#42], true, 1, > StorageLevel(disk, memory, deserialized, 1 replicas) > +- *FileScan csv [_c0#39,_c1#40,_c2#41,_c3#42] Batched: false, > Format: CSV, InputPaths: file:/Users/jacek/dev/oss/spark/people.csv, > PartitionFilters: [], PushedFilters: [], ReadSchema: > struct<_c0:string,_c1:string,_c2:string,_c3:string> > {code} > The custom format is defined for JSON, too. > {code} > scala> spark.read.json("people.csv").cache.explain(extended = true) > == Parsed Logical Plan == > Relation[_corrupt_record#93] json > == Analyzed Logical Plan == > _corrupt_record: string > Relation[_corrupt_record#93] json > == Optimized Logical Plan == > InMemoryRelation [_corrupt_record#93], true, 1, StorageLevel(disk, > memory, deserialized, 1 replicas) >+- *FileScan json [_corrupt_record#93] Batched: false, Format: JSON, > InputPaths: file:/Users/jacek/dev/oss/spark/people.csv, PartitionFilters: [], > PushedFilters: [], ReadSchema: struct<_corrupt_record:string> > == Physical Plan == > InMemoryTableScan [_corrupt_record#93] >+- InMemoryRelation [_corrupt_record#93], true, 1, StorageLevel(disk, > memory, deserialized, 1 replicas) > +- *FileScan json [_corrupt_record#93] Batched: false, Format: JSON, > InputPaths: file:/Users/jacek/dev/oss/spark/people.csv, PartitionFilters: [], > PushedFilters: [], ReadSchema: struct<_corrupt_record:string> > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17101) Provide consistent format identifiers for TextFileFormat and ParquetFileFormat
[ https://issues.apache.org/jira/browse/SPARK-17101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15828241#comment-15828241 ] Sean Owen commented on SPARK-17101: --- Looks like the correct resolution [~hyukjin.kwon] but it looks like the duplicates relationship needs to be reversed from its earlier state. This one duplicates the other, but the old relationship between the JIRAs shows the reverse. While you're cleaning up if you're able to fix up things like that it would be even more useful to future readers. > Provide consistent format identifiers for TextFileFormat and ParquetFileFormat > -- > > Key: SPARK-17101 > URL: https://issues.apache.org/jira/browse/SPARK-17101 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Jacek Laskowski >Priority: Trivial > > Define the format identifier that is used in {{Optimized Logical Plan}} in > {{explain}} for {{text}} file format. > {code} > scala> spark.read.text("people.csv").cache.explain(extended = true) > ... > == Optimized Logical Plan == > InMemoryRelation [value#24], true, 1, StorageLevel(disk, memory, > deserialized, 1 replicas) >+- *FileScan text [value#24] Batched: false, Format: > org.apache.spark.sql.execution.datasources.text.TextFileFormat@262e2c8c, > InputPaths: file:/Users/jacek/dev/oss/spark/people.csv, PartitionFilters: [], > PushedFilters: [], ReadSchema: struct > == Physical Plan == > InMemoryTableScan [value#24] >+- InMemoryRelation [value#24], true, 1, StorageLevel(disk, memory, > deserialized, 1 replicas) > +- *FileScan text [value#24] Batched: false, Format: > org.apache.spark.sql.execution.datasources.text.TextFileFormat@262e2c8c, > InputPaths: file:/Users/jacek/dev/oss/spark/people.csv, PartitionFilters: [], > PushedFilters: [], ReadSchema: struct > {code} > When you {{explain}} csv format you can see {{Format: CSV}}. > {code} > scala> spark.read.csv("people.csv").cache.explain(extended = true) > == Parsed Logical Plan == > Relation[_c0#39,_c1#40,_c2#41,_c3#42] csv > == Analyzed Logical Plan == > _c0: string, _c1: string, _c2: string, _c3: string > Relation[_c0#39,_c1#40,_c2#41,_c3#42] csv > == Optimized Logical Plan == > InMemoryRelation [_c0#39, _c1#40, _c2#41, _c3#42], true, 1, > StorageLevel(disk, memory, deserialized, 1 replicas) >+- *FileScan csv [_c0#39,_c1#40,_c2#41,_c3#42] Batched: false, Format: > CSV, InputPaths: file:/Users/jacek/dev/oss/spark/people.csv, > PartitionFilters: [], PushedFilters: [], ReadSchema: > struct<_c0:string,_c1:string,_c2:string,_c3:string> > == Physical Plan == > InMemoryTableScan [_c0#39, _c1#40, _c2#41, _c3#42] >+- InMemoryRelation [_c0#39, _c1#40, _c2#41, _c3#42], true, 1, > StorageLevel(disk, memory, deserialized, 1 replicas) > +- *FileScan csv [_c0#39,_c1#40,_c2#41,_c3#42] Batched: false, > Format: CSV, InputPaths: file:/Users/jacek/dev/oss/spark/people.csv, > PartitionFilters: [], PushedFilters: [], ReadSchema: > struct<_c0:string,_c1:string,_c2:string,_c3:string> > {code} > The custom format is defined for JSON, too. > {code} > scala> spark.read.json("people.csv").cache.explain(extended = true) > == Parsed Logical Plan == > Relation[_corrupt_record#93] json > == Analyzed Logical Plan == > _corrupt_record: string > Relation[_corrupt_record#93] json > == Optimized Logical Plan == > InMemoryRelation [_corrupt_record#93], true, 1, StorageLevel(disk, > memory, deserialized, 1 replicas) >+- *FileScan json [_corrupt_record#93] Batched: false, Format: JSON, > InputPaths: file:/Users/jacek/dev/oss/spark/people.csv, PartitionFilters: [], > PushedFilters: [], ReadSchema: struct<_corrupt_record:string> > == Physical Plan == > InMemoryTableScan [_corrupt_record#93] >+- InMemoryRelation [_corrupt_record#93], true, 1, StorageLevel(disk, > memory, deserialized, 1 replicas) > +- *FileScan json [_corrupt_record#93] Batched: false, Format: JSON, > InputPaths: file:/Users/jacek/dev/oss/spark/people.csv, PartitionFilters: [], > PushedFilters: [], ReadSchema: struct<_corrupt_record:string> > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17101) Provide consistent format identifiers for TextFileFormat and ParquetFileFormat
[ https://issues.apache.org/jira/browse/SPARK-17101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15818648#comment-15818648 ] Shuai Lin commented on SPARK-17101: --- Seems this issue has already been resolved by https://github.com/apache/spark/pull/14680 ? cc [~rxin] > Provide consistent format identifiers for TextFileFormat and ParquetFileFormat > -- > > Key: SPARK-17101 > URL: https://issues.apache.org/jira/browse/SPARK-17101 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Jacek Laskowski >Priority: Trivial > > Define the format identifier that is used in {{Optimized Logical Plan}} in > {{explain}} for {{text}} file format. > {code} > scala> spark.read.text("people.csv").cache.explain(extended = true) > ... > == Optimized Logical Plan == > InMemoryRelation [value#24], true, 1, StorageLevel(disk, memory, > deserialized, 1 replicas) >+- *FileScan text [value#24] Batched: false, Format: > org.apache.spark.sql.execution.datasources.text.TextFileFormat@262e2c8c, > InputPaths: file:/Users/jacek/dev/oss/spark/people.csv, PartitionFilters: [], > PushedFilters: [], ReadSchema: struct > == Physical Plan == > InMemoryTableScan [value#24] >+- InMemoryRelation [value#24], true, 1, StorageLevel(disk, memory, > deserialized, 1 replicas) > +- *FileScan text [value#24] Batched: false, Format: > org.apache.spark.sql.execution.datasources.text.TextFileFormat@262e2c8c, > InputPaths: file:/Users/jacek/dev/oss/spark/people.csv, PartitionFilters: [], > PushedFilters: [], ReadSchema: struct > {code} > When you {{explain}} csv format you can see {{Format: CSV}}. > {code} > scala> spark.read.csv("people.csv").cache.explain(extended = true) > == Parsed Logical Plan == > Relation[_c0#39,_c1#40,_c2#41,_c3#42] csv > == Analyzed Logical Plan == > _c0: string, _c1: string, _c2: string, _c3: string > Relation[_c0#39,_c1#40,_c2#41,_c3#42] csv > == Optimized Logical Plan == > InMemoryRelation [_c0#39, _c1#40, _c2#41, _c3#42], true, 1, > StorageLevel(disk, memory, deserialized, 1 replicas) >+- *FileScan csv [_c0#39,_c1#40,_c2#41,_c3#42] Batched: false, Format: > CSV, InputPaths: file:/Users/jacek/dev/oss/spark/people.csv, > PartitionFilters: [], PushedFilters: [], ReadSchema: > struct<_c0:string,_c1:string,_c2:string,_c3:string> > == Physical Plan == > InMemoryTableScan [_c0#39, _c1#40, _c2#41, _c3#42] >+- InMemoryRelation [_c0#39, _c1#40, _c2#41, _c3#42], true, 1, > StorageLevel(disk, memory, deserialized, 1 replicas) > +- *FileScan csv [_c0#39,_c1#40,_c2#41,_c3#42] Batched: false, > Format: CSV, InputPaths: file:/Users/jacek/dev/oss/spark/people.csv, > PartitionFilters: [], PushedFilters: [], ReadSchema: > struct<_c0:string,_c1:string,_c2:string,_c3:string> > {code} > The custom format is defined for JSON, too. > {code} > scala> spark.read.json("people.csv").cache.explain(extended = true) > == Parsed Logical Plan == > Relation[_corrupt_record#93] json > == Analyzed Logical Plan == > _corrupt_record: string > Relation[_corrupt_record#93] json > == Optimized Logical Plan == > InMemoryRelation [_corrupt_record#93], true, 1, StorageLevel(disk, > memory, deserialized, 1 replicas) >+- *FileScan json [_corrupt_record#93] Batched: false, Format: JSON, > InputPaths: file:/Users/jacek/dev/oss/spark/people.csv, PartitionFilters: [], > PushedFilters: [], ReadSchema: struct<_corrupt_record:string> > == Physical Plan == > InMemoryTableScan [_corrupt_record#93] >+- InMemoryRelation [_corrupt_record#93], true, 1, StorageLevel(disk, > memory, deserialized, 1 replicas) > +- *FileScan json [_corrupt_record#93] Batched: false, Format: JSON, > InputPaths: file:/Users/jacek/dev/oss/spark/people.csv, PartitionFilters: [], > PushedFilters: [], ReadSchema: struct<_corrupt_record:string> > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org