Hi Michael, Thanks a lot for your help. See below explains for csv and text. Do you see anything worth investigating?
scala> spark.read.csv("people.csv").cache.explain(extended = true) == Parsed Logical Plan == Relation[_c0#39,_c1#40,_c2#41,_c3#42] csv == Analyzed Logical Plan == _c0: string, _c1: string, _c2: string, _c3: string Relation[_c0#39,_c1#40,_c2#41,_c3#42] csv == Optimized Logical Plan == InMemoryRelation [_c0#39, _c1#40, _c2#41, _c3#42], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas) +- *FileScan csv [_c0#39,_c1#40,_c2#41,_c3#42] Batched: false, Format: CSV, InputPaths: file:/Users/jacek/dev/oss/spark/people.csv, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<_c0:string,_c1:string,_c2:string,_c3:string> == Physical Plan == InMemoryTableScan [_c0#39, _c1#40, _c2#41, _c3#42] +- InMemoryRelation [_c0#39, _c1#40, _c2#41, _c3#42], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas) +- *FileScan csv [_c0#39,_c1#40,_c2#41,_c3#42] Batched: false, Format: CSV, InputPaths: file:/Users/jacek/dev/oss/spark/people.csv, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<_c0:string,_c1:string,_c2:string,_c3:string> scala> spark.read.text("people.csv").cache.explain(extended = true) == Parsed Logical Plan == Relation[value#24] text == Analyzed Logical Plan == value: string Relation[value#24] text == Optimized Logical Plan == InMemoryRelation [value#24], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas) +- *FileScan text [value#24] Batched: false, Format: org.apache.spark.sql.execution.datasources.text.TextFileFormat@262e2c8c, InputPaths: file:/Users/jacek/dev/oss/spark/people.csv, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<value:string> == Physical Plan == InMemoryTableScan [value#24] +- InMemoryRelation [value#24], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas) +- *FileScan text [value#24] Batched: false, Format: org.apache.spark.sql.execution.datasources.text.TextFileFormat@262e2c8c, InputPaths: file:/Users/jacek/dev/oss/spark/people.csv, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<value:string> The only thing I could find "interesting" is that TextFileFormat does not print TEXT as CSV does. Anything special you see? Pozdrawiam, Jacek Laskowski ---- https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Tue, Aug 16, 2016 at 7:24 PM, Michael Armbrust <mich...@databricks.com> wrote: > try running explain on each of these. my guess would be caching in broken > in some cases. > > On Tue, Aug 16, 2016 at 6:05 PM, Jacek Laskowski <ja...@japila.pl> wrote: >> >> Hi, >> >> Can anyone explain why spark.read.csv("people.csv").cache.show ends up >> with a WARN while spark.read.text("people.csv").cache.show does not? >> It happens in 2.0 and today's build. >> >> scala> sc.version >> res5: String = 2.1.0-SNAPSHOT >> >> scala> spark.read.csv("people.csv").cache.show >> +---------+---------+-------+----+ >> | _c0| _c1| _c2| _c3| >> +---------+---------+-------+----+ >> |kolumna 1|kolumna 2|kolumn3|size| >> | Jacek| Warszawa| Polska| 40| >> +---------+---------+-------+----+ >> >> scala> spark.read.csv("people.csv").cache.show >> 16/08/16 18:01:52 WARN CacheManager: Asked to cache already cached data. >> +---------+---------+-------+----+ >> | _c0| _c1| _c2| _c3| >> +---------+---------+-------+----+ >> |kolumna 1|kolumna 2|kolumn3|size| >> | Jacek| Warszawa| Polska| 40| >> +---------+---------+-------+----+ >> >> scala> spark.read.text("people.csv").cache.show >> +--------------------+ >> | value| >> +--------------------+ >> |kolumna 1,kolumna...| >> |Jacek,Warszawa,Po...| >> +--------------------+ >> >> scala> spark.read.text("people.csv").cache.show >> +--------------------+ >> | value| >> +--------------------+ >> |kolumna 1,kolumna...| >> |Jacek,Warszawa,Po...| >> +--------------------+ >> >> Pozdrawiam, >> Jacek Laskowski >> ---- >> https://medium.com/@jaceklaskowski/ >> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark >> Follow me at https://twitter.com/jaceklaskowski >> >> --------------------------------------------------------------------- >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> > --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org