Hi Michael,
Thanks a lot for your help. See below explains for csv and text. Do
you see anything worth investigating?
scala> spark.read.csv("people.csv").cache.explain(extended = true)
== Parsed Logical Plan ==
Relation[_c0#39,_c1#40,_c2#41,_c3#42] csv
== Analyzed Logical Plan ==
_c0: string, _c1: string, _c2: string, _c3: string
Relation[_c0#39,_c1#40,_c2#41,_c3#42] csv
== Optimized Logical Plan ==
InMemoryRelation [_c0#39, _c1#40, _c2#41, _c3#42], true, 1,
StorageLevel(disk, memory, deserialized, 1 replicas)
+- *FileScan csv [_c0#39,_c1#40,_c2#41,_c3#42] Batched: false,
Format: CSV, InputPaths: file:/Users/jacek/dev/oss/spark/people.csv,
PartitionFilters: [], PushedFilters: [], ReadSchema:
struct<_c0:string,_c1:string,_c2:string,_c3:string>
== Physical Plan ==
InMemoryTableScan [_c0#39, _c1#40, _c2#41, _c3#42]
+- InMemoryRelation [_c0#39, _c1#40, _c2#41, _c3#42], true, 1,
StorageLevel(disk, memory, deserialized, 1 replicas)
+- *FileScan csv [_c0#39,_c1#40,_c2#41,_c3#42] Batched:
false, Format: CSV, InputPaths:
file:/Users/jacek/dev/oss/spark/people.csv, PartitionFilters: [],
PushedFilters: [], ReadSchema:
struct<_c0:string,_c1:string,_c2:string,_c3:string>
scala> spark.read.text("people.csv").cache.explain(extended = true)
== Parsed Logical Plan ==
Relation[value#24] text
== Analyzed Logical Plan ==
value: string
Relation[value#24] text
== Optimized Logical Plan ==
InMemoryRelation [value#24], true, 1, StorageLevel(disk, memory,
deserialized, 1 replicas)
+- *FileScan text [value#24] Batched: false, Format:
org.apache.spark.sql.execution.datasources.text.TextFileFormat@262e2c8c,
InputPaths: file:/Users/jacek/dev/oss/spark/people.csv,
PartitionFilters: [], PushedFilters: [], ReadSchema:
struct
== Physical Plan ==
InMemoryTableScan [value#24]
+- InMemoryRelation [value#24], true, 1, StorageLevel(disk,
memory, deserialized, 1 replicas)
+- *FileScan text [value#24] Batched: false, Format:
org.apache.spark.sql.execution.datasources.text.TextFileFormat@262e2c8c,
InputPaths: file:/Users/jacek/dev/oss/spark/people.csv,
PartitionFilters: [], PushedFilters: [], ReadSchema:
struct
The only thing I could find "interesting" is that TextFileFormat does
not print TEXT as CSV does. Anything special you see?
Pozdrawiam,
Jacek Laskowski
https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski
On Tue, Aug 16, 2016 at 7:24 PM, Michael Armbrust
wrote:
> try running explain on each of these. my guess would be caching in broken
> in some cases.
>
> On Tue, Aug 16, 2016 at 6:05 PM, Jacek Laskowski wrote:
>>
>> Hi,
>>
>> Can anyone explain why spark.read.csv("people.csv").cache.show ends up
>> with a WARN while spark.read.text("people.csv").cache.show does not?
>> It happens in 2.0 and today's build.
>>
>> scala> sc.version
>> res5: String = 2.1.0-SNAPSHOT
>>
>> scala> spark.read.csv("people.csv").cache.show
>> +-+-+---++
>> | _c0| _c1|_c2| _c3|
>> +-+-+---++
>> |kolumna 1|kolumna 2|kolumn3|size|
>> |Jacek| Warszawa| Polska| 40|
>> +-+-+---++
>>
>> scala> spark.read.csv("people.csv").cache.show
>> 16/08/16 18:01:52 WARN CacheManager: Asked to cache already cached data.
>> +-+-+---++
>> | _c0| _c1|_c2| _c3|
>> +-+-+---++
>> |kolumna 1|kolumna 2|kolumn3|size|
>> |Jacek| Warszawa| Polska| 40|
>> +-+-+---++
>>
>> scala> spark.read.text("people.csv").cache.show
>> ++
>> | value|
>> ++
>> |kolumna 1,kolumna...|
>> |Jacek,Warszawa,Po...|
>> ++
>>
>> scala> spark.read.text("people.csv").cache.show
>> ++
>> | value|
>> ++
>> |kolumna 1,kolumna...|
>> |Jacek,Warszawa,Po...|
>> ++
>>
>> Pozdrawiam,
>> Jacek Laskowski
>>
>> https://medium.com/@jaceklaskowski/
>> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
>> Follow me at https://twitter.com/jaceklaskowski
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org