Re: [SQL] Why does spark.read.csv.cache give me a WARN about cache but not text?!

2016-08-17 Thread Jacek Laskowski
Hi Michael,

Thanks a lot for your help. See below explains for csv and text. Do
you see anything worth investigating?

scala> spark.read.csv("people.csv").cache.explain(extended = true)
== Parsed Logical Plan ==
Relation[_c0#39,_c1#40,_c2#41,_c3#42] csv

== Analyzed Logical Plan ==
_c0: string, _c1: string, _c2: string, _c3: string
Relation[_c0#39,_c1#40,_c2#41,_c3#42] csv

== Optimized Logical Plan ==
InMemoryRelation [_c0#39, _c1#40, _c2#41, _c3#42], true, 1,
StorageLevel(disk, memory, deserialized, 1 replicas)
   +- *FileScan csv [_c0#39,_c1#40,_c2#41,_c3#42] Batched: false,
Format: CSV, InputPaths: file:/Users/jacek/dev/oss/spark/people.csv,
PartitionFilters: [], PushedFilters: [], ReadSchema:
struct<_c0:string,_c1:string,_c2:string,_c3:string>

== Physical Plan ==
InMemoryTableScan [_c0#39, _c1#40, _c2#41, _c3#42]
   +- InMemoryRelation [_c0#39, _c1#40, _c2#41, _c3#42], true, 1,
StorageLevel(disk, memory, deserialized, 1 replicas)
 +- *FileScan csv [_c0#39,_c1#40,_c2#41,_c3#42] Batched:
false, Format: CSV, InputPaths:
file:/Users/jacek/dev/oss/spark/people.csv, PartitionFilters: [],
PushedFilters: [], ReadSchema:
struct<_c0:string,_c1:string,_c2:string,_c3:string>


scala> spark.read.text("people.csv").cache.explain(extended = true)
== Parsed Logical Plan ==
Relation[value#24] text

== Analyzed Logical Plan ==
value: string
Relation[value#24] text

== Optimized Logical Plan ==
InMemoryRelation [value#24], true, 1, StorageLevel(disk, memory,
deserialized, 1 replicas)
   +- *FileScan text [value#24] Batched: false, Format:
org.apache.spark.sql.execution.datasources.text.TextFileFormat@262e2c8c,
InputPaths: file:/Users/jacek/dev/oss/spark/people.csv,
PartitionFilters: [], PushedFilters: [], ReadSchema:
struct

== Physical Plan ==
InMemoryTableScan [value#24]
   +- InMemoryRelation [value#24], true, 1, StorageLevel(disk,
memory, deserialized, 1 replicas)
 +- *FileScan text [value#24] Batched: false, Format:
org.apache.spark.sql.execution.datasources.text.TextFileFormat@262e2c8c,
InputPaths: file:/Users/jacek/dev/oss/spark/people.csv,
PartitionFilters: [], PushedFilters: [], ReadSchema:
struct

The only thing I could find "interesting" is that TextFileFormat does
not print TEXT as CSV does. Anything special you see?

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Tue, Aug 16, 2016 at 7:24 PM, Michael Armbrust
 wrote:
> try running explain on each of these.  my guess would be caching in broken
> in some cases.
>
> On Tue, Aug 16, 2016 at 6:05 PM, Jacek Laskowski  wrote:
>>
>> Hi,
>>
>> Can anyone explain why spark.read.csv("people.csv").cache.show ends up
>> with a WARN while spark.read.text("people.csv").cache.show does not?
>> It happens in 2.0 and today's build.
>>
>> scala> sc.version
>> res5: String = 2.1.0-SNAPSHOT
>>
>> scala> spark.read.csv("people.csv").cache.show
>> +-+-+---++
>> |  _c0|  _c1|_c2| _c3|
>> +-+-+---++
>> |kolumna 1|kolumna 2|kolumn3|size|
>> |Jacek| Warszawa| Polska|  40|
>> +-+-+---++
>>
>> scala> spark.read.csv("people.csv").cache.show
>> 16/08/16 18:01:52 WARN CacheManager: Asked to cache already cached data.
>> +-+-+---++
>> |  _c0|  _c1|_c2| _c3|
>> +-+-+---++
>> |kolumna 1|kolumna 2|kolumn3|size|
>> |Jacek| Warszawa| Polska|  40|
>> +-+-+---++
>>
>> scala> spark.read.text("people.csv").cache.show
>> ++
>> |   value|
>> ++
>> |kolumna 1,kolumna...|
>> |Jacek,Warszawa,Po...|
>> ++
>>
>> scala> spark.read.text("people.csv").cache.show
>> ++
>> |   value|
>> ++
>> |kolumna 1,kolumna...|
>> |Jacek,Warszawa,Po...|
>> ++
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://medium.com/@jaceklaskowski/
>> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
>> Follow me at https://twitter.com/jaceklaskowski
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [SQL] Why does spark.read.csv.cache give me a WARN about cache but not text?!

2016-08-16 Thread Michael Armbrust
try running explain on each of these.  my guess would be caching in broken
in some cases.

On Tue, Aug 16, 2016 at 6:05 PM, Jacek Laskowski  wrote:

> Hi,
>
> Can anyone explain why spark.read.csv("people.csv").cache.show ends up
> with a WARN while spark.read.text("people.csv").cache.show does not?
> It happens in 2.0 and today's build.
>
> scala> sc.version
> res5: String = 2.1.0-SNAPSHOT
>
> scala> spark.read.csv("people.csv").cache.show
> +-+-+---++
> |  _c0|  _c1|_c2| _c3|
> +-+-+---++
> |kolumna 1|kolumna 2|kolumn3|size|
> |Jacek| Warszawa| Polska|  40|
> +-+-+---++
>
> scala> spark.read.csv("people.csv").cache.show
> 16/08/16 18:01:52 WARN CacheManager: Asked to cache already cached data.
> +-+-+---++
> |  _c0|  _c1|_c2| _c3|
> +-+-+---++
> |kolumna 1|kolumna 2|kolumn3|size|
> |Jacek| Warszawa| Polska|  40|
> +-+-+---++
>
> scala> spark.read.text("people.csv").cache.show
> ++
> |   value|
> ++
> |kolumna 1,kolumna...|
> |Jacek,Warszawa,Po...|
> ++
>
> scala> spark.read.text("people.csv").cache.show
> ++
> |   value|
> ++
> |kolumna 1,kolumna...|
> |Jacek,Warszawa,Po...|
> ++
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


[SQL] Why does spark.read.csv.cache give me a WARN about cache but not text?!

2016-08-16 Thread Jacek Laskowski
Hi,

Can anyone explain why spark.read.csv("people.csv").cache.show ends up
with a WARN while spark.read.text("people.csv").cache.show does not?
It happens in 2.0 and today's build.

scala> sc.version
res5: String = 2.1.0-SNAPSHOT

scala> spark.read.csv("people.csv").cache.show
+-+-+---++
|  _c0|  _c1|_c2| _c3|
+-+-+---++
|kolumna 1|kolumna 2|kolumn3|size|
|Jacek| Warszawa| Polska|  40|
+-+-+---++

scala> spark.read.csv("people.csv").cache.show
16/08/16 18:01:52 WARN CacheManager: Asked to cache already cached data.
+-+-+---++
|  _c0|  _c1|_c2| _c3|
+-+-+---++
|kolumna 1|kolumna 2|kolumn3|size|
|Jacek| Warszawa| Polska|  40|
+-+-+---++

scala> spark.read.text("people.csv").cache.show
++
|   value|
++
|kolumna 1,kolumna...|
|Jacek,Warszawa,Po...|
++

scala> spark.read.text("people.csv").cache.show
++
|   value|
++
|kolumna 1,kolumna...|
|Jacek,Warszawa,Po...|
++

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org