Re: Rdd - zip with index

Mich Talebzadeh Wed, 24 Mar 2021 08:56:26 -0700

How does Spark establish there is a csv header as a matter of interest?

Example

val df = spark.read.option("header", true).csv(location)

I need to tell spark to ignore the header correct?

>From Spark Read CSV file into DataFrame — SparkByExamples
<https://sparkbyexamples.com/spark/spark-read-csv-file-into-dataframe/>

If you have a header with column names on file, you need to explicitly
specify true for header option using option("header",true)
<https://sparkbyexamples.com/spark/spark-read-csv-file-into-dataframe/#header>
not
mentioning this, the API treats header as a data record.

Second point which may not be applicable to the newer versions of Spark. My
understanding is that the gz file is not splittable, therefore Spark needs
to read the whole file using a single core which will slow things down (CPU
intensive). After the read is done the data can be shuffled to increase
parallelism.

HTH

   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

On Wed, 24 Mar 2021 at 12:40, Sean Owen <sro...@gmail.com> wrote:

> No need to do that. Reading the header with Spark automatically is trivial.
>
> On Wed, Mar 24, 2021 at 5:25 AM Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
>> If it is a csv then it is a flat file somewhere in a directory I guess.
>>
>> Get the header out by doing
>>
>> */usr/bin/zcat csvfile.gz |head -n 1*
>> Title Number,Tenure,Property
>> Address,District,County,Region,Postcode,Multiple Address Indicator,Price
>> Paid,Proprietor Name (1),Company Registration No. (1),Proprietorship
>> Category (1),Country Incorporated (1),Proprietor (1) Address (1),Proprietor
>> (1) Address (2),Proprietor (1) Address (3),Proprietor Name (2),Company
>> Registration No. (2),Proprietorship Category (2),Country Incorporated
>> (2),Proprietor (2) Address (1),Proprietor (2) Address (2),Proprietor (2)
>> Address (3),Proprietor Name (3),Company Registration No. (3),Proprietorship
>> Category (3),Country Incorporated (3),Proprietor (3) Address (1),Proprietor
>> (3) Address (2),Proprietor (3) Address (3),Proprietor Name (4),Company
>> Registration No. (4),Proprietorship Category (4),Country Incorporated
>> (4),Proprietor (4) Address (1),Proprietor (4) Address (2),Proprietor (4)
>> Address (3),Date Proprietor Added,Additional Proprietor Indicator
>>
>>
>> 10GB is not much of a big CSV file
>>
>> that will resolve the header anyway.
>>
>>
>> Also how are you running the spark, in a local mode (single jvm) or
>> other distributed modes (yarn, standalone) ?
>>
>>
>> HTH
>>
>

Re: Rdd - zip with index

Reply via email to