Re: Rdd - zip with index

Sean Owen Wed, 24 Mar 2021 05:40:49 -0700

No need to do that. Reading the header with Spark automatically is trivial.


On Wed, Mar 24, 2021 at 5:25 AM Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> If it is a csv then it is a flat file somewhere in a directory I guess.
>
> Get the header out by doing
>
> */usr/bin/zcat csvfile.gz |head -n 1*
> Title Number,Tenure,Property
> Address,District,County,Region,Postcode,Multiple Address Indicator,Price
> Paid,Proprietor Name (1),Company Registration No. (1),Proprietorship
> Category (1),Country Incorporated (1),Proprietor (1) Address (1),Proprietor
> (1) Address (2),Proprietor (1) Address (3),Proprietor Name (2),Company
> Registration No. (2),Proprietorship Category (2),Country Incorporated
> (2),Proprietor (2) Address (1),Proprietor (2) Address (2),Proprietor (2)
> Address (3),Proprietor Name (3),Company Registration No. (3),Proprietorship
> Category (3),Country Incorporated (3),Proprietor (3) Address (1),Proprietor
> (3) Address (2),Proprietor (3) Address (3),Proprietor Name (4),Company
> Registration No. (4),Proprietorship Category (4),Country Incorporated
> (4),Proprietor (4) Address (1),Proprietor (4) Address (2),Proprietor (4)
> Address (3),Date Proprietor Added,Additional Proprietor Indicator
>
>
> 10GB is not much of a big CSV file
>
> that will resolve the header anyway.
>
>
> Also how are you running the spark, in a local mode (single jvm) or
> other distributed modes (yarn, standalone) ?
>
>
> HTH
>

Re: Rdd - zip with index

Reply via email to