Re: Rdd - zip with index

Mich Talebzadeh Wed, 24 Mar 2021 03:25:36 -0700

If it is a csv then it is a flat file somewhere in a directory I guess.

Get the header out by doing

*/usr/bin/zcat csvfile.gz |head -n 1*
Title Number,Tenure,Property
Address,District,County,Region,Postcode,Multiple Address Indicator,Price
Paid,Proprietor Name (1),Company Registration No. (1),Proprietorship
Category (1),Country Incorporated (1),Proprietor (1) Address (1),Proprietor
(1) Address (2),Proprietor (1) Address (3),Proprietor Name (2),Company
Registration No. (2),Proprietorship Category (2),Country Incorporated
(2),Proprietor (2) Address (1),Proprietor (2) Address (2),Proprietor (2)
Address (3),Proprietor Name (3),Company Registration No. (3),Proprietorship
Category (3),Country Incorporated (3),Proprietor (3) Address (1),Proprietor
(3) Address (2),Proprietor (3) Address (3),Proprietor Name (4),Company
Registration No. (4),Proprietorship Category (4),Country Incorporated
(4),Proprietor (4) Address (1),Proprietor (4) Address (2),Proprietor (4)
Address (3),Date Proprietor Added,Additional Proprietor Indicator

10GB is not much of a big CSV file

that will resolve the header anyway.

Also how are you running the spark, in a local mode (single jvm) or
other distributed modes (yarn, standalone) ?

HTH

   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

On Wed, 24 Mar 2021 at 03:24, ayan guha <guha.a...@gmail.com> wrote:

> Best case is use dataframe and df.columns will automatically give you
> column names. Are you sure your file is indeed in csv? maybe it is easier
> if you share the code?
>
> On Wed, 24 Mar 2021 at 2:12 pm, Sean Owen <sro...@gmail.com> wrote:
>
>> It would split 10GB of CSV into multiple partitions by default, unless
>> it's gzipped. Something else is going on here.
>>
>> ‪On Tue, Mar 23, 2021 at 10:04 PM ‫"Yuri Oleynikov (‫יורי
>> אולייניקוב‬‎)"‬‎ <yur...@gmail.com> wrote:‬
>>
>>> I’m not Spark core developer and do not want to confuse you but it seems
>>> logical to me that just reading from single file (no matter what format of
>>> the file is used) gives no parallelism unless you do repartition by some
>>> column just after csv load, but the if you’re telling you’ve already tried
>>> repartition with no luck...
>>>
>>>
>>> > On 24 Mar 2021, at 03:47, KhajaAsmath Mohammed <
>>> mdkhajaasm...@gmail.com> wrote:
>>> >
>>> > So spark by default doesn’t split the large 10gb file when loaded?
>>> >
>>> > Sent from my iPhone
>>> >
>>> >> On Mar 23, 2021, at 8:44 PM, Yuri Oleynikov (‫יורי אולייניקוב‬‎) <
>>> yur...@gmail.com> wrote:
>>> >>
>>> >> Hi, Mohammed
>>> >> I think that the reason that only one executor is running and have
>>> single partition is because you have single file that might be read/loaded
>>> into memory.
>>> >>
>>> >> In order to achieve better parallelism I’d suggest to split the csv
>>> file.
>>> >>
>>>
>>> --
> Best Regards,
> Ayan Guha
>

Re: Rdd - zip with index

Reply via email to