Re: Rdd - zip with index

ayan guha Wed, 24 Mar 2021 14:29:39 -0700

Hi

"I still dont understand how the spark will split the large file" -- This
is actually achieved by something called InputFormat, which in turn depends
on what type of file it is and what is the block size. Ex: If you have
block size of 64MB, then a 10GB file will roughly translate to 10240/64 =
160 partitions. (Roughly because line boundaries are taken into account).
Spark launches 1 task for each partitions, so you should see 160 tasks
created.


Because .gz is not splittable, Spark uses a different InputFormat, and
hence number of tasks are same as number of files, not per split (aka
partitions). Hence, a 10GB .gz file will incur only 1 task.

Now these tasks are unit of parallelism and they can be run in parallel.
You can roughly translate this to number of cores available to the cluster
during reading of the file. How many cores are available? Well that depends
how are you launching the job. Ex: If you are launching like local(*) that
means you want all of you local cores to be used.

In a distributed setting, you can ask Spark to group cores (and RAM) and
that is called an executor. Each executor can have 1 or more cores
(SparkConf driven). So each executor takes some of the tasks created above
and runs them in parallel.  Thats what you see in the Spark UI Executor
Page.

So depending on how you are launching the job, you should see
(a) How many executors are running and with how many cores
(b) How many tasks are scheduled to run
(c) Which executor is running those tasks

As a framework, Spark does all of these without you need to do anything, as
Sean said above. The question is why then you see no parallelism? Well,
hope these pointers leads you to atleast look at the right places. Please
share the format of the file, how are you launching the job and if possible
screenshots of Spark UI pages and I am sure good people of this forum will
help you out.

HTH....

On Thu, Mar 25, 2021 at 3:54 AM Sean Owen <sro...@gmail.com> wrote:

> Right, that's all you do to tell it to treat the first line of the files
> as a header defining col names.
> Yes, .gz files still aren't splittable by nature. One huge CSV .csv file
> would be split into partitions, but one .gz file would not, which can be a
> problem.
> To be clear, you do not need to do anything to let Spark read parts of a
> large file in parallel (assuming compression isn't the issue).
>
> On Wed, Mar 24, 2021 at 11:00 AM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> How does Spark establish there is a csv header as a matter of interest?
>>
>> Example
>>
>> val df = spark.read.option("header", true).csv(location)
>>
>> I need to tell spark to ignore the header correct?
>>
>> From Spark Read CSV file into DataFrame — SparkByExamples
>> <https://sparkbyexamples.com/spark/spark-read-csv-file-into-dataframe/>
>>
>> If you have a header with column names on file, you need to explicitly
>> specify true for header option using option("header",true)
>> <https://sparkbyexamples.com/spark/spark-read-csv-file-into-dataframe/#header>
>>  not
>> mentioning this, the API treats header as a data record.
>>
>> Second point which may not be applicable to the newer versions of Spark. My
>> understanding is that the gz file is not splittable, therefore Spark needs
>> to read the whole file using a single core which will slow things down (CPU
>> intensive). After the read is done the data can be shuffled to increase
>> parallelism.
>>
>> HTH
>>
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Wed, 24 Mar 2021 at 12:40, Sean Owen <sro...@gmail.com> wrote:
>>
>>> No need to do that. Reading the header with Spark automatically is
>>> trivial.
>>>
>>> On Wed, Mar 24, 2021 at 5:25 AM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>> If it is a csv then it is a flat file somewhere in a directory I guess.
>>>>
>>>> Get the header out by doing
>>>>
>>>> */usr/bin/zcat csvfile.gz |head -n 1*
>>>> Title Number,Tenure,Property
>>>> Address,District,County,Region,Postcode,Multiple Address Indicator,Price
>>>> Paid,Proprietor Name (1),Company Registration No. (1),Proprietorship
>>>> Category (1),Country Incorporated (1),Proprietor (1) Address (1),Proprietor
>>>> (1) Address (2),Proprietor (1) Address (3),Proprietor Name (2),Company
>>>> Registration No. (2),Proprietorship Category (2),Country Incorporated
>>>> (2),Proprietor (2) Address (1),Proprietor (2) Address (2),Proprietor (2)
>>>> Address (3),Proprietor Name (3),Company Registration No. (3),Proprietorship
>>>> Category (3),Country Incorporated (3),Proprietor (3) Address (1),Proprietor
>>>> (3) Address (2),Proprietor (3) Address (3),Proprietor Name (4),Company
>>>> Registration No. (4),Proprietorship Category (4),Country Incorporated
>>>> (4),Proprietor (4) Address (1),Proprietor (4) Address (2),Proprietor (4)
>>>> Address (3),Date Proprietor Added,Additional Proprietor Indicator
>>>>
>>>>
>>>> 10GB is not much of a big CSV file
>>>>
>>>> that will resolve the header anyway.
>>>>
>>>>
>>>> Also how are you running the spark, in a local mode (single jvm) or
>>>> other distributed modes (yarn, standalone) ?
>>>>
>>>>
>>>> HTH
>>>>
>>>

-- 
Best Regards,
Ayan Guha

Re: Rdd - zip with index

Reply via email to