Hi
"I still dont understand how the spark will split the large file" -- This
is actually achieved by something called InputFormat, which in turn depends
on what type of file it is and what is the block size. Ex: If you have
block size of 64MB, then a 10GB file will roughly translate to 10240/64 =
Right, that's all you do to tell it to treat the first line of the files as
a header defining col names.
Yes, .gz files still aren't splittable by nature. One huge CSV .csv file
would be split into partitions, but one .gz file would not, which can be a
problem.
To be clear, you do not need to do
Thanks Mich. I understood what I am supposed to do now, will try these
options.
I still dont understand how the spark will split the large file. I have a
10 GB file which I want to split automatically after reading. I can split
and load the file before reading but it is a very big requirement
How does Spark establish there is a csv header as a matter of interest?
Example
val df = spark.read.option("header", true).csv(location)
I need to tell spark to ignore the header correct?
>From Spark Read CSV file into DataFrame — SparkByExamples
No need to do that. Reading the header with Spark automatically is trivial.
On Wed, Mar 24, 2021 at 5:25 AM Mich Talebzadeh
wrote:
> If it is a csv then it is a flat file somewhere in a directory I guess.
>
> Get the header out by doing
>
> */usr/bin/zcat csvfile.gz |head -n 1*
> Title
If it is a csv then it is a flat file somewhere in a directory I guess.
Get the header out by doing
*/usr/bin/zcat csvfile.gz |head -n 1*
Title Number,Tenure,Property
Address,District,County,Region,Postcode,Multiple Address Indicator,Price
Paid,Proprietor Name (1),Company Registration No.