Re: Rdd - zip with index

2021-03-24 Thread ayan guha
Hi "I still dont understand how the spark will split the large file" -- This is actually achieved by something called InputFormat, which in turn depends on what type of file it is and what is the block size. Ex: If you have block size of 64MB, then a 10GB file will roughly translate to 10240/64 =

Re: Rdd - zip with index

2021-03-24 Thread Sean Owen
Right, that's all you do to tell it to treat the first line of the files as a header defining col names. Yes, .gz files still aren't splittable by nature. One huge CSV .csv file would be split into partitions, but one .gz file would not, which can be a problem. To be clear, you do not need to do

Re: Rdd - zip with index

2021-03-24 Thread KhajaAsmath Mohammed
Thanks Mich. I understood what I am supposed to do now, will try these options. I still dont understand how the spark will split the large file. I have a 10 GB file which I want to split automatically after reading. I can split and load the file before reading but it is a very big requirement

Re: Rdd - zip with index

2021-03-24 Thread Mich Talebzadeh
How does Spark establish there is a csv header as a matter of interest? Example val df = spark.read.option("header", true).csv(location) I need to tell spark to ignore the header correct? >From Spark Read CSV file into DataFrame — SparkByExamples

Re: Rdd - zip with index

2021-03-24 Thread Sean Owen
No need to do that. Reading the header with Spark automatically is trivial. On Wed, Mar 24, 2021 at 5:25 AM Mich Talebzadeh wrote: > If it is a csv then it is a flat file somewhere in a directory I guess. > > Get the header out by doing > > */usr/bin/zcat csvfile.gz |head -n 1* > Title

Re: Rdd - zip with index

2021-03-24 Thread Mich Talebzadeh
If it is a csv then it is a flat file somewhere in a directory I guess. Get the header out by doing */usr/bin/zcat csvfile.gz |head -n 1* Title Number,Tenure,Property Address,District,County,Region,Postcode,Multiple Address Indicator,Price Paid,Proprietor Name (1),Company Registration No.