Re: removing header from csv file

Michael Segel Tue, 03 May 2016 11:42:18 -0700

Hi, 
Another silly question… 

Don’t you want to use the header line to help create a schema for the RDD?


Thx

-Mike

> On May 3, 2016, at 8:09 AM, Mathieu Longtin <math...@closetwork.org> wrote:
> 
> This only works if the files are "unsplittable". For example gzip files, each 
> partition is one file (if you have more partitions than files), so the first 
> line of each partition is the header.
> 
> Spark-csv extensions reads the very first line of the RDD, assumes it's the 
> header, and then filters every occurrence of that line. Something like this 
> (python code here, but Scala should be very similar)
> 
> header = data.first()
> data = data.filter(lambda line: line != header)
> 
> Since I had lots of small CSV files, and not all of them have the same exact 
> header, I use the following:
> 
> file_list = sc.parallelize(list_of_csv)
> data = 
> file_list.flatMap(function_that_reads_csvs_and_extracts_the_colums_I_want)
> 
> 
> 
> 
> On Tue, May 3, 2016 at 3:23 AM Abhishek Anand <abhis.anan...@gmail.com 
> <mailto:abhis.anan...@gmail.com>> wrote:
> You can use this function to remove the header from your dataset(applicable 
> to RDD)
> 
> def dropHeader(data: RDD[String]): RDD[String] = {
>     data.mapPartitionsWithIndex((idx, lines) => {
>       if (idx == 0) {
>         lines.drop(1)
>       }
>       lines
>     })
>     }
> 
> 
> Abhi 
> 
> On Wed, Apr 27, 2016 at 12:55 PM, Marco Mistroni <mmistr...@gmail.com 
> <mailto:mmistr...@gmail.com>> wrote:
> If u r using Scala api you can do
> Myrdd.zipwithindex.filter(_._2 >0).map(_._1)
> 
> Maybe a little bit complicated but will do the trick
> As per spark CSV, you will get back a data frame which you can reconduct to 
> rdd. .
> Hth
> Marco
> 
> On 27 Apr 2016 6:59 am, "nihed mbarek" <nihe...@gmail.com 
> <mailto:nihe...@gmail.com>> wrote:
> You can add a filter with string that you are sure available only in the 
> header 
> 
> Le mercredi 27 avril 2016, Divya Gehlot <divya.htco...@gmail.com 
> <mailto:divya.htco...@gmail.com>> a écrit :
> yes you can remove the headers by removing the first row 
> 
> can first() or head() to do that 
> 
> 
> Thanks,
> Divya 
> 
> On 27 April 2016 at 13:24, Ashutosh Kumar <kmr.ashutos...@gmail.com <>> wrote:
> I see there is a library spark-csv which can be used for removing header and 
> processing of csv files. But it seems it works with sqlcontext only. Is there 
> a way to remove header from csv files without sqlcontext ? 
> 
> Thanks
> Ashutosh
> 
> 
> 
> -- 
> 
> M'BAREK Med Nihed,
> Fedora Ambassador, TUNISIA, Northern Africa
> http://www.nihed.com <http://www.nihed.com/>
> 
>  <http://tn.linkedin.com/in/nihed>
> 
> 
> -- 
> Mathieu Longtin
> 1-514-803-8977

Re: removing header from csv file

Reply via email to