Hi, Another silly question… Don’t you want to use the header line to help create a schema for the RDD?
Thx -Mike > On May 3, 2016, at 8:09 AM, Mathieu Longtin <math...@closetwork.org> wrote: > > This only works if the files are "unsplittable". For example gzip files, each > partition is one file (if you have more partitions than files), so the first > line of each partition is the header. > > Spark-csv extensions reads the very first line of the RDD, assumes it's the > header, and then filters every occurrence of that line. Something like this > (python code here, but Scala should be very similar) > > header = data.first() > data = data.filter(lambda line: line != header) > > Since I had lots of small CSV files, and not all of them have the same exact > header, I use the following: > > file_list = sc.parallelize(list_of_csv) > data = > file_list.flatMap(function_that_reads_csvs_and_extracts_the_colums_I_want) > > > > > On Tue, May 3, 2016 at 3:23 AM Abhishek Anand <abhis.anan...@gmail.com > <mailto:abhis.anan...@gmail.com>> wrote: > You can use this function to remove the header from your dataset(applicable > to RDD) > > def dropHeader(data: RDD[String]): RDD[String] = { > data.mapPartitionsWithIndex((idx, lines) => { > if (idx == 0) { > lines.drop(1) > } > lines > }) > } > > > Abhi > > On Wed, Apr 27, 2016 at 12:55 PM, Marco Mistroni <mmistr...@gmail.com > <mailto:mmistr...@gmail.com>> wrote: > If u r using Scala api you can do > Myrdd.zipwithindex.filter(_._2 >0).map(_._1) > > Maybe a little bit complicated but will do the trick > As per spark CSV, you will get back a data frame which you can reconduct to > rdd. . > Hth > Marco > > On 27 Apr 2016 6:59 am, "nihed mbarek" <nihe...@gmail.com > <mailto:nihe...@gmail.com>> wrote: > You can add a filter with string that you are sure available only in the > header > > Le mercredi 27 avril 2016, Divya Gehlot <divya.htco...@gmail.com > <mailto:divya.htco...@gmail.com>> a écrit : > yes you can remove the headers by removing the first row > > can first() or head() to do that > > > Thanks, > Divya > > On 27 April 2016 at 13:24, Ashutosh Kumar <kmr.ashutos...@gmail.com <>> wrote: > I see there is a library spark-csv which can be used for removing header and > processing of csv files. But it seems it works with sqlcontext only. Is there > a way to remove header from csv files without sqlcontext ? > > Thanks > Ashutosh > > > > -- > > M'BAREK Med Nihed, > Fedora Ambassador, TUNISIA, Northern Africa > http://www.nihed.com <http://www.nihed.com/> > > <http://tn.linkedin.com/in/nihed> > > > -- > Mathieu Longtin > 1-514-803-8977