This only works if the files are "unsplittable". For example gzip files,
each partition is one file (if you have more partitions than files), so the
first line of each partition is the header.

Spark-csv extensions reads the very first line of the RDD, assumes it's the
header, and then filters every occurrence of that line. Something like this
(python code here, but Scala should be very similar)

header = data.first()
data = data.filter(lambda line: line != header)

Since I had lots of small CSV files, and not all of them have the same
exact header, I use the following:

file_list = sc.parallelize(list_of_csv)
data =
file_list.flatMap(function_that_reads_csvs_and_extracts_the_colums_I_want)




On Tue, May 3, 2016 at 3:23 AM Abhishek Anand <abhis.anan...@gmail.com>
wrote:

> You can use this function to remove the header from your
> dataset(applicable to RDD)
>
> def dropHeader(data: RDD[String]): RDD[String] = {
>     data.mapPartitionsWithIndex((idx, lines) => {
>       if (idx == 0) {
>         lines.drop(1)
>       }
>       lines
>     })
>     }
>
>
> Abhi
>
> On Wed, Apr 27, 2016 at 12:55 PM, Marco Mistroni <mmistr...@gmail.com>
> wrote:
>
>> If u r using Scala api you can do
>> Myrdd.zipwithindex.filter(_._2 >0).map(_._1)
>>
>> Maybe a little bit complicated but will do the trick
>> As per spark CSV, you will get back a data frame which you can reconduct
>> to rdd. .
>> Hth
>> Marco
>> On 27 Apr 2016 6:59 am, "nihed mbarek" <nihe...@gmail.com> wrote:
>>
>>> You can add a filter with string that you are sure available only in the
>>> header
>>>
>>> Le mercredi 27 avril 2016, Divya Gehlot <divya.htco...@gmail.com> a
>>> écrit :
>>>
>>>> yes you can remove the headers by removing the first row
>>>>
>>>> can first() or head() to do that
>>>>
>>>>
>>>> Thanks,
>>>> Divya
>>>>
>>>> On 27 April 2016 at 13:24, Ashutosh Kumar <kmr.ashutos...@gmail.com>
>>>> wrote:
>>>>
>>>>> I see there is a library spark-csv which can be used for removing
>>>>> header and processing of csv files. But it seems it works with sqlcontext
>>>>> only. Is there a way to remove header from csv files without sqlcontext ?
>>>>>
>>>>> Thanks
>>>>> Ashutosh
>>>>>
>>>>
>>>>
>>>
>>> --
>>>
>>> M'BAREK Med Nihed,
>>> Fedora Ambassador, TUNISIA, Northern Africa
>>> http://www.nihed.com
>>>
>>> <http://tn.linkedin.com/in/nihed>
>>>
>>>
>>> --
Mathieu Longtin
1-514-803-8977

Reply via email to