There are two ways to do so.

Firstly, this way will make sure cleanly it skips the header. But of course
the use of mapWithIndex decreases performance

rdd.mapPartitionsWithIndex { (idx, iter) => if (idx == 0) iter.drop(1) else
iter }


Secondly, you can do

val header = rdd.first()
val data = rdd.filter(_ != first)

For the second method, this does not make sure it will only skip the first
because there might be the exactly same records with the header.


CSV data source uses the second way so I gave a todo in the PR I recently
opened.



2016-04-27 14:59 GMT+09:00 nihed mbarek <nihe...@gmail.com>:

> You can add a filter with string that you are sure available only in the
> header
>
>
> Le mercredi 27 avril 2016, Divya Gehlot <divya.htco...@gmail.com> a
> écrit :
>
>> yes you can remove the headers by removing the first row
>>
>> can first() or head() to do that
>>
>>
>> Thanks,
>> Divya
>>
>> On 27 April 2016 at 13:24, Ashutosh Kumar <kmr.ashutos...@gmail.com>
>> wrote:
>>
>>> I see there is a library spark-csv which can be used for removing header
>>> and processing of csv files. But it seems it works with sqlcontext only. Is
>>> there a way to remove header from csv files without sqlcontext ?
>>>
>>> Thanks
>>> Ashutosh
>>>
>>
>>
>
> --
>
> M'BAREK Med Nihed,
> Fedora Ambassador, TUNISIA, Northern Africa
> http://www.nihed.com
>
> <http://tn.linkedin.com/in/nihed>
>
>
>

Reply via email to