Re: removing header from csv file

2016-05-03 Thread Michael Segel
Hi, 
Another silly question… 

Don’t you want to use the header line to help create a schema for the RDD? 

Thx

-Mike

> On May 3, 2016, at 8:09 AM, Mathieu Longtin  wrote:
> 
> This only works if the files are "unsplittable". For example gzip files, each 
> partition is one file (if you have more partitions than files), so the first 
> line of each partition is the header.
> 
> Spark-csv extensions reads the very first line of the RDD, assumes it's the 
> header, and then filters every occurrence of that line. Something like this 
> (python code here, but Scala should be very similar)
> 
> header = data.first()
> data = data.filter(lambda line: line != header)
> 
> Since I had lots of small CSV files, and not all of them have the same exact 
> header, I use the following:
> 
> file_list = sc.parallelize(list_of_csv)
> data = 
> file_list.flatMap(function_that_reads_csvs_and_extracts_the_colums_I_want)
> 
> 
> 
> 
> On Tue, May 3, 2016 at 3:23 AM Abhishek Anand  > wrote:
> You can use this function to remove the header from your dataset(applicable 
> to RDD)
> 
> def dropHeader(data: RDD[String]): RDD[String] = {
> data.mapPartitionsWithIndex((idx, lines) => {
>   if (idx == 0) {
> lines.drop(1)
>   }
>   lines
> })
> }
> 
> 
> Abhi 
> 
> On Wed, Apr 27, 2016 at 12:55 PM, Marco Mistroni  > wrote:
> If u r using Scala api you can do
> Myrdd.zipwithindex.filter(_._2 >0).map(_._1)
> 
> Maybe a little bit complicated but will do the trick
> As per spark CSV, you will get back a data frame which you can reconduct to 
> rdd. .
> Hth
> Marco
> 
> On 27 Apr 2016 6:59 am, "nihed mbarek"  > wrote:
> You can add a filter with string that you are sure available only in the 
> header 
> 
> Le mercredi 27 avril 2016, Divya Gehlot  > a écrit :
> yes you can remove the headers by removing the first row 
> 
> can first() or head() to do that 
> 
> 
> Thanks,
> Divya 
> 
> On 27 April 2016 at 13:24, Ashutosh Kumar > wrote:
> I see there is a library spark-csv which can be used for removing header and 
> processing of csv files. But it seems it works with sqlcontext only. Is there 
> a way to remove header from csv files without sqlcontext ? 
> 
> Thanks
> Ashutosh
> 
> 
> 
> -- 
> 
> M'BAREK Med Nihed,
> Fedora Ambassador, TUNISIA, Northern Africa
> http://www.nihed.com 
> 
>  
> 
> 
> -- 
> Mathieu Longtin
> 1-514-803-8977



Re: removing header from csv file

2016-05-03 Thread Mathieu Longtin
This only works if the files are "unsplittable". For example gzip files,
each partition is one file (if you have more partitions than files), so the
first line of each partition is the header.

Spark-csv extensions reads the very first line of the RDD, assumes it's the
header, and then filters every occurrence of that line. Something like this
(python code here, but Scala should be very similar)

header = data.first()
data = data.filter(lambda line: line != header)

Since I had lots of small CSV files, and not all of them have the same
exact header, I use the following:

file_list = sc.parallelize(list_of_csv)
data =
file_list.flatMap(function_that_reads_csvs_and_extracts_the_colums_I_want)




On Tue, May 3, 2016 at 3:23 AM Abhishek Anand 
wrote:

> You can use this function to remove the header from your
> dataset(applicable to RDD)
>
> def dropHeader(data: RDD[String]): RDD[String] = {
> data.mapPartitionsWithIndex((idx, lines) => {
>   if (idx == 0) {
> lines.drop(1)
>   }
>   lines
> })
> }
>
>
> Abhi
>
> On Wed, Apr 27, 2016 at 12:55 PM, Marco Mistroni 
> wrote:
>
>> If u r using Scala api you can do
>> Myrdd.zipwithindex.filter(_._2 >0).map(_._1)
>>
>> Maybe a little bit complicated but will do the trick
>> As per spark CSV, you will get back a data frame which you can reconduct
>> to rdd. .
>> Hth
>> Marco
>> On 27 Apr 2016 6:59 am, "nihed mbarek"  wrote:
>>
>>> You can add a filter with string that you are sure available only in the
>>> header
>>>
>>> Le mercredi 27 avril 2016, Divya Gehlot  a
>>> écrit :
>>>
 yes you can remove the headers by removing the first row

 can first() or head() to do that


 Thanks,
 Divya

 On 27 April 2016 at 13:24, Ashutosh Kumar 
 wrote:

> I see there is a library spark-csv which can be used for removing
> header and processing of csv files. But it seems it works with sqlcontext
> only. Is there a way to remove header from csv files without sqlcontext ?
>
> Thanks
> Ashutosh
>


>>>
>>> --
>>>
>>> M'BAREK Med Nihed,
>>> Fedora Ambassador, TUNISIA, Northern Africa
>>> http://www.nihed.com
>>>
>>> 
>>>
>>>
>>> --
Mathieu Longtin
1-514-803-8977


Re: removing header from csv file

2016-05-03 Thread Abhishek Anand
You can use this function to remove the header from your dataset(applicable
to RDD)

def dropHeader(data: RDD[String]): RDD[String] = {
data.mapPartitionsWithIndex((idx, lines) => {
  if (idx == 0) {
lines.drop(1)
  }
  lines
})
}


Abhi

On Wed, Apr 27, 2016 at 12:55 PM, Marco Mistroni 
wrote:

> If u r using Scala api you can do
> Myrdd.zipwithindex.filter(_._2 >0).map(_._1)
>
> Maybe a little bit complicated but will do the trick
> As per spark CSV, you will get back a data frame which you can reconduct
> to rdd. .
> Hth
> Marco
> On 27 Apr 2016 6:59 am, "nihed mbarek"  wrote:
>
>> You can add a filter with string that you are sure available only in the
>> header
>>
>> Le mercredi 27 avril 2016, Divya Gehlot  a
>> écrit :
>>
>>> yes you can remove the headers by removing the first row
>>>
>>> can first() or head() to do that
>>>
>>>
>>> Thanks,
>>> Divya
>>>
>>> On 27 April 2016 at 13:24, Ashutosh Kumar 
>>> wrote:
>>>
 I see there is a library spark-csv which can be used for removing
 header and processing of csv files. But it seems it works with sqlcontext
 only. Is there a way to remove header from csv files without sqlcontext ?

 Thanks
 Ashutosh

>>>
>>>
>>
>> --
>>
>> M'BAREK Med Nihed,
>> Fedora Ambassador, TUNISIA, Northern Africa
>> http://www.nihed.com
>>
>> 
>>
>>
>>


Re: removing header from csv file

2016-04-27 Thread Marco Mistroni
If u r using Scala api you can do
Myrdd.zipwithindex.filter(_._2 >0).map(_._1)

Maybe a little bit complicated but will do the trick
As per spark CSV, you will get back a data frame which you can reconduct to
rdd. .
Hth
Marco
On 27 Apr 2016 6:59 am, "nihed mbarek"  wrote:

> You can add a filter with string that you are sure available only in the
> header
>
> Le mercredi 27 avril 2016, Divya Gehlot  a
> écrit :
>
>> yes you can remove the headers by removing the first row
>>
>> can first() or head() to do that
>>
>>
>> Thanks,
>> Divya
>>
>> On 27 April 2016 at 13:24, Ashutosh Kumar 
>> wrote:
>>
>>> I see there is a library spark-csv which can be used for removing header
>>> and processing of csv files. But it seems it works with sqlcontext only. Is
>>> there a way to remove header from csv files without sqlcontext ?
>>>
>>> Thanks
>>> Ashutosh
>>>
>>
>>
>
> --
>
> M'BAREK Med Nihed,
> Fedora Ambassador, TUNISIA, Northern Africa
> http://www.nihed.com
>
> 
>
>
>


Re: removing header from csv file

2016-04-27 Thread Nachiketa
Why "without sqlcontext"  ? Could you please describe what is it that you
are trying to accomplish ? Thanks.

Regards,
Nachiketa

On Wed, Apr 27, 2016 at 10:54 AM, Ashutosh Kumar 
wrote:

> I see there is a library spark-csv which can be used for removing header
> and processing of csv files. But it seems it works with sqlcontext only. Is
> there a way to remove header from csv files without sqlcontext ?
>
> Thanks
> Ashutosh
>



-- 
Regards,
-- Nachiketa


Re: removing header from csv file

2016-04-27 Thread Hyukjin Kwon
There are two ways to do so.


Firstly, this way will make sure cleanly it skips the header. But of course
the use of mapWithIndex decreases performance

rdd.mapPartitionsWithIndex { (idx, iter) => if (idx == 0) iter.drop(1) else
iter }


Secondly, you can do

val header = rdd.first()
val data = rdd.filter(_ != first)

For the second method, this does not make sure it will only skip the first
because there might be the exactly same records with the header.


CSV data source uses the second way so I gave a todo in the PR I recently
opened.



2016-04-27 14:59 GMT+09:00 nihed mbarek :

> You can add a filter with string that you are sure available only in the
> header
>
>
> Le mercredi 27 avril 2016, Divya Gehlot  a
> écrit :
>
>> yes you can remove the headers by removing the first row
>>
>> can first() or head() to do that
>>
>>
>> Thanks,
>> Divya
>>
>> On 27 April 2016 at 13:24, Ashutosh Kumar 
>> wrote:
>>
>>> I see there is a library spark-csv which can be used for removing header
>>> and processing of csv files. But it seems it works with sqlcontext only. Is
>>> there a way to remove header from csv files without sqlcontext ?
>>>
>>> Thanks
>>> Ashutosh
>>>
>>
>>
>
> --
>
> M'BAREK Med Nihed,
> Fedora Ambassador, TUNISIA, Northern Africa
> http://www.nihed.com
>
> 
>
>
>


RE: removing header from csv file

2016-04-27 Thread Mishra, Abhishek
You should be doing something like this:


data = sc.textFile('file:///path1/path/test1.csv')
header = data.first() #extract header
#print header
data = data.filter(lambda x:x !=header)
#print data
Hope it helps.

Sincerely,
Abhishek
+91-7259028700

From: nihed mbarek [mailto:nihe...@gmail.com]
Sent: Wednesday, April 27, 2016 11:29 AM
To: Divya Gehlot
Cc: Ashutosh Kumar; user @spark
Subject: Re: removing header from csv file

You can add a filter with string that you are sure available only in the header

Le mercredi 27 avril 2016, Divya Gehlot 
<divya.htco...@gmail.com<mailto:divya.htco...@gmail.com>> a écrit :
yes you can remove the headers by removing the first row

can first() or head() to do that


Thanks,
Divya

On 27 April 2016 at 13:24, Ashutosh Kumar 
<kmr.ashutos...@gmail.com<javascript:_e(%7B%7D,'cvml','kmr.ashutos...@gmail.com');>>
 wrote:
I see there is a library spark-csv which can be used for removing header and 
processing of csv files. But it seems it works with sqlcontext only. Is there a 
way to remove header from csv files without sqlcontext ?
Thanks
Ashutosh



--

M'BAREK Med Nihed,
Fedora Ambassador, TUNISIA, Northern Africa
http://www.nihed.com

[http://www.linkedin.com/img/webpromo/btn_myprofile_160x33_fr_FR.png]<http://tn.linkedin.com/in/nihed>



Re: removing header from csv file

2016-04-26 Thread nihed mbarek
You can add a filter with string that you are sure available only in the
header

Le mercredi 27 avril 2016, Divya Gehlot  a écrit :

> yes you can remove the headers by removing the first row
>
> can first() or head() to do that
>
>
> Thanks,
> Divya
>
> On 27 April 2016 at 13:24, Ashutosh Kumar  > wrote:
>
>> I see there is a library spark-csv which can be used for removing header
>> and processing of csv files. But it seems it works with sqlcontext only. Is
>> there a way to remove header from csv files without sqlcontext ?
>>
>> Thanks
>> Ashutosh
>>
>
>

-- 

M'BAREK Med Nihed,
Fedora Ambassador, TUNISIA, Northern Africa
http://www.nihed.com




Re: removing header from csv file

2016-04-26 Thread Divya Gehlot
yes you can remove the headers by removing the first row

can first() or head() to do that


Thanks,
Divya

On 27 April 2016 at 13:24, Ashutosh Kumar  wrote:

> I see there is a library spark-csv which can be used for removing header
> and processing of csv files. But it seems it works with sqlcontext only. Is
> there a way to remove header from csv files without sqlcontext ?
>
> Thanks
> Ashutosh
>


Re: removing header from csv file

2016-04-26 Thread Praveen Devarao
Hi Ashutosh,

Could you give more details as to what you are wanting do and in 
what feature of Spark you want use? Yes, spark-csv is a connector for 
SparkSQL module...hence it works with SQLContext only.

Thanking You
-
Praveen Devarao
Spark Technology Centre
IBM India Software Labs
-
"Courage doesn't always roar. Sometimes courage is the quiet voice at the 
end of the day saying I will try again"



From:   Ashutosh Kumar <kmr.ashutos...@gmail.com>
To: "user @spark" <user@spark.apache.org>
Date:   27/04/2016 10:55 am
Subject:    removing header from csv file



I see there is a library spark-csv which can be used for removing header 
and processing of csv files. But it seems it works with sqlcontext only. 
Is there a way to remove header from csv files without sqlcontext ? 

Thanks
Ashutosh





Re: removing header from csv file

2016-04-26 Thread Takeshi Yamamuro
Hi,

What do u mean "with sqlcontext only"?
You mean you'd like to load csv data as rdd (sparkcontext) or something?

// maropu

On Wed, Apr 27, 2016 at 2:24 PM, Ashutosh Kumar 
wrote:

> I see there is a library spark-csv which can be used for removing header
> and processing of csv files. But it seems it works with sqlcontext only. Is
> there a way to remove header from csv files without sqlcontext ?
>
> Thanks
> Ashutosh
>



-- 
---
Takeshi Yamamuro


removing header from csv file

2016-04-26 Thread Ashutosh Kumar
I see there is a library spark-csv which can be used for removing header
and processing of csv files. But it seems it works with sqlcontext only. Is
there a way to remove header from csv files without sqlcontext ?

Thanks
Ashutosh