In the past I've handled this by filtering out the header line, but it
seems to me that it would be useful to have a way of dealing with files
that would preserve sequence, so that e.g. you could just do
mySequentialRDD.drop(1) to get rid of the header. There are other use cases
like this that curr
You must be parsing each
line of the file at some point anyway, so adding a step to filter out
the header should work fine. It'll get executed at the same time as your
parsing/conversion to ints, so there's no significant overhead aside
from the check itself.
For standalone programs, there's a
I am not sure.. the suggestion is to open a TB file and remove a line?
That doesnt sounds that good.
I am hacking my way by using a filter..
Can I put a try:except clause in my lambda function.. Maybe i should just
try that out.
But thanks for the suggestion.
Also, can i run scripts against spark
Bad solution is to run a mapper through the data and null the counts , good
solution is to trim the header before hand without Spark.
On Feb 26, 2014 9:28 AM, "Chengi Liu" wrote:
> Hi,
> How do we deal with headers in csv file.
> For example:
> id, counts
> 1,2
> 1,5
> 2,20
> 2,25
> ... and so
Hi,
How do we deal with headers in csv file.
For example:
id, counts
1,2
1,5
2,20
2,25
... and so on
And I want to do a frequency count of counts for each id. So result will be
:
1,7
2,45
and so on..
My code:
counts = data.map(lambda x: (x[0],int(x[1]))).reduceByKey(lambda a, b: a +
b))
But