Hello!
Thank for your responses. I was afraid that due to partitioning I will
loose the logic that the first element is the header. I observe that
rdd.first calls behind the rdd.take(1) method.
Best regards,
Florin
On Tue, Mar 24, 2015 at 4:41 PM, Dean Wampler wrote:
> Instead of data.zipWit
;Dean Wampler"
Sent: March 24, 2015 9:19 AM
To: "Sean Owen"
Cc: "Spico Florin" , "user"
Subject: Re: Optimal solution for getting the header from CSV with Spark
Good point. There's no guarantee that you'll get the actual first
partition. One reason why
Good point. There's no guarantee that you'll get the actual first
partition. One reason why I wouldn't allow a CSV header line in a real data
file, if I could avoid it.
Back to Spark, a safer approach is RDD.foreachPartition, which takes a
function expecting an iterator. You'll only need to grab t
I think this works in practice, but I don't know that the first block
of the file is guaranteed to be in the first partition? certainly
later down the pipeline that won't be true but presumably this is
happening right after reading the file.
I've always just written some filter that would only mat
Instead of data.zipWithIndex().filter(_._2==0), which will cause Spark to
read the whole file, use data.take(1), which is simpler.
>From the Rdd.take documentation, it works by first scanning one partition,
and using the results from that partition to estimate the number of
additional partitions n
Hello!
I would like to know what is the optimal solution for getting the header
with from a CSV file with Spark? My aproach was:
def getHeader(data: RDD[String]): String = {
data.zipWithIndex().filter(_._2==0).map(x=>x._1).take(1).mkString("") }
Thanks.