Re: Optimal solution for getting the header from CSV with Spark

Dean Wampler Tue, 24 Mar 2015 07:42:37 -0700

Instead of data.zipWithIndex().filter(_._2==0), which will cause Spark to
read the whole file, use data.take(1), which is simpler.

>From the Rdd.take documentation, it works by first scanning one partition,
and using the results from that partition to estimate the number of
additional partitions needed to satisfy the limit. In this case, it will
trivially stop at the first.

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe <http://typesafe.com>
@deanwampler <http://twitter.com/deanwampler>
http://polyglotprogramming.com

On Tue, Mar 24, 2015 at 7:12 AM, Spico Florin <spicoflo...@gmail.com> wrote:

> Hello!
>
> I would like to know what is the optimal solution for getting the header
> with from a CSV file with Spark? My aproach was:
>
> def getHeader(data: RDD[String]): String = {
> data.zipWithIndex().filter(_._2==0).map(x=>x._1).take(1).mkString("") }
>
> Thanks.
>

Re: Optimal solution for getting the header from CSV with Spark

Reply via email to