Re: Optimal solution for getting the header from CSV with Spark

Sean Owen Tue, 24 Mar 2015 09:14:53 -0700

I think this works in practice, but I don't know that the first block
of the file is guaranteed to be in the first partition? certainly
later down the pipeline that won't be true but presumably this is
happening right after reading the file.


I've always just written some filter that would only match the header,
which assumes that this is possible to distinguish, but usually is.

On Tue, Mar 24, 2015 at 2:41 PM, Dean Wampler <deanwamp...@gmail.com> wrote:
> Instead of data.zipWithIndex().filter(_._2==0), which will cause Spark to
> read the whole file, use data.take(1), which is simpler.
>
> From the Rdd.take documentation, it works by first scanning one partition,
> and using the results from that partition to estimate the number of
> additional partitions needed to satisfy the limit. In this case, it will
> trivially stop at the first.
>
>
> Dean Wampler, Ph.D.
> Author: Programming Scala, 2nd Edition (O'Reilly)
> Typesafe
> @deanwampler
> http://polyglotprogramming.com
>
> On Tue, Mar 24, 2015 at 7:12 AM, Spico Florin <spicoflo...@gmail.com> wrote:
>>
>> Hello!
>>
>> I would like to know what is the optimal solution for getting the header
>> with from a CSV file with Spark? My aproach was:
>>
>> def getHeader(data: RDD[String]): String = {
>> data.zipWithIndex().filter(_._2==0).map(x=>x._1).take(1).mkString("") }
>>
>> Thanks.
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Optimal solution for getting the header from CSV with Spark

Reply via email to