Re: Optimal solution for getting the header from CSV with Spark

2015-03-25 Thread Felix C
The spark-csv package can handle header row, and the code is at the link below. It could also use the header to infer field names in the schema. https://github.com/databricks/spark-csv/blob/master/src/main/scala/com/databricks/spark/csv/CsvRelation.scala --- Original Message --- From: Dean

Re: Optimal solution for getting the header from CSV with Spark

2015-03-25 Thread Spico Florin
Hello! Thank for your responses. I was afraid that due to partitioning I will loose the logic that the first element is the header. I observe that rdd.first calls behind the rdd.take(1) method. Best regards, Florin On Tue, Mar 24, 2015 at 4:41 PM, Dean Wampler deanwamp...@gmail.com wrote:

Optimal solution for getting the header from CSV with Spark

2015-03-24 Thread Spico Florin
Hello! I would like to know what is the optimal solution for getting the header with from a CSV file with Spark? My aproach was: def getHeader(data: RDD[String]): String = { data.zipWithIndex().filter(_._2==0).map(x=x._1).take(1).mkString() } Thanks.

Re: Optimal solution for getting the header from CSV with Spark

2015-03-24 Thread Sean Owen
I think this works in practice, but I don't know that the first block of the file is guaranteed to be in the first partition? certainly later down the pipeline that won't be true but presumably this is happening right after reading the file. I've always just written some filter that would only

Re: Optimal solution for getting the header from CSV with Spark

2015-03-24 Thread Dean Wampler
Good point. There's no guarantee that you'll get the actual first partition. One reason why I wouldn't allow a CSV header line in a real data file, if I could avoid it. Back to Spark, a safer approach is RDD.foreachPartition, which takes a function expecting an iterator. You'll only need to grab

Re: Optimal solution for getting the header from CSV with Spark

2015-03-24 Thread Dean Wampler
Instead of data.zipWithIndex().filter(_._2==0), which will cause Spark to read the whole file, use data.take(1), which is simpler. From the Rdd.take documentation, it works by first scanning one partition, and using the results from that partition to estimate the number of additional partitions