zipWithIndex is fine. It will give you unique row IDs across your various
partitions.

You can also use zipWithUniqueId which saves an extra job that is fired by
zipWithIndex. However, there are some differences as to how indexes are
assigned to the row. You can read more about the two APIs in the API
documentation.

https://spark.apache.org/docs/1.6.1/api/scala/index.html#org.apache.spark.rdd.RDD

Hemant Bhanawat <https://www.linkedin.com/in/hemant-bhanawat-92a3811>
www.snappydata.io

On Thu, Sep 15, 2016 at 4:28 AM, Akshay Sachdeva <akshay.sachd...@gmail.com>
wrote:

> Environment:
> Apache Spark 1.6.2
> Scala: 2.10
>
> I am currently using the spark-csv package courtesy of databricks and I
> would like to have a (pre processing ?) stage when reading the CSV file
> that
> also adds a row number to each row of data being read from the csv file.
> This will allow for better traceability and data lineage in case of
> validation or data processing issues downstream.
>
> In doing the research it seems like the zipWithIndex API is the right or
> only way to get this pattern implemented.
>
> Would this be the preferred route?  Would this be safe for parallel
> operations as far as respect no collisions?  Any body have a similar
> requirement and have a better solution you can point me to.
>
> Appreciate any help and responses anyone can offer.
>
> Thanks
> -a
>
>
>
> --
> View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/CSV-Reader-with-
> row-numbers-tp18946.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Reply via email to