Record metadata with RDDs and DataFrames

RJ Nowling Wed, 15 Jul 2015 10:32:47 -0700

Hi all,

I'm working on an ETL task with Spark.  As part of this work, I'd like to
mark records with some info such as:


1. Whether the record is good or bad (e.g, Either)
2. Originating file and lines

Part of my motivation is to prevent errors with individual records from
stopping the entire pipeline.  I'd also like to filter out and log bad
records at various stages.

I could use RDD[Either[T]] for everything but that won't work for
DataFrames.  I was wondering if anyone has had a similar situation and if
they found elegant ways to handle this?

Thanks,
RJ

Record metadata with RDDs and DataFrames

Reply via email to