How about just using two fields, one boolean field to mark good/bad, and another to get the source file?
On Wed, Jul 15, 2015 at 10:31 AM, RJ Nowling <rnowl...@gmail.com> wrote: > Hi all, > > I'm working on an ETL task with Spark. As part of this work, I'd like to > mark records with some info such as: > > 1. Whether the record is good or bad (e.g, Either) > 2. Originating file and lines > > Part of my motivation is to prevent errors with individual records from > stopping the entire pipeline. I'd also like to filter out and log bad > records at various stages. > > I could use RDD[Either[T]] for everything but that won't work for > DataFrames. I was wondering if anyone has had a similar situation and if > they found elegant ways to handle this? > > Thanks, > RJ >