I'm considering a few approaches -- one of which is to provide new functions like mapLeft, mapRight, filterLeft, etc.
But this all falls shorts with DataFrames. RDDs can easily be extended from RDD[T] to RDD[Record[T]]. I guess with DataFrames, I could add special columns? On Wed, Jul 15, 2015 at 12:36 PM, Reynold Xin <r...@databricks.com> wrote: > How about just using two fields, one boolean field to mark good/bad, and > another to get the source file? > > > On Wed, Jul 15, 2015 at 10:31 AM, RJ Nowling <rnowl...@gmail.com> wrote: > >> Hi all, >> >> I'm working on an ETL task with Spark. As part of this work, I'd like to >> mark records with some info such as: >> >> 1. Whether the record is good or bad (e.g, Either) >> 2. Originating file and lines >> >> Part of my motivation is to prevent errors with individual records from >> stopping the entire pipeline. I'd also like to filter out and log bad >> records at various stages. >> >> I could use RDD[Either[T]] for everything but that won't work for >> DataFrames. I was wondering if anyone has had a similar situation and if >> they found elegant ways to handle this? >> >> Thanks, >> RJ >> > >