I'm considering a few approaches -- one of which is to provide new
functions like mapLeft, mapRight, filterLeft, etc.

But this all falls shorts with DataFrames.  RDDs can easily be extended
from RDD[T] to RDD[Record[T]].  I guess with DataFrames, I could add
special columns?

On Wed, Jul 15, 2015 at 12:36 PM, Reynold Xin <r...@databricks.com> wrote:

> How about just using two fields, one boolean field to mark good/bad, and
> another to get the source file?
>
>
> On Wed, Jul 15, 2015 at 10:31 AM, RJ Nowling <rnowl...@gmail.com> wrote:
>
>> Hi all,
>>
>> I'm working on an ETL task with Spark.  As part of this work, I'd like to
>> mark records with some info such as:
>>
>> 1. Whether the record is good or bad (e.g, Either)
>> 2. Originating file and lines
>>
>> Part of my motivation is to prevent errors with individual records from
>> stopping the entire pipeline.  I'd also like to filter out and log bad
>> records at various stages.
>>
>> I could use RDD[Either[T]] for everything but that won't work for
>> DataFrames.  I was wondering if anyone has had a similar situation and if
>> they found elegant ways to handle this?
>>
>> Thanks,
>> RJ
>>
>
>

Reply via email to