Yea - I'd just add a bunch of columns. Doesn't seem like that big of a deal.
On Wed, Jul 15, 2015 at 10:53 AM, RJ Nowling rnowl...@gmail.com wrote:
I'm considering a few approaches -- one of which is to provide new
functions like mapLeft, mapRight, filterLeft, etc.
But this all falls shorts with DataFrames. RDDs can easily be extended
from RDD[T] to RDD[Record[T]]. I guess with DataFrames, I could add
special columns?
On Wed, Jul 15, 2015 at 12:36 PM, Reynold Xin r...@databricks.com wrote:
How about just using two fields, one boolean field to mark good/bad, and
another to get the source file?
On Wed, Jul 15, 2015 at 10:31 AM, RJ Nowling rnowl...@gmail.com wrote:
Hi all,
I'm working on an ETL task with Spark. As part of this work, I'd like
to mark records with some info such as:
1. Whether the record is good or bad (e.g, Either)
2. Originating file and lines
Part of my motivation is to prevent errors with individual records from
stopping the entire pipeline. I'd also like to filter out and log bad
records at various stages.
I could use RDD[Either[T]] for everything but that won't work for
DataFrames. I was wondering if anyone has had a similar situation and if
they found elegant ways to handle this?
Thanks,
RJ