Re: Record metadata with RDDs and DataFrames

2015-07-15 Thread RJ Nowling
I'm considering a few approaches -- one of which is to provide new
functions like mapLeft, mapRight, filterLeft, etc.

But this all falls shorts with DataFrames.  RDDs can easily be extended
from RDD[T] to RDD[Record[T]].  I guess with DataFrames, I could add
special columns?

On Wed, Jul 15, 2015 at 12:36 PM, Reynold Xin r...@databricks.com wrote:

 How about just using two fields, one boolean field to mark good/bad, and
 another to get the source file?


 On Wed, Jul 15, 2015 at 10:31 AM, RJ Nowling rnowl...@gmail.com wrote:

 Hi all,

 I'm working on an ETL task with Spark.  As part of this work, I'd like to
 mark records with some info such as:

 1. Whether the record is good or bad (e.g, Either)
 2. Originating file and lines

 Part of my motivation is to prevent errors with individual records from
 stopping the entire pipeline.  I'd also like to filter out and log bad
 records at various stages.

 I could use RDD[Either[T]] for everything but that won't work for
 DataFrames.  I was wondering if anyone has had a similar situation and if
 they found elegant ways to handle this?

 Thanks,
 RJ





Re: Record metadata with RDDs and DataFrames

2015-07-15 Thread Reynold Xin
Yea - I'd just add a bunch of columns. Doesn't seem like that big of a deal.


On Wed, Jul 15, 2015 at 10:53 AM, RJ Nowling rnowl...@gmail.com wrote:

 I'm considering a few approaches -- one of which is to provide new
 functions like mapLeft, mapRight, filterLeft, etc.

 But this all falls shorts with DataFrames.  RDDs can easily be extended
 from RDD[T] to RDD[Record[T]].  I guess with DataFrames, I could add
 special columns?

 On Wed, Jul 15, 2015 at 12:36 PM, Reynold Xin r...@databricks.com wrote:

 How about just using two fields, one boolean field to mark good/bad, and
 another to get the source file?


 On Wed, Jul 15, 2015 at 10:31 AM, RJ Nowling rnowl...@gmail.com wrote:

 Hi all,

 I'm working on an ETL task with Spark.  As part of this work, I'd like
 to mark records with some info such as:

 1. Whether the record is good or bad (e.g, Either)
 2. Originating file and lines

 Part of my motivation is to prevent errors with individual records from
 stopping the entire pipeline.  I'd also like to filter out and log bad
 records at various stages.

 I could use RDD[Either[T]] for everything but that won't work for
 DataFrames.  I was wondering if anyone has had a similar situation and if
 they found elegant ways to handle this?

 Thanks,
 RJ