Re: Deduping events using Spark

2015-06-04 Thread William Briggs
Hi Lee,

You should be able to create a PairRDD using the Nonce as the key, and the
AnalyticsEvent as the value. I'm very new to Spark, but here is some
uncompilable pseudo code that may or may not help:

events.map(event = (event.getNonce, event)).reduceByKey((a, b) =
a).map(_._2)

The above code is more Scala-like, since that's the syntax with which I
have more familiarity - it looks like the Spark Java 8 API is similar, but
you won't get implicit conversion to a PairRDD when you use a 2-Tuple as
the mapped value. Instead, will need to use the mapToPair function -
there's a good example in the Spark Programming Guide under Working With
Key-Value Pairs
https://spark.apache.org/docs/latest/programming-guide.html#working-with-key-value-pairs
.

Hope this helps!

Regards,
Will

On Thu, Jun 4, 2015 at 1:10 PM, lbierman leebier...@gmail.com wrote:

 I'm still a bit new to Spark and am struggilng to figure out the best way
 to
 Dedupe my events.

 I load my Avro files from HDFS and then I want to dedupe events that have
 the same nonce.

 For example my code so far:

  JavaRDDAnalyticsEvent events = ((JavaRDDAvroKeylt;AnalyticsEvent)
 context.newAPIHadoopRDD(
 context.hadoopConfiguration(),
 AvroKeyInputFormat.class,
 AvroKey.class,
 NullWritable.class
 ).keys())
 .map(event - AnalyticsEvent.newBuilder(event.datum()).build())
 .filter(key - { return
 Optional.ofNullable(key.getStepEventKey()).isPresent(); })

 Now I want to get back an RDD of AnalyticsEvents that are unique. So I
 basically want to do:
 if AnalyticsEvent.getNonce() == AnalyticsEvent2.getNonce() only return 1 of
 them.

 I'm not sure how to do this? If I do reduceByKey it reduces by
 AnalyticsEvent not by the values inside?

 Any guidance would be much appreciated how I can walk this list of events
 and only return a filtered version of unique nocnes.






 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Deduping-events-using-Spark-tp23153.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Deduping events using Spark

2015-06-04 Thread Richard Marscher
I think if you create a bidirectional mapping from AnalyticsEvent to
another type that would wrap it and use the nonce as its equality, you
could then do something like reduceByKey to group by nonce and map back to
AnalyticsEvent after.

On Thu, Jun 4, 2015 at 1:10 PM, lbierman leebier...@gmail.com wrote:

 I'm still a bit new to Spark and am struggilng to figure out the best way
 to
 Dedupe my events.

 I load my Avro files from HDFS and then I want to dedupe events that have
 the same nonce.

 For example my code so far:

  JavaRDDAnalyticsEvent events = ((JavaRDDAvroKeylt;AnalyticsEvent)
 context.newAPIHadoopRDD(
 context.hadoopConfiguration(),
 AvroKeyInputFormat.class,
 AvroKey.class,
 NullWritable.class
 ).keys())
 .map(event - AnalyticsEvent.newBuilder(event.datum()).build())
 .filter(key - { return
 Optional.ofNullable(key.getStepEventKey()).isPresent(); })

 Now I want to get back an RDD of AnalyticsEvents that are unique. So I
 basically want to do:
 if AnalyticsEvent.getNonce() == AnalyticsEvent2.getNonce() only return 1 of
 them.

 I'm not sure how to do this? If I do reduceByKey it reduces by
 AnalyticsEvent not by the values inside?

 Any guidance would be much appreciated how I can walk this list of events
 and only return a filtered version of unique nocnes.






 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Deduping-events-using-Spark-tp23153.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Deduping events using Spark

2015-06-04 Thread lbierman
I'm still a bit new to Spark and am struggilng to figure out the best way to
Dedupe my events.

I load my Avro files from HDFS and then I want to dedupe events that have
the same nonce. 

For example my code so far:

 JavaRDDAnalyticsEvent events = ((JavaRDDAvroKeylt;AnalyticsEvent)
context.newAPIHadoopRDD(
context.hadoopConfiguration(),
AvroKeyInputFormat.class,
AvroKey.class,
NullWritable.class
).keys())
.map(event - AnalyticsEvent.newBuilder(event.datum()).build())
.filter(key - { return
Optional.ofNullable(key.getStepEventKey()).isPresent(); })

Now I want to get back an RDD of AnalyticsEvents that are unique. So I
basically want to do:
if AnalyticsEvent.getNonce() == AnalyticsEvent2.getNonce() only return 1 of
them.

I'm not sure how to do this? If I do reduceByKey it reduces by
AnalyticsEvent not by the values inside?

Any guidance would be much appreciated how I can walk this list of events
and only return a filtered version of unique nocnes.






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Deduping-events-using-Spark-tp23153.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org