I'm still a bit new to Spark and am struggilng to figure out the best way to
Dedupe my events.

I load my Avro files from HDFS and then I want to dedupe events that have
the same nonce. 

For example my code so far:

 JavaRDD<AnalyticsEvent> events = ((JavaRDD<AvroKey&lt;AnalyticsEvent>>)
context.newAPIHadoopRDD(
            context.hadoopConfiguration(),
            AvroKeyInputFormat.class,
            AvroKey.class,
            NullWritable.class
        ).keys())
        .map(event -> AnalyticsEvent.newBuilder(event.datum()).build())
        .filter(key -> { return
Optional.ofNullable(key.getStepEventKey()).isPresent(); })

Now I want to get back an RDD of AnalyticsEvents that are unique. So I
basically want to do:
if AnalyticsEvent.getNonce() == AnalyticsEvent2.getNonce() only return 1 of
them.

I'm not sure how to do this? If I do reduceByKey it reduces by
AnalyticsEvent not by the values inside?

Any guidance would be much appreciated how I can walk this list of events
and only return a filtered version of unique nocnes.






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Deduping-events-using-Spark-tp23153.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to