Re: Deduping events using Spark
Hi Lee, You should be able to create a PairRDD using the Nonce as the key, and the AnalyticsEvent as the value. I'm very new to Spark, but here is some uncompilable pseudo code that may or may not help: events.map(event = (event.getNonce, event)).reduceByKey((a, b) = a).map(_._2) The above code is more Scala-like, since that's the syntax with which I have more familiarity - it looks like the Spark Java 8 API is similar, but you won't get implicit conversion to a PairRDD when you use a 2-Tuple as the mapped value. Instead, will need to use the mapToPair function - there's a good example in the Spark Programming Guide under Working With Key-Value Pairs https://spark.apache.org/docs/latest/programming-guide.html#working-with-key-value-pairs . Hope this helps! Regards, Will On Thu, Jun 4, 2015 at 1:10 PM, lbierman leebier...@gmail.com wrote: I'm still a bit new to Spark and am struggilng to figure out the best way to Dedupe my events. I load my Avro files from HDFS and then I want to dedupe events that have the same nonce. For example my code so far: JavaRDDAnalyticsEvent events = ((JavaRDDAvroKeylt;AnalyticsEvent) context.newAPIHadoopRDD( context.hadoopConfiguration(), AvroKeyInputFormat.class, AvroKey.class, NullWritable.class ).keys()) .map(event - AnalyticsEvent.newBuilder(event.datum()).build()) .filter(key - { return Optional.ofNullable(key.getStepEventKey()).isPresent(); }) Now I want to get back an RDD of AnalyticsEvents that are unique. So I basically want to do: if AnalyticsEvent.getNonce() == AnalyticsEvent2.getNonce() only return 1 of them. I'm not sure how to do this? If I do reduceByKey it reduces by AnalyticsEvent not by the values inside? Any guidance would be much appreciated how I can walk this list of events and only return a filtered version of unique nocnes. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Deduping-events-using-Spark-tp23153.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Deduping events using Spark
I think if you create a bidirectional mapping from AnalyticsEvent to another type that would wrap it and use the nonce as its equality, you could then do something like reduceByKey to group by nonce and map back to AnalyticsEvent after. On Thu, Jun 4, 2015 at 1:10 PM, lbierman leebier...@gmail.com wrote: I'm still a bit new to Spark and am struggilng to figure out the best way to Dedupe my events. I load my Avro files from HDFS and then I want to dedupe events that have the same nonce. For example my code so far: JavaRDDAnalyticsEvent events = ((JavaRDDAvroKeylt;AnalyticsEvent) context.newAPIHadoopRDD( context.hadoopConfiguration(), AvroKeyInputFormat.class, AvroKey.class, NullWritable.class ).keys()) .map(event - AnalyticsEvent.newBuilder(event.datum()).build()) .filter(key - { return Optional.ofNullable(key.getStepEventKey()).isPresent(); }) Now I want to get back an RDD of AnalyticsEvents that are unique. So I basically want to do: if AnalyticsEvent.getNonce() == AnalyticsEvent2.getNonce() only return 1 of them. I'm not sure how to do this? If I do reduceByKey it reduces by AnalyticsEvent not by the values inside? Any guidance would be much appreciated how I can walk this list of events and only return a filtered version of unique nocnes. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Deduping-events-using-Spark-tp23153.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Deduping events using Spark
I'm still a bit new to Spark and am struggilng to figure out the best way to Dedupe my events. I load my Avro files from HDFS and then I want to dedupe events that have the same nonce. For example my code so far: JavaRDDAnalyticsEvent events = ((JavaRDDAvroKeylt;AnalyticsEvent) context.newAPIHadoopRDD( context.hadoopConfiguration(), AvroKeyInputFormat.class, AvroKey.class, NullWritable.class ).keys()) .map(event - AnalyticsEvent.newBuilder(event.datum()).build()) .filter(key - { return Optional.ofNullable(key.getStepEventKey()).isPresent(); }) Now I want to get back an RDD of AnalyticsEvents that are unique. So I basically want to do: if AnalyticsEvent.getNonce() == AnalyticsEvent2.getNonce() only return 1 of them. I'm not sure how to do this? If I do reduceByKey it reduces by AnalyticsEvent not by the values inside? Any guidance would be much appreciated how I can walk this list of events and only return a filtered version of unique nocnes. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Deduping-events-using-Spark-tp23153.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org