ReduceByKey with a byte array as the key
I would like to work with RDD pairs of Tuple2byte[], obj, but byte[]s with the same contents are considered as different values because their reference values are different. I didn't see any to pass in a custom comparer. I could convert the byte[] into a String with an explicit charset, but I'm wondering if there's a more efficient way. Also posted on SO: http://stackoverflow.com/q/30785615/2687324 Thanks, Mark
Re: ReduceByKey with a byte array as the key
I think if you wrap the byte[] into an object and implement equals and hashcode methods, you may be able to do this. There will be the overhead of extra object, but conceptually it should work unless I am missing something. Best Regards, Sonal Founder, Nube Technologies http://www.nubetech.co Check out Reifier at Spark Summit 2015 https://spark-summit.org/2015/events/real-time-fuzzy-matching-with-spark-and-elastic-search/ http://in.linkedin.com/in/sonalgoyal On Thu, Jun 11, 2015 at 9:27 PM, Mark Tse mark@d2l.com wrote: I would like to work with RDD pairs of Tuple2byte[], obj, but byte[]s with the same contents are considered as different values because their reference values are different. I didn't see any to pass in a custom comparer. I could convert the byte[] into a String with an explicit charset, but I'm wondering if there's a more efficient way. Also posted on SO: http://stackoverflow.com/q/30785615/2687324 Thanks, Mark
RE: ReduceByKey with a byte array as the key
Makes sense – I suspect what you suggested should work. However, I think the overhead between this and using `String` would be similar enough to warrant just using `String`. Mark From: Sonal Goyal [mailto:sonalgoy...@gmail.com] Sent: June-11-15 12:58 PM To: Mark Tse Cc: user@spark.apache.org Subject: Re: ReduceByKey with a byte array as the key I think if you wrap the byte[] into an object and implement equals and hashcode methods, you may be able to do this. There will be the overhead of extra object, but conceptually it should work unless I am missing something. Best Regards, Sonal Founder, Nube Technologieshttp://www.nubetech.co Check out Reifier at Spark Summit 2015https://spark-summit.org/2015/events/real-time-fuzzy-matching-with-spark-and-elastic-search/ On Thu, Jun 11, 2015 at 9:27 PM, Mark Tse mark@d2l.commailto:mark@d2l.com wrote: I would like to work with RDD pairs of Tuple2byte[], obj, but byte[]s with the same contents are considered as different values because their reference values are different. I didn't see any to pass in a custom comparer. I could convert the byte[] into a String with an explicit charset, but I'm wondering if there's a more efficient way. Also posted on SO: http://stackoverflow.com/q/30785615/2687324 Thanks, Mark
RE: ReduceByKey with a byte array as the key
Be careful shoving arbitrary binary data into a string, invalid utf characters can cause significant computational overhead in my experience. On Jun 11, 2015 10:09 AM, Mark Tse mark@d2l.com wrote: Makes sense – I suspect what you suggested should work. However, I think the overhead between this and using `String` would be similar enough to warrant just using `String`. Mark *From:* Sonal Goyal [mailto:sonalgoy...@gmail.com] *Sent:* June-11-15 12:58 PM *To:* Mark Tse *Cc:* user@spark.apache.org *Subject:* Re: ReduceByKey with a byte array as the key I think if you wrap the byte[] into an object and implement equals and hashcode methods, you may be able to do this. There will be the overhead of extra object, but conceptually it should work unless I am missing something. Best Regards, Sonal Founder, Nube Technologies http://www.nubetech.co Check out Reifier at Spark Summit 2015 https://spark-summit.org/2015/events/real-time-fuzzy-matching-with-spark-and-elastic-search/ On Thu, Jun 11, 2015 at 9:27 PM, Mark Tse mark@d2l.com wrote: I would like to work with RDD pairs of Tuple2byte[], obj, but byte[]s with the same contents are considered as different values because their reference values are different. I didn't see any to pass in a custom comparer. I could convert the byte[] into a String with an explicit charset, but I'm wondering if there's a more efficient way. Also posted on SO: http://stackoverflow.com/q/30785615/2687324 Thanks, Mark