RE: ReduceByKey with a byte array as the key

Aaron Davidson Thu, 11 Jun 2015 10:13:59 -0700

Be careful shoving arbitrary binary data into a string, invalid utf
characters can cause significant computational overhead in my experience.
On Jun 11, 2015 10:09 AM, "Mark Tse" <mark....@d2l.com> wrote:


>  Makes sense – I suspect what you suggested should work.
>
>
>
> However, I think the overhead between this and using `String` would be
> similar enough to warrant just using `String`.
>
>
>
> Mark
>
>
>
> *From:* Sonal Goyal [mailto:sonalgoy...@gmail.com]
> *Sent:* June-11-15 12:58 PM
> *To:* Mark Tse
> *Cc:* user@spark.apache.org
> *Subject:* Re: ReduceByKey with a byte array as the key
>
>
>
> I think if you wrap the byte[] into an object and implement equals and
> hashcode methods, you may be able to do this. There will be the overhead of
> extra object, but conceptually it should work unless I am missing
> something.
>
>
>      Best Regards,
> Sonal
> Founder, Nube Technologies <http://www.nubetech.co>
>
> Check out Reifier at Spark Summit 2015
> <https://spark-summit.org/2015/events/real-time-fuzzy-matching-with-spark-and-elastic-search/>
>
>
>
>
>
> On Thu, Jun 11, 2015 at 9:27 PM, Mark Tse <mark....@d2l.com> wrote:
>
>  I would like to work with RDD pairs of Tuple2<byte[], obj>, but byte[]s
> with the same contents are considered as different values because their
> reference values are different.
>
>
>
> I didn't see any to pass in a custom comparer. I could convert the byte[]
> into a String with an explicit charset, but I'm wondering if there's a more
> efficient way.
>
>
>
> Also posted on SO: http://stackoverflow.com/q/30785615/2687324
>
>
>
> Thanks,
>
> Mark
>
>
>

RE: ReduceByKey with a byte array as the key

Reply via email to