subject:"ReduceByKey with a byte array as the key"

ReduceByKey with a byte array as the key

2015-06-11 Thread Mark Tse

I would like to work with RDD pairs of Tuple2byte[], obj, but byte[]s with 
the same contents are considered as different values because their reference 
values are different.

I didn't see any to pass in a custom comparer. I could convert the byte[] into 
a String with an explicit charset, but I'm wondering if there's a more 
efficient way.

Also posted on SO: http://stackoverflow.com/q/30785615/2687324

Thanks,
Mark

Re: ReduceByKey with a byte array as the key

2015-06-11 Thread Sonal Goyal

I think if you wrap the byte[] into an object and implement equals and
hashcode methods, you may be able to do this. There will be the overhead of
extra object, but conceptually it should work unless I am missing
something.

Best Regards,
Sonal
Founder, Nube Technologies http://www.nubetech.co
Check out Reifier at Spark Summit 2015
https://spark-summit.org/2015/events/real-time-fuzzy-matching-with-spark-and-elastic-search/

http://in.linkedin.com/in/sonalgoyal



On Thu, Jun 11, 2015 at 9:27 PM, Mark Tse mark@d2l.com wrote:

  I would like to work with RDD pairs of Tuple2byte[], obj, but byte[]s
 with the same contents are considered as different values because their
 reference values are different.



 I didn't see any to pass in a custom comparer. I could convert the byte[]
 into a String with an explicit charset, but I'm wondering if there's a more
 efficient way.



 Also posted on SO: http://stackoverflow.com/q/30785615/2687324



 Thanks,

 Mark

RE: ReduceByKey with a byte array as the key

2015-06-11 Thread Mark Tse

Makes sense – I suspect what you suggested should work.

However, I think the overhead between this and using `String` would be similar 
enough to warrant just using `String`.

Mark

From: Sonal Goyal [mailto:sonalgoy...@gmail.com]
Sent: June-11-15 12:58 PM
To: Mark Tse
Cc: user@spark.apache.org
Subject: Re: ReduceByKey with a byte array as the key

I think if you wrap the byte[] into an object and implement equals and hashcode 
methods, you may be able to do this. There will be the overhead of extra 
object, but conceptually it should work unless I am missing something.

Best Regards,
Sonal
Founder, Nube Technologieshttp://www.nubetech.co
Check out Reifier at Spark Summit 
2015https://spark-summit.org/2015/events/real-time-fuzzy-matching-with-spark-and-elastic-search/

On Thu, Jun 11, 2015 at 9:27 PM, Mark Tse 
mark@d2l.commailto:mark@d2l.com wrote:
I would like to work with RDD pairs of Tuple2byte[], obj, but byte[]s with 
the same contents are considered as different values because their reference 
values are different.

I didn't see any to pass in a custom comparer. I could convert the byte[] into 
a String with an explicit charset, but I'm wondering if there's a more 
efficient way.

Also posted on SO: http://stackoverflow.com/q/30785615/2687324

Thanks,
Mark

RE: ReduceByKey with a byte array as the key

2015-06-11 Thread Aaron Davidson

Be careful shoving arbitrary binary data into a string, invalid utf
characters can cause significant computational overhead in my experience.
On Jun 11, 2015 10:09 AM, Mark Tse mark@d2l.com wrote:

Makes sense – I suspect what you suggested should work.

However, I think the overhead between this and using `String` would be
similar enough to warrant just using `String`.

Mark

*From:* Sonal Goyal [mailto:sonalgoy...@gmail.com]
*Sent:* June-11-15 12:58 PM
*To:* Mark Tse
*Cc:* user@spark.apache.org
*Subject:* Re: ReduceByKey with a byte array as the key

I think if you wrap the byte[] into an object and implement equals and
hashcode methods, you may be able to do this. There will be the overhead of
extra object, but conceptually it should work unless I am missing
something.

Best Regards,
Sonal
Founder, Nube Technologies http://www.nubetech.co

Check out Reifier at Spark Summit 2015
https://spark-summit.org/2015/events/real-time-fuzzy-matching-with-spark-and-elastic-search/

On Thu, Jun 11, 2015 at 9:27 PM, Mark Tse mark@d2l.com wrote:

I would like to work with RDD pairs of Tuple2byte[], obj, but byte[]s
with the same contents are considered as different values because their
reference values are different.

I didn't see any to pass in a custom comparer. I could convert the byte[]
into a String with an explicit charset, but I'm wondering if there's a more
efficient way.

Also posted on SO: http://stackoverflow.com/q/30785615/2687324

Thanks,

Mark

ReduceByKey with a byte array as the key

Re: ReduceByKey with a byte array as the key

RE: ReduceByKey with a byte array as the key

RE: ReduceByKey with a byte array as the key

4 matches

Site Navigation

Mail list logo

Footer information