Hi,

I play with Cassandra 0.7.0 and Hadoop, developing simple MapReduce
tasks. While developing really simple MR task, I've found that a
combiantion of Hadoop optimalization and Cassandra
ColumnFamilyRecordWriter queue creates wrong keys to send to
batch_mutate(). The proble is in the reduce part, the storage behind
the key parameter is reused. For example when storing URL I'll get:

http://119.cz/index.php/vypalovaci-mechaniky-a-vypalovani-disk/120-jak-zjistit-verzi-firmwaru-vypalovaky-ve-windows-vista
 (1)
http://11superstars.xf.cz/index.php?page=12y-a-vypalovani-disk/120-jak-zjistit-verzi-firmwaru-vypalovaky-ve-windows-vista
 (2)
http://12kmenu.unas.cz/18-6-2011-(Isachar).htmlvypalovani-disk/120-jak-zjistit-verzi-firmwaru-vypalovaky-ve-windows-vista
 (3)

You can see, that part of the URL (1) is repeating in the URL (2) and URL (3).

I've changed the my reduce method to clone the key before calling the
context.write(), but I think it should be cloned inside the Cassandra
ColumnFamilyRecordWriter because I as a user I don't care about how is
it implemented inside, I just write values there. For example the
FileOutputFormat, I don't need to clone the key when writting to it.

I'd like to know what's your opinion.

Best regards,
Patrik

Reply via email to