Re: GSoC: Avro Serialization over HBase

Mihai Soloi Mon, 11 Jun 2012 11:02:12 -0700

On 11.06.2012 20:49, Ioan Eugen Stan wrote:

Hi Mihai,


After a quick look...

2012/6/11 Mihai Soloi <mihai.so...@gmail.com>:

Hello Eugen and everybody on the list,

I've completed my exams but I've also done some work on the project, lately
I've been reading up on the HBase API and AVRO API specifications[1] so that
I can get to know them better.

If you need to store AVRO objects, basically, arrays of bytes, into HBase
then you would need to store a schema with the data, for example in the
header of the file, so that you can later read it, if the schema changes
radically over time. Ofcourse AVRO does support some of extension to
modifying it's schemas, if you would look at my test code[0] you'd see that
I was able to extend an existing schema, and prove that it does work with
backward compatibility, I've followed Boris Lublinky's article[4] on using
AVRO to get more familiar with it.

Great. It's nice to experiment.

I've encountered a situation in which I do want to store my data through
AVRO on HBase(due to less memory and structured format and HBase
integration) and I see that there is a class on
"org.apache.hadoop.hbase.avro" like AvroServer which basically starts up a
server through which all sorts of clients can interact with the data store,
and also generated classes(e.g. AColumnValues, APut, AGet, etc.). These
classes from what it would appear in my mind are used to translate the
requests to the server into HBase Puts and Gets by also using the AvroUtils
but I don't know if this is the way to go.

AvroServer is deprecated in 0.94 and scheduled to be removed in 0.96
(https://issues.apache.org/jira/browse/HBASE-5948).  AvroServer
handles the RPC service to use Avro instead of Writables.

Serialization = save an object to disk/file/network and load it in
memory again in the same way (deserialization). We need to
serialize/de-serialize  a lucene index into HBase in an efficient way
(we care about indexing speed, search speed and how much disk/ram it's
going to cost us).

Please read 
http://stackoverflow.com/questions/2486721/what-is-a-data-serialization-system
.

Another thing I've been considering is using Sam Pullara's HAvroBase
implementation[2] and code on github[3]. Sam proposes storing only a
hashcode of the schema and schemas stored separately. HAvroBase is much more
than I would need as it also supports mysql, mongoDB, etc. So I could use
only the storing part for the Lucene IndexWriter.

I think HAvroBase  does a bit more than what we need. It's a bit
generic and I think we can do without adding it as a dependency. The
Lucene index format is not likely to change that much.

Another way to go is to assume that there will never be a change in the
object schemas and just store data just the way it is. This is dangerous
because if there is a change, we would have to change code, instead of a
simple JSON schema.

The way Lucene stores the postings list is pretty standard and will
probably not change that much. I think using Avro is enough.

[0]
http://code.google.com/a/apache-extras.org/p/mailbox-lucene-index-hbase/source/browse/LuceneTest/src/test/java/org/apache/james/mailbox/lucene/avro/AvroInheritanceTest.java
[1] http://avro.apache.org/docs/current/spec.html
[2]
http://www.javarants.com/2010/06/30/havrobase-a-searchable-evolvable-entity-store-on-top-of-hbase-and-solr/
[3] https://github.com/spullara/havrobase
[4]
http://www.infoq.com/articles/ApacheAvro;jsessistore that 
monid=6A801F1882512F455322B572F4B69E24




---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscr...@james.apache.org
For additional commands, e-mail: server-dev-h...@james.apache.org

Re: GSoC: Avro Serialization over HBase

Reply via email to