Re: GSoC: Avro Serialization over HBase

Eric Charles Tue, 12 Jun 2012 05:40:58 -0700

True.

What do you intend to store in Avro format (these bytes being retrievedby any means on the RPC side)?

Thx, Eric


On 06/12/2012 02:14 PM, Ioan Eugen Stan wrote:

Hi,

 From what I know Avro deprecation is for RPC communication. The
Put/Delete/ etc operations are serialized with Avro instead of the
usual Writables. Regardless of what serialization the RPC sub-system
uses, the data stored by the operations (Put/Get/Delete) is viewed as
byte array. If we store Avro objects as binary blobs in HBase then we
have no issues.

Cheers,

2012/6/12 Mihai Soloi<mihai.so...@gmail.com>:

On 12.06.2012 11:30, Eric Charles wrote:


Hi Mihai,

Glad to hear your exams are over (I hope they went fine) :)


Hi Eric,

Thanks, they went very well, I got high marks.


As Ioan said, Avro serialization HBase will be deprecated in favor of
Protobuf (if I understand well...).



I think Avro could be changed rather easily with Protobuf as they're both
doing basically the same thing, only that Avro uses JSON schemas and can be
used with any other language, which is of no of value to the project.


I also like Avro because it gives you serialization&  storage format in
one box, but is this what we want? The key point here is more an effective
access to the persisted data.



If the data is passed through Avro we'll have it serialized and
deserialization is basically handled by Avro, but we'll always have to
interact with the schemas. In Protobuf we have the objects compiled into our
classes, from what i gather it's mostly usefull for RPC, but Avro also has
the protocol in which by using the avro-maven-plugin you can generate you
own classes with which to interact. I can't say I'm an expert in either but
I fancy Avro.



There has been a few tentatives so far to marry HBase and Lucene (see [1],
[2], [3] and [4] for example, see also [5] for a more recent article).

Thank you for the github links, i will look thouroughly through the
projects. I was already aware of Basene and Solandra(former Lucandra), they
have simillar aproaches.

The questions I am wondering:

1. Will you focus on a 'generic' solution (reusable outside James), or on
a very specific one tuned/optimized only for James mailbox needs?


I was thinking of writing generic code so that maybe it could be used
outside of James but the data format would be specific to James mailbox
needs, so the answer in the end is that it will be tuned for James.


2. What strategy will you take (custom Directory or custom
IndexReader/Writer, usage of Coprocessor or not...)?


I was thinking that a custom Directory was the way to go, but I soon
realized that it's not as simple as it sounds and overriding the higher
level classes of IndexReader and IndexWriter would be more appropriate.(as
in article [5]) So by bypassing the Directory I would have to make use of
Hbase Coprocessors. As far as I can think of it, a RegionObserver would be
employed to gather frequently performed on data for the Lucene queries and
Endpoints.



[1] https://github.com/akkumar/hbasene
[2] https://github.com/thkoch2001/lucehbase
[3] https://github.com/jasonrutherglen/HBASE-SEARCH
[4] https://github.com/jasonrutherglen/LUCENE-FOR-HBASE
[5] http://www.infoq.com/articles/LuceneHbase


---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscr...@james.apache.org
For additional commands, e-mail: server-dev-h...@james.apache.org


--
eric | http://about.echarles.net | @echarles

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscr...@james.apache.org
For additional commands, e-mail: server-dev-h...@james.apache.org

Re: GSoC: Avro Serialization over HBase

Reply via email to