Consider talking to Doug Cutting. He is playing with the idea of a variant of JSON, I am sure he would love your help. Specifically he is looking at a coding scheme that is easy to read, does not duplicate key names per record and supports file splits.

C
On Nov 1, 2008, at 8:20 PM, Zhou, Yunqing wrote:

embedded database cannot handle large-scale data, not very efficient
I have about 1 billion records.
these records should be passed through some modules.
I mean a data exchange format similar to XML but more flexible and
efficient.

On Sun, Nov 2, 2008 at 10:49 AM, lamfeeling <[EMAIL PROTECTED]> wrote:

Consider Embeded Database? Berkeley DB, written in C++, and have interface
for many languages.





在2008-11-02?10:15:22,"Zhou,?Yunqing"?<[EMAIL PROTECTED]>?写道:
The?project?I?focused?on?has?many?modules?written?in?different? languages
(several?modules?are?hadoop?jobs).
So?I'd?like?to?utilize?a?common?record?based?data?file?format?for? data
exchange.
XML?is?not?efficient?for?appending?new?records.
SequenceFile?seems?not?having?API?of?other?languages?except?Java.
Protocol?Buffers'?hadoop?API?seems?under?development.
any?recommendation?for?this?

Thanks


Reply via email to