Re: Can anyone recommend me a inter-language data file format?

2008-11-03 Thread Chris Dyer
I've been using protocol buffers to serialize the data and then encoding them in base64 so that I can then treat them like text. This obviously isn't optimal, but I'm assuming that this is only a short term solution which won't be necessary when non-Java clients become first class citizens of the

Re: Can anyone recommend me a inter-language data file format?

2008-11-03 Thread Pete Wyckoff
Protocol buffers, thrift? On 11/3/08 4:07 AM, "Steve Loughran" <[EMAIL PROTECTED]> wrote: Zhou, Yunqing wrote: > embedded database cannot handle large-scale data, not very efficient > I have about 1 billion records. > these records should be passed through some modules. > I mean a data exchange

Re: Can anyone recommend me a inter-language data file format?

2008-11-03 Thread Steve Loughran
Zhou, Yunqing wrote: embedded database cannot handle large-scale data, not very efficient I have about 1 billion records. these records should be passed through some modules. I mean a data exchange format similar to XML but more flexible and efficient. JSON CSV erlang-style records (name,value

Re: Can anyone recommend me a inter-language data file format?

2008-11-02 Thread Alex Loddengaard
Protocol Buffers are tricky in Hadoop. See here: Basically, Protocol Buffers aren't self-delimiting, and because the stream passed to the deserializer has trailing bits, the Protocol Buffer fails to deserialize. I've been lagging badly on HADOOP

Re: Can anyone recommend me a inter-language data file format?

2008-11-02 Thread Zhou, Yunqing
I finally decided to use Protocol Buffers. But there is a problem, when hadoop is handling a file larger than blocksize,it will be splited. How can I determine the boundary of a sequence of protocol buffer records? I was thinking of using hadoop's SequenceFile as a container,but it hasn't a C++ API

Re: Can anyone recommend me a inter-language data file format?

2008-11-01 Thread Bryan Duxbury
Agree, we use Thrift at Rapleaf for this purpose. It's trivial to make a ThriftWritable if you want to be crafty, but you can also just use byte[]s and do the serialization and deserialization yourself. -Bryan On Nov 1, 2008, at 8:01 PM, Alex Loddengaard wrote: Take a look at Thrift:

Re: Can anyone recommend me a inter-language data file format?

2008-11-01 Thread Chris Collins
Consider talking to Doug Cutting. He is playing with the idea of a variant of JSON, I am sure he would love your help. Specifically he is looking at a coding scheme that is easy to read, does not duplicate key names per record and supports file splits. C On Nov 1, 2008, at 8:20 PM, Zhou,

Re: Can anyone recommend me a inter-language data file format?

2008-11-01 Thread Zhou, Yunqing
Can thrift be easily used in hadoop? a lot of things should be written, input/output format, writables,split method,etc. On Sun, Nov 2, 2008 at 11:01 AM, Alex Loddengaard <[EMAIL PROTECTED]> wrote: > Take a look at Thrift: > > > Alex > > On Sat, Nov 1, 200

Re: Can anyone recommend me a inter-language data file format?

2008-11-01 Thread Zhou, Yunqing
embedded database cannot handle large-scale data, not very efficient I have about 1 billion records. these records should be passed through some modules. I mean a data exchange format similar to XML but more flexible and efficient. On Sun, Nov 2, 2008 at 10:49 AM, lamfeeling <[EMAIL PROTECTED]> wr

Re: Can anyone recommend me a inter-language data file format?

2008-11-01 Thread Chris Collins
Sleepycat has a java edition: http://www.oracle.com/technology/products/berkeley-db/index.html I has an "interesting" open source license. If you dont need to ship it on an install disk your probably good to go with that too. you could also consider Derby. C On Nov 1, 2008, at 7:49 PM, lam

Re: Can anyone recommend me a inter-language data file format?

2008-11-01 Thread Alex Loddengaard
Take a look at Thrift: Alex On Sat, Nov 1, 2008 at 7:15 PM, Zhou, Yunqing <[EMAIL PROTECTED]> wrote: > The project I focused on has many modules written in different languages > (several modules are hadoop jobs). > So I'd like to utilize a common record b

Can anyone recommend me a inter-language data file format?

2008-11-01 Thread Zhou, Yunqing
The project I focused on has many modules written in different languages (several modules are hadoop jobs). So I'd like to utilize a common record based data file format for data exchange. XML is not efficient for appending new records. SequenceFile seems not having API of other languages except Ja