Re: [protobuf] ProtocolBuffer + compression in hadoop?

2010-02-19 Thread Yang
Thanks, parse/writeDelimited() is exactly what I needed. I see that the LimitedStream underneath restricts reading from the original stream, so we do not need to re-use stream On Fri, Feb 19, 2010 at 11:53 AM, Kenton Varda wrote: > If the underlying stream does not provide its own boundaries

Re: [protobuf] ProtocolBuffer + compression in hadoop?

2010-02-19 Thread Kenton Varda
If the underlying stream does not provide its own boundaries then you need to prefix the protocol message with a size. Hacking an end-of-record "feature" into the protobuf code is probably not a good idea. We already provide parseDelimitedFrom()/writeDelemitedTo() which prefix the message with a

Re: [protobuf] ProtocolBuffer + compression in hadoop?

2010-02-19 Thread Yang
for your last comment, yes, the end-of-record indicator was another hack I put in. but both your options above ultimately require the underlying stream to provide exact record boundaries. in the last email I pointed out that this may or may not be a valid requirement for the underlying inputstream

Re: [protobuf] ProtocolBuffer + compression in hadoop?

2010-02-19 Thread Kenton Varda
Two options: 1) Do not use parseFrom(InputStream). Use parseFrom(byte[]). Read the byte array from the stream yourself, so you can make sure to read only the correct number of bytes. 2) Create a FilteredInputStream subclass which limits reading to some number of bytes. Wrap your InputStream in

Re: [protobuf] ProtocolBuffer + compression in hadoop?

2010-02-19 Thread Yang
I found the issue, this has the same root cause as a previous issue I reported on this forum. basically I think PB assumes that it stops only where the provided stream has ended, otherwise it keeps on reading. in the last issue the buffer was too long and it read in further junk, so I put and End-

Re: [protobuf] ProtocolBuffer + compression in hadoop?

2010-02-18 Thread Christopher Smith
Is this a case of needing to delimit the input? I'm not familiar with SplitterInputStream, but I'm wondering if it does the right thing for this to work. --Chris On Thu, Feb 18, 2010 at 12:56 PM, Kenton Varda wrote: > Please reply-all so the mailing list stays CC'd. I don't know anything > abo

Fwd: [protobuf] ProtocolBuffer + compression in hadoop?

2010-02-18 Thread Yang
-- Forwarded message -- From: Yang Date: Thu, Feb 18, 2010 at 12:47 PM Subject: Re: [protobuf] ProtocolBuffer + compression in hadoop? To: Kenton Varda btw, I used other inputFileFormats, and they worked, (TFile and RCfile specifically) only Sequence file had issues On Thu

Re: [protobuf] ProtocolBuffer + compression in hadoop?

2010-02-18 Thread Kenton Varda
Please reply-all so the mailing list stays CC'd. I don't know anything about the libraries you are using so I can't really help you further. Maybe someone else can. On Thu, Feb 18, 2010 at 12:46 PM, Yang wrote: > thanks Kenton, > > I thought about the same, > what I did was that I use a splitt

Re: [protobuf] ProtocolBuffer + compression in hadoop?

2010-02-18 Thread Kenton Varda
You should verify that the bytes that come out of the InputStream really are the exact same bytes that were written by the serializer to the OutputStream originally. You could do this by computing a checksum at both ends and printing it, then inspecting visually. You'll probably find that the byt

[protobuf] ProtocolBuffer + compression in hadoop?

2010-02-18 Thread Yang
I tried to use protocol buffer in hadoop, so far it works fine with SequenceFile, after I hook it up with a simple wrapper, but after I put in a compressor in sequenceFile, it fails, because it read all the messages and yet still wants to advance the read pointer, and then readTag() returns 0, so