Protocol Buffers Vs. XML Fast Infoset

2009-04-03 Thread ShirishKul

Kenton,

I worked to see the difference between the *XML fast infoset* and the
*Protocol Buffers* (although I'm not aware about what are internal
things happening therein).

I found that for a typical data to be transferred across the wire for
size of 500KB that a XML file would represent has corresponding file
size as 300KB for PB binary and around 130KB for XML Fast Infoset
binary file.

Timings to parsing and serializing is extremely good for Protocol
buffers.

What makes a difference if we consider XML fast infoset binary against
PB binary in terms for Sizes, speed to parse them up etc.?

Regards,
Shirish
--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Protocol Buffers" group.
To post to this group, send email to protobuf@googlegroups.com
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en
-~--~~~~--~~--~--~---



Re: Protocol Buffers Vs. XML Fast Infoset

2009-04-03 Thread Kenton Varda
On Fri, Apr 3, 2009 at 2:40 AM, ShirishKul  wrote:

> I found that for a typical data to be transferred across the wire for
> size of 500KB that a XML file would represent has corresponding file
> size as 300KB for PB binary and around 130KB for XML Fast Infoset
> binary file.


What kind of data were you encoding?

I'm guessing you enabled some kind of compression for the FI encoding?  Note
that protocol buffers, while compact, do not actually apply any sort of
compression themselves.  For repetitive data or data containing a lot of
text strings, applying zlib compression to the encoded message can make it
much smaller.


> Timings to parsing and serializing is extremely good for Protocol
> buffers.


:)

(Don't forget to use optimize_for = SPEED if performance is important --
this will be the default in the next version.)

What makes a difference if we consider XML fast infoset binary against
> PB binary in terms for Sizes, speed to parse them up etc.?


I don't actually know much about FI.  My guess based on reading some
descriptions of FI is that PB is similar to FI's non-self-describing,
no-compression mode.  I would also guess that because XML is a much more
complicated format than protocol buffers, FI probably has more overhead when
encoding simple structured data, especially number-heavy data.  For
string-heavy data, though, XML works pretty well and so this overhead may
not be an issue in that case.

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Protocol Buffers" group.
To post to this group, send email to protobuf@googlegroups.com
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en
-~--~~~~--~~--~--~---



Re: Protocol Buffers Vs. XML Fast Infoset

2009-04-06 Thread ShirishKul

Kenton,

I've *not* applied any kind of compression while using the FI. I am
not sure but I think there is no default compression done when the XML
is converted to FI file.

FYI...

There is a free tool called "Noemax FI Converter" [http://
www.noemax.com/free_downloads/fi_converter.html] where XML file can be
converted to FI file and Viceversa and has *compression* as additional
option. Using "Noemax FI Viewer" [http://www.noemax.com/free_downloads/
fi_viewer.html], the FI file can be viewed in text format.

Regards,
Shirish

On Apr 3, 10:08 pm, Kenton Varda  wrote:
> On Fri, Apr 3, 2009 at 2:40 AM, ShirishKul  wrote:
> > I found that for a typical data to be transferred across the wire for
> > size of 500KB that a XML file would represent has corresponding file
> > size as 300KB for PB binary and around 130KB for XML Fast Infoset
> > binary file.
>
> What kind of data were you encoding?
>
> I'm guessing you enabled some kind of compression for the FI encoding?  Note
> that protocol buffers, while compact, do not actually apply any sort of
> compression themselves.  For repetitive data or data containing a lot of
> text strings, applying zlib compression to the encoded message can make it
> much smaller.
>
> > Timings to parsing and serializing is extremely good for Protocol
> > buffers.
>
> :)
>
> (Don't forget to use optimize_for = SPEED if performance is important --
> this will be the default in the next version.)
>
> What makes a difference if we consider XML fast infoset binary against
>
> > PB binary in terms for Sizes, speed to parse them up etc.?
>
> I don't actually know much about FI.  My guess based on reading some
> descriptions of FI is that PB is similar to FI's non-self-describing,
> no-compression mode.  I would also guess that because XML is a much more
> complicated format than protocol buffers, FI probably has more overhead when
> encoding simple structured data, especially number-heavy data.  For
> string-heavy data, though, XML works pretty well and so this overhead may
> not be an issue in that case.
--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Protocol Buffers" group.
To post to this group, send email to protobuf@googlegroups.com
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en
-~--~~~~--~~--~--~---



Re: Protocol Buffers Vs. XML Fast Infoset

2009-04-07 Thread Kenton Varda
On Mon, Apr 6, 2009 at 1:53 AM, ShirishKul  wrote:

> I've *not* applied any kind of compression while using the FI.


Then how did your FI data get so small?  Protocol Buffers do not leave much
room to further reduce the size without some sort of compression.  Can you
provide example files?

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Protocol Buffers" group.
To post to this group, send email to protobuf@googlegroups.com
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en
-~--~~~~--~~--~--~---



Re: Protocol Buffers Vs. XML Fast Infoset

2009-04-07 Thread ShirishKul

Kenton,

> > I've *not* applied any kind of compression while using the FI.
>
> Then how did your FI data get so small?  Protocol Buffers do not leave much
> room to further reduce the size without some sort of compression.  Can you
> provide example files?

I do not have any sample file to share with you. But I think FI
handles the repeatative attribute-values.

Regards,
Shirish
--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Protocol Buffers" group.
To post to this group, send email to protobuf@googlegroups.com
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en
-~--~~~~--~~--~--~---



Re: Protocol Buffers Vs. XML Fast Infoset

2009-04-07 Thread Jon Skeet

On Apr 3, 10:40 am, ShirishKul  wrote:
> I worked to see the difference between the *XML fast infoset* and the
> *Protocol Buffers* (although I'm not aware about what are internal
> things happening therein).
>
> I found that for a typical data to be transferred across the wire for
> size of 500KB that a XML file would represent has corresponding file
> size as 300KB for PB binary and around 130KB for XML Fast Infoset
> binary file.

Just going back to these numbers, a less-than-50% benefit for going
from XML to PB is surprisingly bad.

Do you have a sample file with non-confidential data that we could
look at?

Jon


--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Protocol Buffers" group.
To post to this group, send email to protobuf@googlegroups.com
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en
-~--~~~~--~~--~--~---



Re: Protocol Buffers Vs. XML Fast Infoset

2009-04-08 Thread Alex

On Apr 3, 8:08 pm, Kenton Varda  wrote:
> I'm guessing you enabled some kind of compression for the FI encoding?  Note
> that protocol buffers, while compact, do not actually apply any sort of
> compression themselves.  For repetitive data or data containing a lot of
> text strings, applying zlib compression to the encoded message can make it
> much smaller.

The Fast Infoset standard does not specify the use of GZIP or any
other compression algorithm within an FI document. Compression may
optionally be applied *after* the encoding, the same as with text XML
or any other type of document. However, FI does include a simple
vocabulary mechanism that reduces redundancy and which can be very
effective when there are repeating values within a document.

> I don't actually know much about FI.  My guess based on reading some
> descriptions of FI is that PB is similar to FI's non-self-describing,
> no-compression mode.

FI documents generated by FI Converter are always self-describing. The
FI standard does specify the use of an external vocabulary but AFAIK
this feature is rarely used. So anyone can decode an FI document back
to its original XML Infoset without any loss of information -- and
without requiring access to a schema or any other knowledge of the
original dataset.

> I would also guess that because XML is a much more
> complicated format than protocol buffers, FI probably has more overhead when
> encoding simple structured data, especially number-heavy data.  For
> string-heavy data, though, XML works pretty well and so this overhead may
> not be an issue in that case.

Yes, processing-wise protocol buffers is faster than both the text XML
and the FI encodings (and I would expect the same to be true for any
other XML encoding). That's because protocol buffers has a simpler job
to do than encoding XML. If processing performance is the major
concern and XML interoperability and self-description are of no
importance, then protocol buffers is a good option to consider.

Alexander
--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Protocol Buffers" group.
To post to this group, send email to protobuf@googlegroups.com
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en
-~--~~~~--~~--~--~---



Re: Protocol Buffers Vs. XML Fast Infoset

2009-04-08 Thread Kenton Varda
On Tue, Apr 7, 2009 at 10:15 PM, ShirishKul  wrote:

> I do not have any sample file to share with you. But I think FI
> handles the repeatative attribute-values.


OK, well, I call that "compression".  Try gzipping the final protobuf and FI
documents and comparing the compressed sizes.  The protobuf will probably
compress better, so I'd expect the final results to be roughly even.

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Protocol Buffers" group.
To post to this group, send email to protobuf@googlegroups.com
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en
-~--~~~~--~~--~--~---



Re: Protocol Buffers Vs. XML Fast Infoset

2009-04-10 Thread Alexander Philippou

On Apr 8, 10:15 pm, Kenton Varda  wrote:
> OK, well, I call that "compression".  Try gzipping the final protobuf and FI
> documents and comparing the compressed sizes.  The protobuf will probably
> compress better, so I'd expect the final results to be roughly even.

The redundancy elimination mechanism of FI is actually a vocabulary
and it works differently than compression algorithms do. FI documents
are good candidates for compression irrespective of whether a
vocabulary is used or not. We've done a few tests with medium/large-
sized documents and protobuf wasn't more compact than FI.

Using FI as a WCF message encoding will typically return better
compactness than the one reported by FI Converter. In FI Converter
there is no awareness of the data types used in the original document
being converted so everything is always encoded as literal. However,
when FI is used as a WCF message encoding then it is data type aware
and so it chooses the representation that's most appropriate for
returning the highest compactness for each single value. FI supports
binary, literal and restricted alphabet representations; when binary
is used FI can downgrade to a smaller range; and the vocabulary can be
employed to eliminate repetitions of the same values. I hope this
explains how it is possible that Shirish might have better compactness
with FI. But also I accept that there are circumstances, especially
with very small messages, in which protobuf might have the upper hand.

Alexander
--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Protocol Buffers" group.
To post to this group, send email to protobuf@googlegroups.com
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en
-~--~~~~--~~--~--~---



Re: Protocol Buffers Vs. XML Fast Infoset

2009-04-10 Thread Kenton Varda
On Fri, Apr 10, 2009 at 5:24 AM, Alexander Philippou <
alexander.philip...@gmail.com> wrote:

> The redundancy elimination mechanism of FI is actually a vocabulary
> and it works differently than compression algorithms do.


I think we define "compression" differently.  In my book, "redundancy
elimination" and "compression" are pretty much synonymous.  It sounds like
you are using a more specific definition (LZW?).


> FI documents
> are good candidates for compression irrespective of whether a
> vocabulary is used or not. We've done a few tests with medium/large-
> sized documents and protobuf wasn't more compact than FI.


Sure, but FI wasn't smaller than protobuf either, was it?  I would expect
that after applying some sort of LZW compression to *both* documents, they'd
come out roughly the same size.  (FI would probably have some overhead for
self-description but for large documents that wouldn't matter.)

Without the LZW applied, perhaps FI is smaller due to its "redundancy
elimination" -- I still don't know enough about FI to really understand how
it works.  However, I suspect protobuf will be much faster to parse and
encode, by virtue of being simpler.

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Protocol Buffers" group.
To post to this group, send email to protobuf@googlegroups.com
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en
-~--~~~~--~~--~--~---



Re: Protocol Buffers Vs. XML Fast Infoset

2009-04-13 Thread Alexander Philippou

On Apr 10, 10:19 pm, Kenton Varda  wrote:
> I think we define "compression" differently.  In my book, "redundancy
> elimination" and "compression" are pretty much synonymous.  It sounds like
> you are using a more specific definition (LZW?).

If that was true then string interning would also be classified as
compression ;-)

What you are actually referring to is "compaction", not "compression".
Compaction reduces the amount of data used to represent a given amount
of information. For example, an XML encoder can perform compaction by
eliminating unnecessary redundancy, removing irrelevancy or using a
special representation such as a restricted alphabet; all these are
part of the encoder's work. Compression does not reduce the amount of
data used to represent a given amount of information as compaction
does, it reduces the space taken by that data. Contrary to an XML
encoder, a compressor cannot create a representation of any
information, it can only be fed with an existing representation; its
output is the same representation packed into a more dense format.
Fast Infoset is a compact encoding of the XML Infoset. GZIP is a
compressed data format. The binary XML community uses the term
compactness when considering the size of a representation of the XML
Infoset; the term compression is used when GZIP or another compression
format is used to further reduce the size of a binary XML
representation.

> Sure, but FI wasn't smaller than protobuf either, was it?

In the few tests that we performed FI was smaller than protobuf, but
not by a large margin. However, both formats have the potential of
being considerably more compact than the other under different
circumstances; for example, protobuf with small datasets, FI with
medium/large datasets containing repeating values.

> I would expect
> that after applying some sort of LZW compression to *both* documents, they'd
> come out roughly the same size.  (FI would probably have some overhead for
> self-description but for large documents that wouldn't matter.)

In the same tests as those mentioned above, using GZIP compression on
Fast Infoset and protobuf documents resulted in "roughly the same
size" of compressed docs.

> Without the LZW applied, perhaps FI is smaller due to its "redundancy
> elimination" -- I still don't know enough about FI to really understand how
> it works.  However, I suspect protobuf will be much faster to parse and
> encode, by virtue of being simpler.

Yes, protobuf is much faster, I stated so in an earlier post.

Alexander
--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Protocol Buffers" group.
To post to this group, send email to protobuf@googlegroups.com
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en
-~--~~~~--~~--~--~---