[
https://issues.apache.org/jira/browse/THRIFT-110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12637959#action_12637959
]
Noble Paul commented on THRIFT-110:
-----------------------------------
bq.Out of curiosity, do you have a sense of length and frequency of duplicated
strings in your dataset?
When we write out a resultset of lucene documents , in each row the names may
get repeated and the values are different. Unlike RDBMS , the columns are not
fixed in lucene a document (it can have arbitrary fields). It is not possible
to prepopulate the string table because the no:of documents may be huge and
they are read from the disk just in time.
So the name is written as an EXTERN_STRING which we know that will repeat very
likely. Making every string an EXTERN_STRING defeats the purpose we end up
storing too many strings in memory which is a memory bloat
bq.wonder if it would make sense to implement the core of this functionality in
an abstract protocol and then fork it into DenseCompact and SpareCompact
concrete protocols............
I guess the proposal to make too many protocols will lead more problems than we
need. At this point we already have a working protocol. And everyone is using
it and more or less are fine w/ that. Let us look at the most compact solution
and finalize it . The new protocol anyway is not compatible w/ the old one.So,
even the API does not have to be compatible. For the new protocol we can have
new rules and new types (no negative ids, etc) and new compiler
> A more compact format
> ----------------------
>
> Key: THRIFT-110
> URL: https://issues.apache.org/jira/browse/THRIFT-110
> Project: Thrift
> Issue Type: Improvement
> Reporter: Noble Paul
> Attachments: compact_proto_spec.txt
>
>
> Thrift is not very compact in writing out data as (say protobuf) . It does
> not have the concept of variable length integers and various other
> optimizations possible . In Solr we use a lot of such optimizations to make a
> very compact payload. Thrift has a lot common with that format.
> It is all done in a single class
> http://svn.apache.org/viewvc/lucene/solr/trunk/src/java/org/apache/solr/common/util/NamedListCodec.java?revision=685640&view=markup
> The other optimizations include writing type/value in same byte, very fast
> writes of Strings, externalizable strings etc
> We could use a thrift format for non-java clients and I would like to see it
> as compact as the current java version
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.