[ 
https://issues.apache.org/jira/browse/THRIFT-110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12637959#action_12637959
 ] 

Noble Paul commented on THRIFT-110:
-----------------------------------

bq.Out of curiosity, do you have a sense of length and frequency of duplicated 
strings in your dataset?

When we write out a resultset of lucene documents , in each row the names may 
get repeated and the values are different. Unlike RDBMS , the columns are not 
fixed in lucene a document (it can have arbitrary fields). It is not possible 
to prepopulate the string table because the no:of documents may be huge and 
they are read from the disk just in time. 
So the name is written as an EXTERN_STRING which we know that will repeat very 
likely. Making every string an EXTERN_STRING defeats the purpose we end up 
storing too many strings in  memory which is a memory bloat

bq.wonder if it would make sense to implement the core of this functionality in 
an abstract protocol and then fork it into DenseCompact and SpareCompact 
concrete protocols............

I guess the proposal to make too many protocols will lead more problems than we 
need. At this point we already have a working protocol. And everyone is using 
it and more or less are fine w/ that. Let us look at the most compact solution 
and finalize it . The new protocol anyway is not compatible w/ the old one.So, 
even the API does not have to be compatible. For the new protocol we can have 
new rules and new types (no negative ids, etc) and new compiler

> A more compact format 
> ----------------------
>
>                 Key: THRIFT-110
>                 URL: https://issues.apache.org/jira/browse/THRIFT-110
>             Project: Thrift
>          Issue Type: Improvement
>            Reporter: Noble Paul
>         Attachments: compact_proto_spec.txt
>
>
> Thrift is not very compact in writing out data as (say protobuf) . It does 
> not have the concept of variable length integers and various other 
> optimizations possible . In Solr we use a lot of such optimizations to make a 
> very compact payload. Thrift has a lot common with that format.
> It is all done in a single class
> http://svn.apache.org/viewvc/lucene/solr/trunk/src/java/org/apache/solr/common/util/NamedListCodec.java?revision=685640&view=markup
> The other optimizations include writing type/value  in same byte, very fast 
> writes of Strings, externalizable strings etc 
> We could use a thrift format for non-java clients and I would like to see it 
> as compact as the current java version

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to