[ 
https://issues.apache.org/jira/browse/SOLR-810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Noble Paul updated SOLR-810:
----------------------------

    Description: 
For storage purposes javabin can be quite inefficient assuming that we write 
one document at a time. The field names may be written for each document which 
makes it inefficient. 

javabin can be as efficient as a format like say thrift/protocol buffers if we 
do not pay the price of a string per name. We can easily achieve it using a new 
type  KNOWN_STRING. 

KNOWN_STRING can be like an EXTERN_STRING but it is just that these are 
preconfigured string names which is a map of index -> string . The known string 
list can probably have a version . The client must be using a newer version 
known string list than the server . 

an example looks like
{code}
1:responseHeader
2:QTime
3:status
{code}

A newer version of the string list can add a new string at a new index but it 
must never change the index of an existing string. This is similar to an IDL 
file of thrift/protocol buffers but w/o any of those complexities

So when an EXTERN_STRING is written it first looks up in the KNOWN_STRING map. 
If it is present , it is written as a KNOWN_STRING instead of an EXTERN_STRING 
. The value will be the index

Another addition could be a zip string type. This is useful when javabin is 
used for storing data . In storage, the performance cost of 
serialization/deserialization may not be as important as the space itself.  
This may also have a minimum size to compress . Only large strings (say > 2KB?) 
may need to be serialized



  was:
javabin can be as efficient as a format like say thrift/protocol buffers if we 
do not pay the price of a string per name. We can easily achieve it using a new 
type  KNOWN_STRING. 

KNOWN_STRING can be like an EXTERN_STRING but it is just that these are 
preconfigured string names which is a map of index -> string . The known string 
list can probably have a version . The client must be using a newer version 
known string list than the server . 

an example looks like
{code}
1:responseHeader
2:QTime
3:status
{code}

A newer version of the string list can add a new string at a new index but it 
must never change the index of an existing string. This is similar to an IDL 
file of thrift/protocol buffers but w/o any of those complexities

So when an EXTERN_STRING is written it first looks up in the KNOWN_STRING map. 
If it is present , it is written as a KNOWN_STRING instead of an EXTERN_STRING 
. The value will be the index




> changes for javabin format
> --------------------------
>
>                 Key: SOLR-810
>                 URL: https://issues.apache.org/jira/browse/SOLR-810
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Noble Paul
>
> For storage purposes javabin can be quite inefficient assuming that we write 
> one document at a time. The field names may be written for each document 
> which makes it inefficient. 
> javabin can be as efficient as a format like say thrift/protocol buffers if 
> we do not pay the price of a string per name. We can easily achieve it using 
> a new type  KNOWN_STRING. 
> KNOWN_STRING can be like an EXTERN_STRING but it is just that these are 
> preconfigured string names which is a map of index -> string . The known 
> string list can probably have a version . The client must be using a newer 
> version known string list than the server . 
> an example looks like
> {code}
> 1:responseHeader
> 2:QTime
> 3:status
> {code}
> A newer version of the string list can add a new string at a new index but it 
> must never change the index of an existing string. This is similar to an IDL 
> file of thrift/protocol buffers but w/o any of those complexities
> So when an EXTERN_STRING is written it first looks up in the KNOWN_STRING 
> map. If it is present , it is written as a KNOWN_STRING instead of an 
> EXTERN_STRING . The value will be the index
> Another addition could be a zip string type. This is useful when javabin is 
> used for storing data . In storage, the performance cost of 
> serialization/deserialization may not be as important as the space itself.  
> This may also have a minimum size to compress . Only large strings (say > 
> 2KB?) may need to be serialized

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to