[ 
https://issues.apache.org/jira/browse/LUCENE-2886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12990548#comment-12990548
 ] 

Renaud Delbru commented on LUCENE-2886:
---------------------------------------

{quote}
So if we can pack long streams of 1s with
freqs and positions I think this is probably useful for a lot of people.
{quote}
Yes, if the overhead is minimal, it might not be an issue in certain cases.

{quote}
Additionally for the .doc, i see its smaller in the AFOR-3 case too. Is
your "Ent" basically a measure of doc deltas? I'm confused exactly
what it is 
{quote}

Yes, Ent is jsut a delta representation of the id of the entity (which can be 
considered as the document id). It is just that I have changed the name of the 
concept, as SIREn is manipulating principally entity and not document. In my 
case, an entity is just a set of attribute-value pairs, similarly to a document 
in Lucene.

{quote}
Because I would think if you take e.g. Geonames, the place
names in the dataset are not in random order but actually "batched" by
country for example, so you would have long streams of docdelta=1 for
country=Germany's postings. 
{quote}
I checked, and Geonames dataset was alphabetically sorted by url names:
http://sws.geonames.org/1/
http://sws.geonames.org/10/
...
as well as dbpedia and sindice.

So, yes, this might have (good) consequences on the docdelta list for certain 
datasets such as geonames. And especially when indexing semi-structured data, 
as the schema of the data in one dataset is generally identical across 
entities/documents. therefore it is likely to see long runs of 1 for certain 
terms or schema terms.

> Adaptive Frame Of Reference 
> ----------------------------
>
>                 Key: LUCENE-2886
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2886
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Codecs
>            Reporter: Renaud Delbru
>             Fix For: 4.0
>
>         Attachments: LUCENE-2886_simple64.patch, 
> LUCENE-2886_simple64_varint.patch, lucene-afor.tar.gz
>
>
> We could test the implementation of the Adaptive Frame Of Reference [1] on 
> the lucene-4.0 branch.
> I am providing the source code of its implementation. Some work needs to be 
> done, as this implementation is working on the old lucene-1458 branch. 
> I will attach a tarball containing a running version (with tests) of the AFOR 
> implementation, as well as the implementations of PFOR and of Simple64 
> (simple family codec working on 64bits word) that has been used in the 
> experiments in [1].
> [1] http://www.deri.ie/fileadmin/documents/deri-tr-afor.pdf

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to