[jira] [Created] (LUCENE-4018) Make accessible subenums in MappingMultiDocsEnum

2012-04-24 Thread Renaud Delbru (JIRA)
Renaud Delbru created LUCENE-4018:
-

 Summary: Make accessible subenums in MappingMultiDocsEnum
 Key: LUCENE-4018
 URL: https://issues.apache.org/jira/browse/LUCENE-4018
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/codecs
Affects Versions: 4.0
Reporter: Renaud Delbru
 Fix For: 4.0
 Attachments: LUCENE-4018.patch

The #merge method of the PostingsConsumer receives MappingMultiDocsEnum and 
MappingMultiDocsAndPositionsEnum as postings enum. In certain case (with 
specific postings formats), the #merge method needs to be overwritten, and the 
underlying DocsEnums wrapped by the MappingMultiDocsEnum need to be accessed.

The MappingMultiDocsEnum class should provide a method #getSubs similarly to 
MultiDocsEnum class.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4018) Make accessible subenums in MappingMultiDocsEnum

2012-04-24 Thread Renaud Delbru (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Renaud Delbru updated LUCENE-4018:
--

Attachment: LUCENE-4018.patch

> Make accessible subenums in MappingMultiDocsEnum
> 
>
> Key: LUCENE-4018
> URL: https://issues.apache.org/jira/browse/LUCENE-4018
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: 4.0
>Reporter: Renaud Delbru
>  Labels: codec, flex, merge
> Fix For: 4.0
>
> Attachments: LUCENE-4018.patch
>
>
> The #merge method of the PostingsConsumer receives MappingMultiDocsEnum and 
> MappingMultiDocsAndPositionsEnum as postings enum. In certain case (with 
> specific postings formats), the #merge method needs to be overwritten, and 
> the underlying DocsEnums wrapped by the MappingMultiDocsEnum need to be 
> accessed.
> The MappingMultiDocsEnum class should provide a method #getSubs similarly to 
> MultiDocsEnum class.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-4046) Allows IOException in DocsEnum#freq()

2012-05-10 Thread Renaud Delbru (JIRA)
Renaud Delbru created LUCENE-4046:
-

 Summary: Allows IOException in DocsEnum#freq()
 Key: LUCENE-4046
 URL: https://issues.apache.org/jira/browse/LUCENE-4046
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Reporter: Renaud Delbru
 Fix For: 4.0


Currently, DocsEnum#freq() does not allow IOException. This is problematic if 
somebody wants to implement a codec that allows lazy loading of freq. Frequency 
will be read and decoded only when #freq() will be called, therefore calling 
IndexInput's read methods that can throw IOException.

The current workaround is to catch the IOException in freq() and ignore it 
(which is not very nice and not a good solution).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4046) Allows IOException in DocsEnum#freq()

2012-05-13 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13274333#comment-13274333
 ] 

Renaud Delbru commented on LUCENE-4046:
---

Ok, I'll try to provide a patch in the coming weeks.

> Allows IOException in DocsEnum#freq()
> -
>
> Key: LUCENE-4046
> URL: https://issues.apache.org/jira/browse/LUCENE-4046
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Renaud Delbru
>Assignee: Simon Willnauer
>  Labels: codec, index
> Fix For: 4.0
>
>
> Currently, DocsEnum#freq() does not allow IOException. This is problematic if 
> somebody wants to implement a codec that allows lazy loading of freq. 
> Frequency will be read and decoded only when #freq() will be called, 
> therefore calling IndexInput's read methods that can throw IOException.
> The current workaround is to catch the IOException in freq() and ignore it 
> (which is not very nice and not a good solution).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4046) Allows IOException in DocsEnum#freq()

2012-05-20 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13279712#comment-13279712
 ] 

Renaud Delbru commented on LUCENE-4046:
---

Great, Thanks Simon.

> Allows IOException in DocsEnum#freq()
> -
>
> Key: LUCENE-4046
> URL: https://issues.apache.org/jira/browse/LUCENE-4046
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Renaud Delbru
>Assignee: Simon Willnauer
>  Labels: codec, index
> Fix For: 4.0
>
> Attachments: LUCENE-4046.patch
>
>
> Currently, DocsEnum#freq() does not allow IOException. This is problematic if 
> somebody wants to implement a codec that allows lazy loading of freq. 
> Frequency will be read and decoded only when #freq() will be called, 
> therefore calling IndexInput's read methods that can throw IOException.
> The current workaround is to catch the IOException in freq() and ignore it 
> (which is not very nice and not a good solution).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-4055) Refactor SegmentInfo / FieldInfo to make them extensible

2012-05-27 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13284153#comment-13284153
 ] 

Renaud Delbru edited comment on LUCENE-4055 at 5/27/12 1:07 PM:


Does this patch allows FreqProxTermsWriterPerField to be dependent to the 
Codec/Field ? I have a use case where I need my own FreqProxTermsWriterPerField 
for certain fields.

I am asking this because I see in your patch the following line in 
FreqProxTermsWriter.java:

{code}
final FreqProxTermsWriterPerField fieldWriter = allFields.get(fieldNumber);
{code}

which seems to retrieve a particular FreqProxTermsWriterPerField for each field 
type.

  was (Author: renaud.delbru):
Does this patch allows FreqProxTermsWriterPerField to be dependent to the 
Codec/Field ? I have a use case where I need my own FreqProxTermsWriterPerField 
for certain fields.

I am asking this because I see in your patch the following line:

{code}
final FreqProxTermsWriterPerField fieldWriter = allFields.get(fieldNumber);
{code}

which seems to retrieve a particular FreqProxTermsWriterPerField for each field 
type.
  
> Refactor SegmentInfo / FieldInfo to make them extensible
> 
>
> Key: LUCENE-4055
> URL: https://issues.apache.org/jira/browse/LUCENE-4055
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Andrzej Bialecki 
>Assignee: Robert Muir
> Fix For: 4.0
>
> Attachments: LUCENE-4055.patch
>
>
> After LUCENE-4050 is done the resulting SegmentInfo / FieldInfo classes 
> should be made abstract so that they can be extended by Codec-s.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4055) Refactor SegmentInfo / FieldInfo to make them extensible

2012-05-27 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13284153#comment-13284153
 ] 

Renaud Delbru commented on LUCENE-4055:
---

Does this patch allows FreqProxTermsWriterPerField to be dependent to the 
Codec/Field ? I have a use case where I need my own FreqProxTermsWriterPerField 
for certain fields.

I am asking this because I see in your patch the following line:

{code}
final FreqProxTermsWriterPerField fieldWriter = allFields.get(fieldNumber);
{code}

which seems to retrieve a particular FreqProxTermsWriterPerField for each field 
type.

> Refactor SegmentInfo / FieldInfo to make them extensible
> 
>
> Key: LUCENE-4055
> URL: https://issues.apache.org/jira/browse/LUCENE-4055
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Andrzej Bialecki 
>Assignee: Robert Muir
> Fix For: 4.0
>
> Attachments: LUCENE-4055.patch
>
>
> After LUCENE-4050 is done the resulting SegmentInfo / FieldInfo classes 
> should be made abstract so that they can be extended by Codec-s.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-4055) Refactor SegmentInfo / FieldInfo to make them extensible

2012-05-27 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13284153#comment-13284153
 ] 

Renaud Delbru edited comment on LUCENE-4055 at 5/27/12 1:44 PM:


Does this patch allows TermsHashConsumerPerField to be dependent to the 
Codec/Field ? I have a use case where I need my own TermsHashConsumerPerField 
for certain fields.

At the moment, the only extension of TermsHashConsumer is FreqProxTermsWriter 
which is creating a FreqProxTermsWriterPerField for every field (see 
FreqProxTermsWriter#addField()).

A possible solution of my problem would be to have a PostingsTermsWriter 
(extension of TermsHashConsumer) that will create a specific 
PostingsTermsWriterPerField (extension of TermsHashConsumerPerField) depending 
on the codec information found in the FieldInfo object.

  was (Author: renaud.delbru):
Does this patch allows FreqProxTermsWriterPerField to be dependent to the 
Codec/Field ? I have a use case where I need my own FreqProxTermsWriterPerField 
for certain fields.

I am asking this because I see in your patch the following line in 
FreqProxTermsWriter.java:

{code}
final FreqProxTermsWriterPerField fieldWriter = allFields.get(fieldNumber);
{code}

which seems to retrieve a particular FreqProxTermsWriterPerField for each field 
type.
  
> Refactor SegmentInfo / FieldInfo to make them extensible
> 
>
> Key: LUCENE-4055
> URL: https://issues.apache.org/jira/browse/LUCENE-4055
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Andrzej Bialecki 
>Assignee: Robert Muir
> Fix For: 4.0
>
> Attachments: LUCENE-4055.patch
>
>
> After LUCENE-4050 is done the resulting SegmentInfo / FieldInfo classes 
> should be made abstract so that they can be extended by Codec-s.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4055) Refactor SegmentInfo / FieldInfo to make them extensible

2012-05-28 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13284533#comment-13284533
 ] 

Renaud Delbru commented on LUCENE-4055:
---

Hi Robert,

sorry if it seemed a bit out of context, but I am trying to understand how to 
properly do it.

Indeed, I can create my own indexing chain which includes my TermsHashConsumer 
customisation. However, I would still need the codec metadata for every field. 
But from what you told me, it seems that this codec specific metadata could be 
now added to FeildInfo. Is that correct ?

> Refactor SegmentInfo / FieldInfo to make them extensible
> 
>
> Key: LUCENE-4055
> URL: https://issues.apache.org/jira/browse/LUCENE-4055
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Andrzej Bialecki 
>Assignee: Robert Muir
> Fix For: 4.0
>
> Attachments: LUCENE-4055.patch
>
>
> After LUCENE-4050 is done the resulting SegmentInfo / FieldInfo classes 
> should be made abstract so that they can be extended by Codec-s.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4055) Refactor SegmentInfo / FieldInfo to make them extensible

2012-05-28 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13284544#comment-13284544
 ] 

Renaud Delbru commented on LUCENE-4055:
---

{quote}
What codec metadata?
{quote}

Metadata that indicates which codec is used for a particular field.

Let say I want to have a specific TermsHashConsumerPerField depending on the 
codec used by a field. For example, for field A and field B which use the 
Lucen40 codec, we need to use the FreqProxTermsWriterPerField. And for field C 
that uses my own specific codec, I need to use the MyOwnTermsWriterPerField.

My current understanding tells me that to do this, the only way is to customise 
the IndexingChain with a new TermsHashConsumer that overrides the method 
TermsHashConsumer#addField(TermsHashPerField termsHashPerField, FieldInfo 
fieldInfo). This method addField will be able to instantiate the correct 
TermsHashConsumerPerField if and only if there is codec metadata in the 
FieldInfo parameter. That's why I am interested of using a customised FieldInfo 
to store codec-related metadata about a field.

Or is there a better way to get codec-related information about a field in the 
IndexingChain ?


> Refactor SegmentInfo / FieldInfo to make them extensible
> 
>
> Key: LUCENE-4055
> URL: https://issues.apache.org/jira/browse/LUCENE-4055
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Andrzej Bialecki 
>Assignee: Robert Muir
> Fix For: 4.0
>
> Attachments: LUCENE-4055.patch
>
>
> After LUCENE-4050 is done the resulting SegmentInfo / FieldInfo classes 
> should be made abstract so that they can be extended by Codec-s.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4055) Refactor SegmentInfo / FieldInfo to make them extensible

2012-05-28 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13284551#comment-13284551
 ] 

Renaud Delbru commented on LUCENE-4055:
---

Sorry, I meant PostingsFormat instead of Codec.

> Refactor SegmentInfo / FieldInfo to make them extensible
> 
>
> Key: LUCENE-4055
> URL: https://issues.apache.org/jira/browse/LUCENE-4055
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Andrzej Bialecki 
>Assignee: Robert Muir
> Fix For: 4.0
>
> Attachments: LUCENE-4055.patch
>
>
> After LUCENE-4050 is done the resulting SegmentInfo / FieldInfo classes 
> should be made abstract so that they can be extended by Codec-s.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-4591) Make StoredFieldsFormat more configurable

2012-12-06 Thread Renaud Delbru (JIRA)
Renaud Delbru created LUCENE-4591:
-

 Summary: Make StoredFieldsFormat more configurable
 Key: LUCENE-4591
 URL: https://issues.apache.org/jira/browse/LUCENE-4591
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/codecs
Affects Versions: 4.1
Reporter: Renaud Delbru
 Fix For: 4.1


The current StoredFieldsFormat are implemented with the assumption that only 
one type of StoredfieldsFormat is used by the index.
We would like to be able to configure a StoredFieldsFormat per field, similarly 
to the PostingsFormat.
There is a few issues that need to be solved for allowing that:
1) allowing to configure a segment suffix to the StoredFieldsFormat
2) implement SPI interface in StoredFieldsFormat 
3) create a PerFieldStoredFieldsFormat

We are proposing to start first with 1) by modifying the signature of 
StoredFieldsFormat#fieldsReader and StoredFieldsFormat#fieldsWriter so that 
they use SegmentReadState and SegmentWriteState instead of the current set of 
parameters.

Let us know what you think about this idea. If this is of interest, we can 
contribute with a first path for 1).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4591) Make StoredFieldsFormat more configurable

2012-12-06 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13511335#comment-13511335
 ] 

Renaud Delbru commented on LUCENE-4591:
---

Hi Adrien, yes I understand the problem. While, it is true that in the extreme 
case, people could configure a different StoredFieldsFormat for each field, 
which will lead to a large increase of disk seeks, here we would like to use 
the CompressingStoredFieldsFormat for all the standard fields, but have a 
different mechanism for specific fields.

We would like to store certain fields that requires a different type of data 
structure than the one currently supported, i.e., a document is not a simple 
list of fields, but a more complex data structure.

We could solve the problem by copying and modifying the current 
CompressingStoredFieldsWriter and CompressingStoredFieldsReader so that it can 
decide what type of encoding to use based on the field info. However, this is 
kind of hacky, and we will have to keep in synch our copy with the original 
implementation. The only way we could find is to have a perfield approach.

> Make StoredFieldsFormat more configurable
> -
>
> Key: LUCENE-4591
> URL: https://issues.apache.org/jira/browse/LUCENE-4591
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: 4.1
>Reporter: Renaud Delbru
> Fix For: 4.1
>
>
> The current StoredFieldsFormat are implemented with the assumption that only 
> one type of StoredfieldsFormat is used by the index.
> We would like to be able to configure a StoredFieldsFormat per field, 
> similarly to the PostingsFormat.
> There is a few issues that need to be solved for allowing that:
> 1) allowing to configure a segment suffix to the StoredFieldsFormat
> 2) implement SPI interface in StoredFieldsFormat 
> 3) create a PerFieldStoredFieldsFormat
> We are proposing to start first with 1) by modifying the signature of 
> StoredFieldsFormat#fieldsReader and StoredFieldsFormat#fieldsWriter so that 
> they use SegmentReadState and SegmentWriteState instead of the current set of 
> parameters.
> Let us know what you think about this idea. If this is of interest, we can 
> contribute with a first path for 1).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4591) Make StoredFieldsFormat more configurable

2012-12-06 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13511425#comment-13511425
 ] 

Renaud Delbru commented on LUCENE-4591:
---

{quote}
How would you do that? As far as I know, the only way to access stored fields 
is through the StoredFieldVisitor API
{quote}

Yes, but we can pass our own implementation of StoredFieldVisitor with our own 
specific information about what to retrieve.

So from Adrien and Robert feedback, it looks like you do not really want to see 
a perfield StoredFieldsFormat mechanism. that's fine and understandable. 
Could we instead try to open/extend the current code base so that we are more 
free to extend it on our side ? For example, opening 
CompressingStoredFieldsWriter and CompressingStoredFieldsReader, as well as 
making them configurable with a segment suffixes ? That would greatly help.

> Make StoredFieldsFormat more configurable
> -
>
> Key: LUCENE-4591
> URL: https://issues.apache.org/jira/browse/LUCENE-4591
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: 4.1
>Reporter: Renaud Delbru
> Fix For: 4.1
>
>
> The current StoredFieldsFormat are implemented with the assumption that only 
> one type of StoredfieldsFormat is used by the index.
> We would like to be able to configure a StoredFieldsFormat per field, 
> similarly to the PostingsFormat.
> There is a few issues that need to be solved for allowing that:
> 1) allowing to configure a segment suffix to the StoredFieldsFormat
> 2) implement SPI interface in StoredFieldsFormat 
> 3) create a PerFieldStoredFieldsFormat
> We are proposing to start first with 1) by modifying the signature of 
> StoredFieldsFormat#fieldsReader and StoredFieldsFormat#fieldsWriter so that 
> they use SegmentReadState and SegmentWriteState instead of the current set of 
> parameters.
> Let us know what you think about this idea. If this is of interest, we can 
> contribute with a first path for 1).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4591) Make StoredFieldsFormat more configurable

2012-12-06 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13512043#comment-13512043
 ] 

Renaud Delbru commented on LUCENE-4591:
---

Given that the task to properly support segment suffixes looks more difficult 
than expected (until maybe Robert comes with a patch), the remaining solution 
for us, in order to avoid to copy most of the compressing package, would be to 
open CompressingStoredFieldsWriter and CompressingStoredFieldsReader, i.e., 
make them public and non final (see patch). I am not sure that would be 
acceptable for you, but that would be good for expert-level extension.

> Make StoredFieldsFormat more configurable
> -
>
> Key: LUCENE-4591
> URL: https://issues.apache.org/jira/browse/LUCENE-4591
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: 4.1
>Reporter: Renaud Delbru
> Fix For: 4.1
>
>
> The current StoredFieldsFormat are implemented with the assumption that only 
> one type of StoredfieldsFormat is used by the index.
> We would like to be able to configure a StoredFieldsFormat per field, 
> similarly to the PostingsFormat.
> There is a few issues that need to be solved for allowing that:
> 1) allowing to configure a segment suffix to the StoredFieldsFormat
> 2) implement SPI interface in StoredFieldsFormat 
> 3) create a PerFieldStoredFieldsFormat
> We are proposing to start first with 1) by modifying the signature of 
> StoredFieldsFormat#fieldsReader and StoredFieldsFormat#fieldsWriter so that 
> they use SegmentReadState and SegmentWriteState instead of the current set of 
> parameters.
> Let us know what you think about this idea. If this is of interest, we can 
> contribute with a first path for 1).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4591) Make StoredFieldsFormat more configurable

2012-12-06 Thread Renaud Delbru (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Renaud Delbru updated LUCENE-4591:
--

Attachment: LUCENE-4591.patch

> Make StoredFieldsFormat more configurable
> -
>
> Key: LUCENE-4591
> URL: https://issues.apache.org/jira/browse/LUCENE-4591
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: 4.1
>Reporter: Renaud Delbru
> Fix For: 4.1
>
> Attachments: LUCENE-4591.patch
>
>
> The current StoredFieldsFormat are implemented with the assumption that only 
> one type of StoredfieldsFormat is used by the index.
> We would like to be able to configure a StoredFieldsFormat per field, 
> similarly to the PostingsFormat.
> There is a few issues that need to be solved for allowing that:
> 1) allowing to configure a segment suffix to the StoredFieldsFormat
> 2) implement SPI interface in StoredFieldsFormat 
> 3) create a PerFieldStoredFieldsFormat
> We are proposing to start first with 1) by modifying the signature of 
> StoredFieldsFormat#fieldsReader and StoredFieldsFormat#fieldsWriter so that 
> they use SegmentReadState and SegmentWriteState instead of the current set of 
> parameters.
> Let us know what you think about this idea. If this is of interest, we can 
> contribute with a first path for 1).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4591) Make StoredFieldsFormat more configurable

2012-12-07 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13526575#comment-13526575
 ] 

Renaud Delbru commented on LUCENE-4591:
---

To be able to subclass and extend them. If that is not acceptable for you, then 
the solution left to us is to copy-paste the original code, extend it on our 
side (and close this issue). 

> Make StoredFieldsFormat more configurable
> -
>
> Key: LUCENE-4591
> URL: https://issues.apache.org/jira/browse/LUCENE-4591
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: 4.1
>Reporter: Renaud Delbru
> Fix For: 4.1
>
> Attachments: LUCENE-4591.patch
>
>
> The current StoredFieldsFormat are implemented with the assumption that only 
> one type of StoredfieldsFormat is used by the index.
> We would like to be able to configure a StoredFieldsFormat per field, 
> similarly to the PostingsFormat.
> There is a few issues that need to be solved for allowing that:
> 1) allowing to configure a segment suffix to the StoredFieldsFormat
> 2) implement SPI interface in StoredFieldsFormat 
> 3) create a PerFieldStoredFieldsFormat
> We are proposing to start first with 1) by modifying the signature of 
> StoredFieldsFormat#fieldsReader and StoredFieldsFormat#fieldsWriter so that 
> they use SegmentReadState and SegmentWriteState instead of the current set of 
> parameters.
> Let us know what you think about this idea. If this is of interest, we can 
> contribute with a first path for 1).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4591) Make StoredFieldsFormat more configurable

2012-12-08 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13527261#comment-13527261
 ] 

Renaud Delbru commented on LUCENE-4591:
---

It is a similar approach that we followed (see attached files: 
PerFieldStoredFieldsFormat, PerFieldStoredFieldsWriter and 
PerFieldStoredFieldsReader).
The issue is that our secondary StoredFieldsReader/Writer we are using is, for 
the moment, a wrapper around an instance of the 
CompressingStoredFieldsReader/Writer (using a wrapper approach was another way 
to extend CompressingStoredFieldsReader/Writer). The wrapper implements our 
encoding logic, and uses the underlying CompressingStoredFieldsWriter to write 
our data as a binary block. The problem with this approach is that since we can 
not configure the segment suffix of the CompressingStoredFieldsWriter, then the 
two StoredFieldsFormat try to write to files that have identical names.
Since we are using a CompressingStoredFieldsReader/Writer as underlying 
mechanism to write the stored fields, why are we not using just one instance to 
store default lucene fields and our specific fields ? The reasons are: that it 
was more simple for our first implementation to leverage 
CompressingStoredFieldsReader/Writer (as a temporary solution); and that we 
would like to keep things (code and segment files) more isolated from each 
other.
As said previously, we could simply copy-paste the compressing codec on our 
side to solve the problem, but I thought that maybe by raising the issue, we 
could have found a more appropriate solution.


> Make StoredFieldsFormat more configurable
> -
>
> Key: LUCENE-4591
> URL: https://issues.apache.org/jira/browse/LUCENE-4591
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: 4.1
>Reporter: Renaud Delbru
> Fix For: 4.1
>
> Attachments: LUCENE-4591.patch
>
>
> The current StoredFieldsFormat are implemented with the assumption that only 
> one type of StoredfieldsFormat is used by the index.
> We would like to be able to configure a StoredFieldsFormat per field, 
> similarly to the PostingsFormat.
> There is a few issues that need to be solved for allowing that:
> 1) allowing to configure a segment suffix to the StoredFieldsFormat
> 2) implement SPI interface in StoredFieldsFormat 
> 3) create a PerFieldStoredFieldsFormat
> We are proposing to start first with 1) by modifying the signature of 
> StoredFieldsFormat#fieldsReader and StoredFieldsFormat#fieldsWriter so that 
> they use SegmentReadState and SegmentWriteState instead of the current set of 
> parameters.
> Let us know what you think about this idea. If this is of interest, we can 
> contribute with a first path for 1).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4591) Make StoredFieldsFormat more configurable

2012-12-08 Thread Renaud Delbru (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Renaud Delbru updated LUCENE-4591:
--

Attachment: PerFieldStoredFieldsWriter.java
PerFieldStoredFieldsReader.java
PerFieldStoredFieldsFormat.java

> Make StoredFieldsFormat more configurable
> -
>
> Key: LUCENE-4591
> URL: https://issues.apache.org/jira/browse/LUCENE-4591
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: 4.1
>Reporter: Renaud Delbru
> Fix For: 4.1
>
> Attachments: LUCENE-4591.patch, PerFieldStoredFieldsFormat.java, 
> PerFieldStoredFieldsReader.java, PerFieldStoredFieldsWriter.java
>
>
> The current StoredFieldsFormat are implemented with the assumption that only 
> one type of StoredfieldsFormat is used by the index.
> We would like to be able to configure a StoredFieldsFormat per field, 
> similarly to the PostingsFormat.
> There is a few issues that need to be solved for allowing that:
> 1) allowing to configure a segment suffix to the StoredFieldsFormat
> 2) implement SPI interface in StoredFieldsFormat 
> 3) create a PerFieldStoredFieldsFormat
> We are proposing to start first with 1) by modifying the signature of 
> StoredFieldsFormat#fieldsReader and StoredFieldsFormat#fieldsWriter so that 
> they use SegmentReadState and SegmentWriteState instead of the current set of 
> parameters.
> Let us know what you think about this idea. If this is of interest, we can 
> contribute with a first path for 1).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4591) Make StoredFieldsFormat more configurable

2012-12-09 Thread Renaud Delbru (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Renaud Delbru updated LUCENE-4591:
--

Attachment: LUCENE-4591.patch

Another solution would be to extend the  constructors of 
CompressingStoredFieldsWriter/Reader to accept a segment suffix as parameter. 
See new patch attached. And this might also pave the road for Robert's work on 
making codec components respectful of segment suffix.

> Make StoredFieldsFormat more configurable
> -
>
> Key: LUCENE-4591
> URL: https://issues.apache.org/jira/browse/LUCENE-4591
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: 4.1
>Reporter: Renaud Delbru
> Fix For: 4.1
>
> Attachments: LUCENE-4591.patch, LUCENE-4591.patch, 
> PerFieldStoredFieldsFormat.java, PerFieldStoredFieldsReader.java, 
> PerFieldStoredFieldsWriter.java
>
>
> The current StoredFieldsFormat are implemented with the assumption that only 
> one type of StoredfieldsFormat is used by the index.
> We would like to be able to configure a StoredFieldsFormat per field, 
> similarly to the PostingsFormat.
> There is a few issues that need to be solved for allowing that:
> 1) allowing to configure a segment suffix to the StoredFieldsFormat
> 2) implement SPI interface in StoredFieldsFormat 
> 3) create a PerFieldStoredFieldsFormat
> We are proposing to start first with 1) by modifying the signature of 
> StoredFieldsFormat#fieldsReader and StoredFieldsFormat#fieldsWriter so that 
> they use SegmentReadState and SegmentWriteState instead of the current set of 
> parameters.
> Let us know what you think about this idea. If this is of interest, we can 
> contribute with a first path for 1).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4591) Make StoredFieldsFormat more configurable

2012-12-10 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13527894#comment-13527894
 ] 

Renaud Delbru commented on LUCENE-4591:
---

That is fine for me. Thanks for your help Adrien.

> Make StoredFieldsFormat more configurable
> -
>
> Key: LUCENE-4591
> URL: https://issues.apache.org/jira/browse/LUCENE-4591
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: 4.1
>Reporter: Renaud Delbru
> Fix For: 4.1
>
> Attachments: LUCENE-4591.patch, LUCENE-4591.patch, LUCENE-4591.patch, 
> PerFieldStoredFieldsFormat.java, PerFieldStoredFieldsReader.java, 
> PerFieldStoredFieldsWriter.java
>
>
> The current StoredFieldsFormat are implemented with the assumption that only 
> one type of StoredfieldsFormat is used by the index.
> We would like to be able to configure a StoredFieldsFormat per field, 
> similarly to the PostingsFormat.
> There is a few issues that need to be solved for allowing that:
> 1) allowing to configure a segment suffix to the StoredFieldsFormat
> 2) implement SPI interface in StoredFieldsFormat 
> 3) create a PerFieldStoredFieldsFormat
> We are proposing to start first with 1) by modifying the signature of 
> StoredFieldsFormat#fieldsReader and StoredFieldsFormat#fieldsWriter so that 
> they use SegmentReadState and SegmentWriteState instead of the current set of 
> parameters.
> Let us know what you think about this idea. If this is of interest, we can 
> contribute with a first path for 1).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-4613) CompressingStoredFieldsWriter ignores the segment suffix if writing aborted

2012-12-10 Thread Renaud Delbru (JIRA)
Renaud Delbru created LUCENE-4613:
-

 Summary: CompressingStoredFieldsWriter ignores the segment suffix 
if writing aborted
 Key: LUCENE-4613
 URL: https://issues.apache.org/jira/browse/LUCENE-4613
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/codecs
Affects Versions: 4.1
Reporter: Renaud Delbru
 Fix For: 4.1


If the writing is aborted, CompressingStoredFieldsWriter does not remove 
partially-written files as the segment suffix is not taken into consideration.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4613) CompressingStoredFieldsWriter ignores the segment suffix if writing aborted

2012-12-10 Thread Renaud Delbru (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Renaud Delbru updated LUCENE-4613:
--

Attachment: LUCENE-4613.patch

Fix bug introduced by LUCENE-4591

> CompressingStoredFieldsWriter ignores the segment suffix if writing aborted
> ---
>
> Key: LUCENE-4613
> URL: https://issues.apache.org/jira/browse/LUCENE-4613
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/codecs
>Affects Versions: 4.1
>Reporter: Renaud Delbru
> Fix For: 4.1
>
> Attachments: LUCENE-4613.patch
>
>
> If the writing is aborted, CompressingStoredFieldsWriter does not remove 
> partially-written files as the segment suffix is not taken into consideration.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4613) CompressingStoredFieldsWriter ignores the segment suffix if writing aborted

2012-12-11 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13528847#comment-13528847
 ] 

Renaud Delbru commented on LUCENE-4613:
---

Ok, I'll upload something today.

> CompressingStoredFieldsWriter ignores the segment suffix if writing aborted
> ---
>
> Key: LUCENE-4613
> URL: https://issues.apache.org/jira/browse/LUCENE-4613
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/codecs
>Affects Versions: 4.1
>Reporter: Renaud Delbru
> Fix For: 4.1
>
> Attachments: LUCENE-4613.patch
>
>
> If the writing is aborted, CompressingStoredFieldsWriter does not remove 
> partially-written files as the segment suffix is not taken into consideration.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4613) CompressingStoredFieldsWriter ignores the segment suffix if writing aborted

2012-12-11 Thread Renaud Delbru (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Renaud Delbru updated LUCENE-4613:
--

Attachment: LUCENE-4613.patch

New path with a unit test that checks that partially written files are removed 
if writing abort. 
I had to modify a bit the API of CompressingStoredFieldsFormat to make the test 
possible. Also, CompressingCodec is now always adding a segment suffix. We 
might be bale to improve this by adding or not randomly a segment suffix.

> CompressingStoredFieldsWriter ignores the segment suffix if writing aborted
> ---
>
> Key: LUCENE-4613
> URL: https://issues.apache.org/jira/browse/LUCENE-4613
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/codecs
>Affects Versions: 4.1
>Reporter: Renaud Delbru
> Fix For: 4.1
>
> Attachments: LUCENE-4613.patch, LUCENE-4613.patch
>
>
> If the writing is aborted, CompressingStoredFieldsWriter does not remove 
> partially-written files as the segment suffix is not taken into consideration.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-4613) CompressingStoredFieldsWriter ignores the segment suffix if writing aborted

2012-12-11 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13528896#comment-13528896
 ] 

Renaud Delbru edited comment on LUCENE-4613 at 12/11/12 11:47 AM:
--

New path with a unit test that checks that partially written files are removed 
if writing abort. 
I had to modify a bit the API of CompressingStoredFieldsFormat to make the test 
possible. Also, CompressingCodec is now always adding a segment suffix. We 
might be bale to improve this by adding or not randomly a segment suffix.
Ant test has been executed and did not report any errors.

  was (Author: renaud.delbru):
New path with a unit test that checks that partially written files are 
removed if writing abort. 
I had to modify a bit the API of CompressingStoredFieldsFormat to make the test 
possible. Also, CompressingCodec is now always adding a segment suffix. We 
might be bale to improve this by adding or not randomly a segment suffix.
  
> CompressingStoredFieldsWriter ignores the segment suffix if writing aborted
> ---
>
> Key: LUCENE-4613
> URL: https://issues.apache.org/jira/browse/LUCENE-4613
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/codecs
>Affects Versions: 4.1
>Reporter: Renaud Delbru
> Fix For: 4.1
>
> Attachments: LUCENE-4613.patch, LUCENE-4613.patch
>
>
> If the writing is aborted, CompressingStoredFieldsWriter does not remove 
> partially-written files as the segment suffix is not taken into consideration.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4613) CompressingStoredFieldsWriter ignores the segment suffix if writing aborted

2012-12-11 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13529065#comment-13529065
 ] 

Renaud Delbru commented on LUCENE-4613:
---

{quote}
Yes I think it'd be even better. You can use 
_TestUtil#randomSimpleString(Random, int maxLength).
{quote}

In fact, using random segment suffix is very difficult to achieve, as we cannot 
ensure that the codec instantiated through SPI will get the same segment suffix 
as the codec instantiated in the {{TestCompressingStoredFieldsFormat#setUp()}, 
and therefore the {{_TestUtil#checkIndex(Directory dir)}} will fail. The 
segment suffix must be deterministic, e.g., similar to the current solution 
using the formatName as segment suffix.

> CompressingStoredFieldsWriter ignores the segment suffix if writing aborted
> ---
>
> Key: LUCENE-4613
> URL: https://issues.apache.org/jira/browse/LUCENE-4613
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/codecs
>Affects Versions: 4.1
>Reporter: Renaud Delbru
> Fix For: 4.1
>
> Attachments: LUCENE-4613.patch, LUCENE-4613.patch
>
>
> If the writing is aborted, CompressingStoredFieldsWriter does not remove 
> partially-written files as the segment suffix is not taken into consideration.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4613) CompressingStoredFieldsWriter ignores the segment suffix if writing aborted

2012-12-11 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13529163#comment-13529163
 ] 

Renaud Delbru commented on LUCENE-4613:
---

Maybe one solution would be to not change the behavior of 
{{CompressingCodec#randomInstance(Random random)}}, and introduce a new method 
{{CompressingCodec#randomInstance(Random random, boolean randomSegmentSuffix)}} 
that is used in {{TestCompressingStoredFieldsFormat#setUp()}}. This will keep 
backward compatibility with the other unit tests.

> CompressingStoredFieldsWriter ignores the segment suffix if writing aborted
> ---
>
> Key: LUCENE-4613
> URL: https://issues.apache.org/jira/browse/LUCENE-4613
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/codecs
>Affects Versions: 4.1
>Reporter: Renaud Delbru
> Fix For: 4.1
>
> Attachments: LUCENE-4613.patch, LUCENE-4613.patch
>
>
> If the writing is aborted, CompressingStoredFieldsWriter does not remove 
> partially-written files as the segment suffix is not taken into consideration.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-4613) CompressingStoredFieldsWriter ignores the segment suffix if writing aborted

2012-12-11 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13529065#comment-13529065
 ] 

Renaud Delbru edited comment on LUCENE-4613 at 12/11/12 5:52 PM:
-

{quote}
Yes I think it'd be even better. You can use 
_TestUtil#randomSimpleString(Random, int maxLength).
{quote}

In fact, using random segment suffix is very difficult to achieve, as we cannot 
ensure that the codec instantiated through SPI will get the same segment suffix 
as the codec instantiated in the {{TestCompressingStoredFieldsFormat#setUp()}}, 
and therefore the {{_TestUtil#checkIndex(Directory dir)}} will fail. The 
segment suffix must be deterministic, e.g., similar to the current solution 
using the formatName as segment suffix.

  was (Author: renaud.delbru):
{quote}
Yes I think it'd be even better. You can use 
_TestUtil#randomSimpleString(Random, int maxLength).
{quote}

In fact, using random segment suffix is very difficult to achieve, as we cannot 
ensure that the codec instantiated through SPI will get the same segment suffix 
as the codec instantiated in the {{TestCompressingStoredFieldsFormat#setUp()}, 
and therefore the {{_TestUtil#checkIndex(Directory dir)}} will fail. The 
segment suffix must be deterministic, e.g., similar to the current solution 
using the formatName as segment suffix.
  
> CompressingStoredFieldsWriter ignores the segment suffix if writing aborted
> ---
>
> Key: LUCENE-4613
> URL: https://issues.apache.org/jira/browse/LUCENE-4613
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/codecs
>Affects Versions: 4.1
>Reporter: Renaud Delbru
> Fix For: 4.1
>
> Attachments: LUCENE-4613.patch, LUCENE-4613.patch
>
>
> If the writing is aborted, CompressingStoredFieldsWriter does not remove 
> partially-written files as the segment suffix is not taken into consideration.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4613) CompressingStoredFieldsWriter ignores the segment suffix if writing aborted

2012-12-12 Thread Renaud Delbru (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Renaud Delbru updated LUCENE-4613:
--

Attachment: LUCENE-4613.patch

A first refactoring to try to keep backward compatibility of the 
{{CompressingCodec#randomInstance(Random random)}}. Let me know if this is good 
enough. Tests are passing, as well as the specific TestIndexFileDeleter test 
case you previously reported.

> CompressingStoredFieldsWriter ignores the segment suffix if writing aborted
> ---
>
> Key: LUCENE-4613
> URL: https://issues.apache.org/jira/browse/LUCENE-4613
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/codecs
>Affects Versions: 4.1
>Reporter: Renaud Delbru
> Fix For: 4.1
>
> Attachments: LUCENE-4613.patch, LUCENE-4613.patch, LUCENE-4613.patch
>
>
> If the writing is aborted, CompressingStoredFieldsWriter does not remove 
> partially-written files as the segment suffix is not taken into consideration.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-4642) TokenizerFactory should provide a create method with a given AttributeSource

2012-12-20 Thread Renaud Delbru (JIRA)
Renaud Delbru created LUCENE-4642:
-

 Summary: TokenizerFactory should provide a create method with a 
given AttributeSource
 Key: LUCENE-4642
 URL: https://issues.apache.org/jira/browse/LUCENE-4642
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/analysis
Affects Versions: 4.1
Reporter: Renaud Delbru
 Fix For: 4.1


All tokenizer implementations have a constructor that takes a given 
AttributeSource as parameter (LUCENE-1826). However, the TokenizerFactory does 
not provide an API to create tokenizers with a given AttributeSource.

Side note: There are still a lot of tokenizers that do not provide constructors 
that take AttributeSource and AttributeFactory.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4642) TokenizerFactory should provide a create method with a given AttributeSource

2012-12-20 Thread Renaud Delbru (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Renaud Delbru updated LUCENE-4642:
--

Attachment: LUCENE-4642.patch

Patch adding #create(AttributeSource source, Reader reader) to the 
TokenizerFactory class and to all its subclasses.

Given there are a lot of tokenizers that do not have constructors that take a 
given AttributeSource, I have implemented the new create method for their 
respective factory which throws a UnsupportedOperationException.

> TokenizerFactory should provide a create method with a given AttributeSource
> 
>
> Key: LUCENE-4642
> URL: https://issues.apache.org/jira/browse/LUCENE-4642
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Affects Versions: 4.1
>Reporter: Renaud Delbru
>  Labels: analysis, attribute, tokenizer
> Fix For: 4.1
>
> Attachments: LUCENE-4642.patch
>
>
> All tokenizer implementations have a constructor that takes a given 
> AttributeSource as parameter (LUCENE-1826). However, the TokenizerFactory 
> does not provide an API to create tokenizers with a given AttributeSource.
> Side note: There are still a lot of tokenizers that do not provide 
> constructors that take AttributeSource and AttributeFactory.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4642) TokenizerFactory should provide a create method with a given AttributeSource

2013-01-02 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13542144#comment-13542144
 ] 

Renaud Delbru commented on LUCENE-4642:
---

Hi,

Any plan to commit this patch ? Or is there additional work to do before ?

thanks

> TokenizerFactory should provide a create method with a given AttributeSource
> 
>
> Key: LUCENE-4642
> URL: https://issues.apache.org/jira/browse/LUCENE-4642
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Affects Versions: 4.1
>Reporter: Renaud Delbru
>  Labels: analysis, attribute, tokenizer
> Fix For: 4.2, 5.0
>
> Attachments: LUCENE-4642.patch
>
>
> All tokenizer implementations have a constructor that takes a given 
> AttributeSource as parameter (LUCENE-1826). However, the TokenizerFactory 
> does not provide an API to create tokenizers with a given AttributeSource.
> Side note: There are still a lot of tokenizers that do not provide 
> constructors that take AttributeSource and AttributeFactory.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4642) TokenizerFactory should provide a create method with a given AttributeSource

2013-01-16 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13555146#comment-13555146
 ] 

Renaud Delbru commented on LUCENE-4642:
---

Could someone from the team tell us if this patch may be considered for 
inclusion at some point ? We currently need it in our project, and therefore it 
is kind of blocking us in our development. Thanks.

> TokenizerFactory should provide a create method with a given AttributeSource
> 
>
> Key: LUCENE-4642
> URL: https://issues.apache.org/jira/browse/LUCENE-4642
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Affects Versions: 4.1
>Reporter: Renaud Delbru
>  Labels: analysis, attribute, tokenizer
> Fix For: 4.2, 5.0
>
> Attachments: LUCENE-4642.patch
>
>
> All tokenizer implementations have a constructor that takes a given 
> AttributeSource as parameter (LUCENE-1826). However, the TokenizerFactory 
> does not provide an API to create tokenizers with a given AttributeSource.
> Side note: There are still a lot of tokenizers that do not provide 
> constructors that take AttributeSource and AttributeFactory.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Created: (SOLR-2146) Custom SchemaField object

2010-10-09 Thread Renaud Delbru (JIRA)
Custom SchemaField object
-

 Key: SOLR-2146
 URL: https://issues.apache.org/jira/browse/SOLR-2146
 Project: Solr
  Issue Type: New Feature
  Components: Schema and Analysis
Reporter: Renaud Delbru
Priority: Minor
 Fix For: 3.1


There is some use cases that require to extend SchemaField objects with 
"attributes" or "properties".
For example, I would like to be able to assign a specific "term mapping file" 
for each of my field. Each field name will have a "mapping file" associated 
that I can access at query time using the IndexSchema object.

The FieldType object already enables the addition of attributes. However, these 
attributes are "local" to a field type, not a field definition. Multiple fields 
can have the same field types, which is not suitable for our use cases. 
One possible solution will be to create one field type per field definition, 
but this is more a dirty hack: it means duplicating field types, making them 
more difficult to maintain.

References to mailing list discussion:
http://www.mail-archive.com/solr-u...@lucene.apache.org/msg40436.html
http://www.mail-archive.com/solr-u...@lucene.apache.org/msg40585.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2886) Adaptive Frame Of Reference

2011-01-24 Thread Renaud Delbru (JIRA)
Adaptive Frame Of Reference 


 Key: LUCENE-2886
 URL: https://issues.apache.org/jira/browse/LUCENE-2886
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Codecs
Reporter: Renaud Delbru
 Fix For: 4.0


We could test the implementation of the Adaptive Frame Of Reference [1] on the 
lucene-4.0 branch.
I am providing the source code of its implementation. Some work needs to be 
done, as this implementation is working on the old lucene-1458 branch. 
I will attach a tarball containing a running version (with tests) of the AFOR 
implementation, as well as the implementations of PFOR and of Simple64 (simple 
family codec working on 64bits word) that has been used in the experiments in 
[1].

[1] http://www.deri.ie/fileadmin/documents/deri-tr-afor.pdf

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2886) Adaptive Frame Of Reference

2011-01-24 Thread Renaud Delbru (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Renaud Delbru updated LUCENE-2886:
--

Attachment: lucene-afor.tar.gz

tarball containing a maven project with source code and unit tests for:
- AFOR1
- AFOR2
- FOR
- PFOR Non Compulsive
- Simple64
- a basic tool for debugging IntBlock codecs.

It includes also the lucene-1458 snapshot dependencies that are necessary to 
compile the code and run the tests.

> Adaptive Frame Of Reference 
> 
>
> Key: LUCENE-2886
> URL: https://issues.apache.org/jira/browse/LUCENE-2886
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Codecs
>Reporter: Renaud Delbru
> Fix For: 4.0
>
> Attachments: lucene-afor.tar.gz
>
>
> We could test the implementation of the Adaptive Frame Of Reference [1] on 
> the lucene-4.0 branch.
> I am providing the source code of its implementation. Some work needs to be 
> done, as this implementation is working on the old lucene-1458 branch. 
> I will attach a tarball containing a running version (with tests) of the AFOR 
> implementation, as well as the implementations of PFOR and of Simple64 
> (simple family codec working on 64bits word) that has been used in the 
> experiments in [1].
> [1] http://www.deri.ie/fileadmin/documents/deri-tr-afor.pdf

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2886) Adaptive Frame Of Reference

2011-02-04 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12990509#comment-12990509
 ] 

Renaud Delbru commented on LUCENE-2886:
---

Hi Michael, Robert,
great to hear that the code is useful, looking forward to see some benchmark.
I think the VarIntBlock approach is a good idea. Concerning the two unused 
"frame" codes, it will not cost too much to add them. This might be useful for 
the frequency inverted lists. However, I am not sure they will be used that 
much. In our experiments, we had a version of AFOR allowing frames of size 8, 
16 and 32 integers with allOnes and allZeros. The gain was very minimal, in the 
order to 0.x% index size reduction, because these cases were occurring very 
rarely. But, this is still better than nothing. However, in the case of 
simple64, we are not talking about small frame (up to 32 integers), but frame 
of 120 to 240 integers. Therefore, I expect to see a drop of probability to 
encounter 120 or 240 consecutive ones. Maybe we can use them for more clever 
configurations such as
- inter-leaved sequences of 1 bit and 2 bits integers
- inter-leaved sequences of 2 bits and 3 bits integers
or something like this.
The best will be to do some tests to see which new configurations will make 
sense, like how many times a allOnes config is selected, or other configs, and 
choose which one to add.

> Adaptive Frame Of Reference 
> 
>
> Key: LUCENE-2886
> URL: https://issues.apache.org/jira/browse/LUCENE-2886
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Codecs
>Reporter: Renaud Delbru
> Fix For: 4.0
>
> Attachments: LUCENE-2886_simple64.patch, 
> LUCENE-2886_simple64_varint.patch, lucene-afor.tar.gz
>
>
> We could test the implementation of the Adaptive Frame Of Reference [1] on 
> the lucene-4.0 branch.
> I am providing the source code of its implementation. Some work needs to be 
> done, as this implementation is working on the old lucene-1458 branch. 
> I will attach a tarball containing a running version (with tests) of the AFOR 
> implementation, as well as the implementations of PFOR and of Simple64 
> (simple family codec working on 64bits word) that has been used in the 
> experiments in [1].
> [1] http://www.deri.ie/fileadmin/documents/deri-tr-afor.pdf

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-2886) Adaptive Frame Of Reference

2011-02-04 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12990509#comment-12990509
 ] 

Renaud Delbru edited comment on LUCENE-2886 at 2/4/11 10:42 AM:


Hi Michael, Robert,
great to hear that the code is useful, looking forward to see some benchmark.
I think the VarIntBlock approach is a good idea. Concerning the two unused 
"frame" codes, it will not cost too much to add them. This might be useful for 
the frequency inverted lists. However, I am not sure they will be used that 
much. In our experiments, we had a version of AFOR allowing frames of size 8, 
16 and 32 integers with allOnes and allZeros. The gain was very minimal, in the 
order to 0.x% index size reduction, because these cases were occurring very 
rarely. But, this is still better than nothing. However, in the case of 
simple64, we are not talking about small frame (up to 32 integers), but frame 
of 120 to 240 integers. Therefore, I expect to see a drop of probability to 
encounter 120 or 240 consecutive ones. Maybe we can use them for more clever 
configurations such as
- inter-leaved sequences of 1 bit and 2 bits integers
- inter-leaved sequences of 2 bits and 3 bits integers
or something like this.
The best will be to do some tests to see which new configurations will make 
sense, like how many times a allOnes config is selected, or other configs, and 
choose which one to add. But this can be tedious task with only a limited 
benefit.

  was (Author: renaud.delbru):
Hi Michael, Robert,
great to hear that the code is useful, looking forward to see some benchmark.
I think the VarIntBlock approach is a good idea. Concerning the two unused 
"frame" codes, it will not cost too much to add them. This might be useful for 
the frequency inverted lists. However, I am not sure they will be used that 
much. In our experiments, we had a version of AFOR allowing frames of size 8, 
16 and 32 integers with allOnes and allZeros. The gain was very minimal, in the 
order to 0.x% index size reduction, because these cases were occurring very 
rarely. But, this is still better than nothing. However, in the case of 
simple64, we are not talking about small frame (up to 32 integers), but frame 
of 120 to 240 integers. Therefore, I expect to see a drop of probability to 
encounter 120 or 240 consecutive ones. Maybe we can use them for more clever 
configurations such as
- inter-leaved sequences of 1 bit and 2 bits integers
- inter-leaved sequences of 2 bits and 3 bits integers
or something like this.
The best will be to do some tests to see which new configurations will make 
sense, like how many times a allOnes config is selected, or other configs, and 
choose which one to add.
  
> Adaptive Frame Of Reference 
> 
>
> Key: LUCENE-2886
> URL: https://issues.apache.org/jira/browse/LUCENE-2886
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Codecs
>Reporter: Renaud Delbru
> Fix For: 4.0
>
> Attachments: LUCENE-2886_simple64.patch, 
> LUCENE-2886_simple64_varint.patch, lucene-afor.tar.gz
>
>
> We could test the implementation of the Adaptive Frame Of Reference [1] on 
> the lucene-4.0 branch.
> I am providing the source code of its implementation. Some work needs to be 
> done, as this implementation is working on the old lucene-1458 branch. 
> I will attach a tarball containing a running version (with tests) of the AFOR 
> implementation, as well as the implementations of PFOR and of Simple64 
> (simple family codec working on 64bits word) that has been used in the 
> experiments in [1].
> [1] http://www.deri.ie/fileadmin/documents/deri-tr-afor.pdf

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2886) Adaptive Frame Of Reference

2011-02-04 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12990535#comment-12990535
 ] 

Renaud Delbru commented on LUCENE-2886:
---

{quote}
In the case of 240 1's, i was surprised to see this selector was used over 2% 
of the time
for the gov collection's doc file?
{quote}
our results were performed on the wikipedia dataset and blogs dataset. I don;t 
know what was our selection rate, I was just referring to the gain in overall 
compression rate.

{quote}
But still, for the all 1's case I'm not actually thinking about unstructured 
text so much...
in this case I am thinking about metadata fields and more structured data?
{quote}

Yes, this makes sense. In the context of SIREn (kind of simple xml node based 
inverted index) which is meant for indexing semi-structured data, the 
difference was more observable (mainly on the frequency and position files, as 
well as other structure node files).
This might be also useful on the document id file for very common terms (maybe 
for certain type of facets, with a very few number of values covering a large 
portion of the document collection).

> Adaptive Frame Of Reference 
> 
>
> Key: LUCENE-2886
> URL: https://issues.apache.org/jira/browse/LUCENE-2886
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Codecs
>Reporter: Renaud Delbru
> Fix For: 4.0
>
> Attachments: LUCENE-2886_simple64.patch, 
> LUCENE-2886_simple64_varint.patch, lucene-afor.tar.gz
>
>
> We could test the implementation of the Adaptive Frame Of Reference [1] on 
> the lucene-4.0 branch.
> I am providing the source code of its implementation. Some work needs to be 
> done, as this implementation is working on the old lucene-1458 branch. 
> I will attach a tarball containing a running version (with tests) of the AFOR 
> implementation, as well as the implementations of PFOR and of Simple64 
> (simple family codec working on 64bits word) that has been used in the 
> experiments in [1].
> [1] http://www.deri.ie/fileadmin/documents/deri-tr-afor.pdf

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2886) Adaptive Frame Of Reference

2011-02-04 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12990538#comment-12990538
 ] 

Renaud Delbru commented on LUCENE-2886:
---

Just an additional comment on semi-structured data indexing. AFOR-2 and AFOR-3 
(AFOR-3 refers to AFOR-2 with special code for allOnes frames), was able to 
beat Rice on two datasets, and S-64 on one (but it was very close to Rice on 
the others):

DBpedia dataset: (structured version of wikipedia)

||Method||Ent||Frq||Att||Val||Pos||Total||
|AFOR-1|0.246|0.043|0.141|0.065|0.180|0.816|
|AFOR-2|0.229|0.039|0.132|0.059|0.167|0.758|
|AFOR-3|0.229|0.031|0.131|0.054|0.159|0.736|
|FOR|0.315|0.061|0.170|0.117|0.216|1.049|
|PFOR|0.317|0.044|0.155|0.070|0.205|0.946|
|Rice|0.240|0.029|0.115|0.057|0.152|0.708|
|S-64|0.249|0.041|0.133|0.062|0.171|0.791|
|VByte|0.264|0.162|0.222|0.222|0.245|1.335|

Geonames Dataset: 

||Method||Ent||Frq||Att||Val||Pos||Total||
|AFOR-1|0.129|0.023|0.058|0.025|0.025|0.318|
|AFOR-2|0.123|0.023|0.057|0.024|0.024|0.307|
|AFOR-3|0.114|0.006|0.056|0.016|0.008|0.256|
|FOR|0.150|0.021|0.065|0.025|0.023|0.349|
|PFOR|0.154|0.019|0.057|0.022|0.023|0.332|
|Rice|0.133|0.019|0.063|0.029|0.021|0.327|
|S-64|0.147|0.021|0.058|0.023|0.023|0.329|
|VByte|0.264|0.162|0.222|0.222|0.245|1.335|

Sindice Dataset: Very heterogeneous dataset containing hundred of thousands of 
web dataset

||Method||Ent||Frq||Att||Val||Pos||Total||
|AFOR-1|2.578|0.395|0.942|0.665|1.014|6.537|
|AFOR-2|2.361|0.380|0.908|0.619|0.906|6.082|
|AFOR-3|2.297|0.176|0.876|0.530|0.722|5.475|
|FOR|3.506|0.506|1.121|0.916|1.440|8.611|
|PFOR|3.221|0.374|1.153|0.795|1.227|7.924|
|Rice|2.721|0.314|0.958|0.714|0.941|6.605|
|S-64|2.581|0.370|0.917|0.621|0.908|6.313|
|VByte|3.287|2.106|2.411|2.430|2.488|15.132|

Here, Ent refers to entity id (similar to doc id), Att and Val are structural 
node ids.

> Adaptive Frame Of Reference 
> 
>
> Key: LUCENE-2886
> URL: https://issues.apache.org/jira/browse/LUCENE-2886
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Codecs
>Reporter: Renaud Delbru
> Fix For: 4.0
>
> Attachments: LUCENE-2886_simple64.patch, 
> LUCENE-2886_simple64_varint.patch, lucene-afor.tar.gz
>
>
> We could test the implementation of the Adaptive Frame Of Reference [1] on 
> the lucene-4.0 branch.
> I am providing the source code of its implementation. Some work needs to be 
> done, as this implementation is working on the old lucene-1458 branch. 
> I will attach a tarball containing a running version (with tests) of the AFOR 
> implementation, as well as the implementations of PFOR and of Simple64 
> (simple family codec working on 64bits word) that has been used in the 
> experiments in [1].
> [1] http://www.deri.ie/fileadmin/documents/deri-tr-afor.pdf

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-2886) Adaptive Frame Of Reference

2011-02-04 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12990538#comment-12990538
 ] 

Renaud Delbru edited comment on LUCENE-2886 at 2/4/11 12:05 PM:


Just an additional comment on semi-structured data indexing. AFOR-2 and AFOR-3 
(AFOR-3 refers to AFOR-2 with special code for allOnes frames), was able to 
beat Rice on two datasets, and S-64 on one (but it was very close to Rice on 
the others):

DBpedia dataset: (structured version of wikipedia)

||Method||Ent||Frq||Att||Val||Pos||Total||
|AFOR-1|0.246|0.043|0.141|0.065|0.180|0.816|
|AFOR-2|0.229|0.039|0.132|0.059|0.167|0.758|
|AFOR-3|0.229|0.031|0.131|0.054|0.159|0.736|
|FOR|0.315|0.061|0.170|0.117|0.216|1.049|
|PFOR|0.317|0.044|0.155|0.070|0.205|0.946|
|Rice|0.240|0.029|0.115|0.057|0.152|0.708|
|S-64|0.249|0.041|0.133|0.062|0.171|0.791|
|VByte|0.264|0.162|0.222|0.222|0.245|1.335|

Geonames Dataset: 

||Method||Ent||Frq||Att||Val||Pos||Total||
|AFOR-1|0.129|0.023|0.058|0.025|0.025|0.318|
|AFOR-2|0.123|0.023|0.057|0.024|0.024|0.307|
|AFOR-3|0.114|0.006|0.056|0.016|0.008|0.256|
|FOR|0.150|0.021|0.065|0.025|0.023|0.349|
|PFOR|0.154|0.019|0.057|0.022|0.023|0.332|
|Rice|0.133|0.019|0.063|0.029|0.021|0.327|
|S-64|0.147|0.021|0.058|0.023|0.023|0.329|
|VByte|0.216|0.142|0.143|0.143|0.143|0.929|

Sindice Dataset: Very heterogeneous dataset containing hundred of thousands of 
web dataset

||Method||Ent||Frq||Att||Val||Pos||Total||
|AFOR-1|2.578|0.395|0.942|0.665|1.014|6.537|
|AFOR-2|2.361|0.380|0.908|0.619|0.906|6.082|
|AFOR-3|2.297|0.176|0.876|0.530|0.722|5.475|
|FOR|3.506|0.506|1.121|0.916|1.440|8.611|
|PFOR|3.221|0.374|1.153|0.795|1.227|7.924|
|Rice|2.721|0.314|0.958|0.714|0.941|6.605|
|S-64|2.581|0.370|0.917|0.621|0.908|6.313|
|VByte|3.287|2.106|2.411|2.430|2.488|15.132|

Here, Ent refers to entity id (similar to doc id), Att and Val are structural 
node ids.

  was (Author: renaud.delbru):
Just an additional comment on semi-structured data indexing. AFOR-2 and 
AFOR-3 (AFOR-3 refers to AFOR-2 with special code for allOnes frames), was able 
to beat Rice on two datasets, and S-64 on one (but it was very close to Rice on 
the others):

DBpedia dataset: (structured version of wikipedia)

||Method||Ent||Frq||Att||Val||Pos||Total||
|AFOR-1|0.246|0.043|0.141|0.065|0.180|0.816|
|AFOR-2|0.229|0.039|0.132|0.059|0.167|0.758|
|AFOR-3|0.229|0.031|0.131|0.054|0.159|0.736|
|FOR|0.315|0.061|0.170|0.117|0.216|1.049|
|PFOR|0.317|0.044|0.155|0.070|0.205|0.946|
|Rice|0.240|0.029|0.115|0.057|0.152|0.708|
|S-64|0.249|0.041|0.133|0.062|0.171|0.791|
|VByte|0.264|0.162|0.222|0.222|0.245|1.335|

Geonames Dataset: 

||Method||Ent||Frq||Att||Val||Pos||Total||
|AFOR-1|0.129|0.023|0.058|0.025|0.025|0.318|
|AFOR-2|0.123|0.023|0.057|0.024|0.024|0.307|
|AFOR-3|0.114|0.006|0.056|0.016|0.008|0.256|
|FOR|0.150|0.021|0.065|0.025|0.023|0.349|
|PFOR|0.154|0.019|0.057|0.022|0.023|0.332|
|Rice|0.133|0.019|0.063|0.029|0.021|0.327|
|S-64|0.147|0.021|0.058|0.023|0.023|0.329|
|VByte|0.264|0.162|0.222|0.222|0.245|1.335|

Sindice Dataset: Very heterogeneous dataset containing hundred of thousands of 
web dataset

||Method||Ent||Frq||Att||Val||Pos||Total||
|AFOR-1|2.578|0.395|0.942|0.665|1.014|6.537|
|AFOR-2|2.361|0.380|0.908|0.619|0.906|6.082|
|AFOR-3|2.297|0.176|0.876|0.530|0.722|5.475|
|FOR|3.506|0.506|1.121|0.916|1.440|8.611|
|PFOR|3.221|0.374|1.153|0.795|1.227|7.924|
|Rice|2.721|0.314|0.958|0.714|0.941|6.605|
|S-64|2.581|0.370|0.917|0.621|0.908|6.313|
|VByte|3.287|2.106|2.411|2.430|2.488|15.132|

Here, Ent refers to entity id (similar to doc id), Att and Val are structural 
node ids.
  
> Adaptive Frame Of Reference 
> 
>
> Key: LUCENE-2886
> URL: https://issues.apache.org/jira/browse/LUCENE-2886
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Codecs
>Reporter: Renaud Delbru
> Fix For: 4.0
>
> Attachments: LUCENE-2886_simple64.patch, 
> LUCENE-2886_simple64_varint.patch, lucene-afor.tar.gz
>
>
> We could test the implementation of the Adaptive Frame Of Reference [1] on 
> the lucene-4.0 branch.
> I am providing the source code of its implementation. Some work needs to be 
> done, as this implementation is working on the old lucene-1458 branch. 
> I will attach a tarball containing a running version (with tests) of the AFOR 
> implementation, as well as the implementations of PFOR and of Simple64 
> (simple family codec working on 64bits word) that has been used in the 
> experiments in [1].
> [1] http://www.deri.ie/fileadmin/documents/deri-tr-afor.pdf

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene

[jira] Commented: (LUCENE-2886) Adaptive Frame Of Reference

2011-02-04 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12990548#comment-12990548
 ] 

Renaud Delbru commented on LUCENE-2886:
---

{quote}
So if we can pack long streams of 1s with
freqs and positions I think this is probably useful for a lot of people.
{quote}
Yes, if the overhead is minimal, it might not be an issue in certain cases.

{quote}
Additionally for the .doc, i see its smaller in the AFOR-3 case too. Is
your "Ent" basically a measure of doc deltas? I'm confused exactly
what it is 
{quote}

Yes, Ent is jsut a delta representation of the id of the entity (which can be 
considered as the document id). It is just that I have changed the name of the 
concept, as SIREn is manipulating principally entity and not document. In my 
case, an entity is just a set of attribute-value pairs, similarly to a document 
in Lucene.

{quote}
Because I would think if you take e.g. Geonames, the place
names in the dataset are not in random order but actually "batched" by
country for example, so you would have long streams of docdelta=1 for
country=Germany's postings. 
{quote}
I checked, and Geonames dataset was alphabetically sorted by url names:
http://sws.geonames.org/1/
http://sws.geonames.org/10/
...
as well as dbpedia and sindice.

So, yes, this might have (good) consequences on the docdelta list for certain 
datasets such as geonames. And especially when indexing semi-structured data, 
as the schema of the data in one dataset is generally identical across 
entities/documents. therefore it is likely to see long runs of 1 for certain 
terms or schema terms.

> Adaptive Frame Of Reference 
> 
>
> Key: LUCENE-2886
> URL: https://issues.apache.org/jira/browse/LUCENE-2886
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Codecs
>Reporter: Renaud Delbru
> Fix For: 4.0
>
> Attachments: LUCENE-2886_simple64.patch, 
> LUCENE-2886_simple64_varint.patch, lucene-afor.tar.gz
>
>
> We could test the implementation of the Adaptive Frame Of Reference [1] on 
> the lucene-4.0 branch.
> I am providing the source code of its implementation. Some work needs to be 
> done, as this implementation is working on the old lucene-1458 branch. 
> I will attach a tarball containing a running version (with tests) of the AFOR 
> implementation, as well as the implementations of PFOR and of Simple64 
> (simple family codec working on 64bits word) that has been used in the 
> experiments in [1].
> [1] http://www.deri.ie/fileadmin/documents/deri-tr-afor.pdf

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2886) Adaptive Frame Of Reference

2011-02-04 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12990568#comment-12990568
 ] 

Renaud Delbru commented on LUCENE-2886:
---

Hi Michael, 
the first results are not that impressive. 
* Could you tell me what is BulkVInt ? Is it the simple VInt codec implemented 
on top of the Bulk branch ?
* What the difference between '+united +states' and '+nebraska +states' ? Is 
nebraska a low frequency term ?

> Adaptive Frame Of Reference 
> 
>
> Key: LUCENE-2886
> URL: https://issues.apache.org/jira/browse/LUCENE-2886
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Codecs
>Reporter: Renaud Delbru
> Fix For: 4.0
>
> Attachments: LUCENE-2886_simple64.patch, 
> LUCENE-2886_simple64_varint.patch, lucene-afor.tar.gz
>
>
> We could test the implementation of the Adaptive Frame Of Reference [1] on 
> the lucene-4.0 branch.
> I am providing the source code of its implementation. Some work needs to be 
> done, as this implementation is working on the old lucene-1458 branch. 
> I will attach a tarball containing a running version (with tests) of the AFOR 
> implementation, as well as the implementations of PFOR and of Simple64 
> (simple family codec working on 64bits word) that has been used in the 
> experiments in [1].
> [1] http://www.deri.ie/fileadmin/documents/deri-tr-afor.pdf

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2886) Adaptive Frame Of Reference

2011-02-04 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12990611#comment-12990611
 ] 

Renaud Delbru commented on LUCENE-2886:
---

{quote}
The BulkVInt codec is VInt implemented as a FixedIntBlock codec.
{quote}

Yes, I saw the code, it is a similar implementation of the VInt we used in our 
experiments.

{quote}
previously various codecs
looked much faster than Vint but a lot of the reason for this is due to the way 
Vint
was implemented...
{quote}

This is odd, because we observed the contrary (on the lucene-1458 branch). The 
standard codec was by an order of magnitude faster than any other codec. We 
discovered that this was due to the IntBlock interface implementation that:
- was copying the buffer bytearray two times (one time from the disk to the 
buffer, then another time from the buffer to the IntBlock codec).
- had to perform more work wrt to check each of the buffer (IntBlock buffer, 
IndexInput buffer).
But this might have been improved since then. Michael told me he worked on a 
new version of the IntBlock interface which was more performant.

{quote}
So, if we 'group' the long values so we are e.g. reading say N long values
at once in a single internal 'block', I think we might get more efficiency
via the I/O system, and also less overhead from the bulkpostings apis.
{quote}

If I understand, this is similar to increasing the boundaries of the variable 
block size. Indeed, it incurs some non-negligible overhead to perform a block 
read for each simple64 long word (simple64 frame), and this might be better to 
read more than one per block read.

> Adaptive Frame Of Reference 
> 
>
> Key: LUCENE-2886
> URL: https://issues.apache.org/jira/browse/LUCENE-2886
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Codecs
>Reporter: Renaud Delbru
> Fix For: 4.0
>
> Attachments: LUCENE-2886_simple64.patch, 
> LUCENE-2886_simple64_varint.patch, lucene-afor.tar.gz
>
>
> We could test the implementation of the Adaptive Frame Of Reference [1] on 
> the lucene-4.0 branch.
> I am providing the source code of its implementation. Some work needs to be 
> done, as this implementation is working on the old lucene-1458 branch. 
> I will attach a tarball containing a running version (with tests) of the AFOR 
> implementation, as well as the implementations of PFOR and of Simple64 
> (simple family codec working on 64bits word) that has been used in the 
> experiments in [1].
> [1] http://www.deri.ie/fileadmin/documents/deri-tr-afor.pdf

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-2905) Sep codec writes insane amounts of skip data

2011-02-04 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12990649#comment-12990649
 ] 

Renaud Delbru edited comment on LUCENE-2905 at 2/4/11 5:53 PM:
---

Hi Robert,

it is good to see we are going in the same direction ;o). Here is a short paper 
about an extension of skip list for block based inverted file [1] which was 
accepted at the European Conference of Information Retrieval. I shared this 
paper in a previous discussion with Michael. Maybe you will find some ideas 
inside that are worth keeping in mind.

Also, the main problem of the current skip list implementation when it is 
applied on Sep codec is in my opinion the fact that we have to store for each 
skip list entry the fp of each of the sep file. And more you have files (in the 
SIREn case, we had 5 files), more the skip list data structure get bigger.
However, if you think of it, the main goal of the skip list is to skip doc id, 
not freq or pos. We are storing the fps of the freq and pos files only because 
we need to synchronise the position of the "inverted file pointer" on each 
file. 
Also, this occurs some overhead when you only need to answer
- pure boolean query: we only need to scan the doc file, but we are still 
reading and decoding the fp pointers of the freq and pos file;
- extended boolean query: we only need to scan the doc and freq file, but we 
are still reading and decoding the fp pointers of the pos file;

An idea I had a few months ago (but never found the time to implement it and 
test it) was to change the way the skip list data structure is created. The 
idea was to store the pointer to the doc file in the skip entry, and nothing 
else. The other pointers (to the freq file and position file) are in fact 
stored into the block header.
When using the skip list, you will traverse the skip list until you find the 
skip point of interest, then decode the associated skip entry and get the doc 
file pointer. The doc file pointer indicates the beginning of the block that 
contains the identifiers that you are looking for.
After reading the block from the disk and load it into memory, you can decode 
its header, which contains a pointer to the associated block into the frequency 
file. Similarly, into the block header of the frequency file, you have a 
pointer to the associated block into the pos file. In fact, you can picture 
this by a linked list of block. The skip list provides only the first pointer 
to the block doc file, then pointers to subsequent blocks are included into the 
block headers.
On one hand, this considerably reduce the size of the skip list, since most of 
the information are "exported" and encoded into block headers. On the other 
hand, I am not sure if it reduces the size of the index, as it just move the 
data from the skip list to the inverted file. In addition, I think it makes 
impossible the delta-encoding of the fp for the freq and pos file. But they 
might be other optimisation possible with this data model.

[1] http://dl.dropbox.com/u/1278798/ecir2011-skipblock.pdf

  was (Author: renaud.delbru):
Hi Robert,

it is good to see we are going in the same direction ;o). Here is a short paper 
about an extension of skip list for block based inverted file [1] which was 
accepted at the European Conference of Information Retrieval. I shared this 
paper in a previous discussion with Michael. Maybe you will find some ideas 
inside that are worth keeping in mind.

Also, the main problem of the current skip list implementation when it is 
applied on Sep codec is in my opinion the fact that we have to store for each 
skip list entry the fp of each of the sep file. And more you have files (in the 
SIREn case, we had 5 files), more the skip list data structure get bigger.
However, if you think of it, the main goal of the skip list is to skip doc id, 
not freq or pos. We are storing the fps of the freq and pos files only because 
we need to synchronise the position of the "inverted file pointer" on each 
file. 
Also, this occurs some overhead when you only need to answer
- pure boolean query: we only need to scan the doc file, but we are still 
reading and decoding the fp pointers of the freq and pos file;
- extended boolean query: we only need to scan the doc and freq file, but we 
are still reading and decoding the fp pointers of the pos file;
An idea I had a few months ago (but never found the time to implement it and 
test it) was to change the way the skip list data structure is created. The 
idea was to store the pointer to the doc file in the skip entry, and nothing 
else. The other pointers (to the freq file and position file) are in fact 
stored into the block header.
When using the skip list, you will traverse the skip list until you find the 
skip point of interest, then decode the associated skip entry and get the doc 
file pointer. The doc fil

[jira] Commented: (LUCENE-2905) Sep codec writes insane amounts of skip data

2011-02-04 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12990649#comment-12990649
 ] 

Renaud Delbru commented on LUCENE-2905:
---

Hi Robert,

it is good to see we are going in the same direction ;o). Here is a short paper 
about an extension of skip list for block based inverted file [1] which was 
accepted at the European Conference of Information Retrieval. I shared this 
paper in a previous discussion with Michael. Maybe you will find some ideas 
inside that are worth keeping in mind.

Also, the main problem of the current skip list implementation when it is 
applied on Sep codec is in my opinion the fact that we have to store for each 
skip list entry the fp of each of the sep file. And more you have files (in the 
SIREn case, we had 5 files), more the skip list data structure get bigger.
However, if you think of it, the main goal of the skip list is to skip doc id, 
not freq or pos. We are storing the fps of the freq and pos files only because 
we need to synchronise the position of the "inverted file pointer" on each 
file. 
Also, this occurs some overhead when you only need to answer
- pure boolean query: we only need to scan the doc file, but we are still 
reading and decoding the fp pointers of the freq and pos file;
- extended boolean query: we only need to scan the doc and freq file, but we 
are still reading and decoding the fp pointers of the pos file;
An idea I had a few months ago (but never found the time to implement it and 
test it) was to change the way the skip list data structure is created. The 
idea was to store the pointer to the doc file in the skip entry, and nothing 
else. The other pointers (to the freq file and position file) are in fact 
stored into the block header.
When using the skip list, you will traverse the skip list until you find the 
skip point of interest, then decode the associated skip entry and get the doc 
file pointer. The doc file pointer indicates the beginning of the block that 
contains the identifiers that you are looking for.
After reading the block from the disk and load it into memory, you can decode 
its header, which contains a pointer to the associated block into the frequency 
file. Similarly, into the block header of the frequency file, you have a 
pointer to the associated block into the pos file. In fact, you can picture 
this by a linked list of block. The skip list provides only the first pointer 
to the block doc file, then pointers to subsequent blocks are included into the 
block headers.
On one hand, this considerably reduce the size of the skip list, since most of 
the information are "exported" and encoded into block headers. On the other 
hand, I am not sure if it reduces the size of the index, as it just move the 
data from the skip list to the inverted file. In addition, I think it makes 
impossible the delta-encoding of the fp for the freq and pos file. But they 
might be other optimisation possible with this data model.

[1] http://dl.dropbox.com/u/1278798/ecir2011-skipblock.pdf

> Sep codec writes insane amounts of skip data
> 
>
> Key: LUCENE-2905
> URL: https://issues.apache.org/jira/browse/LUCENE-2905
> Project: Lucene - Java
>  Issue Type: Bug
>Reporter: Robert Muir
> Fix For: Bulk Postings branch
>
>
> Currently, even if we use better compression algorithms via Fixed or Variable 
> Intblock
> encodings, we have problems with both performance and index size versus 
> StandardCodec.
> Consider the following numbers:
> {noformat}
> standard:
> frq: 1,862,174,204 bytes
> prx: 1,146,898,936 bytes
> tib: 541,128,354 bytes
> complete index: 4,321,032,720 bytes
> bulkvint:
> doc: 1,297,215,588 bytes
> frq: 725,060,776 bytes
> pos: 1,163,335,609 bytes
> tib: 729,019,637 bytes
> complete index: 5,180,088,695 bytes
> simple64:
> doc: 1,260,869,240 bytes
> frq: 234,491,576 bytes
> pos: 1,055,024,224 bytes
> skp: 473,293,042 bytes
> tib: 725,928,817 bytes
> complete index: 4,520,488,986 bytes
> {noformat}
> I think there are several reasons for this:
> * Splitting into separate files (e.g. postings into .doc + .freq). 
> * Having to store both a relative delta to the block start, and an offset 
> into the block.
> * In a lot of cases various numbers involved are larger than they should be: 
> e.g. they are file pointer deltas, but blocksize is fixed...
> Here are some ideas (some are probably stupid) of things we could do to try 
> to fix this:
> Is Sep really necessary? Instead should we make an alternative to Sep, 
> Interleaved? that interleaves doc and freq blocks (doc,freq,doc,freq) into 
> one file? the concrete impl could implement skipBlock() for when they only 
> want docdeltas: e.g. for Simple64 blocks on disk are fixed size so it could 
> just skip N bytes. Fixed Int Block codecs like PF

[jira] [Commented] (LUCENE-4642) TokenizerFactory should provide a create method with a given AttributeSource

2013-01-21 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13558832#comment-13558832
 ] 

Renaud Delbru commented on LUCENE-4642:
---

{quote}
Personally: I think we should remove Tokenizer(AttributeSource): it bloats the 
APIs and causes ctor explosion.
{quote}

Why not the contrary instead ? I.e., remove Tokenizer(AttributeFactory) and 
leave Tokenizer(AttributeSource) since AttributeFactory is an enclosed class of 
AttributeSource ? Limiting the API to only AttributeFactory will restrict it 
unnecessarily imho.

Our use case is to be able to create "advanced token streams", where one 
"parent token stream" can have multiple "child token streams", the parent token 
stream will share their attribute sources with the child token streams for 
performance reasons. Emulating this behaviour by doing copies of the attributes 
from stream to stream is really ineffective (our throughput is divided by at 
least 3).
A more concrete use case is the ability to create "specific token streams" for 
a particular "token type". For example, our parent tokenizer tokenizes a string 
into a list of tokens, each one having a specific type. Then, each token is 
processed downstream by "child token streams". The child token stream that will 
process the token depends on the token type attribute.

> TokenizerFactory should provide a create method with a given AttributeSource
> 
>
> Key: LUCENE-4642
> URL: https://issues.apache.org/jira/browse/LUCENE-4642
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Affects Versions: 4.1
>Reporter: Renaud Delbru
>Assignee: Steve Rowe
>  Labels: analysis, attribute, tokenizer
> Fix For: 4.2, 5.0
>
> Attachments: LUCENE-4642.patch, LUCENE-4642.patch
>
>
> All tokenizer implementations have a constructor that takes a given 
> AttributeSource as parameter (LUCENE-1826). However, the TokenizerFactory 
> does not provide an API to create tokenizers with a given AttributeSource.
> Side note: There are still a lot of tokenizers that do not provide 
> constructors that take AttributeSource and AttributeFactory.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4642) TokenizerFactory should provide a create method with a given AttributeSource

2013-01-21 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13558850#comment-13558850
 ] 

Renaud Delbru commented on LUCENE-4642:
---

{quote}
Because its totally unrelated.
{quote}

Well, I think the user could simply create a new AttributeSource with a given 
AttributeFactory to emulate the Tokenizer(AttributeFactory) ? But that might 
add some burden on the user side.

> TokenizerFactory should provide a create method with a given AttributeSource
> 
>
> Key: LUCENE-4642
> URL: https://issues.apache.org/jira/browse/LUCENE-4642
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Affects Versions: 4.1
>Reporter: Renaud Delbru
>Assignee: Steve Rowe
>  Labels: analysis, attribute, tokenizer
> Fix For: 4.2, 5.0
>
> Attachments: LUCENE-4642.patch, LUCENE-4642.patch
>
>
> All tokenizer implementations have a constructor that takes a given 
> AttributeSource as parameter (LUCENE-1826). However, the TokenizerFactory 
> does not provide an API to create tokenizers with a given AttributeSource.
> Side note: There are still a lot of tokenizers that do not provide 
> constructors that take AttributeSource and AttributeFactory.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4642) TokenizerFactory should provide a create method with a given AttributeSource

2013-01-25 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13562740#comment-13562740
 ] 

Renaud Delbru commented on LUCENE-4642:
---

Hi, 

are there still some open questions on this issue that block the patch of being 
committed ? 

> TokenizerFactory should provide a create method with a given AttributeSource
> 
>
> Key: LUCENE-4642
> URL: https://issues.apache.org/jira/browse/LUCENE-4642
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Affects Versions: 4.1
>Reporter: Renaud Delbru
>Assignee: Steve Rowe
>  Labels: analysis, attribute, tokenizer
> Fix For: 4.2, 5.0
>
> Attachments: LUCENE-4642.patch, LUCENE-4642.patch
>
>
> All tokenizer implementations have a constructor that takes a given 
> AttributeSource as parameter (LUCENE-1826). However, the TokenizerFactory 
> does not provide an API to create tokenizers with a given AttributeSource.
> Side note: There are still a lot of tokenizers that do not provide 
> constructors that take AttributeSource and AttributeFactory.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4642) TokenizerFactory should provide a create method with a given AttributeSource

2013-01-25 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13562853#comment-13562853
 ] 

Renaud Delbru commented on LUCENE-4642:
---

@steve:

{quote}
have you looked at TeeSinkTokenFilter
{quote}

Yes, and from my current understanding, it is similar to our current 
implementation. The problem with this approach is that the exchange of 
attributes is performed using the AttributeSource.State API with 
AttributeSource#captureState and AttributeSource#restoreState, which copies the 
values of all attribute implementations that the state contains, and this is 
very inefficient as it has to copies arrays and other objects (e.g., char term 
arrays, etc.) for every single token.

@robert:

Concerning the problem of UOEs, the new patch of Steve reduces the number of 
UOEs to one only, which is much more reasonable than my first approach. I have 
looked at the current state of the Lucene trunk, and there are already a lot of 
UOEs in many places. So, I would suggest that this problem may not be a 
blocking one (but I might be wrong).

Concerning the problem of constructor explosion, maybe we can find a consensus. 
Your proposition of removing Tokenizer(AttributeSource) cannot work for us, as 
we need it to share a same AttributeSource across multiple streams. However, as 
I proposed, removing the Tokenizer(AttributeFactory) could work as it could be 
emulated by using Tokenizer(AttributeSource).



> TokenizerFactory should provide a create method with a given AttributeSource
> 
>
> Key: LUCENE-4642
> URL: https://issues.apache.org/jira/browse/LUCENE-4642
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Affects Versions: 4.1
>Reporter: Renaud Delbru
>Assignee: Steve Rowe
>  Labels: analysis, attribute, tokenizer
> Fix For: 4.2, 5.0
>
> Attachments: LUCENE-4642.patch, LUCENE-4642.patch
>
>
> All tokenizer implementations have a constructor that takes a given 
> AttributeSource as parameter (LUCENE-1826). However, the TokenizerFactory 
> does not provide an API to create tokenizers with a given AttributeSource.
> Side note: There are still a lot of tokenizers that do not provide 
> constructors that take AttributeSource and AttributeFactory.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4642) TokenizerFactory should provide a create method with a given AttributeSource

2013-01-27 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13563784#comment-13563784
 ] 

Renaud Delbru commented on LUCENE-4642:
---

Hi Robert,

I understand your point of view. One possible alternative for simplifying the 
API would be to refactor constructors with AttributeSource/AttributeFactory 
into setters. After a quick look, this looks compatible with the existing 
tokenizers and tokenizer factories. 
The setting of AttributeSource/AttributeFactory for a tokenizer will be 
transparent (i.e., they do not have to explicitly create a constructor), and 
specific extension can be still implemented by subclasses (e.g., 
NumericTokenStream can overwrite the setAttributeFactory method to wrap a given 
factory with NumericAttributeFactory).
For the tokenizer factories, we can then implement a create method with an 
AttributeSource/AttributeFactory parameter, which will call the abstract method 
create and then call the setAttributeSource/setAttributeFactory on the newly 
created tokenizer.

What do you think ? Did I miss something in my reasoning which could break 
something ? 

> TokenizerFactory should provide a create method with a given AttributeSource
> 
>
> Key: LUCENE-4642
> URL: https://issues.apache.org/jira/browse/LUCENE-4642
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Affects Versions: 4.1
>Reporter: Renaud Delbru
>Assignee: Steve Rowe
>  Labels: analysis, attribute, tokenizer
> Fix For: 4.2, 5.0
>
> Attachments: LUCENE-4642.patch, LUCENE-4642.patch
>
>
> All tokenizer implementations have a constructor that takes a given 
> AttributeSource as parameter (LUCENE-1826). However, the TokenizerFactory 
> does not provide an API to create tokenizers with a given AttributeSource.
> Side note: There are still a lot of tokenizers that do not provide 
> constructors that take AttributeSource and AttributeFactory.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4642) TokenizerFactory should provide a create method with a given AttributeSource

2013-01-28 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13564422#comment-13564422
 ] 

Renaud Delbru commented on LUCENE-4642:
---

Great, I think that AttributeFactory hack could work for us. Would you agree to 
add a TokenizerFactory.create(AttributeFactory) method ? I could prepare a 
patch for that.

> TokenizerFactory should provide a create method with a given AttributeSource
> 
>
> Key: LUCENE-4642
> URL: https://issues.apache.org/jira/browse/LUCENE-4642
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Affects Versions: 4.1
>Reporter: Renaud Delbru
>Assignee: Steve Rowe
>  Labels: analysis, attribute, tokenizer
> Fix For: 4.2, 5.0
>
> Attachments: LUCENE-4642.patch, LUCENE-4642.patch, 
> TrieTokenizerFactory.java.patch
>
>
> All tokenizer implementations have a constructor that takes a given 
> AttributeSource as parameter (LUCENE-1826). However, the TokenizerFactory 
> does not provide an API to create tokenizers with a given AttributeSource.
> Side note: There are still a lot of tokenizers that do not provide 
> constructors that take AttributeSource and AttributeFactory.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4642) TokenizerFactory should provide a create method with a given AttributeSource

2013-02-03 Thread Renaud Delbru (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Renaud Delbru updated LUCENE-4642:
--

Attachment: LUCENE-4642.patch

> TokenizerFactory should provide a create method with a given AttributeSource
> 
>
> Key: LUCENE-4642
> URL: https://issues.apache.org/jira/browse/LUCENE-4642
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Affects Versions: 4.1
>Reporter: Renaud Delbru
>Assignee: Steve Rowe
>  Labels: analysis, attribute, tokenizer
> Fix For: 4.2, 5.0
>
> Attachments: LUCENE-4642.patch, LUCENE-4642.patch, LUCENE-4642.patch, 
> TrieTokenizerFactory.java.patch
>
>
> All tokenizer implementations have a constructor that takes a given 
> AttributeSource as parameter (LUCENE-1826). However, the TokenizerFactory 
> does not provide an API to create tokenizers with a given AttributeSource.
> Side note: There are still a lot of tokenizers that do not provide 
> constructors that take AttributeSource and AttributeFactory.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4642) TokenizerFactory should provide a create method with a given AttributeSource

2013-02-03 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13569883#comment-13569883
 ] 

Renaud Delbru commented on LUCENE-4642:
---

Hi,

I have submitted a patch which integrates:
- the patch from Uwe
- the removal of the Tokenizer(AttributeSource) constructor
- the addition of a TokenizerFactory.create(AttributeFactory) method
- some of the changes from the previous patch from Steve (e.g., 
TokenizerFactory.create method throw UOE by default)

All test suites are passing.

> TokenizerFactory should provide a create method with a given AttributeSource
> 
>
> Key: LUCENE-4642
> URL: https://issues.apache.org/jira/browse/LUCENE-4642
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Affects Versions: 4.1
>Reporter: Renaud Delbru
>Assignee: Steve Rowe
>  Labels: analysis, attribute, tokenizer
> Fix For: 4.2, 5.0
>
> Attachments: LUCENE-4642.patch, LUCENE-4642.patch, LUCENE-4642.patch, 
> TrieTokenizerFactory.java.patch
>
>
> All tokenizer implementations have a constructor that takes a given 
> AttributeSource as parameter (LUCENE-1826). However, the TokenizerFactory 
> does not provide an API to create tokenizers with a given AttributeSource.
> Side note: There are still a lot of tokenizers that do not provide 
> constructors that take AttributeSource and AttributeFactory.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4642) Add create(AttributeFactory) to TokenizerFactory and subclasses with ctors taking AttributeFactory, and remove Tokenizer's and subclasses' ctors taking AttributeSource

2013-03-20 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13607809#comment-13607809
 ] 

Renaud Delbru commented on LUCENE-4642:
---

Thanks for committing this, Steve and Robert. That's great.

> Add create(AttributeFactory) to TokenizerFactory and subclasses with ctors 
> taking AttributeFactory, and remove Tokenizer's and subclasses' ctors taking 
> AttributeSource
> ---
>
> Key: LUCENE-4642
> URL: https://issues.apache.org/jira/browse/LUCENE-4642
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Affects Versions: 4.1
>Reporter: Renaud Delbru
>Assignee: Steve Rowe
>  Labels: analysis, attribute, tokenizer
> Fix For: 5.0, 4.3
>
> Attachments: LUCENE-4642.patch, LUCENE-4642.patch, LUCENE-4642.patch, 
> LUCENE-4642.patch, 
> LUCENE-4642-single-create-method-on-TokenizerFactory-subclasses.patch, 
> LUCENE-4642-single-create-method-on-TokenizerFactory-subclasses.patch, 
> TrieTokenizerFactory.java.patch
>
>
> All tokenizer implementations have a constructor that takes a given 
> AttributeSource as parameter (LUCENE-1826).  These should be removed.
> TokenizerFactory does not provide an API to create tokenizers with a given 
> AttributeFactory, but quite a few tokenizers have constructors that take an 
> AttributeFactory.  TokenizerFactory should add a create(AttributeFactory) 
> method, as should subclasses for tokenizers with AttributeFactory accepting 
> ctors.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-4919) IntsRef, BytesRef and CharsRef returns incorrect hashcode when filled with 0

2013-04-09 Thread Renaud Delbru (JIRA)
Renaud Delbru created LUCENE-4919:
-

 Summary: IntsRef, BytesRef and CharsRef returns incorrect hashcode 
when filled with 0
 Key: LUCENE-4919
 URL: https://issues.apache.org/jira/browse/LUCENE-4919
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/other
Affects Versions: 4.2
Reporter: Renaud Delbru
 Fix For: 4.3


IntsRef, BytesRef and CharsRef implementation does not follow the java 
Arrays.hashCode implementation, and returns incorrect hashcode when filled with 
0. 
For example, an IntsRef with { 0 } will return the same hashcode than an 
IntsRef with { 0, 0 }.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4919) IntsRef, BytesRef and CharsRef returns incorrect hashcode when filled with 0

2013-04-09 Thread Renaud Delbru (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Renaud Delbru updated LUCENE-4919:
--

Description: 
IntsRef, BytesRef and CharsRef implementation does not follow the java 
Arrays.hashCode implementation, and returns incorrect hashcode when filled with 
0. 
For example, an IntsRef with \{ 0 \} will return the same hashcode than an 
IntsRef with \{ 0, 0 \}.

  was:
IntsRef, BytesRef and CharsRef implementation does not follow the java 
Arrays.hashCode implementation, and returns incorrect hashcode when filled with 
0. 
For example, an IntsRef with { 0 } will return the same hashcode than an 
IntsRef with { 0, 0 }.


> IntsRef, BytesRef and CharsRef returns incorrect hashcode when filled with 0
> 
>
> Key: LUCENE-4919
> URL: https://issues.apache.org/jira/browse/LUCENE-4919
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/other
>Affects Versions: 4.2
>Reporter: Renaud Delbru
> Fix For: 4.3
>
>
> IntsRef, BytesRef and CharsRef implementation does not follow the java 
> Arrays.hashCode implementation, and returns incorrect hashcode when filled 
> with 0. 
> For example, an IntsRef with \{ 0 \} will return the same hashcode than an 
> IntsRef with \{ 0, 0 \}.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4919) IntsRef, BytesRef and CharsRef returns incorrect hashcode when filled with 0

2013-04-09 Thread Renaud Delbru (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Renaud Delbru updated LUCENE-4919:
--

Attachment: LUCENE-4919.patch

Here is a patch for IntsRef, BytesRef and CharsRef including unit tests. The 
new hashcode implementation is identical to the one found in Arrays.hashCode.

> IntsRef, BytesRef and CharsRef returns incorrect hashcode when filled with 0
> 
>
> Key: LUCENE-4919
> URL: https://issues.apache.org/jira/browse/LUCENE-4919
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/other
>Affects Versions: 4.2
>Reporter: Renaud Delbru
> Fix For: 4.3
>
> Attachments: LUCENE-4919.patch
>
>
> IntsRef, BytesRef and CharsRef implementation does not follow the java 
> Arrays.hashCode implementation, and returns incorrect hashcode when filled 
> with 0. 
> For example, an IntsRef with \{ 0 \} will return the same hashcode than an 
> IntsRef with \{ 0, 0 \}.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4919) IntsRef, BytesRef and CharsRef return incorrect hashcode when filled with 0

2013-04-09 Thread Renaud Delbru (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Renaud Delbru updated LUCENE-4919:
--

Description: 
IntsRef, BytesRef and CharsRef implementation do not follow the java 
Arrays.hashCode implementation, and return incorrect hashcode when filled with 
0. 
For example, an IntsRef with \{ 0 \} will return the same hashcode than an 
IntsRef with \{ 0, 0 \}.

  was:
IntsRef, BytesRef and CharsRef implementation does not follow the java 
Arrays.hashCode implementation, and returns incorrect hashcode when filled with 
0. 
For example, an IntsRef with \{ 0 \} will return the same hashcode than an 
IntsRef with \{ 0, 0 \}.

Summary: IntsRef, BytesRef and CharsRef return incorrect hashcode when 
filled with 0  (was: IntsRef, BytesRef and CharsRef returns incorrect hashcode 
when filled with 0)

> IntsRef, BytesRef and CharsRef return incorrect hashcode when filled with 0
> ---
>
> Key: LUCENE-4919
> URL: https://issues.apache.org/jira/browse/LUCENE-4919
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/other
>Affects Versions: 4.2
>Reporter: Renaud Delbru
> Fix For: 4.3
>
> Attachments: LUCENE-4919.patch
>
>
> IntsRef, BytesRef and CharsRef implementation do not follow the java 
> Arrays.hashCode implementation, and return incorrect hashcode when filled 
> with 0. 
> For example, an IntsRef with \{ 0 \} will return the same hashcode than an 
> IntsRef with \{ 0, 0 \}.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4919) IntsRef, BytesRef and CharsRef return incorrect hashcode when filled with 0

2013-04-09 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13626454#comment-13626454
 ] 

Renaud Delbru commented on LUCENE-4919:
---

Hi Robert,

>From my understanding, this applies only for BytesRef (even if this behavior 
>sounds dangerous to me). However, why is IntsRef and CharsRef following the 
>same behavior ?

> IntsRef, BytesRef and CharsRef return incorrect hashcode when filled with 0
> ---
>
> Key: LUCENE-4919
> URL: https://issues.apache.org/jira/browse/LUCENE-4919
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/other
>Affects Versions: 4.2
>Reporter: Renaud Delbru
> Fix For: 4.3
>
> Attachments: LUCENE-4919.patch
>
>
> IntsRef, BytesRef and CharsRef implementation do not follow the java 
> Arrays.hashCode implementation, and return incorrect hashcode when filled 
> with 0. 
> For example, an IntsRef with \{ 0 \} will return the same hashcode than an 
> IntsRef with \{ 0, 0 \}.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4919) IntsRef, BytesRef and CharsRef return incorrect hashcode when filled with 0

2013-04-09 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13626458#comment-13626458
 ] 

Renaud Delbru commented on LUCENE-4919:
---

I see that BytesRef is used a bit everywhere in various contexts, contexts 
which are different from the TermsHash context. This hashcode behavior might 
cause unexpected problems, as I am sure most of the users of BytesRef are 
unaware of this particular behavior.

> IntsRef, BytesRef and CharsRef return incorrect hashcode when filled with 0
> ---
>
> Key: LUCENE-4919
> URL: https://issues.apache.org/jira/browse/LUCENE-4919
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/other
>Affects Versions: 4.2
>Reporter: Renaud Delbru
> Fix For: 4.3
>
> Attachments: LUCENE-4919.patch
>
>
> IntsRef, BytesRef and CharsRef implementation do not follow the java 
> Arrays.hashCode implementation, and return incorrect hashcode when filled 
> with 0. 
> For example, an IntsRef with \{ 0 \} will return the same hashcode than an 
> IntsRef with \{ 0, 0 \}.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4919) IntsRef, BytesRef and CharsRef return incorrect hashcode when filled with 0

2013-04-09 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13626471#comment-13626471
 ] 

Renaud Delbru commented on LUCENE-4919:
---

Ok, I understand Robert. That sounds like a big task. I can try to make a first 
pass over it in the next days if you think it is worth it (personally I would 
feel more reassured knowing that the hashcode follows a more common behavior).

> IntsRef, BytesRef and CharsRef return incorrect hashcode when filled with 0
> ---
>
> Key: LUCENE-4919
> URL: https://issues.apache.org/jira/browse/LUCENE-4919
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/other
>Affects Versions: 4.2
>Reporter: Renaud Delbru
> Fix For: 4.3
>
> Attachments: LUCENE-4919.patch
>
>
> IntsRef, BytesRef and CharsRef implementation do not follow the java 
> Arrays.hashCode implementation, and return incorrect hashcode when filled 
> with 0. 
> For example, an IntsRef with \{ 0 \} will return the same hashcode than an 
> IntsRef with \{ 0, 0 \}.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4919) IntsRef, BytesRef and CharsRef return incorrect hashcode when filled with 0

2013-04-09 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13626477#comment-13626477
 ] 

Renaud Delbru commented on LUCENE-4919:
---

@Simon: I discovered the issue when using IntsRef. during query processing, I 
am streaming array of integers using IntsRef. I was relying on the hashCode to 
compute a unique identifier for the content of a particular IntsRef until I 
started to see unexpected results in my unit tests. Then I saw that the same 
behaviour is found in the other *Ref classes. 
I could live without it and bypass the problem by changing my implementation 
(and computing myself my own hash code). But I thought this behaviour is not 
very clear for the user, and could be potentially dangerous, and therefore good 
to share it with you.

> IntsRef, BytesRef and CharsRef return incorrect hashcode when filled with 0
> ---
>
> Key: LUCENE-4919
> URL: https://issues.apache.org/jira/browse/LUCENE-4919
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/other
>Affects Versions: 4.2
>Reporter: Renaud Delbru
> Fix For: 4.3
>
> Attachments: LUCENE-4919.patch
>
>
> IntsRef, BytesRef and CharsRef implementation do not follow the java 
> Arrays.hashCode implementation, and return incorrect hashcode when filled 
> with 0. 
> For example, an IntsRef with \{ 0 \} will return the same hashcode than an 
> IntsRef with \{ 0, 0 \}.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4919) IntsRef, BytesRef and CharsRef return incorrect hashcode when filled with 0

2013-04-09 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13626480#comment-13626480
 ] 

Renaud Delbru commented on LUCENE-4919:
---

Maybe a simpler solution would be to clearly state this behavior in all the 
methods javadoc.

> IntsRef, BytesRef and CharsRef return incorrect hashcode when filled with 0
> ---
>
> Key: LUCENE-4919
> URL: https://issues.apache.org/jira/browse/LUCENE-4919
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/other
>Affects Versions: 4.2
>Reporter: Renaud Delbru
> Fix For: 4.3
>
> Attachments: LUCENE-4919.patch
>
>
> IntsRef, BytesRef and CharsRef implementation do not follow the java 
> Arrays.hashCode implementation, and return incorrect hashcode when filled 
> with 0. 
> For example, an IntsRef with \{ 0 \} will return the same hashcode than an 
> IntsRef with \{ 0, 0 \}.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4919) IntsRef, BytesRef and CharsRef return incorrect hashcode when filled with 0

2013-04-09 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13626486#comment-13626486
 ] 

Renaud Delbru commented on LUCENE-4919:
---

I agree with you Dawid, but this particular behaviour increases the chance of 
getting the same hash for a certain type of inputs. Anyway, I think the general 
decision is to not change their hashCode behvaiour ;o), I am fine with it. Feel 
free to close the issue.
Thanks, and sorry for the distraction.

> IntsRef, BytesRef and CharsRef return incorrect hashcode when filled with 0
> ---
>
> Key: LUCENE-4919
> URL: https://issues.apache.org/jira/browse/LUCENE-4919
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/other
>Affects Versions: 4.2
>Reporter: Renaud Delbru
> Fix For: 4.3
>
> Attachments: LUCENE-4919.patch
>
>
> IntsRef, BytesRef and CharsRef implementation do not follow the java 
> Arrays.hashCode implementation, and return incorrect hashcode when filled 
> with 0. 
> For example, an IntsRef with \{ 0 \} will return the same hashcode than an 
> IntsRef with \{ 0, 0 \}.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Closed] (LUCENE-4919) IntsRef, BytesRef and CharsRef return incorrect hashcode when filled with 0

2013-04-09 Thread Renaud Delbru (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Renaud Delbru closed LUCENE-4919.
-

Resolution: Not A Problem

> IntsRef, BytesRef and CharsRef return incorrect hashcode when filled with 0
> ---
>
> Key: LUCENE-4919
> URL: https://issues.apache.org/jira/browse/LUCENE-4919
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/other
>Affects Versions: 4.2
>Reporter: Renaud Delbru
> Fix For: 4.3
>
> Attachments: LUCENE-4919.patch
>
>
> IntsRef, BytesRef and CharsRef implementation do not follow the java 
> Arrays.hashCode implementation, and return incorrect hashcode when filled 
> with 0. 
> For example, an IntsRef with \{ 0 \} will return the same hashcode than an 
> IntsRef with \{ 0, 0 \}.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4642) TokenizerFactory should provide a create method with a given AttributeSource

2013-02-14 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13578331#comment-13578331
 ] 

Renaud Delbru commented on LUCENE-4642:
---

Hi, would this patch be considered for inclusion at some point in time ? Thanks.

> TokenizerFactory should provide a create method with a given AttributeSource
> 
>
> Key: LUCENE-4642
> URL: https://issues.apache.org/jira/browse/LUCENE-4642
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Affects Versions: 4.1
>Reporter: Renaud Delbru
>Assignee: Steve Rowe
>  Labels: analysis, attribute, tokenizer
> Fix For: 4.2, 5.0
>
> Attachments: LUCENE-4642.patch, LUCENE-4642.patch, LUCENE-4642.patch, 
> TrieTokenizerFactory.java.patch
>
>
> All tokenizer implementations have a constructor that takes a given 
> AttributeSource as parameter (LUCENE-1826). However, the TokenizerFactory 
> does not provide an API to create tokenizers with a given AttributeSource.
> Side note: There are still a lot of tokenizers that do not provide 
> constructors that take AttributeSource and AttributeFactory.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4642) TokenizerFactory should provide a create method with a given AttributeSource

2013-02-26 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13587269#comment-13587269
 ] 

Renaud Delbru commented on LUCENE-4642:
---

Hi, any updates about the patch ? thanks.

> TokenizerFactory should provide a create method with a given AttributeSource
> 
>
> Key: LUCENE-4642
> URL: https://issues.apache.org/jira/browse/LUCENE-4642
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Affects Versions: 4.1
>Reporter: Renaud Delbru
>Assignee: Steve Rowe
>  Labels: analysis, attribute, tokenizer
> Fix For: 4.2, 5.0
>
> Attachments: LUCENE-4642.patch, LUCENE-4642.patch, LUCENE-4642.patch, 
> TrieTokenizerFactory.java.patch
>
>
> All tokenizer implementations have a constructor that takes a given 
> AttributeSource as parameter (LUCENE-1826). However, the TokenizerFactory 
> does not provide an API to create tokenizers with a given AttributeSource.
> Side note: There are still a lot of tokenizers that do not provide 
> constructors that take AttributeSource and AttributeFactory.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4642) TokenizerFactory should provide a create method with a given AttributeSource

2013-03-11 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13598688#comment-13598688
 ] 

Renaud Delbru commented on LUCENE-4642:
---

Hi Steve, I imagine things were busy these past days with the 4.2 release. 
Would you need help to finalise this patch ? thanks.

> TokenizerFactory should provide a create method with a given AttributeSource
> 
>
> Key: LUCENE-4642
> URL: https://issues.apache.org/jira/browse/LUCENE-4642
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Affects Versions: 4.1
>Reporter: Renaud Delbru
>Assignee: Steve Rowe
>  Labels: analysis, attribute, tokenizer
> Fix For: 4.2, 5.0
>
> Attachments: LUCENE-4642.patch, LUCENE-4642.patch, LUCENE-4642.patch, 
> TrieTokenizerFactory.java.patch
>
>
> All tokenizer implementations have a constructor that takes a given 
> AttributeSource as parameter (LUCENE-1826). However, the TokenizerFactory 
> does not provide an API to create tokenizers with a given AttributeSource.
> Side note: There are still a lot of tokenizers that do not provide 
> constructors that take AttributeSource and AttributeFactory.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org