[jira] [Commented] (SOLR-1535) Pre-analyzed field type

2012-04-19 Thread Andrzej Bialecki (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13257311#comment-13257311
 ] 

Andrzej Bialecki  commented on SOLR-1535:
-

The latest patch implements requested improvements. If there are no objections 
I'd like to commit it shortly, and track further improvements as separate 
issues.

> Pre-analyzed field type
> ---
>
> Key: SOLR-1535
> URL: https://issues.apache.org/jira/browse/SOLR-1535
> Project: Solr
>  Issue Type: New Feature
>Affects Versions: 1.5
>Reporter: Andrzej Bialecki 
> Fix For: 4.0
>
> Attachments: SOLR-1535.patch, SOLR-1535.patch, SOLR-1535.patch, 
> preanalyzed.patch, preanalyzed.patch
>
>
> PreAnalyzedFieldType provides a functionality to index (and optionally store) 
> content that was already processed and split into tokens using some external 
> processing chain. This implementation defines a serialization format for 
> sending tokens with any currently supported Attributes (eg. type, posIncr, 
> payload, ...). This data is de-serialized into a regular TokenStream that is 
> returned in Field.tokenStreamValue() and thus added to the index as index 
> terms, and optionally a stored part that is returned in Field.stringValue() 
> and is then added as a stored value of the field.
> This field type is useful for integrating Solr with existing text-processing 
> pipelines, such as third-party NLP systems.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-3139) StreamingUpdateSolrServer doesn't send UpdateRequest.getParams()

2012-04-13 Thread Andrzej Bialecki (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13253370#comment-13253370
 ] 

Andrzej Bialecki  commented on SOLR-3139:
-

+1 looks good to me.

> StreamingUpdateSolrServer doesn't send UpdateRequest.getParams()
> 
>
> Key: SOLR-3139
> URL: https://issues.apache.org/jira/browse/SOLR-3139
> Project: Solr
>  Issue Type: Bug
>  Components: clients - java
>Reporter: Andrzej Bialecki 
>Assignee: Sami Siren
> Fix For: 4.0
>
> Attachments: SOLR-3139.patch
>
>
> CommonsHttpSolrServer properly encodes the request's SolrParams depending on 
> GET or POST. However, StreamingUpdateSolrServer only looks at the params to 
> determine whether they contain optimize/commit ops, and otherwise discards 
> them.
> This is unexpected - it should properly encode and send SolrParams per 
> request, in a similar way as CommonsHttpSolrServer does. Currently this bug 
> prevents one from e.g. selecting a different update chain per request when 
> using the streaming server.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-1535) Pre-analyzed field type

2012-04-13 Thread Andrzej Bialecki (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13253363#comment-13253363
 ] 

Andrzej Bialecki  commented on SOLR-1535:
-

Nice idea about a pluggable format... Hmm. This should be specified then in the 
field type definition, I think, and not in a preamble of the data itself (UTF 
BOM mess comes to mind). I can implement the JSON version, and the current 
"simple" format, each with a version attrib.

New patch coming soon.

> Pre-analyzed field type
> ---
>
> Key: SOLR-1535
> URL: https://issues.apache.org/jira/browse/SOLR-1535
> Project: Solr
>  Issue Type: New Feature
>Affects Versions: 1.5
>Reporter: Andrzej Bialecki 
> Fix For: 4.0
>
> Attachments: SOLR-1535.patch, SOLR-1535.patch, preanalyzed.patch, 
> preanalyzed.patch
>
>
> PreAnalyzedFieldType provides a functionality to index (and optionally store) 
> content that was already processed and split into tokens using some external 
> processing chain. This implementation defines a serialization format for 
> sending tokens with any currently supported Attributes (eg. type, posIncr, 
> payload, ...). This data is de-serialized into a regular TokenStream that is 
> returned in Field.tokenStreamValue() and thus added to the index as index 
> terms, and optionally a stored part that is returned in Field.stringValue() 
> and is then added as a stored value of the field.
> This field type is useful for integrating Solr with existing text-processing 
> pipelines, such as third-party NLP systems.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3934) Residual IDF calculation in the pruning package is wrong

2012-03-29 Thread Andrzej Bialecki (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13241475#comment-13241475
 ] 

Andrzej Bialecki  commented on LUCENE-3934:
---

Eh, it's even worse - the 
[http://www.dc.fi.udc.es/~barreiro/publications/blanco_barreiro_ecir2007.pdf|paper]
 that we used as a reference is buggy itself :) or at least misleading.

Formula 1 that supposedly gives the Robertson-Sparck-Jones normalization of idf 
should really read (according to 
[http://terrierteam.dcs.gla.ac.uk/publications/rtlo_DIRpaper.pdf|its authors]:
{code}
IDF = log ( ((D - df) + 0.5) / (df + 0.5) )

  or: IDF = - log ( (df + 0.5) / ((D - df) + 0.5) )
{code}
As it's presented in the Blanco-Barreiro paper it would be invalid (for some 
values the argument to log() would be negative).

At this point I wasn't sure about the Formula 2 in Blanco-Barreiro, because 
going by the definition it should be a difference between the observed IDF - 
that is, the one that is calculated in Formula 1 - and an expected estimate 
based on a Poisson model, denoted as expIDF. Whereas the Formula 2 seemed 
different... After searching the literature for a while I found 
[http://www.cstr.ed.ac.uk/downloads/publications/2007/48920155.pdf|another 
paper] by Murray-Renals where a formula for RIDF is presented clearly enough 
for math-challenged people like me:
{code}
expIDF = - log ( 1 - e^(-totalFreq/D) )
RIDF = IDF - expIDF
{code}
So, to summarize, the Formula 2 in the Blanco-Barreiro paper should look 
something like this:
{code}
RIDF = log(((D - df) + 0.5) / (df + 0.5)) + log( 1 - e^(-totalFreq/D) )

   or: RIDF = -log((df + 0.5) / ((D - df) + 0.5)) + log( 1 - e^(-totalFreq/D) )

{code}
Now, comparing to the original formula from the Blanco-Barreiro paper we can 
clearly see that it is similar, but it differs in the way it calculates IDF:
{code}
RIDF = - log(df/D) + log(1 - e^(-totalFreq/D))   (Formula 2)
{code}
Which means that even though they mention the Robertson-Sparck-Jones 
normalization they don't use it (and neither do Murray and Renals in their 
paper).

To summarize, I think the Formula 2 is correct, and our code has to be fixed. 
Patch is coming shortly, I need to write a unit test.

> Residual IDF calculation in the pruning package is wrong
> 
>
> Key: LUCENE-3934
> URL: https://issues.apache.org/jira/browse/LUCENE-3934
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 3.5, 3.6
>Reporter: Andrzej Bialecki 
>Assignee: Andrzej Bialecki 
>
> As discussed on the mailing list 
> (http://markmail.org/message/cwnyfqmet3wophec) there seems to be a bug in 
> both the formula and in the way RIDF is calculated. The formula is missing a 
> minus, but also the calculation uses local (in-document) term frequency 
> instead of the total term frequency (sum of all term occurrences in a corpus).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3854) Non-tokenized fields become tokenized when a document is deleted and added back

2012-03-06 Thread Andrzej Bialecki (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13223470#comment-13223470
 ] 

Andrzej Bialecki  commented on LUCENE-3854:
---

bq. There isn't a better way to attack that problem in 4.0, is there?

Not yet - LUCENE-3837 is still in early stages.

> Non-tokenized fields become tokenized when a document is deleted and added 
> back
> ---
>
> Key: LUCENE-3854
> URL: https://issues.apache.org/jira/browse/LUCENE-3854
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 4.0
>Reporter: Benson Margulies
>
> https://github.com/bimargulies/lucene-4-update-case is a JUnit test case that 
> seems to show a problem with the current trunk. It creates a document with a 
> Field typed as StringField.TYPE_STORED and a value with a "-" in it. A 
> TermQuery can find the value, initially, since the field is not tokenized.
> Then, the case reads the Document back out through a reader. In the copy of 
> the Document that gets read out, the Field now has the tokenized bit turned 
> on. 
> Next, the case deletes and adds the Document. The 'tokenized' bit is 
> respected, so now the field gets tokenized, and the result is that the query 
> on the term with the - in it no longer works.
> So I think that the defect here is in the code that reconstructs the Document 
> when read from the index, and which turns on the tokenized bit.
> I have an ICLA on file so you can take this code from github, but if you 
> prefer I can also attach it here.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3854) Non-tokenized fields become tokenized when a document is deleted and added back

2012-03-06 Thread Andrzej Bialecki (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13223361#comment-13223361
 ] 

Andrzej Bialecki  commented on LUCENE-3854:
---

+1, though separate classes for input / output documents would be better. Solr 
uses SolrInputDocument for input and SolrDocument for output, and obviously 
they are not interchangeable.

> Non-tokenized fields become tokenized when a document is deleted and added 
> back
> ---
>
> Key: LUCENE-3854
> URL: https://issues.apache.org/jira/browse/LUCENE-3854
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 4.0
>Reporter: Benson Margulies
>
> https://github.com/bimargulies/lucene-4-update-case is a JUnit test case that 
> seems to show a problem with the current trunk. It creates a document with a 
> Field typed as StringField.TYPE_STORED and a value with a "-" in it. A 
> TermQuery can find the value, initially, since the field is not tokenized.
> Then, the case reads the Document back out through a reader. In the copy of 
> the Document that gets read out, the Field now has the tokenized bit turned 
> on. 
> Next, the case deletes and adds the Document. The 'tokenized' bit is 
> respected, so now the field gets tokenized, and the result is that the query 
> on the term with the - in it no longer works.
> So I think that the defect here is in the code that reconstructs the Document 
> when read from the index, and which turns on the tokenized bit.
> I have an ICLA on file so you can take this code from github, but if you 
> prefer I can also attach it here.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3854) Non-tokenized fields become tokenized when a document is deleted and added back

2012-03-06 Thread Andrzej Bialecki (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13223309#comment-13223309
 ] 

Andrzej Bialecki  commented on LUCENE-3854:
---

I suspect the problem lies in DocuemntStoredFieldVisitor.stringField(...). It 
uses FieldInfo to populate FieldType of the retrieved field, and there is no 
information there about the tokenization (so it assumes true by default). AFAIK 
the information about the tokenization is lost once the document is indexed so 
it's not possible to retrieve it back, hence the use of a default value.

> Non-tokenized fields become tokenized when a document is deleted and added 
> back
> ---
>
> Key: LUCENE-3854
> URL: https://issues.apache.org/jira/browse/LUCENE-3854
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 4.0
>Reporter: Benson Margulies
>
> https://github.com/bimargulies/lucene-4-update-case is a JUnit test case that 
> seems to show a problem with the current trunk. It creates a document with a 
> Field typed as StringField.TYPE_STORED and a value with a "-" in it. A 
> TermQuery can find the value, initially, since the field is not tokenized.
> Then, the case reads the Document back out through a reader. In the copy of 
> the Document that gets read out, the Field now has the tokenized bit turned 
> on. 
> Next, the case deletes and adds the Document. The 'tokenized' bit is 
> respected, so now the field gets tokenized, and the result is that the query 
> on the term with the - in it no longer works.
> So I think that the defect here is in the code that reconstructs the Document 
> when read from the index, and which turns on the tokenized bit.
> I have an ICLA on file so you can take this code from github, but if you 
> prefer I can also attach it here.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3837) A modest proposal for updateable fields

2012-03-01 Thread Andrzej Bialecki (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13220420#comment-13220420
 ] 

Andrzej Bialecki  commented on LUCENE-3837:
---

bq. I guess that your progress is better than no progress at all.
That's my perspective too, and it's reflected in the title of this issue... I 
remember your description and in fact my proposal is somewhat similar. It does 
not use PQs, but indeed it merges updates on the fly, at the cost of keeping a 
static map of primary->secondary ids and random seeking in the secondary index 
to retrieve matching data. Please check the description above. And then once a 
segment merge is executed the overlay data will be integrated into the main 
data, because the merge process will pull in this mix of new and old without 
being aware of it - it will be hidden by Codec's read formats. Codec 
abstractions are great for this kind of manipulations.
bq. One comment – when the updates are collapsed, the may not just simply 
'replace' what exists before them.
Right, old data will be returned if not overlaid by new data, meaning that e.g. 
old stored field values will be returned for all other fields except the 
updated field, and for that field the data from the overlay will be returned.

> A modest proposal for updateable fields
> ---
>
> Key: LUCENE-3837
> URL: https://issues.apache.org/jira/browse/LUCENE-3837
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: core/index
>Affects Versions: 4.0
>Reporter: Andrzej Bialecki 
>
> I'd like to propose a simple design for implementing updateable fields in 
> Lucene. This design has some limitations, so I'm not claiming it will be 
> appropriate for every use case, and it's obvious it has some performance 
> consequences, but at least it's a start...
> This proposal uses a concept of "overlays" or "stacked updates", where the 
> original data is not removed but instead it's overlaid with the new data. I 
> propose to reuse as much of the existing APIs as possible, and represent 
> updates as an IndexReader. Updates to documents in a specific segment would 
> be collected in an "overlay" index specific to that segment, i.e. there would 
> be as many overlay indexes as there are segments in the primary index. 
> A field update would be represented as a new document in the overlay index . 
> The document would consist of just the updated fields, plus a field that 
> records the id in the primary segment of the document affected by the update. 
> These updates would be processed as usual via secondary IndexWriter-s, as 
> many as there are primary segments, so the same analysis chains would be 
> used, the same field types, etc.
> On opening a segment with updates the SegmentReader (see also LUCENE-3836) 
> would check for the presence of the "overlay" index, and if so it would open 
> it first (as an AtomicReader? or it would open individual codec format 
> readers? perhaps it should load the whole thing into memory?), and it would 
> construct an in-memory map between the primary's docId-s and the overlay's 
> docId-s. And finally it would wrap the original format readers with "overlay 
> readers", initialized also with the id map.
> Now, when consumers of the 4D API would ask for specific data, the "overlay 
> readers" would first re-map the primary's docId to the overlay's docId, and 
> check whether overlay data exists for that docId and this type of data (e.g. 
> postings, stored fields, vectors) and return this data instead of the 
> original. Otherwise they would return the original data.
> One obvious performance issue with this appraoch is that the sequential 
> access to primary data would translate into random access to the overlay 
> data. This could be solved by sorting the overlay index so that at least the 
> overlay ids increase monotonically as primary ids do.
> Updates to the primary index would be handled as usual, i.e. segment merges, 
> since the segments with updates would pretend to have no overlays) would just 
> work as usual, only the overlay index would have to be deleted once the 
> primary segment is deleted after merge.
> Updates to the existing documents that already had some fields updated would 
> be again handled as usual, only underneath they would open an IndexWriter on 
> the overlay index for a specific segment.
> That's the broad idea. Feel free to pipe in - I started some coding at the 
> codec level but got stuck using the approach in LUCENE-3836. The approach 
> that uses a modified SegmentReader seems more promising.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more informatio

[jira] [Commented] (LUCENE-3837) A modest proposal for updateable fields

2012-03-01 Thread Andrzej Bialecki (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13220315#comment-13220315
 ] 

Andrzej Bialecki  commented on LUCENE-3837:
---

bq. Could we use the actual docID (ie same docID as the base segment)?
Updates may arrive out of order, so the updates will naturally get different 
internal IDs (also, if you wanted to use the same ids they would have gaps). I 
don't know if various parts of Lucene can handle out of order ids coming from 
iterators? If we wanted to match the ids early then we would have to sort them, 
a la IndexSorter, on every flush and on every merge, which seems too costly. 
So, a re-mapping structure seems like a decent compromise. Yes, it could be 
large - we could put artificial limits on the number of updates before we do a 
merge.

bq. Also, can't we directly write the stacked segments ourselves? (Ie, within a 
single IW).
I don't know, it didn't seem likely to me - AFAIK IW operates on a single 
segment before flushing it? And updates could refer to docs outside the current 
segment.

> A modest proposal for updateable fields
> ---
>
> Key: LUCENE-3837
> URL: https://issues.apache.org/jira/browse/LUCENE-3837
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: core/index
>Affects Versions: 4.0
>Reporter: Andrzej Bialecki 
>
> I'd like to propose a simple design for implementing updateable fields in 
> Lucene. This design has some limitations, so I'm not claiming it will be 
> appropriate for every use case, and it's obvious it has some performance 
> consequences, but at least it's a start...
> This proposal uses a concept of "overlays" or "stacked updates", where the 
> original data is not removed but instead it's overlaid with the new data. I 
> propose to reuse as much of the existing APIs as possible, and represent 
> updates as an IndexReader. Updates to documents in a specific segment would 
> be collected in an "overlay" index specific to that segment, i.e. there would 
> be as many overlay indexes as there are segments in the primary index. 
> A field update would be represented as a new document in the overlay index . 
> The document would consist of just the updated fields, plus a field that 
> records the id in the primary segment of the document affected by the update. 
> These updates would be processed as usual via secondary IndexWriter-s, as 
> many as there are primary segments, so the same analysis chains would be 
> used, the same field types, etc.
> On opening a segment with updates the SegmentReader (see also LUCENE-3836) 
> would check for the presence of the "overlay" index, and if so it would open 
> it first (as an AtomicReader? or it would open individual codec format 
> readers? perhaps it should load the whole thing into memory?), and it would 
> construct an in-memory map between the primary's docId-s and the overlay's 
> docId-s. And finally it would wrap the original format readers with "overlay 
> readers", initialized also with the id map.
> Now, when consumers of the 4D API would ask for specific data, the "overlay 
> readers" would first re-map the primary's docId to the overlay's docId, and 
> check whether overlay data exists for that docId and this type of data (e.g. 
> postings, stored fields, vectors) and return this data instead of the 
> original. Otherwise they would return the original data.
> One obvious performance issue with this appraoch is that the sequential 
> access to primary data would translate into random access to the overlay 
> data. This could be solved by sorting the overlay index so that at least the 
> overlay ids increase monotonically as primary ids do.
> Updates to the primary index would be handled as usual, i.e. segment merges, 
> since the segments with updates would pretend to have no overlays) would just 
> work as usual, only the overlay index would have to be deleted once the 
> primary segment is deleted after merge.
> Updates to the existing documents that already had some fields updated would 
> be again handled as usual, only underneath they would open an IndexWriter on 
> the overlay index for a specific segment.
> That's the broad idea. Feel free to pipe in - I started some coding at the 
> codec level but got stuck using the approach in LUCENE-3836. The approach 
> that uses a modified SegmentReader seems more promising.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For 

[jira] [Commented] (LUCENE-3837) A modest proposal for updateable fields

2012-03-01 Thread Andrzej Bialecki (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13220300#comment-13220300
 ] 

Andrzej Bialecki  commented on LUCENE-3837:
---

That was my point, we should be able to come up with estimates that yield 
"slightly wrong yet consistent" stats. I don't know the details of new 
similarities, so it's up to you Robert to come up with suggestions :)

> A modest proposal for updateable fields
> ---
>
> Key: LUCENE-3837
> URL: https://issues.apache.org/jira/browse/LUCENE-3837
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: core/index
>Affects Versions: 4.0
>Reporter: Andrzej Bialecki 
>
> I'd like to propose a simple design for implementing updateable fields in 
> Lucene. This design has some limitations, so I'm not claiming it will be 
> appropriate for every use case, and it's obvious it has some performance 
> consequences, but at least it's a start...
> This proposal uses a concept of "overlays" or "stacked updates", where the 
> original data is not removed but instead it's overlaid with the new data. I 
> propose to reuse as much of the existing APIs as possible, and represent 
> updates as an IndexReader. Updates to documents in a specific segment would 
> be collected in an "overlay" index specific to that segment, i.e. there would 
> be as many overlay indexes as there are segments in the primary index. 
> A field update would be represented as a new document in the overlay index . 
> The document would consist of just the updated fields, plus a field that 
> records the id in the primary segment of the document affected by the update. 
> These updates would be processed as usual via secondary IndexWriter-s, as 
> many as there are primary segments, so the same analysis chains would be 
> used, the same field types, etc.
> On opening a segment with updates the SegmentReader (see also LUCENE-3836) 
> would check for the presence of the "overlay" index, and if so it would open 
> it first (as an AtomicReader? or it would open individual codec format 
> readers? perhaps it should load the whole thing into memory?), and it would 
> construct an in-memory map between the primary's docId-s and the overlay's 
> docId-s. And finally it would wrap the original format readers with "overlay 
> readers", initialized also with the id map.
> Now, when consumers of the 4D API would ask for specific data, the "overlay 
> readers" would first re-map the primary's docId to the overlay's docId, and 
> check whether overlay data exists for that docId and this type of data (e.g. 
> postings, stored fields, vectors) and return this data instead of the 
> original. Otherwise they would return the original data.
> One obvious performance issue with this appraoch is that the sequential 
> access to primary data would translate into random access to the overlay 
> data. This could be solved by sorting the overlay index so that at least the 
> overlay ids increase monotonically as primary ids do.
> Updates to the primary index would be handled as usual, i.e. segment merges, 
> since the segments with updates would pretend to have no overlays) would just 
> work as usual, only the overlay index would have to be deleted once the 
> primary segment is deleted after merge.
> Updates to the existing documents that already had some fields updated would 
> be again handled as usual, only underneath they would open an IndexWriter on 
> the overlay index for a specific segment.
> That's the broad idea. Feel free to pipe in - I started some coding at the 
> codec level but got stuck using the approach in LUCENE-3836. The approach 
> that uses a modified SegmentReader seems more promising.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3837) A modest proposal for updateable fields

2012-03-01 Thread Andrzej Bialecki (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13220194#comment-13220194
 ] 

Andrzej Bialecki  commented on LUCENE-3837:
---

Ad 1. I don't think it's such a big deal, we already return approximate stats 
(too high counts) in presence of deletes. I think we should go all the way, at 
least initially, and ignore stats from an overlay completely, unless the data 
is present only in the overlay - e.g. for terms not present in the main index.

Ad 2. I think that if getArray() is supported then on the first call we have to 
roll-in all updates to the main array created from the primary.

> A modest proposal for updateable fields
> ---
>
> Key: LUCENE-3837
> URL: https://issues.apache.org/jira/browse/LUCENE-3837
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: core/index
>Affects Versions: 4.0
>Reporter: Andrzej Bialecki 
>
> I'd like to propose a simple design for implementing updateable fields in 
> Lucene. This design has some limitations, so I'm not claiming it will be 
> appropriate for every use case, and it's obvious it has some performance 
> consequences, but at least it's a start...
> This proposal uses a concept of "overlays" or "stacked updates", where the 
> original data is not removed but instead it's overlaid with the new data. I 
> propose to reuse as much of the existing APIs as possible, and represent 
> updates as an IndexReader. Updates to documents in a specific segment would 
> be collected in an "overlay" index specific to that segment, i.e. there would 
> be as many overlay indexes as there are segments in the primary index. 
> A field update would be represented as a new document in the overlay index . 
> The document would consist of just the updated fields, plus a field that 
> records the id in the primary segment of the document affected by the update. 
> These updates would be processed as usual via secondary IndexWriter-s, as 
> many as there are primary segments, so the same analysis chains would be 
> used, the same field types, etc.
> On opening a segment with updates the SegmentReader (see also LUCENE-3836) 
> would check for the presence of the "overlay" index, and if so it would open 
> it first (as an AtomicReader? or it would open individual codec format 
> readers? perhaps it should load the whole thing into memory?), and it would 
> construct an in-memory map between the primary's docId-s and the overlay's 
> docId-s. And finally it would wrap the original format readers with "overlay 
> readers", initialized also with the id map.
> Now, when consumers of the 4D API would ask for specific data, the "overlay 
> readers" would first re-map the primary's docId to the overlay's docId, and 
> check whether overlay data exists for that docId and this type of data (e.g. 
> postings, stored fields, vectors) and return this data instead of the 
> original. Otherwise they would return the original data.
> One obvious performance issue with this appraoch is that the sequential 
> access to primary data would translate into random access to the overlay 
> data. This could be solved by sorting the overlay index so that at least the 
> overlay ids increase monotonically as primary ids do.
> Updates to the primary index would be handled as usual, i.e. segment merges, 
> since the segments with updates would pretend to have no overlays) would just 
> work as usual, only the overlay index would have to be deleted once the 
> primary segment is deleted after merge.
> Updates to the existing documents that already had some fields updated would 
> be again handled as usual, only underneath they would open an IndexWriter on 
> the overlay index for a specific segment.
> That's the broad idea. Feel free to pipe in - I started some coding at the 
> codec level but got stuck using the approach in LUCENE-3836. The approach 
> that uses a modified SegmentReader seems more promising.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3836) Most Codec.*Format().*Reader() methods should use SegmentReadState

2012-03-01 Thread Andrzej Bialecki (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13220093#comment-13220093
 ] 

Andrzej Bialecki  commented on LUCENE-3836:
---

I think this could work, too - I would instantiate the "overlay" data in 
SegmentReader, and then I'd create the "overlay" codec's format readers in 
SegmentReader, using the original format readers plus the overlay data. I'll 
try this approach ... I'll create a separate issue to discuss this.

Let's close this as won't fix for now.

> Most Codec.*Format().*Reader() methods should use SegmentReadState
> --
>
> Key: LUCENE-3836
> URL: https://issues.apache.org/jira/browse/LUCENE-3836
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Andrzej Bialecki 
>Assignee: Andrzej Bialecki 
> Fix For: 4.0
>
> Attachments: LUCENE-3836.patch
>
>
> Codec formats API for opening readers is inconsistent - sometimes it uses 
> SegmentReadState, in other cases it uses individual arguments that are 
> already available via SegmentReadState. This complicates extending the API, 
> e.g. if additional per-segment state would need to be passed to the readers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3836) Most Codec.*Format().*Reader() methods should use SegmentReadState

2012-03-01 Thread Andrzej Bialecki (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13220071#comment-13220071
 ] 

Andrzej Bialecki  commented on LUCENE-3836:
---

I hear you .. SegmentWriteState is bad, I agree. But the argument about 
SegmentWriteState is not really applicable to SegmentReadState - write state is 
mutable and can change under your feet whereas SegmentReadState is immutable, 
created once in SegmentReader and used only for initialization of format 
readers. On the other hand, if we insist that we always pass individual 
arguments around then providing some additional segment-global context to 
format readers requires changing method signatures (adding arguments).

The background for this issue is that I started looking at updateable fields, 
where updates are put in a segment (or reader) of its own and they provide an 
"overlay" for the main segment, with a special codec magic to pull and remap 
data from the "overlay" as the main data is accessed. However, in order to do 
that I need to provide this data when format readers are initialized. I can't 
do this when creating a Codec instance (Codec is automatically instantiated) or 
when creating Codec.*Format(), because format instances are usually shared as 
well.

So the idea I had in mind was to use SegmentReaderState uniformly, and put this 
overlay data in SegmentReadState so that it's passed down to formats during 
format readers' creation. I'm open to other ideas... :)

> Most Codec.*Format().*Reader() methods should use SegmentReadState
> --
>
> Key: LUCENE-3836
> URL: https://issues.apache.org/jira/browse/LUCENE-3836
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Andrzej Bialecki 
>Assignee: Andrzej Bialecki 
> Fix For: 4.0
>
> Attachments: LUCENE-3836.patch
>
>
> Codec formats API for opening readers is inconsistent - sometimes it uses 
> SegmentReadState, in other cases it uses individual arguments that are 
> already available via SegmentReadState. This complicates extending the API, 
> e.g. if additional per-segment state would need to be passed to the readers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3788) Separate getting Directory from IndexReader from its concrete subclasses

2012-02-15 Thread Andrzej Bialecki (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13208407#comment-13208407
 ] 

Andrzej Bialecki  commented on LUCENE-3788:
---

Cool. I really appreciate your help, Uwe. Let's close this as Won't Fix - at 
least now we have a JIRA that explains why this is a bad idea ;)

> Separate getting Directory from IndexReader from its concrete subclasses
> 
>
> Key: LUCENE-3788
> URL: https://issues.apache.org/jira/browse/LUCENE-3788
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Andrzej Bialecki 
> Attachments: LUCENE-3788.patch
>
>
> Currently only subclasses of DirectoryReader expose the underlying Directory 
> via public final directory(). IMHO this aspect should be separated from 
> DirectoryReader so that other IndexReader implementations could expose any 
> underlying Directory if they wanted to. Specifically, I have a use case where 
> I'd like to expose a synthetic Directory view of resources used for 
> ParallelCompositeReader.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3788) Separate getting Directory from IndexReader from its concrete subclasses

2012-02-15 Thread Andrzej Bialecki (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13208374#comment-13208374
 ] 

Andrzej Bialecki  commented on LUCENE-3788:
---

Hi Uwe,

bq. It would be good if you would communicate with Robert about that 
Lucid-internal problem
bq. this issue was created explicitely for Solr and your company.

This is not helpful ... this was IMHO a legitimate doubt about the use of trunk 
APIs, and if there's a possible confusion about how to use the new IR-s then 
probably I'm not the only one confused, and it's irrelevant whether I'm working 
today for Lucid or for any other company.

bq. but it's easy to solve, I can also help.

And this is helpful. :) Thanks, let's close this issue.

> Separate getting Directory from IndexReader from its concrete subclasses
> 
>
> Key: LUCENE-3788
> URL: https://issues.apache.org/jira/browse/LUCENE-3788
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Andrzej Bialecki 
> Attachments: LUCENE-3788.patch
>
>
> Currently only subclasses of DirectoryReader expose the underlying Directory 
> via public final directory(). IMHO this aspect should be separated from 
> DirectoryReader so that other IndexReader implementations could expose any 
> underlying Directory if they wanted to. Specifically, I have a use case where 
> I'd like to expose a synthetic Directory view of resources used for 
> ParallelCompositeReader.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-2632) FilteringCodec, TeeCodec, TeeDirectory

2012-02-14 Thread Andrzej Bialecki (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13207686#comment-13207686
 ] 

Andrzej Bialecki  commented on LUCENE-2632:
---

bq. the TODO/etc in term vectors makes me wish our codec consumer APIs for 
Fields/TermVectors were more consistent...

Also, the handling of segments.gen and compound files that bypasses codec 
actually forced me to implement TeeDirectory.

Re. synchronization - yes, many of these should be removed. I synced everything 
for now to narrow down the source of merge problems. TeeCodec.files() - well 
spotted, this should be fixed.

> FilteringCodec, TeeCodec, TeeDirectory
> --
>
> Key: LUCENE-2632
> URL: https://issues.apache.org/jira/browse/LUCENE-2632
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: core/index
>Affects Versions: 4.0
>Reporter: Andrzej Bialecki 
> Attachments: LUCENE-2632.patch, LUCENE-2632.patch
>
>
> This issue adds two new Codec implementations:
> * TeeCodec: there have been attempts in the past to implement parallel 
> writing to multiple indexes so that they are all synchronized. This was 
> however complicated due to the complexity of IndexWriter/SegmentMerger logic. 
> The solution presented here offers a similar functionality but working on a 
> different level - as the name suggests, the TeeCodec duplicates index data 
> into multiple output Directories.
> * TeeDirectory (used also in TeeCodec) is a simple abstraction to perform 
> Directory operations on several directories in parallel (effectively 
> mirroring their data). Optionally it's possible to specify a set of suffixes 
> of files that should be mirrored so that non-matching files are skipped.
> * FilteringCodec is related in a remote way to the ideas of index pruning 
> presented in LUCENE-1812 and the concept of tiered search. Since we can use 
> TeeCodec to write to multiple output Directories in a synchronized way, we 
> could also filter out or modify some of the data that is being written. The 
> FilteringCodec provides this functionality, so that you can use like this:
> {code}
> IndexWriter --> TeeCodec
>  |  |
>  |  +--> StandardCodec --> Directory1
>  +--> FilteringCodec --> StandardCodec --> Directory2
> {code}
> The end result of this chain is two indexes that are kept in sync - one is 
> the full regular index, and the other one is a filtered index.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-2632) FilteringCodec, TeeCodec, TeeDirectory

2012-02-14 Thread Andrzej Bialecki (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13207613#comment-13207613
 ] 

Andrzej Bialecki  commented on LUCENE-2632:
---

That's one potential application, I'm sure there are dozen other things you can 
do with this (e.g. hot backup).

Merges don't work well in this patch yet, which is a problem ;) and also the 
handling of compound files is hackish (I didn't even try to handle the main CFS 
ones, but I added a workaround for the nrm.cfs / cfe files). I'd appreciate 
review and help in debugging/fixing.

> FilteringCodec, TeeCodec, TeeDirectory
> --
>
> Key: LUCENE-2632
> URL: https://issues.apache.org/jira/browse/LUCENE-2632
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: core/index
>Affects Versions: 4.0
>Reporter: Andrzej Bialecki 
> Attachments: LUCENE-2632.patch, LUCENE-2632.patch
>
>
> This issue adds two new Codec implementations:
> * TeeCodec: there have been attempts in the past to implement parallel 
> writing to multiple indexes so that they are all synchronized. This was 
> however complicated due to the complexity of IndexWriter/SegmentMerger logic. 
> The solution presented here offers a similar functionality but working on a 
> different level - as the name suggests, the TeeCodec duplicates index data 
> into multiple output Directories.
> * TeeDirectory (used also in TeeCodec) is a simple abstraction to perform 
> Directory operations on several directories in parallel (effectively 
> mirroring their data). Optionally it's possible to specify a set of suffixes 
> of files that should be mirrored so that non-matching files are skipped.
> * FilteringCodec is related in a remote way to the ideas of index pruning 
> presented in LUCENE-1812 and the concept of tiered search. Since we can use 
> TeeCodec to write to multiple output Directories in a synchronized way, we 
> could also filter out or modify some of the data that is being written. The 
> FilteringCodec provides this functionality, so that you can use like this:
> {code}
> IndexWriter --> TeeCodec
>  |  |
>  |  +--> StandardCodec --> Directory1
>  +--> FilteringCodec --> StandardCodec --> Directory2
> {code}
> The end result of this chain is two indexes that are kept in sync - one is 
> the full regular index, and the other one is a filtered index.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2593) A new core admin action 'split' for splitting index

2012-02-13 Thread Andrzej Bialecki (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13207452#comment-13207452
 ] 

Andrzej Bialecki  commented on SOLR-2593:
-

Jason, see LUCENE-2632 for a possible way to implement this at the Lucene 
level. Splitting into arbitrary parts so far required multiple passes over 
input data, using the approach of tee/filter codecs it's possible to do this in 
one pass over the input data.

> A new core admin action 'split' for splitting index
> ---
>
> Key: SOLR-2593
> URL: https://issues.apache.org/jira/browse/SOLR-2593
> Project: Solr
>  Issue Type: New Feature
>Reporter: Noble Paul
> Fix For: 4.0
>
>
> If an index is too large/hot it would be desirable to split it out to another 
> core .
> This core may eventually be replicated out to another host.
> There can be to be multiple strategies 
> * random split of x or x% 
> * fq="user:johndoe"
> example :
> action=split&split=20percent&newcore=my_new_index
> or
> action=split&fq=user:johndoe&newcore=john_doe_index

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-1535) Pre-analyzed field type

2012-01-31 Thread Andrzej Bialecki (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13197390#comment-13197390
 ] 

Andrzej Bialecki  commented on SOLR-1535:
-

Avro adds yet another dependency, which would make sense if Solr used Avro 
instead of JavaBin - but that's a separate discussion that merits a separate 
JIRA issue... as it isn't used now I'd rather avoid putting additional burden 
on clients just for the sake of this patch.

JSON could be a nice alternative, if it only supported binary data natively (it 
doesn't, one has to use base64 - however, it's [not that awful as you could 
think|http://stackoverflow.com/questions/1443158/binary-data-in-json-string-something-better-than-base64]).
 I wanted to avoid complex formats like XML - too much boilerplate for such 
small bits of data. So the current custom serialization tried to strike a 
balance between simplicity, flexibility and low overhead.

Serialization of terms was also discussed in SOLR-1632 - e.g. this patch 
doesn't serialize binary terms properly.

> Pre-analyzed field type
> ---
>
> Key: SOLR-1535
> URL: https://issues.apache.org/jira/browse/SOLR-1535
> Project: Solr
>  Issue Type: New Feature
>Affects Versions: 1.5
>Reporter: Andrzej Bialecki 
> Fix For: 3.6, 4.0
>
> Attachments: SOLR-1535.patch, preanalyzed.patch, preanalyzed.patch
>
>
> PreAnalyzedFieldType provides a functionality to index (and optionally store) 
> content that was already processed and split into tokens using some external 
> processing chain. This implementation defines a serialization format for 
> sending tokens with any currently supported Attributes (eg. type, posIncr, 
> payload, ...). This data is de-serialized into a regular TokenStream that is 
> returned in Field.tokenStreamValue() and thus added to the index as index 
> terms, and optionally a stored part that is returned in Field.stringValue() 
> and is then added as a stored value of the field.
> This field type is useful for integrating Solr with existing text-processing 
> pipelines, such as third-party NLP systems.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-1812) Static index pruning by in-document term frequency (Carmel pruning)

2012-01-30 Thread Andrzej Bialecki (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13196385#comment-13196385
 ] 

Andrzej Bialecki  commented on LUCENE-1812:
---

Excellent, thanks for seeing this through!

> Static index pruning by in-document term frequency (Carmel pruning)
> ---
>
> Key: LUCENE-1812
> URL: https://issues.apache.org/jira/browse/LUCENE-1812
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: modules/other
>Reporter: Andrzej Bialecki 
>Assignee: Doron Cohen
> Fix For: 3.6, 4.0
>
> Attachments: pruning.patch, pruning.patch, pruning.patch, 
> pruning.patch, pruning.patch, pruning.patch
>
>
> This module provides tools to produce a subset of input indexes by removing 
> postings data for those terms where their in-document frequency is below a 
> specified threshold. The net effect of this processing is a much smaller 
> index that for common types of queries returns nearly identical top-N results 
> as compared with the original index, but with increased performance. 
> Optionally, stored values and term vectors can also be removed. This 
> functionality is largely independent, so it can be used without term pruning 
> (when term freq. threshold is set to 1).
> As the threshold value increases, the total size of the index decreases, 
> search performance increases, and recall decreases (i.e. search quality 
> deteriorates). NOTE: especially phrase recall deteriorates significantly at 
> higher threshold values. 
> Primary purpose of this class is to produce small first-tier indexes that fit 
> completely in RAM, and store these indexes using 
> IndexWriter.addIndexes(IndexReader[]). Usually the performance of this class 
> will not be sufficient to use the resulting index view for on-the-fly pruning 
> and searching. 
> NOTE: If the input index is optimized (i.e. doesn't contain deletions) then 
> the index produced via IndexWriter.addIndexes(IndexReader[]) will preserve 
> internal document id-s so that they are in sync with the original index. This 
> means that all other auxiliary information not necessary for first-tier 
> processing, such as some stored fields, can also be removed, to be quickly 
> retrieved on-demand from the original index using the same internal document 
> id. 
> Threshold values can be specified globally (for terms in all fields) using 
> defaultThreshold parameter, and can be overriden using per-field or per-term 
> values supplied in a thresholds map. Keys in this map are either field names, 
> or terms in field:text format. The precedence of these values is the 
> following: first a per-term threshold is used if present, then per-field 
> threshold if present, and finally the default threshold.
> A command-line tool (PruningTool) is provided for convenience. At this moment 
> it doesn't support all functionality available through API.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-1632) Distributed IDF

2012-01-27 Thread Andrzej Bialecki (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13194915#comment-13194915
 ] 

Andrzej Bialecki  commented on SOLR-1632:
-

bq. hopefully we avoid requesting term stats for these?
There is no provision for this yet in the current patch.

> Distributed IDF
> ---
>
> Key: SOLR-1632
> URL: https://issues.apache.org/jira/browse/SOLR-1632
> Project: Solr
>  Issue Type: New Feature
>  Components: search
>Affects Versions: 1.5
>Reporter: Andrzej Bialecki 
> Attachments: 3x_SOLR-1632_doesntwork.patch, SOLR-1632.patch, 
> SOLR-1632.patch, SOLR-1632.patch, distrib-2.patch, distrib.patch
>
>
> Distributed IDF is a valuable enhancement for distributed search across 
> non-uniform shards. This issue tracks the proposed implementation of an API 
> to support this functionality in Solr.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-1632) Distributed IDF

2012-01-27 Thread Andrzej Bialecki (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13194907#comment-13194907
 ] 

Andrzej Bialecki  commented on SOLR-1632:
-

\x or %xx escaping could be ok, I guess - it's safe, and in most cases it's 
still readable, unlike base64.

> Distributed IDF
> ---
>
> Key: SOLR-1632
> URL: https://issues.apache.org/jira/browse/SOLR-1632
> Project: Solr
>  Issue Type: New Feature
>  Components: search
>Affects Versions: 1.5
>Reporter: Andrzej Bialecki 
> Attachments: 3x_SOLR-1632_doesntwork.patch, SOLR-1632.patch, 
> SOLR-1632.patch, SOLR-1632.patch, distrib-2.patch, distrib.patch
>
>
> Distributed IDF is a valuable enhancement for distributed search across 
> non-uniform shards. This issue tracks the proposed implementation of an API 
> to support this functionality in Solr.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-1632) Distributed IDF

2012-01-27 Thread Andrzej Bialecki (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13194842#comment-13194842
 ] 

Andrzej Bialecki  commented on SOLR-1632:
-

Hmm, indeed...  I must have switched to toString() for debugging (its easier to 
eyeball an ascii string than a base64 string ;) ). This should use base64 
throughout. I'll prepare a patch shortly.

(BTW, I'm aware that passing around blobs of base64 inside SolrParams is ugly. 
I'm open to suggestions how to handle this better).

> Distributed IDF
> ---
>
> Key: SOLR-1632
> URL: https://issues.apache.org/jira/browse/SOLR-1632
> Project: Solr
>  Issue Type: New Feature
>  Components: search
>Affects Versions: 1.5
>Reporter: Andrzej Bialecki 
> Attachments: 3x_SOLR-1632_doesntwork.patch, SOLR-1632.patch, 
> SOLR-1632.patch, SOLR-1632.patch, distrib-2.patch, distrib.patch
>
>
> Distributed IDF is a valuable enhancement for distributed search across 
> non-uniform shards. This issue tracks the proposed implementation of an API 
> to support this functionality in Solr.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3722) make similarities/term/collectionstats take long (for > 2B docs)

2012-01-24 Thread Andrzej Bialecki (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13192790#comment-13192790
 ] 

Andrzej Bialecki  commented on LUCENE-3722:
---

bq. maybe we should not have add() here then at all and let the consumer take 
care of this themselves.

Fair point, I'd rather remove it.

bq. i think it must not be treated as 0

Ok, it makes sense in local (multi reader) situations but in distributed 
scenario it may be still acceptable to lose just a part of the statistics from 
one shard while keeping the stats from other shards.

Which would be yet another reason to drop the add() methods :)

> make similarities/term/collectionstats take long (for > 2B docs)
> 
>
> Key: LUCENE-3722
> URL: https://issues.apache.org/jira/browse/LUCENE-3722
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 4.0
>Reporter: Robert Muir
> Attachments: LUCENE-3722.patch, LUCENE-3722.patch
>
>
> As noted by Yonik and Andrzej on SOLR-1632, this would be useful for 
> distributed scoring.
> we can also add a sugar method add() to both of these to make it easier to 
> sum.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3722) make similarities/term/collectionstats take long (for > 2B docs)

2012-01-24 Thread Andrzej Bialecki (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13192774#comment-13192774
 ] 

Andrzej Bialecki  commented on LUCENE-3722:
---

I think the add() methods as implemented in this patch are of limited 
usefulness... The reason I created a pair of similar classes in SOLR-1632 was 
to avoid creating new objects by allowing modification of int/long members in 
place, which is useful when aggregating partial stats. So I think a more useful 
implementation of add() could look like this:

{code}
public void add(CollectionStatistics other) {
  assert this.field.equals(other.field);
  this.maxDoc += other.maxDoc;
  this.docCount += other.docCount;
  this.sumTotalTermFreq += other.sumTotalTermFreq;
  this.sumDocFreq += other.sumDocFreq;
}
{code}

Regarding the handling of -1: this code doesn't handle the case when only one 
stats is -1, which may happen in distributed scenario. I think in such case a 
-1 value should be treated as 0, i.e. it should not "reset" the accumulated 
value from other shards to -1, right?

> make similarities/term/collectionstats take long (for > 2B docs)
> 
>
> Key: LUCENE-3722
> URL: https://issues.apache.org/jira/browse/LUCENE-3722
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 4.0
>Reporter: Robert Muir
> Attachments: LUCENE-3722.patch, LUCENE-3722.patch
>
>
> As noted by Yonik and Andrzej on SOLR-1632, this would be useful for 
> distributed scoring.
> we can also add a sugar method add() to both of these to make it easier to 
> sum.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-1632) Distributed IDF

2012-01-24 Thread Andrzej Bialecki (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13192407#comment-13192407
 ] 

Andrzej Bialecki  commented on SOLR-1632:
-

Yeah, I was curious about this too. However, this is how CollectionStatistics 
is defined in Lucene, so it's something that we have to change in Lucene too.

> Distributed IDF
> ---
>
> Key: SOLR-1632
> URL: https://issues.apache.org/jira/browse/SOLR-1632
> Project: Solr
>  Issue Type: New Feature
>  Components: search
>Affects Versions: 1.5
>Reporter: Andrzej Bialecki 
> Attachments: SOLR-1632.patch, SOLR-1632.patch, distrib-2.patch, 
> distrib.patch
>
>
> Distributed IDF is a valuable enhancement for distributed search across 
> non-uniform shards. This issue tracks the proposed implementation of an API 
> to support this functionality in Solr.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-1632) Distributed IDF

2012-01-24 Thread Andrzej Bialecki (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13192374#comment-13192374
 ] 

Andrzej Bialecki  commented on SOLR-1632:
-

bq. Is this something that can be added to branch_3x?

Not without porting - Lucene / Solr API-s have changed significantly, and this 
patch uses some low-level API-s that are different between trunk and 3x.

> Distributed IDF
> ---
>
> Key: SOLR-1632
> URL: https://issues.apache.org/jira/browse/SOLR-1632
> Project: Solr
>  Issue Type: New Feature
>  Components: search
>Affects Versions: 1.5
>Reporter: Andrzej Bialecki 
> Attachments: SOLR-1632.patch, SOLR-1632.patch, distrib-2.patch, 
> distrib.patch
>
>
> Distributed IDF is a valuable enhancement for distributed search across 
> non-uniform shards. This issue tracks the proposed implementation of an API 
> to support this functionality in Solr.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3617) Graduate appendingcodec from contrib/misc

2011-12-02 Thread Andrzej Bialecki (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13162067#comment-13162067
 ] 

Andrzej Bialecki  commented on LUCENE-3617:
---

+1. It would be nice to be Hadoop-friendly out of the box.

> Graduate appendingcodec from contrib/misc
> -
>
> Key: LUCENE-3617
> URL: https://issues.apache.org/jira/browse/LUCENE-3617
> Project: Lucene - Java
>  Issue Type: Task
>Affects Versions: 4.0
>Reporter: Robert Muir
> Attachments: LUCENE-3617.patch
>
>
> * All tests pass with this codec (at least once, maybe we don't test that 
> two-phase commit stuff very well!)
> * It doesn't require special client side configuration anymore to work (just 
> set it on indexwriter and go)
> * it now works with the compound file format.
> I don't think it needs to live in contrib anymore.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-2621) Extend Codec to handle also stored fields and term vectors

2011-11-08 Thread Andrzej Bialecki (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13146634#comment-13146634
 ] 

Andrzej Bialecki  commented on LUCENE-2621:
---

+1 looks good (as far as I can tell!), even SegmentMerger starts making sense 
now :)

> Extend Codec to handle also stored fields and term vectors
> --
>
> Key: LUCENE-2621
> URL: https://issues.apache.org/jira/browse/LUCENE-2621
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 4.0
>Reporter: Andrzej Bialecki 
>Assignee: Robert Muir
>  Labels: gsoc2011, lucene-gsoc-11, mentor
> Attachments: LUCENE-2621.patch, LUCENE-2621_rote.patch
>
>
> Currently Codec API handles only writing/reading of term-related data, while 
> stored fields data and term frequency vector data writing/reading is handled 
> elsewhere.
> I propose to extend the Codec API to handle this data as well.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3560) add extra safety to concrete codec implementations

2011-11-05 Thread Andrzej Bialecki (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144714#comment-13144714
 ] 

Andrzej Bialecki  commented on LUCENE-3560:
---

Let's not go too far either... we want people to write and contribute new 
codecs, if we make the api too onerous to use then we risk a lot of 
copy&paste-ing (e.g. I'd like to extend an existing codec to add one file to 
files() - bummer, I have to reimplement the whole codec now). So let's leave 
extensibility where it's clear that stuff can be extended with no harm (or "no 
harm if you read the instructions").

> add extra safety to concrete codec implementations
> --
>
> Key: LUCENE-3560
> URL: https://issues.apache.org/jira/browse/LUCENE-3560
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 4.0
>Reporter: Robert Muir
> Attachments: LUCENE-3560.patch
>
>
> In LUCENE-3490, we reorganized the codec model, and a key part of this is 
> that Codecs are "safer"
> and don't rely upon client-side configuration: IndexReader doesn't take Codec 
> or anything of that 
> nature, only IndexWriter.
> Instead for "read" all codecs are initialized from the classpath via a no-arg 
> ctor from Java's 
> Service Provider Mechanism.
> So, although Codecs can still take parameters in the constructors, be 
> subclassable, etc (for passing
> to IndexWriter), this enforces that they must write any configuration 
> information they need into
> the index, so that we don't have a flimsy API.
> I think we should go even further, for additional safety. Any methods on our 
> concrete codecs that
> are not intended to be subclassed should be final, and we should add 
> assertions to verify this.
> For example, SimpleText's files() implementation should be final. If you want 
> to make an extension
> of simpletext that has additional files, then this is a different index 
> format and should have a
> different name!
> Note: This doesn't stop extensibility, only stupid mistakes. 
> For example, this means that Lucene40Codec's postingsFormat() implementation 
> is final, even though 
> it offers a configurable "hook" (getPostingsFormatForField) for you to 
> specify per-field postings 
> formats (which it writes into a .per file into the index, so that it knows 
> how to read each field).
> {code}
> private final PostingsFormat postingsFormat = new PerFieldPostingsFormat() {
>   @Override
>   public PostingsFormat getPostingsFormatForField(String field) {
> return Lucene40Codec.this.getPostingsFormatForField(field);
>   }
> };
> ...
> @Override
> public final PostingsFormat postingsFormat() {
>   return postingsFormat;
> }
> ...
>   /** Returns the postings format that should be used for writing 
>*  new segments of field.
>*  
>*  The default implementation always returns "Lucene40"
>*/
>   public PostingsFormat getPostingsFormatForField(String field) {
> return defaultFormat;
>   }
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3490) Restructure codec hierarchy

2011-11-03 Thread Andrzej Bialecki (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13143402#comment-13143402
 ] 

Andrzej Bialecki  commented on LUCENE-3490:
---

Awesome work, guys! +1 to merging it to trunk.

> Restructure codec hierarchy
> ---
>
> Key: LUCENE-3490
> URL: https://issues.apache.org/jira/browse/LUCENE-3490
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Robert Muir
> Fix For: 4.0
>
> Attachments: LUCENE-3490_SPI.patch
>
>
> Spinoff of LUCENE-2621. (Hoping we can do some of the renaming etc here in a 
> rote way to make progress).
> Currently Codec.java only represents a portion of the index, but there are 
> other parts of the index 
> (stored fields, term vectors, fieldinfos, ...) that we want under codec 
> control. There is also some 
> inconsistency about what a Codec is currently, for example Memory and Pulsing 
> are really just 
> PostingsFormats, you might just apply them to a specific field. On the other 
> hand, PreFlex actually
> is a Codec: it represents the Lucene 3.x index format (just not all parts 
> yet). I imagine we would
> like SimpleText to be the same way.
> So, I propose restructuring the classes so that we have something like:
> * CodecProvider <-- dead, replaced by java ServiceProvider mechanism. All 
> indexes are 'readable' if codecs are in classpath.
> * Codec <-- represents the index format (PostingsFormat + FieldsFormat + ...)
> * PostingsFormat: this is what Codec controls today, and Codec will return 
> one of these for a field.
> * FieldsFormat: Stored Fields + Term Vectors + FieldInfos?
> I think for PreFlex, it doesnt make sense to expose its PostingsFormat as a 
> 'public' class, because preflex
> can never be per-field so there is no use in allowing you to configure 
> PreFlex for a specific field.
> Similarly, I think in the future we should do the same thing for SimpleText. 
> Nobody needs SimpleText for production, it should
> just be a Codec where we try to make as much of the index as plain text and 
> simple as possible for debugging/learning/etc.
> So we don't need to expose its PostingsFormat. On the other hand, I don't 
> think we need Pulsing or Memory codecs,
> because its pretty silly to make your entire index use one of their 
> PostingsFormats. To parallel with analysis:
> PostingsFormat is like Tokenizer and Codec is like Analyzer, and we don't 
> need Analyzers to "show off" every Tokenizer.
> we can also move the baked in PerFieldCodecWrapper out (it would basically be 
> PerFieldPostingsFormat). Privately it would
> write the ids to the file like it does today. in the future, all 3.x hairy 
> backwards code would move to PreflexCodec. 
> SimpleTextCodec would get a plain text fieldinfos impl, etc.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-2621) Extend Codec to handle also stored fields and term vectors

2011-10-05 Thread Andrzej Bialecki (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13120916#comment-13120916
 ] 

Andrzej Bialecki  commented on LUCENE-2621:
---

IMHO you could merge the branch back into trunk, we are allowed to experiment 
and the current patch already improves the API and shows what to do next.

> Extend Codec to handle also stored fields and term vectors
> --
>
> Key: LUCENE-2621
> URL: https://issues.apache.org/jira/browse/LUCENE-2621
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 4.0
>Reporter: Andrzej Bialecki 
>Assignee: Robert Muir
>  Labels: gsoc2011, lucene-gsoc-11, mentor
> Attachments: LUCENE-2621_rote.patch
>
>
> Currently Codec API handles only writing/reading of term-related data, while 
> stored fields data and term frequency vector data writing/reading is handled 
> elsewhere.
> I propose to extend the Codec API to handle this data as well.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3433) Random access non RAM resident IndexDocValues (CSF)

2011-10-05 Thread Andrzej Bialecki (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13120914#comment-13120914
 ] 

Andrzej Bialecki  commented on LUCENE-3433:
---

Random access is also important if we want to use DocValues instead of norms.

> Random access non RAM resident IndexDocValues (CSF)
> ---
>
> Key: LUCENE-3433
> URL: https://issues.apache.org/jira/browse/LUCENE-3433
> Project: Lucene - Java
>  Issue Type: New Feature
>Affects Versions: 4.0
>Reporter: Yonik Seeley
> Fix For: 4.0
>
> Attachments: LUCENE-3433.patch
>
>
> There should be a way to get specific IndexDocValues by going through the 
> Directory rather than loading all of the values into memory.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-2621) Extend Codec to handle also stored fields and term vectors

2011-10-05 Thread Andrzej Bialecki (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13120888#comment-13120888
 ] 

Andrzej Bialecki  commented on LUCENE-2621:
---

CodecProvider -> Codec and Codec -> FieldCodec makes sense to me. This way the 
Codec would be responsible for all global index parts (seginfos, fieldinfos), 
and it would provide API to manage per-field data (FieldCodec), stored fields 
(FieldsWriter/Reader, perhaps a class to tie these two together), and term 
vectors (TermVectorsWriter/Reader, again grouped in a class).

Re. current patch: this looks like a great start. I discovered a problem 
lurking in SegmentMerger.mergeFields. setMatchingSegmentReaders checks only 
fieldInfos for compatibility, but it doesn't check the codecs (and 
fieldWriter/fieldReader) compatibility. It's happy then to use the 
matchingSegmentReader directly, which results in raw documents encoded with one 
codec being sent as a raw stream to a fieldWriter from another codec.

Also, SegmentInfo.files() is messy, it should be populated from codecs - as 
soon as I changed the extension of the stored fields file things exploded 
because TieredMergePolicy couldn't find the .fdx file reported by files().

> Extend Codec to handle also stored fields and term vectors
> --
>
> Key: LUCENE-2621
> URL: https://issues.apache.org/jira/browse/LUCENE-2621
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 4.0
>Reporter: Andrzej Bialecki 
>Assignee: Robert Muir
>  Labels: gsoc2011, lucene-gsoc-11, mentor
> Attachments: LUCENE-2621_rote.patch
>
>
> Currently Codec API handles only writing/reading of term-related data, while 
> stored fields data and term frequency vector data writing/reading is handled 
> elsewhere.
> I propose to extend the Codec API to handle this data as well.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2809) searcher leases

2011-10-04 Thread Andrzej Bialecki (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13120164#comment-13120164
 ] 

Andrzej Bialecki  commented on SOLR-2809:
-

Multiple leases could lead to searchers piling up ... perhaps it's less 
expensive to pass around searcher versions and fail+retry distributed requests 
if the reported version from a node disagrees with the version obtained during 
earlier phases? I guess it depends on the rate of change, and consequently the 
likelihood of failure.

> searcher leases
> ---
>
> Key: SOLR-2809
> URL: https://issues.apache.org/jira/browse/SOLR-2809
> Project: Solr
>  Issue Type: New Feature
>Reporter: Yonik Seeley
>
> Leases/reservations on searcher instances would give us the ability to use 
> the same searcher across phases of a distributed search, or for clients to 
> send multiple requests and have them hit a consistent/unchanging view of the 
> index. The latter requires something extra to ensure that the load balancer 
> contacts the same replicas as before.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org