date:20091005

[jira] Resolved: (LUCENE-486) Core Test should not have dependencies on the Demo code

2009-10-05 Thread Michael Busch (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Busch resolved LUCENE-486.
--

Resolution: Fixed

Committed revision 822139.

> Core Test should not have dependencies on the Demo code
> ---
>
> Key: LUCENE-486
> URL: https://issues.apache.org/jira/browse/LUCENE-486
> Project: Lucene - Java
>  Issue Type: Test
>  Components: Build
>Affects Versions: 1.4
>Reporter: Grant Ingersoll
>Assignee: Michael Busch
>Priority: Trivial
> Fix For: 3.0
>
> Attachments: FileDocument.java, lucene-486.patch, testdoc.patch
>
>
> The TestDoc.java Test file has a dependency on the Demo FileDocument code.  
> Some of us don't keep the Demo code around after downloading, so this breaks 
> the build.
> Patch will be along shortly

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1711) Field meta-data

2009-10-05 Thread Michael Busch (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Busch updated LUCENE-1711:
--

Fix Version/s: (was: 3.0)
   3.1

> Field meta-data
> ---
>
> Key: LUCENE-1711
> URL: https://issues.apache.org/jira/browse/LUCENE-1711
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.4.1
>Reporter: Jason Rutherglen
>Priority: Minor
> Fix For: 3.1
>
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> Allow user defined meta-data per Field. This would be stored by
> FieldInfos.write. Not sure about how to merge different values.
> The actual typed value should be Map available
> from Field. 
> The functionality can be used for a variety of purposes
> including trie, schemas, CSF, etc.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-965) Implement a state-of-the-art retrieval function in Lucene

2009-10-05 Thread Michael Busch (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Busch updated LUCENE-965:
-

Fix Version/s: (was: 3.0)
   3.1

> Implement a state-of-the-art retrieval function in Lucene
> -
>
> Key: LUCENE-965
> URL: https://issues.apache.org/jira/browse/LUCENE-965
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.2
>Reporter: Hui Fang
> Fix For: 3.1
>
> Attachments: axiomaticFunction.patch
>
>
> We implemented the axiomatic retrieval function, which is a state-of-the-art 
> retrieval function, to 
> replace the default similarity function in Lucene. We compared the 
> performance of these two functions and reported the results at 
> http://sifaka.cs.uiuc.edu/hfang/lucene/Lucene_exp.pdf. 
> The report shows that the performance of the axiomatic retrieval function is 
> much better than the default function. The axiomatic retrieval function is 
> able to find more relevant documents and users can see more relevant 
> documents in the top-ranked documents. Incorporating such a state-of-the-art 
> retrieval function could improve the search performance of all the 
> applications which were built upon Lucene. 
> Most changes related to the implementation are made in AXSimilarity, 
> TermScorer and TermQuery.java.  However, many test cases are hand coded to 
> test whether the implementation of the default function is correct. Thus, I 
> also made the modification to many test files to make the new retrieval 
> function pass those cases. In fact, we found that some old test cases are not 
> reasonable. For example, in the testQueries02 of TestBoolean2.java, 
> the query is "+w3 xx", and we have two documents "w1 xx w2 yy w3" and "w1 w3 
> xx w2 yy w3". 
> The second document should be more relevant than the first one, because it 
> has more 
> occurrences of the query term "w3". But the original test case would require 
> us to rank 
> the first document higher than the second one, which is not reasonable. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1888) Provide Option to Store Payloads on the Term Vector

2009-10-05 Thread Michael Busch (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Busch updated LUCENE-1888:
--

Fix Version/s: (was: 3.0)

> Provide Option to Store Payloads on the Term Vector
> ---
>
> Key: LUCENE-1888
> URL: https://issues.apache.org/jira/browse/LUCENE-1888
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
> Fix For: 3.1
>
>
> Would be nice to have the option to access the payloads in a document-centric 
> way by adding them to the Term Vectors.  Naturally, this makes the Term 
> Vectors bigger, but it may be just what one needs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1458) Further steps towards flexible indexing

2009-10-05 Thread Mark Miller (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller updated LUCENE-1458:


Attachment: LUCENE-1458.patch

eh - even if you have moved on, if I'm going to put up a patch, might as well 
do it right - here is another:

* removed a boatload of unused imports
* removed DefaultSkipListWriter/Reader - I accidently put them back in
* removed an unused field or two (not all)
* paramaterized LegacySegmentMergeQueue.java
* Fixed the double read I mentioned in previous comment in IndexWriter
* TermRef defines an equals (that throws UOE) and not hashCode - early stuff I 
guess but odd since no class extends it. Added a hashCode that throws UOE 
anyway.
* fixed bug in TermRangeTermsEnum: lowerTermRef = new TermRef(lowerTermText); 
to lowerTermRef = new TermRef(this.lowerTermText);
* Fixed Remote contrib test to work with TermRef for fieldcache parser (since 
you don't include contrib in the tar)
* Missed a StringBuffer to StringBuilder in MultiTermQuery.toString
* had missed removing deprecated IndexReader.open(final Directory directory) 
and deprecated IndexReader.open(final IndexCommit commit)
* Paramertized some stuff in ParrallelReader that made sense - what the heck
* added a nocommit or two on unread fields with a comment that made it look 
like they were/will be used
* Looks like SegmentTermPositions.java may have been screwy in last patch - 
ensure its now a deleted file - same with TermInfosWriter.java
* You left getEnum(IndexReader reader) in the MultiTerm queries, but no in 
PrefixQuery - just checkin'.
* Missed removing listAll from FileSwitchDirectory - gone
* cleaned up some white space nothings in the patch
* I guess TestBackwardsCompatibility.java has been removed from trunk or 
something? kept it here for now.
* looks like i missed merging in a change to 
TestIndexWriter.java#assertNoUnreferencedFiles - done
* doubled checked my merge work

core and contrib tests pass




> Further steps towards flexible indexing
> ---
>
> Key: LUCENE-1458
> URL: https://issues.apache.org/jira/browse/LUCENE-1458
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2
>
>
> I attached a very rough checkpoint of my current patch, to get early
> feedback.  All tests pass, though back compat tests don't pass due to
> changes to package-private APIs plus certain bugs in tests that
> happened to work (eg call TermPostions.nextPosition() too many times,
> which the new API asserts against).
> [Aside: I think, when we commit changes to package-private APIs such
> that back-compat tests don't pass, we could go back, make a branch on
> the back-compat tag, commit changes to the tests to use the new
> package private APIs on that branch, then fix nightly build to use the
> tip of that branch?o]
> There's still plenty to do before this is committable! This is a
> rather large change:
>   * Switches to a new more efficient terms dict format.  This still
> uses tii/tis files, but the tii only stores term & long offset
> (not a TermInfo).  At seek points, tis encodes term & freq/prox
> offsets absolutely instead of with deltas delta.  Also, tis/tii
> are structured by field, so we don't have to record field number
> in every term.
> .
> On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
> -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> .
> RAM usage when loading terms dict index is significantly less
> since we only load an array of offsets and an array of String (no
> more TermInfo array).  It should be faster to init too.
> .
> This part is basically done.
>   * Introduces modular reader codec that strongly decouples terms dict
> from docs/positions readers.  EG there is no more TermInfo used
> when reading the new format.
> .
> There's nice symmetry now between reading & writing in the codec
> chain -- the current docs/prox format is captured in:
> {code}
> FormatPostingsTermsDictWriter/Reader
> FormatPostingsDocsWriter/Reader (.frq file) and
> FormatPostingsPositionsWriter/Reader (.prx file).
> {code}
> This part is basically done.
>   * Introduces a

[jira] Updated: (LUCENE-486) Core Test should not have dependencies on the Demo code

2009-10-05 Thread Michael Busch (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Busch updated LUCENE-486:
-

Attachment: lucene-486.patch

Will have to commit the change in TestDoc also to the compatibility tests 
branch.

> Core Test should not have dependencies on the Demo code
> ---
>
> Key: LUCENE-486
> URL: https://issues.apache.org/jira/browse/LUCENE-486
> Project: Lucene - Java
>  Issue Type: Test
>  Components: Build
>Affects Versions: 1.4
>Reporter: Grant Ingersoll
>Assignee: Michael Busch
>Priority: Trivial
> Fix For: 3.0
>
> Attachments: FileDocument.java, lucene-486.patch, testdoc.patch
>
>
> The TestDoc.java Test file has a dependency on the Demo FileDocument code.  
> Some of us don't keep the Demo code around after downloading, so this breaks 
> the build.
> Patch will be along shortly

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1458) Further steps towards flexible indexing

2009-10-05 Thread Mark Miller (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller updated LUCENE-1458:


Attachment: LUCENE-1458.patch

bq. I haven't "svn up"d to all the recent the deprecations removals / generics 
additions. Kinda dreading doing so  :)'

Come on old man, stop clinging to emacs ;) I've got a meditation technique for 
that :)

Sounds like some annoyance, and I think I made a comment there - and I'm a man 
of my word... or child of my word - take your pick.

To trunk. Since you likely have moved on, don't worry - this was good practice 
- I'll do it again sometime if you'd like. I may have mis merged something 
little or something. I went fairly quick (I think it took like 30 or 40 min - 
was hoping to do it faster, but eh - sometimes I like to grind).

I didn't really look at the code, but some stuff I noticed:

java 6 in pfor Arrays.copy

skiplist stuff in codecs still have package of index - not sure what is going 
on there - changed them

in IndexWriter: 
+  // Mark: read twice?
   segmentInfos.read(directory);
+segmentInfos.read(directory, codecs);

Core tests pass, but I didn't wait for contrib or back compat.

> Further steps towards flexible indexing
> ---
>
> Key: LUCENE-1458
> URL: https://issues.apache.org/jira/browse/LUCENE-1458
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2
>
>
> I attached a very rough checkpoint of my current patch, to get early
> feedback.  All tests pass, though back compat tests don't pass due to
> changes to package-private APIs plus certain bugs in tests that
> happened to work (eg call TermPostions.nextPosition() too many times,
> which the new API asserts against).
> [Aside: I think, when we commit changes to package-private APIs such
> that back-compat tests don't pass, we could go back, make a branch on
> the back-compat tag, commit changes to the tests to use the new
> package private APIs on that branch, then fix nightly build to use the
> tip of that branch?o]
> There's still plenty to do before this is committable! This is a
> rather large change:
>   * Switches to a new more efficient terms dict format.  This still
> uses tii/tis files, but the tii only stores term & long offset
> (not a TermInfo).  At seek points, tis encodes term & freq/prox
> offsets absolutely instead of with deltas delta.  Also, tis/tii
> are structured by field, so we don't have to record field number
> in every term.
> .
> On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
> -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> .
> RAM usage when loading terms dict index is significantly less
> since we only load an array of offsets and an array of String (no
> more TermInfo array).  It should be faster to init too.
> .
> This part is basically done.
>   * Introduces modular reader codec that strongly decouples terms dict
> from docs/positions readers.  EG there is no more TermInfo used
> when reading the new format.
> .
> There's nice symmetry now between reading & writing in the codec
> chain -- the current docs/prox format is captured in:
> {code}
> FormatPostingsTermsDictWriter/Reader
> FormatPostingsDocsWriter/Reader (.frq file) and
> FormatPostingsPositionsWriter/Reader (.prx file).
> {code}
> This part is basically done.
>   * Introduces a new "flex" API for iterating through the fields,
> terms, docs and positions:
> {code}
> FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
> {code}
> This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
> old API on top of the new API to keep back-compat.
> 
> Next steps:
>   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
> fix any hidden assumptions.
>   * Expose new API out of IndexReader, deprecate old API but emulate
> old API on top of new one, switch all core/contrib users to the
> new API.
>   * Maybe switch to AttributeSources as the base class for TermsEnum,
> DocsEnum, PostingsEnum -- this would give readers API flexibility
> (not just index-file-format flexibility).  EG if someone wanted

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-10-05 Thread Grant Ingersoll (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12762406#action_12762406
 ] 

Grant Ingersoll commented on LUCENE-1458:
-

I haven't followed too closely (even though it is one of my favorite issues) 
but I figured while Yonik was throwing out ideas, I'd add that one of the 
obvious use cases for flexible indexing is altering scoring.  One of the common 
statistics one needs to implement some more advanced scoring approaches is the 
average document length.  Is this patch far enough along that I could take a 
look at it and think about how one might do this?

> Further steps towards flexible indexing
> ---
>
> Key: LUCENE-1458
> URL: https://issues.apache.org/jira/browse/LUCENE-1458
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2
>
>
> I attached a very rough checkpoint of my current patch, to get early
> feedback.  All tests pass, though back compat tests don't pass due to
> changes to package-private APIs plus certain bugs in tests that
> happened to work (eg call TermPostions.nextPosition() too many times,
> which the new API asserts against).
> [Aside: I think, when we commit changes to package-private APIs such
> that back-compat tests don't pass, we could go back, make a branch on
> the back-compat tag, commit changes to the tests to use the new
> package private APIs on that branch, then fix nightly build to use the
> tip of that branch?o]
> There's still plenty to do before this is committable! This is a
> rather large change:
>   * Switches to a new more efficient terms dict format.  This still
> uses tii/tis files, but the tii only stores term & long offset
> (not a TermInfo).  At seek points, tis encodes term & freq/prox
> offsets absolutely instead of with deltas delta.  Also, tis/tii
> are structured by field, so we don't have to record field number
> in every term.
> .
> On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
> -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> .
> RAM usage when loading terms dict index is significantly less
> since we only load an array of offsets and an array of String (no
> more TermInfo array).  It should be faster to init too.
> .
> This part is basically done.
>   * Introduces modular reader codec that strongly decouples terms dict
> from docs/positions readers.  EG there is no more TermInfo used
> when reading the new format.
> .
> There's nice symmetry now between reading & writing in the codec
> chain -- the current docs/prox format is captured in:
> {code}
> FormatPostingsTermsDictWriter/Reader
> FormatPostingsDocsWriter/Reader (.frq file) and
> FormatPostingsPositionsWriter/Reader (.prx file).
> {code}
> This part is basically done.
>   * Introduces a new "flex" API for iterating through the fields,
> terms, docs and positions:
> {code}
> FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
> {code}
> This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
> old API on top of the new API to keep back-compat.
> 
> Next steps:
>   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
> fix any hidden assumptions.
>   * Expose new API out of IndexReader, deprecate old API but emulate
> old API on top of new one, switch all core/contrib users to the
> new API.
>   * Maybe switch to AttributeSources as the base class for TermsEnum,
> DocsEnum, PostingsEnum -- this would give readers API flexibility
> (not just index-file-format flexibility).  EG if someone wanted
> to store payload at the term-doc level instead of
> term-doc-position level, you could just add a new attribute.
>   * Test performance & iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1636) TokenFilters with a null value in the constructor fail

2009-10-05 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12762387#action_12762387
 ] 

Uwe Schindler commented on LUCENE-1636:
---

This change is also noted in the backwards compatibility section of Lucene 2.9.

The assignment of filter in the ctor is totally useless, as the super ctor does 
it already, so it the problem of this third party software that used the API in 
an undocumented way. I am sorry for your problems, but the author of lucene-ja 
should provide a fix, if you have the source code available, it is a less 
important problem, if it is closed source, you have to ask the author to fix it 
soon.

> TokenFilters with a null value in the constructor fail
> --
>
> Key: LUCENE-1636
> URL: https://issues.apache.org/jira/browse/LUCENE-1636
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 2.9
>Reporter: Wouter Heijke
>Assignee: Michael McCandless
> Fix For: 2.9
>
> Attachments: LUCENE-1636.patch
>
>
> While migrating from 2.4.x to 2.9-dev I found a lot of failing unittests.
> One problem is with TokenFilters that do a super(null) in the constructor.
> I fixed it by changing the constructor to super(new EmptyTokenStream())
> This will cause problems and frustration to others while migrating to 2.9.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1946) Remove deprecated TokenStream API

2009-10-05 Thread Michael Busch (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12762386#action_12762386
 ] 

Michael Busch commented on LUCENE-1946:
---

{quote}
even the UIMA people were very happy about the new API, because it fits better 
to the UIMA architecture
{quote}

Yeah, I had the feeling they like it too (had lunch with them a couple weeks 
ago when I was working from Boeblingen).

I'm personally ok with removing the old API; just thought Grant and others 
mentioned concerns about removing it too soon, because lots of users have their 
own TokenStreams.

> Remove deprecated TokenStream API
> -
>
> Key: LUCENE-1946
> URL: https://issues.apache.org/jira/browse/LUCENE-1946
> Project: Lucene - Java
>  Issue Type: Task
>  Components: Analysis, contrib/analyzers
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
> Fix For: 3.0
>
> Attachments: LUCENE-1946.patch, LUCENE-1946.patch
>
>
> I looked into clover analysis: It seems to be no longer used since I removed 
> the tests yesterday - I am happy!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1946) Remove deprecated TokenStream API

2009-10-05 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12762375#action_12762375
 ] 

Uwe Schindler commented on LUCENE-1946:
---

I already promoted (and also Grant in his webinar) that Lucene 3.0 will not 
contain the old TokenStream API anymore. Nobody had problems with it (even the 
UIMA people were very happy about the new API, because it fits better to the 
UIMA architecture [they also have such things like Attribute but called 
different] - there was a conference about UIMA in Potsdam).

I keep this patch pending (only the generics changes are submitted), but all 
tests for the BW compatibility were removed in my deprecated cleanup, so there 
are no tests for the old api anymore in trunk.

> Remove deprecated TokenStream API
> -
>
> Key: LUCENE-1946
> URL: https://issues.apache.org/jira/browse/LUCENE-1946
> Project: Lucene - Java
>  Issue Type: Task
>  Components: Analysis, contrib/analyzers
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
> Fix For: 3.0
>
> Attachments: LUCENE-1946.patch, LUCENE-1946.patch
>
>
> I looked into clover analysis: It seems to be no longer used since I removed 
> the tests yesterday - I am happy!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1946) Remove deprecated TokenStream API

2009-10-05 Thread Michael Busch (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12762330#action_12762330
 ] 

Michael Busch commented on LUCENE-1946:
---

In case the change in our backwards-compatibility policy happens (see 
LUCENE-1698) we could think about removing the old TokenStream API in 3.1, 
considering how central this API is. 

> Remove deprecated TokenStream API
> -
>
> Key: LUCENE-1946
> URL: https://issues.apache.org/jira/browse/LUCENE-1946
> Project: Lucene - Java
>  Issue Type: Task
>  Components: Analysis, contrib/analyzers
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
> Fix For: 3.0
>
> Attachments: LUCENE-1946.patch, LUCENE-1946.patch
>
>
> I looked into clover analysis: It seems to be no longer used since I removed 
> the tests yesterday - I am happy!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1698) Change backwards-compatibility policy

2009-10-05 Thread Michael Busch (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12762329#action_12762329
 ] 

Michael Busch commented on LUCENE-1698:
---

Now that 2.9 is out and 3.0 is close, I'd like to get back to this one to get 
to a conclusion.

We had several informal +1s on javadev and a -0 here from Hoss and Mark. So no 
-1 yet.

I think the final decision should be made with an official vote on java-dev. 
How does everyone feel about this? Shall we have a vote right away? I think it 
might be a good idea to get some feedback about this proposal on java-user 
first? Or we could even wait a month and bring it up in Oakland at the Lucene 
BOF?

> Change backwards-compatibility policy
> -
>
> Key: LUCENE-1698
> URL: https://issues.apache.org/jira/browse/LUCENE-1698
> Project: Lucene - Java
>  Issue Type: Task
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.0
>
>
> These proposed changes might still change slightly:
> I'll call X.Y -> X+1.0 a 'major release', X.Y -> X.Y+1 a
> 'minor release' and X.Y.Z -> X.Y.Z+1 a 'bugfix release'. (we can later
> use different names; just for convenience here...)
> 1. The file format backwards-compatiblity policy will remain unchanged;
>i.e. Lucene X.Y supports reading all indexes written with Lucene
>X-1.Y. That means Lucene 4.0 will not have to be able to read 2.x
>indexes.
> 2. Deprecated public and protected APIs can be removed if they have
>been released in at least one major or minor release. E.g. an 3.1
>API can be released as deprecated in 3.2 and removed in 3.3 or 4.0
>(if 4.0 comes after 3.2).
> 3. No public or protected APIs are changed in a bugfix release; except
>if a severe bug can't be changed otherwise.
> 4. Each release will have release notes with a new section
>"Incompatible changes", which lists, as the names says, all changes that
>break backwards compatibility. The list should also have information
>about how to convert to the new API. I think the eclipse releases
>have such a release notes section. Furthermore, the Deprecation tag 
>comment will state the minimum version when this API is to be removed,  
> e.g.
>@deprecated See #fooBar().  Will be removed in 3.3 
>or
>@deprecated See #fooBar().  Will be removed in 3.3 or later.
> I'd suggest to treat a runtime change like an API change (unless it's fixing 
> a bug of course),
> i.e. giving a warning, providing a switch, switching the default behavior 
> only after a major 
> or minor release was around that had the warning/switch. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-10-05 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12762292#action_12762292
 ] 

Yonik Seeley commented on LUCENE-1458:
--

bq. I agree this would be useful. I did have ord() in early iterations of the 
TermsEnum API, but it wasn't fully implemented and I stripped it when I 
switched to "just finish it already" mode

A "complete" implementation seems hard (i.e. across multiple segments also)... 
but it still seems useful even if it's only at the segment level.  So perhaps 
just on SegmentTermEnum, and uses would have to cast to access?

Exposing the term index array (i.e. every 128th term) as an 
expert-subject-to-change warning would let people implement variants themselves 
at least.

bq. you'd also presumably need seek(int ord)

Yep.

> Further steps towards flexible indexing
> ---
>
> Key: LUCENE-1458
> URL: https://issues.apache.org/jira/browse/LUCENE-1458
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2
>
>
> I attached a very rough checkpoint of my current patch, to get early
> feedback.  All tests pass, though back compat tests don't pass due to
> changes to package-private APIs plus certain bugs in tests that
> happened to work (eg call TermPostions.nextPosition() too many times,
> which the new API asserts against).
> [Aside: I think, when we commit changes to package-private APIs such
> that back-compat tests don't pass, we could go back, make a branch on
> the back-compat tag, commit changes to the tests to use the new
> package private APIs on that branch, then fix nightly build to use the
> tip of that branch?o]
> There's still plenty to do before this is committable! This is a
> rather large change:
>   * Switches to a new more efficient terms dict format.  This still
> uses tii/tis files, but the tii only stores term & long offset
> (not a TermInfo).  At seek points, tis encodes term & freq/prox
> offsets absolutely instead of with deltas delta.  Also, tis/tii
> are structured by field, so we don't have to record field number
> in every term.
> .
> On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
> -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> .
> RAM usage when loading terms dict index is significantly less
> since we only load an array of offsets and an array of String (no
> more TermInfo array).  It should be faster to init too.
> .
> This part is basically done.
>   * Introduces modular reader codec that strongly decouples terms dict
> from docs/positions readers.  EG there is no more TermInfo used
> when reading the new format.
> .
> There's nice symmetry now between reading & writing in the codec
> chain -- the current docs/prox format is captured in:
> {code}
> FormatPostingsTermsDictWriter/Reader
> FormatPostingsDocsWriter/Reader (.frq file) and
> FormatPostingsPositionsWriter/Reader (.prx file).
> {code}
> This part is basically done.
>   * Introduces a new "flex" API for iterating through the fields,
> terms, docs and positions:
> {code}
> FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
> {code}
> This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
> old API on top of the new API to keep back-compat.
> 
> Next steps:
>   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
> fix any hidden assumptions.
>   * Expose new API out of IndexReader, deprecate old API but emulate
> old API on top of new one, switch all core/contrib users to the
> new API.
>   * Maybe switch to AttributeSources as the base class for TermsEnum,
> DocsEnum, PostingsEnum -- this would give readers API flexibility
> (not just index-file-format flexibility).  EG if someone wanted
> to store payload at the term-doc level instead of
> term-doc-position level, you could just add a new attribute.
>   * Test performance & iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.o

[jira] Commented: (LUCENE-1910) Extension to MoreLikeThis to use tag information

2009-10-05 Thread Mark Harwood (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12762290#action_12762290
 ] 

Mark Harwood commented on LUCENE-1910:
--

> 2 minutes to create a query based on 10,000 documents?

Unfortunately, I can't see this being generally useful until the performance is 
improved dramatically.


> Extension to MoreLikeThis to use tag information
> 
>
> Key: LUCENE-1910
> URL: https://issues.apache.org/jira/browse/LUCENE-1910
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Reporter: Thomas D'Silva
>Priority: Minor
> Attachments: LUCENE-1910.patch
>
>
> I would like to contribute a class based on the MoreLikeThis class in
> contrib/queries that generates a query based on the tags associated
> with a document. The class assumes that documents are tagged with a
> set of tags (which are stored in the index in a seperate Field). The
> class determines the top document terms associated with a given tag
> using the information gain metric.
> While generating a MoreLikeThis query for a document the tags
> associated with document are used to determine the terms in the query.
> This class is useful for finding similar documents to a document that
> does not have many relevant terms but was tagged.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1947) Snowball package contains BSD licensed code with ASL header

2009-10-05 Thread Mark Miller (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12762284#action_12762284
 ] 

Mark Miller commented on LUCENE-1947:
-

We should actually add a comment in the files about the BSD license as well - 
to keep this from being a recurring theme.

> Snowball package contains BSD licensed code with ASL header
> ---
>
> Key: LUCENE-1947
> URL: https://issues.apache.org/jira/browse/LUCENE-1947
> Project: Lucene - Java
>  Issue Type: Task
>  Components: contrib/analyzers
>Affects Versions: 2.9
>Reporter: Karl Wettin
>Assignee: Karl Wettin
> Fix For: 3.0
>
> Attachments: LUCENE-1947.patch
>
>
> All classes in org.tartarus.snowball (but not in org.tartarus.snowball.ext) 
> has for some reason been given an ASL header. These classes are licensed with 
> BSD. Thus the ASL header should be removed. I suppose this a misstake or 
> possible due to the ASL header automation tool.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-10-05 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12762282#action_12762282
 ] 

Michael McCandless commented on LUCENE-1458:


bq. 1) How many terms in a field?

Actually I've already added this one (Terms.getUniqueTermCount), but I
didn't punch it through to IndexReader.  I'll do that.  The standard
codec (new "default" codec when writing segments) already records this
per field, so it's trivial to expose.

However, some impls may throw UOE (eg a composite IndexReader).

bq. 2) Convert back and forth between a term number and a term.

I agree this would be useful.  I did have ord() in early iterations of
the TermsEnum API, but it wasn't fully implemented and I stripped it
when I switched to "just finish it already" mode :) We could think
about adding it back, though you'd also presumably need seek(int ord)
as well?  (And docFreq(String field, int ord) sugar exposed in
IndexReader?).


> Further steps towards flexible indexing
> ---
>
> Key: LUCENE-1458
> URL: https://issues.apache.org/jira/browse/LUCENE-1458
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2
>
>
> I attached a very rough checkpoint of my current patch, to get early
> feedback.  All tests pass, though back compat tests don't pass due to
> changes to package-private APIs plus certain bugs in tests that
> happened to work (eg call TermPostions.nextPosition() too many times,
> which the new API asserts against).
> [Aside: I think, when we commit changes to package-private APIs such
> that back-compat tests don't pass, we could go back, make a branch on
> the back-compat tag, commit changes to the tests to use the new
> package private APIs on that branch, then fix nightly build to use the
> tip of that branch?o]
> There's still plenty to do before this is committable! This is a
> rather large change:
>   * Switches to a new more efficient terms dict format.  This still
> uses tii/tis files, but the tii only stores term & long offset
> (not a TermInfo).  At seek points, tis encodes term & freq/prox
> offsets absolutely instead of with deltas delta.  Also, tis/tii
> are structured by field, so we don't have to record field number
> in every term.
> .
> On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
> -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> .
> RAM usage when loading terms dict index is significantly less
> since we only load an array of offsets and an array of String (no
> more TermInfo array).  It should be faster to init too.
> .
> This part is basically done.
>   * Introduces modular reader codec that strongly decouples terms dict
> from docs/positions readers.  EG there is no more TermInfo used
> when reading the new format.
> .
> There's nice symmetry now between reading & writing in the codec
> chain -- the current docs/prox format is captured in:
> {code}
> FormatPostingsTermsDictWriter/Reader
> FormatPostingsDocsWriter/Reader (.frq file) and
> FormatPostingsPositionsWriter/Reader (.prx file).
> {code}
> This part is basically done.
>   * Introduces a new "flex" API for iterating through the fields,
> terms, docs and positions:
> {code}
> FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
> {code}
> This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
> old API on top of the new API to keep back-compat.
> 
> Next steps:
>   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
> fix any hidden assumptions.
>   * Expose new API out of IndexReader, deprecate old API but emulate
> old API on top of new one, switch all core/contrib users to the
> new API.
>   * Maybe switch to AttributeSources as the base class for TermsEnum,
> DocsEnum, PostingsEnum -- this would give readers API flexibility
> (not just index-file-format flexibility).  EG if someone wanted
> to store payload at the term-doc level instead of
> term-doc-position level, you could just add a new attribute.
>   * Test performance & iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a c

[jira] Commented: (LUCENE-1636) TokenFilters with a null value in the constructor fail

2009-10-05 Thread Mark Bennett (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12762281#action_12762281
 ] 

Mark Bennett commented on LUCENE-1636:
--

(trouble posting, forgive if duplicate)
This change also broke the Japanese morphological SEN / Lucene integration code 
in lucene-ja.  Since Solr 1.4 is based on Lucene 2.9, this will also 
effectively break SEN for Solr users who upgrade to 1.4.

I'm not complaining.  Reading the above comments, the change was probably the 
"right" thing to do.  I've contacted the author of lucene-ja, and I hope to 
work on a rewrite to address this.

I would be interested in any comments you folks might have about the lucene-ja 
code.

Class org.apache.lucene.analysis.ja.POSFilter
Extends org.apache.lucene.analysis.TokenFilter

Offending code in lucene-ja's POSFilter
public POSFilter(TokenStream in, Hashtable posTable) {
super(in);
input = in; // << this is a member field of parent TokenFilter
table = posTable;
}
This is done in several classes.

> TokenFilters with a null value in the constructor fail
> --
>
> Key: LUCENE-1636
> URL: https://issues.apache.org/jira/browse/LUCENE-1636
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 2.9
>Reporter: Wouter Heijke
>Assignee: Michael McCandless
> Fix For: 2.9
>
> Attachments: LUCENE-1636.patch
>
>
> While migrating from 2.4.x to 2.9-dev I found a lot of failing unittests.
> One problem is with TokenFilters that do a super(null) in the constructor.
> I fixed it by changing the constructor to super(new EmptyTokenStream())
> This will cause problems and frustration to others while migrating to 2.9.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Lucene 2.9 and deprecated IR.open() methods

2009-10-05 Thread Michael Busch


I think we shouldn't discuss too many different things here.
To begin I'd just like to introduce the IndexConfig class, that will 
hold the parameters we currently pass to the different IndexWriter 
constructors.


If we later need to create different IndexWriter impls we can introduce 
a factory.


If we want to change some IW settings to be mandatory on IW 
instantiation we can move those parameters from IW to the Config class 
then.


If we see in the future the need to pass arguments to the different flex 
index consumers, we can add an AttributeSource or Properties hashmap to 
the config class, or maybe directly to the IndexingChain class. I don't 
really think the IndexWriter needs this flexibility right now and it 
seems like Mike hasn't seen the need thus far while working on 
LUCENE-1458 either.


Adding the Config class and deprecating all other IW constructors will 
not prevent us from doing any of the other things in the future IMO and 
is already a great start to simplify things. So let's do that first and 
discuss the other points separately when the need arises.


 Michael

On 10/5/09 5:40 AM, Uwe Schindler wrote:

Hi Mike,

   

I think AS is overkill for conveying configuration of IW/IR?

Suddenly, instead of:

   cfg.setRAMBufferSizeMB(128.0)

I'd have to do something like this?


cfg.addAttribute(IndexWriterRAMBufferSizeAttribute.class).setRAMBufferSize
(128.0)

It's too cumbersome, I think, for something that ought to be simple.
I'd prefer a dedicated config class with strongly typed setters
exposed.  Of all the "pure syntax" options so far I'd still prefer the
traditional "config object with setters".
 

> From this point-of-view, it's also overkill for TokenStream. But as AS was
also designed for flexible indexing it would fit very well into this area.

The new query parser is a good example pro attributes. What is an argument
against atts is the fact, that also Michael Bush didn't promote them from
the beginning of this discussion :-) (maybe he needs also one night longer
to think about it).

Good points for AS, are e.g. the type-safety, simplicity to enhance,
built-in defaults (you do not need to check for existence of attributes,
just add them at the point you want to use them, like in your example -
maybe with nicer and shorter names). With generics, AS is as simple to use
like simple get/setters.

   

Also, I don't think we should roll this out for all Lucene classes.  I
think most classes do just fine accepting args to their ctor.  EG
TermQuery simply takes Term to its ctor.

I do agree IW should not be in the business of brokering changes to
the settings of its sub-components (eg mergeFactor, maxMergeDocs).
You really should make such changes directly via your merge policy.
 

AttributeSource would also help us with e.g. the possibility for later
changes to various attributes. If some of the attributes are fixed after
construction of IW/IR, just throw IllegalStateExceptions.

   

Finally, I'm not convinced we should lock down all settings after
classes are created.  (I'm not convinced we shouldn't, either).

A merge policy has no trouble changing its mergeFactor,
maxMergeDocs/Size.  IW has no trouble changing the its RAM buffer
size, maxFieldLength, or useCompoundFile.  Sure there are some things
that cannot (or would be very tricky to) change, eg deletion policy.
But then analyzer isn't changeable today, but could be.

But, then, I can also see it'd simplify our code to not have to deal
w/ such changes, reduce chance of subtle bugs, and it seems minor to
go and re-open your IndexWriter if you need to make a settings change?
(Hmm except in an NRT setting, because the reader pool would be reset;
really we need to get the reader pool separated from the IW instance).

Mike

On Mon, Oct 5, 2009 at 4:38 AM, Uwe Schindler  wrote:
 

See my second mail. The recently introduced Attributes and
   

AttributeSource
 

would solve this. Each component just defines its attribute interface
   

and
 

impl class and you pass in an AttributeSource as configuration. Then
   

you
 

can
 

do:

AttributeSource cfg = new AttributeSource();

ComponentAttribute compCfg =
   

cfg.addAttribute(ComponentAttribute.class);
 

compCfg.setMergeScheduler(FooScheduler.class);

MergeBarAttribute mergeCfg =
   

cfg.addAttribute(MergeBarAttribute.class);
 

mergeCfg.setWhateverProp(1234);
...
IndexWriter iw = new IndexWriter(dir, cfg);

(this is just brainstorming not yet thoroughly thought about).
   

This approach suggests IW creates its components, and while doing so
provides them your AS instance.
I personally prefer creating all these components myself, configuring
them (at the moment of creation) and passing them to IW in one way or
another.
This requires way less code, you don't have to invent elaborate
schemes of passing through your custom per-component settings and
selecting which exact component

Re: [jira] Created: (LUCENE-1948) Deprecating InstantiatedIndexWriter

2009-10-05 Thread DM Smith


On 10/05/2009 12:22 PM, Karl Wettin (JIRA) wrote:

Deprecating InstantiatedIndexWriter
---

  Key: LUCENE-1948
  URL: https://issues.apache.org/jira/browse/LUCENE-1948
  Project: Lucene - Java
   Issue Type: Task
   Components: contrib/*
 Affects Versions: 2.9
 Reporter: Karl Wettin
 Assignee: Karl Wettin
  Fix For: 3.0


http://markmail.org/message/j6ip266fpzuaibf7

I suppose that should have been suggested before 2.9 rather than
after...
   
There will be a 2.9.1 for bug fixes. Consider adding the deprecation as 
a bug fix.


-- DM

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-1947) Snowball package contains BSD licensed code with ASL header

2009-10-05 Thread Mark Miller (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12762253#action_12762253
 ] 

Mark Miller edited comment on LUCENE-1947 at 10/5/09 9:48 AM:
--

bq. I suppose this a misstake or possible due to the ASL header automation tool.

Yes. If BSD is an approved license, it would be nice if RAT would recognize it 
and then we could just add it to these files. But RAT doesn't appear to. It 
just looks for "http://dojotoolkit.org/community/licensing.shtml"; and "TMF854 
Version 1.0 - Copyright TeleManagement Forum" - which it considers modified 
BSD. Weak.

Anyway, NOTICE should also state the license for Snowball along with the 
copyright as well. (reads weird - i know the copyright is there with a link - 
but it should state the license as well)

  was (Author: markrmil...@gmail.com):
bq. I suppose this a misstake or possible due to the ASL header automation 
tool.

Yes. If BSD is an approved license, it would be nice if RAT would recognize it 
and then we could just add it to these files. But RAT doesn't appear to. It 
just looks for "http://dojotoolkit.org/community/licensing.shtml"; and "TMF854 
Version 1.0 - Copyright TeleManagement Forum" - which it considers modified 
BSD. Weak.

Anyway, NOTICE should also state the license for Snowball along with the 
copyright as well.
  
> Snowball package contains BSD licensed code with ASL header
> ---
>
> Key: LUCENE-1947
> URL: https://issues.apache.org/jira/browse/LUCENE-1947
> Project: Lucene - Java
>  Issue Type: Task
>  Components: contrib/analyzers
>Affects Versions: 2.9
>Reporter: Karl Wettin
>Assignee: Karl Wettin
> Fix For: 3.0
>
> Attachments: LUCENE-1947.patch
>
>
> All classes in org.tartarus.snowball (but not in org.tartarus.snowball.ext) 
> has for some reason been given an ASL header. These classes are licensed with 
> BSD. Thus the ASL header should be removed. I suppose this a misstake or 
> possible due to the ASL header automation tool.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1947) Snowball package contains BSD licensed code with ASL header

2009-10-05 Thread Mark Miller (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12762253#action_12762253
 ] 

Mark Miller commented on LUCENE-1947:
-

bq. I suppose this a misstake or possible due to the ASL header automation tool.

Yes. If BSD is an approved license, it would be nice if RAT would recognize it 
and then we could just add it to these files. But RAT doesn't appear to. It 
just looks for "http://dojotoolkit.org/community/licensing.shtml"; and "TMF854 
Version 1.0 - Copyright TeleManagement Forum" - which it considers modified 
BSD. Weak.

Anyway, NOTICE should also state the license for Snowball along with the 
copyright as well.

> Snowball package contains BSD licensed code with ASL header
> ---
>
> Key: LUCENE-1947
> URL: https://issues.apache.org/jira/browse/LUCENE-1947
> Project: Lucene - Java
>  Issue Type: Task
>  Components: contrib/analyzers
>Affects Versions: 2.9
>Reporter: Karl Wettin
>Assignee: Karl Wettin
> Fix For: 3.0
>
> Attachments: LUCENE-1947.patch
>
>
> All classes in org.tartarus.snowball (but not in org.tartarus.snowball.ext) 
> has for some reason been given an ASL header. These classes are licensed with 
> BSD. Thus the ASL header should be removed. I suppose this a misstake or 
> possible due to the ASL header automation tool.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1948) Deprecating InstantiatedIndexWriter

2009-10-05 Thread Karl Wettin (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wettin updated LUCENE-1948:


Attachment: LUCENE-1948.patch

> Deprecating InstantiatedIndexWriter
> ---
>
> Key: LUCENE-1948
> URL: https://issues.apache.org/jira/browse/LUCENE-1948
> Project: Lucene - Java
>  Issue Type: Task
>  Components: contrib/*
>Affects Versions: 2.9
>Reporter: Karl Wettin
>Assignee: Karl Wettin
> Fix For: 3.0
>
> Attachments: LUCENE-1948.patch
>
>
> http://markmail.org/message/j6ip266fpzuaibf7
> I suppose that should have been suggested before 2.9 rather than  
> after...
> There are at least three reasons to why I want to do this:
> The code is based on the behaviour or the Directory IndexWriter as of  
> 2.3 and I have not been touching it since then. If there will be  
> changes in the future one will have to keep IIW in sync, something  
> that's easy to forget.
> There is no locking which will cause concurrent modification  
> exceptions when accessing the index via searcher/reader while  
> committing.
> It use the old token stream API so it has to be upgraded in case it  
> should stay.
> The java- and package level docs have since it was committed been  
> suggesting that one should consider using II as if it was immutable  
> due to the locklessness. My suggestion is that we make it immutable  
> for real.
> Since II is ment for small corpora there is very little time lost by  
> using the constructor that builts the index from an IndexReader. I.e.  
> rather than using InstantiatedIndexWriter one would have to use a  
> Directory and an IndexWriter and then pass an IndexReader to a new  
> InstantiatedIndex.
> Any objections?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Created: (LUCENE-1948) Deprecating InstantiatedIndexWriter

2009-10-05 Thread Karl Wettin (JIRA)

Deprecating InstantiatedIndexWriter
---

 Key: LUCENE-1948
 URL: https://issues.apache.org/jira/browse/LUCENE-1948
 Project: Lucene - Java
  Issue Type: Task
  Components: contrib/*
Affects Versions: 2.9
Reporter: Karl Wettin
Assignee: Karl Wettin
 Fix For: 3.0


http://markmail.org/message/j6ip266fpzuaibf7

I suppose that should have been suggested before 2.9 rather than  
after...

There are at least three reasons to why I want to do this:

The code is based on the behaviour or the Directory IndexWriter as of  
2.3 and I have not been touching it since then. If there will be  
changes in the future one will have to keep IIW in sync, something  
that's easy to forget.
There is no locking which will cause concurrent modification  
exceptions when accessing the index via searcher/reader while  
committing.
It use the old token stream API so it has to be upgraded in case it  
should stay.

The java- and package level docs have since it was committed been  
suggesting that one should consider using II as if it was immutable  
due to the locklessness. My suggestion is that we make it immutable  
for real.

Since II is ment for small corpora there is very little time lost by  
using the constructor that builts the index from an IndexReader. I.e.  
rather than using InstantiatedIndexWriter one would have to use a  
Directory and an IndexWriter and then pass an IndexReader to a new  
InstantiatedIndex.

Any objections?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1947) Snowball package contains BSD licensed code with ASL header

2009-10-05 Thread Karl Wettin (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wettin updated LUCENE-1947:


Attachment: LUCENE-1947.patch

> Snowball package contains BSD licensed code with ASL header
> ---
>
> Key: LUCENE-1947
> URL: https://issues.apache.org/jira/browse/LUCENE-1947
> Project: Lucene - Java
>  Issue Type: Task
>  Components: contrib/analyzers
>Affects Versions: 2.9
>Reporter: Karl Wettin
>Assignee: Karl Wettin
> Fix For: 3.0
>
> Attachments: LUCENE-1947.patch
>
>
> All classes in org.tartarus.snowball (but not in org.tartarus.snowball.ext) 
> has for some reason been given an ASL header. These classes are licensed with 
> BSD. Thus the ASL header should be removed. I suppose this a misstake or 
> possible due to the ASL header automation tool.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Created: (LUCENE-1947) Snowball package contains BSD licensed code with ASL header

2009-10-05 Thread Karl Wettin (JIRA)

Snowball package contains BSD licensed code with ASL header
---

 Key: LUCENE-1947
 URL: https://issues.apache.org/jira/browse/LUCENE-1947
 Project: Lucene - Java
  Issue Type: Task
  Components: contrib/analyzers
Affects Versions: 2.9
Reporter: Karl Wettin
Assignee: Karl Wettin
 Fix For: 3.0


All classes in org.tartarus.snowball (but not in org.tartarus.snowball.ext) has 
for some reason been given an ASL header. These classes are licensed with BSD. 
Thus the ASL header should be removed. I suppose this a misstake or possible 
due to the ASL header automation tool.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Closed: (LUCENE-1939) IndexOutOfBoundsException at ShingleMatrixFilter's Iterator#hasNext method

2009-10-05 Thread Karl Wettin (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wettin closed LUCENE-1939.
---

   Resolution: Fixed
Fix Version/s: 3.0

Committed in 821888.

Thanks Patrick!

(I'll consider the other stuff mentioned in the issue later this week, and if 
managable then as a new issue.)

> IndexOutOfBoundsException at ShingleMatrixFilter's Iterator#hasNext method
> --
>
> Key: LUCENE-1939
> URL: https://issues.apache.org/jira/browse/LUCENE-1939
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/analyzers
>Affects Versions: 2.9
>Reporter: Patrick Jungermann
>Assignee: Karl Wettin
> Fix For: 3.0
>
> Attachments: ShingleMatrixFilter_IndexOutOfBoundsException.patch
>
>
> I tried to use the ShingleMatrixFilter within Solr. To test the functionality 
> etc., I first used the built-in field analysis view.The filter was configured 
> to be used only at query time analysis with "_" as spacer character and a 
> min. and max. shingle size of 2. The generation of the shingles for query 
> strings with this filter seems to work at this view, but by turn on the 
> highlighting of indexed terms that will match the query terms, the exception 
> was thrown. Also, each time I tried to query the index the exception was 
> immediately thrown.
> Stacktrace:
> {code}
> java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
>   at java.util.ArrayList.RangeCheck(Unknown Source)
>   at java.util.ArrayList.get(Unknown Source)
>   at 
> org.apache.lucene.analysis.shingle.ShingleMatrixFilter$Matrix$1.hasNext(ShingleMatrixFilter.java:729)
>   at 
> org.apache.lucene.analysis.shingle.ShingleMatrixFilter.next(ShingleMatrixFilter.java:380)
>   at org.apache.lucene.analysis.StopFilter.next(StopFilter.java:120)
>   at org.apache.lucene.analysis.TokenStream.next(TokenStream.java:47)
>   ...
> {code}
> Within the hasNext method, there is the {{s-1}}-th Column from the ArrayList 
> {{columns}} requested, but there isn't this entry within columns.
> I created a patch that checks, if {{columns}} contains enough entries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-10-05 Thread John Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12762224#action_12762224
 ] 

John Wang commented on LUCENE-1458:
---

Hi Yonik:

 These are indeed useful features. LUCENE-1922 addresses 1), perhaps, we 
can add 2) to the same issue to track?

Thanks

-John

> Further steps towards flexible indexing
> ---
>
> Key: LUCENE-1458
> URL: https://issues.apache.org/jira/browse/LUCENE-1458
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2
>
>
> I attached a very rough checkpoint of my current patch, to get early
> feedback.  All tests pass, though back compat tests don't pass due to
> changes to package-private APIs plus certain bugs in tests that
> happened to work (eg call TermPostions.nextPosition() too many times,
> which the new API asserts against).
> [Aside: I think, when we commit changes to package-private APIs such
> that back-compat tests don't pass, we could go back, make a branch on
> the back-compat tag, commit changes to the tests to use the new
> package private APIs on that branch, then fix nightly build to use the
> tip of that branch?o]
> There's still plenty to do before this is committable! This is a
> rather large change:
>   * Switches to a new more efficient terms dict format.  This still
> uses tii/tis files, but the tii only stores term & long offset
> (not a TermInfo).  At seek points, tis encodes term & freq/prox
> offsets absolutely instead of with deltas delta.  Also, tis/tii
> are structured by field, so we don't have to record field number
> in every term.
> .
> On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
> -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> .
> RAM usage when loading terms dict index is significantly less
> since we only load an array of offsets and an array of String (no
> more TermInfo array).  It should be faster to init too.
> .
> This part is basically done.
>   * Introduces modular reader codec that strongly decouples terms dict
> from docs/positions readers.  EG there is no more TermInfo used
> when reading the new format.
> .
> There's nice symmetry now between reading & writing in the codec
> chain -- the current docs/prox format is captured in:
> {code}
> FormatPostingsTermsDictWriter/Reader
> FormatPostingsDocsWriter/Reader (.frq file) and
> FormatPostingsPositionsWriter/Reader (.prx file).
> {code}
> This part is basically done.
>   * Introduces a new "flex" API for iterating through the fields,
> terms, docs and positions:
> {code}
> FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
> {code}
> This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
> old API on top of the new API to keep back-compat.
> 
> Next steps:
>   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
> fix any hidden assumptions.
>   * Expose new API out of IndexReader, deprecate old API but emulate
> old API on top of new one, switch all core/contrib users to the
> new API.
>   * Maybe switch to AttributeSources as the base class for TermsEnum,
> DocsEnum, PostingsEnum -- this would give readers API flexibility
> (not just index-file-format flexibility).  EG if someone wanted
> to store payload at the term-doc level instead of
> term-doc-position level, you could just add a new attribute.
>   * Test performance & iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-10-05 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12762203#action_12762203
 ] 

Yonik Seeley commented on LUCENE-1458:
--

Sounding cool!  I haven't had time to look at the code too much... but I j ust 
wanted to mention two features I've had in the back of my mind for a while that 
seem to have multiple use cases.

1) How many terms in a field?
- If the tii/TermInfos were exposed, this could be estimated.
- Perhaps this could just be stored in FieldInfos... should be easy to track 
during indexing?
- MultiTermQuery could also use this to switch impls

2) Convert back and forth between a term number and a term.
Solr has code to do this... stores every 128th term in memory as an index, and 
uses that to convert back and forth.  This is very much like the internals of 
TermInfos... would be nice to expose some of that.

> Further steps towards flexible indexing
> ---
>
> Key: LUCENE-1458
> URL: https://issues.apache.org/jira/browse/LUCENE-1458
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2
>
>
> I attached a very rough checkpoint of my current patch, to get early
> feedback.  All tests pass, though back compat tests don't pass due to
> changes to package-private APIs plus certain bugs in tests that
> happened to work (eg call TermPostions.nextPosition() too many times,
> which the new API asserts against).
> [Aside: I think, when we commit changes to package-private APIs such
> that back-compat tests don't pass, we could go back, make a branch on
> the back-compat tag, commit changes to the tests to use the new
> package private APIs on that branch, then fix nightly build to use the
> tip of that branch?o]
> There's still plenty to do before this is committable! This is a
> rather large change:
>   * Switches to a new more efficient terms dict format.  This still
> uses tii/tis files, but the tii only stores term & long offset
> (not a TermInfo).  At seek points, tis encodes term & freq/prox
> offsets absolutely instead of with deltas delta.  Also, tis/tii
> are structured by field, so we don't have to record field number
> in every term.
> .
> On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
> -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> .
> RAM usage when loading terms dict index is significantly less
> since we only load an array of offsets and an array of String (no
> more TermInfo array).  It should be faster to init too.
> .
> This part is basically done.
>   * Introduces modular reader codec that strongly decouples terms dict
> from docs/positions readers.  EG there is no more TermInfo used
> when reading the new format.
> .
> There's nice symmetry now between reading & writing in the codec
> chain -- the current docs/prox format is captured in:
> {code}
> FormatPostingsTermsDictWriter/Reader
> FormatPostingsDocsWriter/Reader (.frq file) and
> FormatPostingsPositionsWriter/Reader (.prx file).
> {code}
> This part is basically done.
>   * Introduces a new "flex" API for iterating through the fields,
> terms, docs and positions:
> {code}
> FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
> {code}
> This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
> old API on top of the new API to keep back-compat.
> 
> Next steps:
>   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
> fix any hidden assumptions.
>   * Expose new API out of IndexReader, deprecate old API but emulate
> old API on top of new one, switch all core/contrib users to the
> new API.
>   * Maybe switch to AttributeSources as the base class for TermsEnum,
> DocsEnum, PostingsEnum -- this would give readers API flexibility
> (not just index-file-format flexibility).  EG if someone wanted
> to store payload at the term-doc level instead of
> term-doc-position level, you could just add a new attribute.
>   * Test performance & iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---

Re: Lucene 2.9 and deprecated IR.open() methods

2009-10-05 Thread Earwin Burrfoot

> I think AS is overkill for conveying configuration of IW/IR?
Agree.

> It's too cumbersome, I think, for something that ought to be simple.
> I'd prefer a dedicated config class with strongly typed setters
> exposed.  Of all the "pure syntax" options so far I'd still prefer the
> traditional "config object with setters".
Builders are visually cleaner. But well, it's just my aestetical preference.

> Also, I don't think we should roll this out for all Lucene classes.  I
> think most classes do just fine accepting args to their ctor.  EG
> TermQuery simply takes Term to its ctor.
It's obvious.

> I do agree IW should not be in the business of brokering changes to
> the settings of its sub-components (eg mergeFactor, maxMergeDocs).
> You really should make such changes directly via your merge policy.
Aaaand, you shouldn't do such changes after construction :)

> But, then, I can also see it'd simplify our code to not have to deal
> w/ such changes, reduce chance of subtle bugs, and it seems minor to
> go and re-open your IndexWriter if you need to make a settings change?
> (Hmm except in an NRT setting, because the reader pool would be reset;
> really we need to get the reader pool separated from the IW instance).
Even if recreating IW is costly, you don't change settings that often, isn't it?

Mark:
> Agreed we need to deal with - *but* I personally think it gets tricky.
> Users should be able to flip compound on/off easily without dealing with
> a mergepolicy IMO. And advanced users that set a mergepolicy shouldn't
> have to deal with losing a compound setting they set with IW after
> setting a new mergepolicy. Can't I have it both ways :)
I don't understand why on earth compound setting is a property of MergePolicy.
The question of which segments to merge is really orthogonal to the
way you store these segments on disk.

-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Lucene 2.9 and deprecated IR.open() methods

2009-10-05 Thread Mark Miller

Michael McCandless wrote:
> I think AS is overkill for conveying configuration of IW/IR?
>
> Suddenly, instead of:
>
>   cfg.setRAMBufferSizeMB(128.0)
>
> I'd have to do something like this?
>
>   
> cfg.addAttribute(IndexWriterRAMBufferSizeAttribute.class).setRAMBufferSize(128.0)
>
> It's too cumbersome, I think, for something that ought to be simple.
> I'd prefer a dedicated config class with strongly typed setters
> exposed.  Of all the "pure syntax" options so far I'd still prefer the
> traditional "config object with setters".
>   
+1

> Also, I don't think we should roll this out for all Lucene classes.  I
> think most classes do just fine accepting args to their ctor.  EG
> TermQuery simply takes Term to its ctor.
>   
+1
> I do agree IW should not be in the business of brokering changes to
> the settings of its sub-components (eg mergeFactor, maxMergeDocs).
> You really should make such changes directly via your merge policy.
>   
Agreed we need to deal with - *but* I personally think it gets tricky.
Users should be able to flip compound on/off easily without dealing with
a mergepolicy IMO. And advanced users that set a mergepolicy shouldn't
have to deal with losing a compound setting they set with IW after
setting a new mergepolicy. Can't I have it both ways :)
> Finally, I'm not convinced we should lock down all settings after
> classes are created.  (I'm not convinced we shouldn't, either).
>
> A merge policy has no trouble changing its mergeFactor,
> maxMergeDocs/Size.  IW has no trouble changing the its RAM buffer
> size, maxFieldLength, or useCompoundFile.  Sure there are some things
> that cannot (or would be very tricky to) change, eg deletion policy.
> But then analyzer isn't changeable today, but could be.
>
> But, then, I can also see it'd simplify our code to not have to deal
> w/ such changes, reduce chance of subtle bugs, and it seems minor to
> go and re-open your IndexWriter if you need to make a settings change?
> (Hmm except in an NRT setting, because the reader pool would be reset;
> really we need to get the reader pool separated from the IW instance).
>
> Mike
>
> On Mon, Oct 5, 2009 at 4:38 AM, Uwe Schindler  wrote:
>   
 See my second mail. The recently introduced Attributes and
 
>>> AttributeSource
>>>   
 would solve this. Each component just defines its attribute interface
 
>>> and
>>>   
 impl class and you pass in an AttributeSource as configuration. Then you
 
>>> can
>>>   
 do:

 AttributeSource cfg = new AttributeSource();

 ComponentAttribute compCfg = cfg.addAttribute(ComponentAttribute.class);
 compCfg.setMergeScheduler(FooScheduler.class);

 MergeBarAttribute mergeCfg = cfg.addAttribute(MergeBarAttribute.class);
 mergeCfg.setWhateverProp(1234);
 ...
 IndexWriter iw = new IndexWriter(dir, cfg);

 (this is just brainstorming not yet thoroughly thought about).
 
>>> This approach suggests IW creates its components, and while doing so
>>> provides them your AS instance.
>>> I personally prefer creating all these components myself, configuring
>>> them (at the moment of creation) and passing them to IW in one way or
>>> another.
>>> This requires way less code, you don't have to invent elaborate
>>> schemes of passing through your custom per-component settings and
>>> selecting which exact component types IW should use, you don't risk
>>> construct/postConstruct/postpostConstruct-style things.
>>>   
>> Not really. That was just brainstorming. But you can pass also instances
>> instead of class names through attributesource. AttributeSurce only provides
>> type safety for the various configuration settings (which are interfaces).
>> But you could also create an attribute that gets the pointer to the
>> component. So "compCfg.setMergeScheduler(FooScheduler.class);" could also be
>> compConfig.addComponent(new FooScheduler(...));
>>
>> The AttributeSource approach has one other good thing:
>> If you want to use the default settings for one attribute, you do not have
>> to add it to the AS (or you can forget it). With the properties approach,
>> you have to hardcode the parameter defaults and validation everywhere. As
>> the consumer of an AttributeSource gets the attribute also by an
>> addAttribute-call (see current indexing code consuming TokenStreams), this
>> call would add the missing attribute with its default settings defined by
>> the implementation class. So in the above example, if you do not want to
>> provide the "whateverProp", leave the whole MergeBarAttribute out. The
>> consumer (IW) would just call addAttribute(MergeBarAttribute.class), because
>> it needs the attribute to configure itself. AS would add this attribute with
>> default settings.
>>
>> Uwe
>>
>>
>> -
>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>> For additional commands, e-

RE: Lucene 2.9 and deprecated IR.open() methods

2009-10-05 Thread Uwe Schindler

Hi Mike,

> I think AS is overkill for conveying configuration of IW/IR?
> 
> Suddenly, instead of:
> 
>   cfg.setRAMBufferSizeMB(128.0)
> 
> I'd have to do something like this?
> 
> 
> cfg.addAttribute(IndexWriterRAMBufferSizeAttribute.class).setRAMBufferSize
> (128.0)
> 
> It's too cumbersome, I think, for something that ought to be simple.
> I'd prefer a dedicated config class with strongly typed setters
> exposed.  Of all the "pure syntax" options so far I'd still prefer the
> traditional "config object with setters".

>From this point-of-view, it's also overkill for TokenStream. But as AS was
also designed for flexible indexing it would fit very well into this area.

The new query parser is a good example pro attributes. What is an argument
against atts is the fact, that also Michael Bush didn't promote them from
the beginning of this discussion :-) (maybe he needs also one night longer
to think about it).

Good points for AS, are e.g. the type-safety, simplicity to enhance,
built-in defaults (you do not need to check for existence of attributes,
just add them at the point you want to use them, like in your example -
maybe with nicer and shorter names). With generics, AS is as simple to use
like simple get/setters.

> Also, I don't think we should roll this out for all Lucene classes.  I
> think most classes do just fine accepting args to their ctor.  EG
> TermQuery simply takes Term to its ctor.
> 
> I do agree IW should not be in the business of brokering changes to
> the settings of its sub-components (eg mergeFactor, maxMergeDocs).
> You really should make such changes directly via your merge policy.

AttributeSource would also help us with e.g. the possibility for later
changes to various attributes. If some of the attributes are fixed after
construction of IW/IR, just throw IllegalStateExceptions.

> Finally, I'm not convinced we should lock down all settings after
> classes are created.  (I'm not convinced we shouldn't, either).
> 
> A merge policy has no trouble changing its mergeFactor,
> maxMergeDocs/Size.  IW has no trouble changing the its RAM buffer
> size, maxFieldLength, or useCompoundFile.  Sure there are some things
> that cannot (or would be very tricky to) change, eg deletion policy.
> But then analyzer isn't changeable today, but could be.
> 
> But, then, I can also see it'd simplify our code to not have to deal
> w/ such changes, reduce chance of subtle bugs, and it seems minor to
> go and re-open your IndexWriter if you need to make a settings change?
> (Hmm except in an NRT setting, because the reader pool would be reset;
> really we need to get the reader pool separated from the IW instance).
> 
> Mike
> 
> On Mon, Oct 5, 2009 at 4:38 AM, Uwe Schindler  wrote:
> >> > See my second mail. The recently introduced Attributes and
> >> AttributeSource
> >> > would solve this. Each component just defines its attribute interface
> >> and
> >> > impl class and you pass in an AttributeSource as configuration. Then
> you
> >> can
> >> > do:
> >> >
> >> > AttributeSource cfg = new AttributeSource();
> >> >
> >> > ComponentAttribute compCfg =
> cfg.addAttribute(ComponentAttribute.class);
> >> > compCfg.setMergeScheduler(FooScheduler.class);
> >> >
> >> > MergeBarAttribute mergeCfg =
> cfg.addAttribute(MergeBarAttribute.class);
> >> > mergeCfg.setWhateverProp(1234);
> >> > ...
> >> > IndexWriter iw = new IndexWriter(dir, cfg);
> >> >
> >> > (this is just brainstorming not yet thoroughly thought about).
> >>
> >> This approach suggests IW creates its components, and while doing so
> >> provides them your AS instance.
> >> I personally prefer creating all these components myself, configuring
> >> them (at the moment of creation) and passing them to IW in one way or
> >> another.
> >> This requires way less code, you don't have to invent elaborate
> >> schemes of passing through your custom per-component settings and
> >> selecting which exact component types IW should use, you don't risk
> >> construct/postConstruct/postpostConstruct-style things.
> >
> >
> > Not really. That was just brainstorming. But you can pass also instances
> > instead of class names through attributesource. AttributeSurce only
> provides
> > type safety for the various configuration settings (which are
> interfaces).
> > But you could also create an attribute that gets the pointer to the
> > component. So "compCfg.setMergeScheduler(FooScheduler.class);" could
> also be
> > compConfig.addComponent(new FooScheduler(...));
> >
> > The AttributeSource approach has one other good thing:
> > If you want to use the default settings for one attribute, you do not
> have
> > to add it to the AS (or you can forget it). With the properties
> approach,
> > you have to hardcode the parameter defaults and validation everywhere.
> As
> > the consumer of an AttributeSource gets the attribute also by an
> > addAttribute-call (see current indexing code consuming TokenStreams),
> this
> > call would add the missing attribute with its defa

[jira] Updated: (LUCENE-1458) Further steps towards flexible indexing

2009-10-05 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1458:
---

Attachment: LUCENE-1458.tar.bz2
LUCENE-1458-back-compat.patch

New patch attached.  All tests pass.  The changes are mostly cutting
many things over to the flex API.  Still many nocommits to address,
but I'm getting closer!

I haven't "svn up"d to all the recent the deprecations removals /
generics additions.  Kinda dreading doing so :) I think I'll wait
until all deprecations are gone and then bite the bullet...

Cutting over all the MultiTermQuery subclasses was nice because all
the places where we get a TermEnum & iterate, checking if .field() is
still our field, are now cleaner because with the flex API the
TermsEnum you get is already only for your requested field.


> Further steps towards flexible indexing
> ---
>
> Key: LUCENE-1458
> URL: https://issues.apache.org/jira/browse/LUCENE-1458
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2
>
>
> I attached a very rough checkpoint of my current patch, to get early
> feedback.  All tests pass, though back compat tests don't pass due to
> changes to package-private APIs plus certain bugs in tests that
> happened to work (eg call TermPostions.nextPosition() too many times,
> which the new API asserts against).
> [Aside: I think, when we commit changes to package-private APIs such
> that back-compat tests don't pass, we could go back, make a branch on
> the back-compat tag, commit changes to the tests to use the new
> package private APIs on that branch, then fix nightly build to use the
> tip of that branch?o]
> There's still plenty to do before this is committable! This is a
> rather large change:
>   * Switches to a new more efficient terms dict format.  This still
> uses tii/tis files, but the tii only stores term & long offset
> (not a TermInfo).  At seek points, tis encodes term & freq/prox
> offsets absolutely instead of with deltas delta.  Also, tis/tii
> are structured by field, so we don't have to record field number
> in every term.
> .
> On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
> -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> .
> RAM usage when loading terms dict index is significantly less
> since we only load an array of offsets and an array of String (no
> more TermInfo array).  It should be faster to init too.
> .
> This part is basically done.
>   * Introduces modular reader codec that strongly decouples terms dict
> from docs/positions readers.  EG there is no more TermInfo used
> when reading the new format.
> .
> There's nice symmetry now between reading & writing in the codec
> chain -- the current docs/prox format is captured in:
> {code}
> FormatPostingsTermsDictWriter/Reader
> FormatPostingsDocsWriter/Reader (.frq file) and
> FormatPostingsPositionsWriter/Reader (.prx file).
> {code}
> This part is basically done.
>   * Introduces a new "flex" API for iterating through the fields,
> terms, docs and positions:
> {code}
> FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
> {code}
> This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
> old API on top of the new API to keep back-compat.
> 
> Next steps:
>   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
> fix any hidden assumptions.
>   * Expose new API out of IndexReader, deprecate old API but emulate
> old API on top of new one, switch all core/contrib users to the
> new API.
>   * Maybe switch to AttributeSources as the base class for TermsEnum,
> DocsEnum, PostingsEnum -- this would give readers API flexibility
> (not just index-file-format flexibility).  EG if someone wanted
> to store payload at the term-doc level instead of
> term-doc-position level, you could just add a new attribute.
>   * Test performance & iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: jav

Re: Lucene 2.9 and deprecated IR.open() methods

2009-10-05 Thread Michael McCandless

I think AS is overkill for conveying configuration of IW/IR?

Suddenly, instead of:

  cfg.setRAMBufferSizeMB(128.0)

I'd have to do something like this?

cfg.addAttribute(IndexWriterRAMBufferSizeAttribute.class).setRAMBufferSize(128.0)

It's too cumbersome, I think, for something that ought to be simple.
I'd prefer a dedicated config class with strongly typed setters
exposed.  Of all the "pure syntax" options so far I'd still prefer the
traditional "config object with setters".

Also, I don't think we should roll this out for all Lucene classes.  I
think most classes do just fine accepting args to their ctor.  EG
TermQuery simply takes Term to its ctor.

I do agree IW should not be in the business of brokering changes to
the settings of its sub-components (eg mergeFactor, maxMergeDocs).
You really should make such changes directly via your merge policy.

Finally, I'm not convinced we should lock down all settings after
classes are created.  (I'm not convinced we shouldn't, either).

A merge policy has no trouble changing its mergeFactor,
maxMergeDocs/Size.  IW has no trouble changing the its RAM buffer
size, maxFieldLength, or useCompoundFile.  Sure there are some things
that cannot (or would be very tricky to) change, eg deletion policy.
But then analyzer isn't changeable today, but could be.

But, then, I can also see it'd simplify our code to not have to deal
w/ such changes, reduce chance of subtle bugs, and it seems minor to
go and re-open your IndexWriter if you need to make a settings change?
(Hmm except in an NRT setting, because the reader pool would be reset;
really we need to get the reader pool separated from the IW instance).

Mike

On Mon, Oct 5, 2009 at 4:38 AM, Uwe Schindler  wrote:
>> > See my second mail. The recently introduced Attributes and
>> AttributeSource
>> > would solve this. Each component just defines its attribute interface
>> and
>> > impl class and you pass in an AttributeSource as configuration. Then you
>> can
>> > do:
>> >
>> > AttributeSource cfg = new AttributeSource();
>> >
>> > ComponentAttribute compCfg = cfg.addAttribute(ComponentAttribute.class);
>> > compCfg.setMergeScheduler(FooScheduler.class);
>> >
>> > MergeBarAttribute mergeCfg = cfg.addAttribute(MergeBarAttribute.class);
>> > mergeCfg.setWhateverProp(1234);
>> > ...
>> > IndexWriter iw = new IndexWriter(dir, cfg);
>> >
>> > (this is just brainstorming not yet thoroughly thought about).
>>
>> This approach suggests IW creates its components, and while doing so
>> provides them your AS instance.
>> I personally prefer creating all these components myself, configuring
>> them (at the moment of creation) and passing them to IW in one way or
>> another.
>> This requires way less code, you don't have to invent elaborate
>> schemes of passing through your custom per-component settings and
>> selecting which exact component types IW should use, you don't risk
>> construct/postConstruct/postpostConstruct-style things.
>
>
> Not really. That was just brainstorming. But you can pass also instances
> instead of class names through attributesource. AttributeSurce only provides
> type safety for the various configuration settings (which are interfaces).
> But you could also create an attribute that gets the pointer to the
> component. So "compCfg.setMergeScheduler(FooScheduler.class);" could also be
> compConfig.addComponent(new FooScheduler(...));
>
> The AttributeSource approach has one other good thing:
> If you want to use the default settings for one attribute, you do not have
> to add it to the AS (or you can forget it). With the properties approach,
> you have to hardcode the parameter defaults and validation everywhere. As
> the consumer of an AttributeSource gets the attribute also by an
> addAttribute-call (see current indexing code consuming TokenStreams), this
> call would add the missing attribute with its default settings defined by
> the implementation class. So in the above example, if you do not want to
> provide the "whateverProp", leave the whole MergeBarAttribute out. The
> consumer (IW) would just call addAttribute(MergeBarAttribute.class), because
> it needs the attribute to configure itself. AS would add this attribute with
> default settings.
>
> Uwe
>
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1257) Port to Java5

2009-10-05 Thread Karl Wettin (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12762131#action_12762131
 ] 

Karl Wettin commented on LUCENE-1257:
-

bq. err... looks like perhaps its only hit once though and then reused.. maybe 
not so nasty. My first time looking at this code, so I'm sure you can clear it 
up ...

Mark, are you referring to the reflection in Among? Those are pretty tough to 
get rid of.

I think we should replace the StringBuffers in the stemmers if nobody else 
minds. But I think we should do that in another issue. I also found a bit of 
ASL headers in some of the classes. Suppose they have been added automatically 
at some point. These classes are all BSD.

> Port to Java5
> -
>
> Key: LUCENE-1257
> URL: https://issues.apache.org/jira/browse/LUCENE-1257
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis, Examples, Index, Other, Query/Scoring, 
> QueryParser, Search, Store, Term Vectors
>Affects Versions: 2.3.1
>Reporter: Cédric Champeau
>Assignee: Uwe Schindler
>Priority: Minor
> Fix For: 3.0
>
> Attachments: instantiated_fieldable.patch, java5.patch, 
> LUCENE-1257-Document.patch, LUCENE-1257-StringBuffer.patch, 
> LUCENE-1257-StringBuffer.patch, LUCENE-1257-StringBuffer.patch, 
> LUCENE-1257_messages.patch, lucene1257surround1.patch, 
> lucene1257surround1.patch, shinglematrixfilter_generified.patch
>
>
> For my needs I've updated Lucene so that it uses Java 5 constructs. I know 
> Java 5 migration had been planned for 2.1 someday in the past, but don't know 
> when it is planned now. This patch against the trunk includes :
> - most obvious generics usage (there are tons of usages of sets, ... Those 
> which are commonly used have been generified)
> - PriorityQueue generification
> - replacement of indexed for loops with for each constructs
> - removal of unnececessary unboxing
> The code is to my opinion much more readable with those features (you 
> actually *know* what is stored in collections reading the code, without the 
> need to lookup for field definitions everytime) and it simplifies many 
> algorithms.
> Note that this patch also includes an interface for the Query class. This has 
> been done for my company's needs for building custom Query classes which add 
> some behaviour to the base Lucene queries. It prevents multiple unnnecessary 
> casts. I know this introduction is not wanted by the team, but it really 
> makes our developments easier to maintain. If you don't want to use this, 
> replace all /Queriable/ calls with standard /Query/.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

RE: Lucene 2.9 and deprecated IR.open() methods

2009-10-05 Thread Uwe Schindler

> > See my second mail. The recently introduced Attributes and
> AttributeSource
> > would solve this. Each component just defines its attribute interface
> and
> > impl class and you pass in an AttributeSource as configuration. Then you
> can
> > do:
> >
> > AttributeSource cfg = new AttributeSource();
> >
> > ComponentAttribute compCfg = cfg.addAttribute(ComponentAttribute.class);
> > compCfg.setMergeScheduler(FooScheduler.class);
> >
> > MergeBarAttribute mergeCfg = cfg.addAttribute(MergeBarAttribute.class);
> > mergeCfg.setWhateverProp(1234);
> > ...
> > IndexWriter iw = new IndexWriter(dir, cfg);
> >
> > (this is just brainstorming not yet thoroughly thought about).
> 
> This approach suggests IW creates its components, and while doing so
> provides them your AS instance.
> I personally prefer creating all these components myself, configuring
> them (at the moment of creation) and passing them to IW in one way or
> another.
> This requires way less code, you don't have to invent elaborate
> schemes of passing through your custom per-component settings and
> selecting which exact component types IW should use, you don't risk
> construct/postConstruct/postpostConstruct-style things.


Not really. That was just brainstorming. But you can pass also instances
instead of class names through attributesource. AttributeSurce only provides
type safety for the various configuration settings (which are interfaces).
But you could also create an attribute that gets the pointer to the
component. So "compCfg.setMergeScheduler(FooScheduler.class);" could also be
compConfig.addComponent(new FooScheduler(...));

The AttributeSource approach has one other good thing:
If you want to use the default settings for one attribute, you do not have
to add it to the AS (or you can forget it). With the properties approach,
you have to hardcode the parameter defaults and validation everywhere. As
the consumer of an AttributeSource gets the attribute also by an
addAttribute-call (see current indexing code consuming TokenStreams), this
call would add the missing attribute with its default settings defined by
the implementation class. So in the above example, if you do not want to
provide the "whateverProp", leave the whole MergeBarAttribute out. The
consumer (IW) would just call addAttribute(MergeBarAttribute.class), because
it needs the attribute to configure itself. AS would add this attribute with
default settings.

Uwe


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Lucene 2.9 and deprecated IR.open() methods

2009-10-05 Thread Earwin Burrfoot

On Mon, Oct 5, 2009 at 12:01, Uwe Schindler  wrote:
> Hi Marvin,
>
>> > Property names are always String, values any type (therefore
>> Map).
>> > With Java 5, integer props and so on are no "bad syntax" problem because
>> of
>> > autoboxing (no need to pass new Integer() or Integer.valueOf()).
>>
>> Argument validation gets to be a headache when you pass around complex
>> data
>> structures.  It's doable, but messy and hard to grok.  Going through
>> dedicated
>> methods is cleaner and safer.
>>
>> > Another good thing is, that implementors of e.g. XML config files like
>> in
>> > Solr, can simple pass all elements in config to this map.
>>
>> I go back and forth on this.  At some point, the volume of data becomes
>> overwhelming and it becomes easier to swap in the name of a class where
>> most
>> of the data can reside in nice, reliable, structured code.
>
> See my second mail. The recently introduced Attributes and AttributeSource
> would solve this. Each component just defines its attribute interface and
> impl class and you pass in an AttributeSource as configuration. Then you can
> do:
>
> AttributeSource cfg = new AttributeSource();
>
> ComponentAttribute compCfg = cfg.addAttribute(ComponentAttribute.class);
> compCfg.setMergeScheduler(FooScheduler.class);
>
> MergeBarAttribute mergeCfg = cfg.addAttribute(MergeBarAttribute.class);
> mergeCfg.setWhateverProp(1234);
> ...
> IndexWriter iw = new IndexWriter(dir, cfg);
>
> (this is just brainstorming not yet thoroughly thought about).

This approach suggests IW creates its components, and while doing so
provides them your AS instance.
I personally prefer creating all these components myself, configuring
them (at the moment of creation) and passing them to IW in one way or
another.
This requires way less code, you don't have to invent elaborate
schemes of passing through your custom per-component settings and
selecting which exact component types IW should use, you don't risk
construct/postConstruct/postpostConstruct-style things.

-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

RE: Lucene 2.9 and deprecated IR.open() methods

2009-10-05 Thread Uwe Schindler

Hi Marvin,

> > Property names are always String, values any type (therefore
> Map).
> > With Java 5, integer props and so on are no "bad syntax" problem because
> of
> > autoboxing (no need to pass new Integer() or Integer.valueOf()).
> 
> Argument validation gets to be a headache when you pass around complex
> data
> structures.  It's doable, but messy and hard to grok.  Going through
> dedicated
> methods is cleaner and safer.
> 
> > Another good thing is, that implementors of e.g. XML config files like
> in
> > Solr, can simple pass all elements in config to this map.
> 
> I go back and forth on this.  At some point, the volume of data becomes
> overwhelming and it becomes easier to swap in the name of a class where
> most
> of the data can reside in nice, reliable, structured code.

See my second mail. The recently introduced Attributes and AttributeSource
would solve this. Each component just defines its attribute interface and
impl class and you pass in an AttributeSource as configuration. Then you can
do:

AttributeSource cfg = new AttributeSource();

ComponentAttribute compCfg = cfg.addAttribute(ComponentAttribute.class);
compCfg.setMergeScheduler(FooScheduler.class);

MergeBarAttribute mergeCfg = cfg.addAttribute(MergeBarAttribute.class);
mergeCfg.setWhateverProp(1234);
...
IndexWriter iw = new IndexWriter(dir, cfg);

(this is just brainstorming not yet thoroughly thought about).

Uwe



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Lucene 2.9 and deprecated IR.open() methods

2009-10-05 Thread Marvin Humphrey

On Mon, Oct 05, 2009 at 08:27:20AM +0200, Uwe Schindler wrote:

> Pass a Properties or Map to the ctor/open. The keys are predefined
> constants. Maybe our previous idea of an IndexConfiguration class is a
> subclass of HashMap with all the constants and some easy-to-use
> setter methods for very often-used settings (like index dir) and some
> reasonable defaults.

Interesting.  The design we worked out for Lucy's Segment class (prototype in
KS devel branch) uses hash/array/string data to store arbitrary metadata on
behalf of segment components, written as JSON to seg_NNN/segmeta.json.  In
that case, though, each component is responsible for generating and consuming
its own data.  That's different from having the user supply data via such a
format.

I still think you're going to want an extensible builder class.

> This allows us to pass these properties to any flex indexing component
> without need to modify/extend it to support the additional properties. The
> flexible indexing component just defines its own property names (e.g. as
> URNs, URLs, using its class name as prefix,...). 

But how do you determine what the flex indexing components *are*?  In theory,
you can pass class names and sufficient arguments to build them up via your
big ball of data, but then you're essentially creating a new language, with
all the headaches that entails. 

In KS, Indexer/IndexReader configuration is divided between three classes.

  * Schema: field definitions.
  * Architecture: Settings that never change for the life of the index.
  * IndexManager: Settings that can change per index/search session.

Schema isn't worth discussing -- Lucy will have it, Lucene won't, end of
story.  Architecture and IndexManager, though, are fairly close to what's
being discussed.

Architecture is responsible for e.g. determining which plugabble components
get registered.  It's the builder class.

IndexManager is where things like merging and locking policies reside.

> Property names are always String, values any type (therefore Map).
> With Java 5, integer props and so on are no "bad syntax" problem because of
> autoboxing (no need to pass new Integer() or Integer.valueOf()).

Argument validation gets to be a headache when you pass around complex data
structures.  It's doable, but messy and hard to grok.  Going through dedicated
methods is cleaner and safer.

> Another good thing is, that implementors of e.g. XML config files like in
> Solr, can simple pass all elements in config to this map.

I go back and forth on this.  At some point, the volume of data becomes
overwhelming and it becomes easier to swap in the name of a class where most
of the data can reside in nice, reliable, structured code.

Marvin Humphrey

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

RE: Lucene 2.9 and deprecated IR.open() methods

2009-10-05 Thread Uwe Schindler

> > On Sun, Oct 04, 2009 at 05:53:14AM -0400, Michael McCandless wrote:
> >
> > >   1 Do we prevent config settings from changing after creating an
> > > IW/IR?
> >
> > Any settings conveyed via a settings object ought to be final if you
> want
> > pluggable index components.  Otherwise, you need some nightmarish
> > notification
> > system to propagate settings down into your subcomponents, which may or
> > may
> > not be prepared to handle the value modifications.
> 
> +1, this is an argument in my opinion for final members/settings.
> 
> By the way, there is a third possibility for passing configuration
> settings:
> The idea is to enable passing settings to IR/IW and its flexible indexing
> components by the same technique like JAXP does it (please don't hit me!):
> Pass a Properties or Map to the ctor/open. The keys are
> predefined
> constants. Maybe our previous idea of an IndexConfiguration class is a
> subclass of HashMap with all the constants and some easy-to-use
> setter methods for very often-used settings (like index dir) and some
> reasonable defaults.
> 
> This allows us to pass these properties to any flex indexing component
> without need to modify/extend it to support the additional properties. The
> flexible indexing component just defines its own property names (e.g. as
> URNs, URLs, using its class name as prefix,...). Property names are always
> String, values any type (therefore Map). With Java 5, integer
> props and so on are no "bad syntax" problem because of autoboxing (no need
> to pass new Integer() or Integer.valueOf()).
> 
> Another good thing is, that implementors of e.g. XML config files like in
> Solr, can simple pass all elements in config to this map.

Another option for extensibility with type safety, properties would not
have, would be Attributes. Just pass an AttributeSource as configuration.
And the default index properties are one attribute where custom extensions
can define own ones.

Uwe


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

40 matches

Mail list logo