from:"Otis Gospodnetic $JIRA$"

[jira] Commented: (LUCENE-2393) Utility to output total term frequency and df from a lucene index

2010-04-14 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12857107#action_12857107
 ] 

Otis Gospodnetic commented on LUCENE-2393:
--

I think creating a small index with a couple of docs would be the way to go.

 Utility to output total term frequency and df from a lucene index
 -

 Key: LUCENE-2393
 URL: https://issues.apache.org/jira/browse/LUCENE-2393
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Tom Burton-West
Priority: Trivial
 Attachments: LUCENE-2393.patch


 This is a command line utility that takes a field name, term, and index 
 directory and outputs the document frequency for the term and the total 
 number of occurrences of the term in the index (i.e. the sum of the tf of the 
 term for each document).  It is useful for estimating the size of the term's 
 entry in the *prx files and consequent Disk I/O demands

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2127) Improved large result handling

2010-01-07 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12797776#action_12797776
 ] 

Otis Gospodnetic commented on LUCENE-2127:
--

+1 for Aaron's patch in a separate issue, too.

 Improved large result handling
 --

 Key: LUCENE-2127
 URL: https://issues.apache.org/jira/browse/LUCENE-2127
 Project: Lucene - Java
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: LUCENE-2127.patch, LUCENE-2127.patch


 Per 
 http://search.lucidimagination.com/search/document/350c54fc90d257ed/lots_of_results#fbb84bd297d15dd5,
  it would be nice to offer some other Collectors that are better at handling 
 really large number of results.  This could be implemented in a variety of 
 ways via Collectors.  For instance, we could have a raw collector that does 
 no sorting and just returns the ScoreDocs, or we could do as Mike suggests 
 and have Collectors that have heuristics about memory tradeoffs and only 
 heapify when appropriate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1910) Extension to MoreLikeThis to use tag information

2009-12-15 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12790889#action_12790889
 ] 

Otis Gospodnetic commented on LUCENE-1910:
--

* I'll second Mark's suggestion to extract the Information Gain piece of the 
patch into separate class(es), so we can reuse it in other places.  It looks 
like it's currently an integral part of MoreLikeThisUsingTags class.  Would 
that be possible?

* I noticed the code needs ASL (the Apache Software License) added.

* Also, could you please use the Lucene code format? (Eclipse/IntelliJ 
templates are at the bottom of 
http://wiki.apache.org/lucene-java/HowToContribute )


 Extension to MoreLikeThis to use tag information
 

 Key: LUCENE-1910
 URL: https://issues.apache.org/jira/browse/LUCENE-1910
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Thomas D'Silva
Priority: Minor
 Attachments: LUCENE-1910.patch


 I would like to contribute a class based on the MoreLikeThis class in
 contrib/queries that generates a query based on the tags associated
 with a document. The class assumes that documents are tagged with a
 set of tags (which are stored in the index in a seperate Field). The
 class determines the top document terms associated with a given tag
 using the information gain metric.
 While generating a MoreLikeThis query for a document the tags
 associated with document are used to determine the terms in the query.
 This class is useful for finding similar documents to a document that
 does not have many relevant terms but was tagged.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2091) Add BM25 Scoring to Lucene

2009-12-03 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12785473#action_12785473
 ] 

Otis Gospodnetic commented on LUCENE-2091:
--

+1 for skipping BM25 and going straight to BM25F.

I think the answer to Uwe's question about why this can't just be a different 
Similarity or some such is that BM25 requires some data that Lucene currently 
doesn't collect.  That's what there were some of those static methods in 
examples on the author's site.  I *think* what I'm saying is correct. :)


 Add BM25 Scoring to Lucene
 --

 Key: LUCENE-2091
 URL: https://issues.apache.org/jira/browse/LUCENE-2091
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Yuval Feinstein
Priority: Minor
 Fix For: 3.1

 Attachments: LUCENE-2091.patch, persianlucene.jpg

   Original Estimate: 48h
  Remaining Estimate: 48h

 http://nlp.uned.es/~jperezi/Lucene-BM25/ describes an implementation of 
 Okapi-BM25 scoring in the Lucene framework,
 as an alternative to the standard Lucene scoring (which is a version of mixed 
 boolean/TFIDF).
 I have refactored this a bit, added unit tests and improved the runtime 
 somewhat.
 I would like to contribute the code to Lucene under contrib. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2091) Add BM25 Scoring to Lucene

2009-12-03 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12785690#action_12785690
 ] 

Otis Gospodnetic commented on LUCENE-2091:
--

Joaquin - could you please explain what you mean by Saturate the effect of 
frequency with k1?  Thanks.

 Add BM25 Scoring to Lucene
 --

 Key: LUCENE-2091
 URL: https://issues.apache.org/jira/browse/LUCENE-2091
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Yuval Feinstein
Priority: Minor
 Fix For: 3.1

 Attachments: LUCENE-2091.patch, persianlucene.jpg

   Original Estimate: 48h
  Remaining Estimate: 48h

 http://nlp.uned.es/~jperezi/Lucene-BM25/ describes an implementation of 
 Okapi-BM25 scoring in the Lucene framework,
 as an alternative to the standard Lucene scoring (which is a version of mixed 
 boolean/TFIDF).
 I have refactored this a bit, added unit tests and improved the runtime 
 somewhat.
 I would like to contribute the code to Lucene under contrib. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2091) Add BM25 Scoring to Lucene

2009-11-29 Thread Otis Gospodnetic (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783530#action_12783530
]

Otis Gospodnetic commented on LUCENE-2091:
--

Has anyone compared this particular BM25 impl. to the current Lucene's
quasi-VSM approach in terms of:
* any of the relevance eval methods
* indexing performance
* search performance
* ...

Also, this issue is marked as contrib/*. Should this not go straight to core,
so more people actually use this and provide feedback? Who knows, there is a
chance (ha!) BM25 might turn out better than the current approach, and become
the default.

Add BM25 Scoring to Lucene
--

Key: LUCENE-2091
URL: https://issues.apache.org/jira/browse/LUCENE-2091
Project: Lucene - Java
Issue Type: New Feature
Components: contrib/*
Reporter: Yuval Feinstein
Priority: Minor
Fix For: 3.1

Original Estimate: 48h
Remaining Estimate: 48h

http://nlp.uned.es/~jperezi/Lucene-BM25/ describes an implementation of
Okapi-BM25 scoring in the Lucene framework,
as an alternative to the standard Lucene scoring (which is a version of mixed
boolean/TFIDF).
I have refactored this a bit, added unit tests and improved the runtime
somewhat.
I would like to contribute the code to Lucene under contrib.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-2091) Add BM25 Scoring to Lucene

2009-11-29 Thread Otis Gospodnetic (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783530#action_12783530
]

Otis Gospodnetic edited comment on LUCENE-2091 at 11/30/09 4:21 AM:

Has anyone compared this particular BM25 impl. to the current Lucene's
quasi-VSM approach in terms of:
* any of the relevance eval methods
* indexing performance
* search performance
* ...

Aha, I found something:
http://markmail.org/message/c2r4v7zj7mduzs5d

was (Author: otis):
Has anyone compared this particular BM25 impl. to the current Lucene's
quasi-VSM approach in terms of:
* any of the relevance eval methods
* indexing performance
* search performance
* ...

Add BM25 Scoring to Lucene
--

Key: LUCENE-2091
URL: https://issues.apache.org/jira/browse/LUCENE-2091
Project: Lucene - Java
Issue Type: New Feature
Components: contrib/*
Reporter: Yuval Feinstein
Priority: Minor
Fix For: 3.1

Original Estimate: 48h
Remaining Estimate: 48h

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Resolved: (LUCENE-1491) EdgeNGramTokenFilter stops on tokens smaller then minimum gram size.

2009-07-14 Thread Otis Gospodnetic (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic resolved LUCENE-1491.
--

Resolution: Fixed

Thanks Todd  Co.

SendingCHANGES.txt
Sending
analyzers/src/java/org/apache/lucene/analysis/ngram/EdgeNGramTokenFilter.java
Sending
analyzers/src/java/org/apache/lucene/analysis/ngram/NGramTokenFilter.java
Sending
analyzers/src/test/org/apache/lucene/analysis/ngram/EdgeNGramTokenFilterTest.java
Sending
analyzers/src/test/org/apache/lucene/analysis/ngram/NGramTokenFilterTest.java
Transmitting file data .
Committed revision 794034.


 EdgeNGramTokenFilter stops on tokens smaller then minimum gram size.
 

 Key: LUCENE-1491
 URL: https://issues.apache.org/jira/browse/LUCENE-1491
 Project: Lucene - Java
  Issue Type: Bug
  Components: Analysis
Affects Versions: 2.4, 2.4.1, 2.9, 3.0
Reporter: Todd Feak
Assignee: Otis Gospodnetic
 Fix For: 2.9

 Attachments: LUCENE-1491.patch


 If a token is encountered in the stream that is shorter in length than the 
 min gram size, the filter will stop processing the token stream.
 Working up a unit test now, but may be a few days before I can provide it. 
 Wanted to get it in the system.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1677) Remove GCJ IndexReader specializations

2009-06-04 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12716428#action_12716428
 ] 

Otis Gospodnetic commented on LUCENE-1677:
--

In my cca 10 year history of being around Lucene I think I saw GCJ mentioned 
only about half a dozen times.

 Remove GCJ IndexReader specializations
 --

 Key: LUCENE-1677
 URL: https://issues.apache.org/jira/browse/LUCENE-1677
 Project: Lucene - Java
  Issue Type: Task
Reporter: Earwin Burrfoot
 Fix For: 2.9


 These specializations are outdated, unsupported, most probably pointless due 
 to the speed of modern JVMs and, I bet, nobody uses them (Mike, you said you 
 are going to ask people on java-user, anybody replied that they need it?). 
 While giving nothing, they make SegmentReader instantiation code look real 
 ugly.
 If nobody objects, I'm going to post a patch that removes these from Lucene.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1491) EdgeNGramTokenFilter stops on tokens smaller then minimum gram size.

2009-06-03 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12716053#action_12716053
 ] 

Otis Gospodnetic commented on LUCENE-1491:
--

I'm getting convinced to just drop ngrams  minNgram.
If nobody complains by the end of the week, I'll commit.


 EdgeNGramTokenFilter stops on tokens smaller then minimum gram size.
 

 Key: LUCENE-1491
 URL: https://issues.apache.org/jira/browse/LUCENE-1491
 Project: Lucene - Java
  Issue Type: Bug
  Components: Analysis
Affects Versions: 2.4, 2.4.1, 2.9, 3.0
Reporter: Todd Feak
Assignee: Otis Gospodnetic
 Fix For: 2.9

 Attachments: LUCENE-1491.patch


 If a token is encountered in the stream that is shorter in length than the 
 min gram size, the filter will stop processing the token stream.
 Working up a unit test now, but may be a few days before I can provide it. 
 Wanted to get it in the system.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Resolved: (LUCENE-1378) Remove remaining @author references

2009-06-02 Thread Otis Gospodnetic (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic resolved LUCENE-1378.
--

Resolution: Fixed

Done.  Thank you Paul.

Sendingsrc/java/org/apache/lucene/analysis/package.html
Sendingsrc/java/org/apache/lucene/analysis/standard/package.html
Sendingsrc/java/org/apache/lucene/index/package.html
Sendingsrc/java/org/apache/lucene/queryParser/package.html
Sendingsrc/java/org/apache/lucene/search/package.html
Sendingsrc/java/org/apache/lucene/store/package.html
Sendingsrc/java/org/apache/lucene/util/package.html
Sendingsrc/test/org/apache/lucene/search/TestBooleanOr.java
Transmitting file data 
Committed revision 781055.


 Remove remaining @author references
 ---

 Key: LUCENE-1378
 URL: https://issues.apache.org/jira/browse/LUCENE-1378
 Project: Lucene - Java
  Issue Type: Task
Reporter: Otis Gospodnetic
Assignee: Otis Gospodnetic
Priority: Trivial
 Fix For: 2.9

 Attachments: LUCENE-1378.patch, LUCENE-1378.patch, 
 LUCENE-1378b.patch, LUCENE-1378c.patch


 $ find . -name \*.java | xargs grep '@author' | cut -d':' -f1 | xargs perl 
 -pi -e 's/ \...@author.*//'

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Resolved: (LUCENE-898) contrib/javascript is not packaged into releases

2009-06-02 Thread Otis Gospodnetic (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic resolved LUCENE-898.
-

Resolution: Fixed

Done.

D javascript/queryEscaper/luceneQueryEscaper.js
D javascript/queryEscaper/testQueryEscaper.html
D javascript/queryEscaper
D javascript/queryConstructor/luceneQueryConstructor.js
D javascript/queryConstructor/luceneQueryConstructor.html
D javascript/queryConstructor/testQueryConstructor.html
D javascript/queryConstructor
D javascript/queryValidator/luceneQueryValidator.js
D javascript/queryValidator/testQueryValidator.html
D javascript/queryValidator
D javascript

Committed revision 781057.


 contrib/javascript is not packaged into releases
 

 Key: LUCENE-898
 URL: https://issues.apache.org/jira/browse/LUCENE-898
 Project: Lucene - Java
  Issue Type: Bug
  Components: Build
Reporter: Hoss Man
Assignee: Otis Gospodnetic
Priority: Trivial

 the contrib/javascript directory is (apparently) a collection of javascript 
 utilities for lucene .. but it has not build files or any mechanism to 
 package it, so it is excluded form releases.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1491) EdgeNGramTokenFilter stops on tokens smaller then minimum gram size.

2009-06-02 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12715551#action_12715551
 ] 

Otis Gospodnetic commented on LUCENE-1491:
--

I agree this is an improvement, but like Hoss I'm worried about silently 
skipping shorter-than-specified-min-ngram-size tokens.

Perhaps we need boolean keepSmaller somewhere, so we can explicitly control the 
behaviour?


 EdgeNGramTokenFilter stops on tokens smaller then minimum gram size.
 

 Key: LUCENE-1491
 URL: https://issues.apache.org/jira/browse/LUCENE-1491
 Project: Lucene - Java
  Issue Type: Bug
  Components: Analysis
Affects Versions: 2.4, 2.4.1, 2.9, 3.0
Reporter: Todd Feak
Assignee: Otis Gospodnetic
 Fix For: 2.9

 Attachments: LUCENE-1491.patch


 If a token is encountered in the stream that is shorter in length than the 
 min gram size, the filter will stop processing the token stream.
 Working up a unit test now, but may be a few days before I can provide it. 
 Wanted to get it in the system.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1272) Support for boost factor in MoreLikeThis

2009-06-02 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1271#action_1271
 ] 

Otis Gospodnetic commented on LUCENE-1272:
--

Jonathan, would it be possible for you to update this patch to work with the 
trunk, so I can apply it?  Thanks!

 Support for boost factor in MoreLikeThis
 

 Key: LUCENE-1272
 URL: https://issues.apache.org/jira/browse/LUCENE-1272
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Jonathan Leibiusky
Assignee: Otis Gospodnetic
Priority: Minor
 Fix For: 2.9

 Attachments: morelikethis_boostfactor.patch


 This is a patch I made to be able to boost the terms with a specific factor 
 beside the relevancy returned by MoreLikeThis. This is helpful when having 
 more then 1 MoreLikeThis in the query, so words in the field A (i.e. Title) 
 can be boosted more than words in the field B (i.e. Description).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1378) Remove remaining @author references

2009-06-02 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12715578#action_12715578
 ] 

Otis Gospodnetic commented on LUCENE-1378:
--

I think a bunch of that xdocs stuff under site should/will really be removed 
with time, as some of it is out of date (e.g. benchmarks, contrib) and harder 
to maintain than Wiki pages.


 Remove remaining @author references
 ---

 Key: LUCENE-1378
 URL: https://issues.apache.org/jira/browse/LUCENE-1378
 Project: Lucene - Java
  Issue Type: Task
Reporter: Otis Gospodnetic
Assignee: Otis Gospodnetic
Priority: Trivial
 Fix For: 2.9

 Attachments: LUCENE-1378.patch, LUCENE-1378.patch, 
 LUCENE-1378b.patch, LUCENE-1378c.patch


 $ find . -name \*.java | xargs grep '@author' | cut -d':' -f1 | xargs perl 
 -pi -e 's/ \...@author.*//'

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1491) EdgeNGramTokenFilter stops on tokens smaller then minimum gram size.

2009-06-02 Thread Otis Gospodnetic (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12715661#action_12715661
]

Otis Gospodnetic commented on LUCENE-1491:
--

I'm not 100% sure - I'm not using ngrams at the moment, so I have no place to
test this out, but skipping a shorter than minimal ngrams seems like it would
result in silent data loss.

Ah, here, example:
What would happen to to be or not to be if min=4 and we relied on ngrams to
perform phrase queries?

All of those terms would be dropped, so a search for to be or not to be would
result in 0 hits.

If the above is correct, I think this sounds like a bad thing that one wouldn't
expect...

EdgeNGramTokenFilter stops on tokens smaller then minimum gram size.

Key: LUCENE-1491
URL: https://issues.apache.org/jira/browse/LUCENE-1491
Project: Lucene - Java
Issue Type: Bug
Components: Analysis
Affects Versions: 2.4, 2.4.1, 2.9, 3.0
Reporter: Todd Feak
Assignee: Otis Gospodnetic
Fix For: 2.9

Attachments: LUCENE-1491.patch

If a token is encountered in the stream that is shorter in length than the
min gram size, the filter will stop processing the token stream.
Working up a unit test now, but may be a few days before I can provide it.
Wanted to get it in the system.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1491) EdgeNGramTokenFilter stops on tokens smaller then minimum gram size.

2009-06-02 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12715763#action_12715763
 ] 

Otis Gospodnetic commented on LUCENE-1491:
--

Karl - LUCENE-1306 - I agree, I think the existing edge and non-edge ngram 
stuff should be folded into LUCENE-1306 (or the other way around, if it's 
easier).

But won't question of what we do with the chunks shorter than min ngram remain? 
 Does adding that boolean hurt anything? (other than an if test for every ngram 
:) ).

 EdgeNGramTokenFilter stops on tokens smaller then minimum gram size.
 

 Key: LUCENE-1491
 URL: https://issues.apache.org/jira/browse/LUCENE-1491
 Project: Lucene - Java
  Issue Type: Bug
  Components: Analysis
Affects Versions: 2.4, 2.4.1, 2.9, 3.0
Reporter: Todd Feak
Assignee: Otis Gospodnetic
 Fix For: 2.9

 Attachments: LUCENE-1491.patch


 If a token is encountered in the stream that is shorter in length than the 
 min gram size, the filter will stop processing the token stream.
 Working up a unit test now, but may be a few days before I can provide it. 
 Wanted to get it in the system.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1377) Add HTMLStripReader and WordDelimiterFilter from SOLR

2009-06-02 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12715780#action_12715780
 ] 

Otis Gospodnetic commented on LUCENE-1377:
--

Could we make this even more generic and say that all basic tokenizers and 
filters that currently live in Solr should really move to Lucene?


 Add HTMLStripReader and WordDelimiterFilter from SOLR
 -

 Key: LUCENE-1377
 URL: https://issues.apache.org/jira/browse/LUCENE-1377
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Affects Versions: 2.3.2
Reporter: Jason Rutherglen
Priority: Minor
   Original Estimate: 24h
  Remaining Estimate: 24h

 SOLR has two classes HTMLStripReader and WordDelimiterFilter which are very 
 useful for a wide variety of use cases.  It would be good to place them into 
 core Lucene.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1629) contrib intelligent Analyzer for Chinese

2009-05-28 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12714285#action_12714285
 ] 

Otis Gospodnetic commented on LUCENE-1629:
--

I just got to look at this code and I only scanned it quickly.  Is all of the 
code really Chinese-specific? 
Would any of it be applicable to other languages, say Japanese or Korean? 
(assuming we have dictionaries in suitable format)



 contrib intelligent Analyzer for Chinese
 

 Key: LUCENE-1629
 URL: https://issues.apache.org/jira/browse/LUCENE-1629
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/analyzers
Affects Versions: 2.4.1
 Environment: for java 1.5 or higher, lucene 2.4.1
Reporter: Xiaoping Gao
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: analysis-data.zip, bigramdict.mem, 
 build-resources-with-folder.patch, build-resources.patch, 
 build-resources.patch, coredict.mem, LUCENE-1629-encoding-fix.patch, 
 LUCENE-1629-java1.4.patch


 I wrote a Analyzer for apache lucene for analyzing sentences in Chinese 
 language. it's called imdict-chinese-analyzer, the project on google code 
 is here: http://code.google.com/p/imdict-chinese-analyzer/
 In Chinese, 我是中国人(I am Chinese), should be tokenized as 我(I)   是(am)   
 中国人(Chinese), not 我 是中 国人. So the analyzer must handle each sentence 
 properly, or there will be mis-understandings everywhere in the index 
 constructed by Lucene, and the accuracy of the search engine will be affected 
 seriously!
 Although there are two analyzer packages in apache repository which can 
 handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or 
 every two adjoining characters as a single word, this is obviously not true 
 in reality, also this strategy will increase the index size and hurt the 
 performance baddly.
 The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model 
 (HMM), so it can tokenize chinese sentence in a really intelligent way. 
 Tokenizaion accuracy of this model is above 90% according to the paper 
 HHMM-based Chinese Lexical analyzer ICTCLAL while other analyzer's is about 
 60%.
 As imdict-chinese-analyzer is a really fast and intelligent. I want to 
 contribute it to the apache lucene repository.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1284) Set of Java classes that allow the Lucene search engine to use morphological information developed for the Apertium open-source machine translation platform (http://www

2009-04-23 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702192#action_12702192
 ] 

Otis Gospodnetic commented on LUCENE-1284:
--

Hm, I feel that because of these command-line non-Java and GPLed tools it may 
not be possible (or will be very clunky) to integrate this with Lucene.

What do others think?

Felipe, although Java equivalents of those command-line tools don't exist 
currently, do you think one could implement them in Java (and release them 
under ASL)?  I don't know what exactly is in those tools and what it would take 
to port them to Java.
Thanks.

 Set of Java classes that allow the Lucene search engine to use morphological 
 information developed for the Apertium open-source machine translation 
 platform (http://www.apertium.org)
 --

 Key: LUCENE-1284
 URL: https://issues.apache.org/jira/browse/LUCENE-1284
 Project: Lucene - Java
  Issue Type: New Feature
 Environment: New feature developed under GNU/Linux, but it should 
 work in any other Java-compliance platform
Reporter: Felipe Sánchez Martínez
Assignee: Otis Gospodnetic
 Attachments: apertium-morph.0.9.0.tgz


 Set of Java classes that allow the Lucene search engine to use morphological 
 information developed for the Apertium open-source machine translation 
 platform (http://www.apertium.org). Morphological information is used to 
 index new documents and to process smarter queries in which morphological 
 attributes can be used to specify query terms.
 The tool makes use of morphological analyzers and dictionaries developed for 
 the open-source machine translation platform Apertium (http://apertium.org) 
 and, optionally, the part-of-speech taggers developed for it. Currently there 
 are morphological dictionaries available for Spanish, Catalan, Galician, 
 Portuguese, 
 Aranese, Romanian, French and English. In addition new dictionaries are being 
 developed for Esperanto, Occitan, Basque, Swedish, Danish, 
 Welsh, Polish and Italian, among others; we hope more language pairs to be 
 added to the Apertium machine translation platform in the near future.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1284) Set of Java classes that allow the Lucene search engine to use morphological information developed for the Apertium open-source machine translation platform (http://www

2009-04-10 Thread Otis Gospodnetic (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12697952#action_12697952
]

Otis Gospodnetic commented on LUCENE-1284:
--

Hi Felipe,

OK, I looked at this some more. So the Java code you contributed is ASL and
Apertium's tools (and data?) is GPL v2?

The thing that puzzles me are the language pairs themselves. Why are they in
pairs? Is that simply for the translation part of Apertium, and something
that's ignored when you use the pair for Lucene and morphological analysis?

If I'm interested in, say, French morphological analyzer, why do I need any
other language? For French, I see:

* br-fr
* en-fr
* fr-ca
* fr-es

If I'm interested in French, which of the 4 above is the right one to use? The
one with the highest number of lemmata?

I had a look at the Indexer and Searcher to get an idea about the usage. Those
classes are really just for demonstration, right? Still, do you mind replacing
the deprecated Hits object in the Searcher class?

In the README you mention this:
{quote}
2. The Spanish morphological dictionary must be preprocessed in advance to
remove multiword expressions:

$ java -classpath lucene-apertium-morph-2.4-dev.jar \
org.apache.lucene.apertium.tools.RemoveMultiWordsFromDix \
--dix apertium-es-ca.es.dix apertium-es-ca.es-nomw.dix
{quote}

Could you explain why the removal of multiword expressions is needed?
Is that Spanish-specific or something one needs to do regardless of the
language?

Also:
{quote}
4. Each file to be indexed must be preprocessed using the Apertium tools:

$ cat file.txt | apertium-destxt | lt-proc -a es-ca-nomw.automorf.bin |
apertium-tagger -g -f es-ca.prob file.pos.txt
{quote}

So these are a few command-line tools that end up marking up the input text
with POS? (I seem to be missing some libraries and can't compile Apterium
locally to check what that this marked up file looks like).
But my main question here is whether there are Java equivalents of these
command-line tools, so that one can easily use them from Java? Or is one
forced to use Runtime.exec(...)?

Thanks.

Set of Java classes that allow the Lucene search engine to use morphological
information developed for the Apertium open-source machine translation
platform (http://www.apertium.org)
--

Key: LUCENE-1284
URL: https://issues.apache.org/jira/browse/LUCENE-1284
Project: Lucene - Java
Issue Type: New Feature
Environment: New feature developed under GNU/Linux, but it should
work in any other Java-compliance platform
Reporter: Felipe Sánchez Martínez
Assignee: Otis Gospodnetic
Attachments: apertium-morph.0.9.0.tgz

Set of Java classes that allow the Lucene search engine to use morphological
information developed for the Apertium open-source machine translation
platform (http://www.apertium.org). Morphological information is used to
index new documents and to process smarter queries in which morphological
attributes can be used to specify query terms.
The tool makes use of morphological analyzers and dictionaries developed for
the open-source machine translation platform Apertium (http://apertium.org)
and, optionally, the part-of-speech taggers developed for it. Currently there
are morphological dictionaries available for Spanish, Catalan, Galician,
Portuguese,
Aranese, Romanian, French and English. In addition new dictionaries are being
developed for Esperanto, Occitan, Basque, Swedish, Danish,
Welsh, Polish and Italian, among others; we hope more language pairs to be
added to the Apertium machine translation platform in the near future.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1284) Set of Java classes that allow the Lucene search engine to use morphological information developed for the Apertium open-source machine translation platform (http://www

2009-04-08 Thread Otis Gospodnetic (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12697185#action_12697185
]

Otis Gospodnetic commented on LUCENE-1284:
--

Felipe:
I took another look at this. I spotted mentions of GPL, but it's not clear to
me what's GPLed. We can't have GPL software in Apache, unfortunately. Could
you please explain which pieces are GPLed and tell us if this is something that
could be changed to ASL? Thanks.

Set of Java classes that allow the Lucene search engine to use morphological
information developed for the Apertium open-source machine translation
platform (http://www.apertium.org)
--

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1284) Set of Java classes that allow the Lucene search engine to use morphological information developed for the Apertium open-source machine translation platform (http://www

2009-04-08 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12697208#action_12697208
 ] 

Otis Gospodnetic commented on LUCENE-1284:
--

One more for Felipe.
Is there a page on http://wiki.apertium.org/ that lists the definite/up to date 
list of supported languages and perhaps some kind of indicator of status (e.g. 
anyone actively working on the language or not) and level of support.

I see http://wiki.apertium.org/wiki/List_of_language_pairs and 
http://wiki.apertium.org/wiki/Language_and_pair_maintainer

...but I can't quite translate (no pun intended) those numbers into the level 
of support for a language.  Could you please shed some light on this?


 Set of Java classes that allow the Lucene search engine to use morphological 
 information developed for the Apertium open-source machine translation 
 platform (http://www.apertium.org)
 --

 Key: LUCENE-1284
 URL: https://issues.apache.org/jira/browse/LUCENE-1284
 Project: Lucene - Java
  Issue Type: New Feature
 Environment: New feature developed under GNU/Linux, but it should 
 work in any other Java-compliance platform
Reporter: Felipe Sánchez Martínez
Assignee: Otis Gospodnetic
 Attachments: apertium-morph.0.9.0.tgz


 Set of Java classes that allow the Lucene search engine to use morphological 
 information developed for the Apertium open-source machine translation 
 platform (http://www.apertium.org). Morphological information is used to 
 index new documents and to process smarter queries in which morphological 
 attributes can be used to specify query terms.
 The tool makes use of morphological analyzers and dictionaries developed for 
 the open-source machine translation platform Apertium (http://apertium.org) 
 and, optionally, the part-of-speech taggers developed for it. Currently there 
 are morphological dictionaries available for Spanish, Catalan, Galician, 
 Portuguese, 
 Aranese, Romanian, French and English. In addition new dictionaries are being 
 developed for Esperanto, Occitan, Basque, Swedish, Danish, 
 Welsh, Polish and Italian, among others; we hope more language pairs to be 
 added to the Apertium machine translation platform in the near future.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1561) Maybe rename Field.omitTf, and strengthen the javadocs

2009-03-23 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12688385#action_12688385
 ] 

Otis Gospodnetic commented on LUCENE-1561:
--

Might be good to keep a consistent name across Lucene/Solr.
More info coming up in SOLR-1079.


 Maybe rename Field.omitTf, and strengthen the javadocs
 --

 Key: LUCENE-1561
 URL: https://issues.apache.org/jira/browse/LUCENE-1561
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.4.1
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1561.patch


 Spinoff from here:
   
 http://www.nabble.com/search-problem-when-indexed-using-Field.setOmitTf()-td22456141.html
 Maybe rename omitTf to something like omitTermPositions, and make it clear 
 what queries will silently fail to work as a result.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1284) Set of Java classes that allow the Lucene search engine to use morphological information developed for the Apertium open-source machine translation platform (http://www

2009-02-21 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12675604#action_12675604
 ] 

Otis Gospodnetic commented on LUCENE-1284:
--

Felipe - I'll have a look at this next week, thanks for the reminder!


 Set of Java classes that allow the Lucene search engine to use morphological 
 information developed for the Apertium open-source machine translation 
 platform (http://www.apertium.org)
 --

 Key: LUCENE-1284
 URL: https://issues.apache.org/jira/browse/LUCENE-1284
 Project: Lucene - Java
  Issue Type: New Feature
 Environment: New feature developed under GNU/Linux, but it should 
 work in any other Java-compliance platform
Reporter: Felipe Sánchez Martínez
Assignee: Otis Gospodnetic
 Attachments: apertium-morph.0.9.0.tgz


 Set of Java classes that allow the Lucene search engine to use morphological 
 information developed for the Apertium open-source machine translation 
 platform (http://www.apertium.org). Morphological information is used to 
 index new documents and to process smarter queries in which morphological 
 attributes can be used to specify query terms.
 The tool makes use of morphological analyzers and dictionaries developed for 
 the open-source machine translation platform Apertium (http://apertium.org) 
 and, optionally, the part-of-speech taggers developed for it. Currently there 
 are morphological dictionaries available for Spanish, Catalan, Galician, 
 Portuguese, 
 Aranese, Romanian, French and English. In addition new dictionaries are being 
 developed for Esperanto, Occitan, Basque, Swedish, Danish, 
 Welsh, Polish and Italian, among others; we hope more language pairs to be 
 added to the Apertium machine translation platform in the near future.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1519) Change Primitive Data Types from int to long in class SegmentMerger.java

2009-01-13 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12663633#action_12663633
 ] 

Otis Gospodnetic commented on LUCENE-1519:
--

Deepak - could you please bring this up on the java-user mailing list instead 
and close this issue?


 Change Primitive Data Types from int to long in class SegmentMerger.java
 

 Key: LUCENE-1519
 URL: https://issues.apache.org/jira/browse/LUCENE-1519
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.4
 Environment: lucene 2.4.0, jdk1.6.0_03/07/11
Reporter: Deepak
   Original Estimate: 4h
  Remaining Estimate: 4h

 Hi
 We are getting an exception while optimize. We are getting this exception 
 mergeFields produced an invalid result: docCount is 385282378 but fdx file 
 size is 3082259028; now aborting this merge to prevent index corruption
  
 I have  checked the code for class SegmentMerger.java and found this check 
 ***
 if (4+docCount*8 != fdxFileLength)
 // This is most likely a bug in Sun JRE 1.6.0_04/_05;
 // we detect that the bug has struck, here, and
 // throw an exception to prevent the corruption from
 // entering the index.  See LUCENE-1282 for
 // details.
 throw new RuntimeException(mergeFields produced an invalid result: 
 docCount is  + docCount +  but fdx file size is  + fdxFileLength + ; now 
 aborting this merge to prevent index corruption);
 }
 ***
 In our case docCount is 385282378 and fdxFileLength size is 3082259028, even 
 though 4+385282378*8 is equal to 3082259028, the above code will not work 
 because number 3082259028 is out of int range. So type of variable docCount 
 needs to be changed to long
 I have written a small test for this 
 
 public class SegmentMergerTest {
 public static void main(String[] args) {
 int docCount = 385282378; 
 long fdxFileLength = 3082259028L; 
 if(4+docCount*8 != fdxFileLength) 
 System.out.println(No Match + (4+docCount*8));
 else 
 System.out.println(Match + (4+docCount*8));
 }
 }
 
 Above test will print No Match but if you change the data type of docCount to 
 long, it will print Match
 Can you please advise us if this issue will be fixed in next release?
 Regards
 Deepak
  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1513) fastss fuzzyquery

2009-01-06 Thread Otis Gospodnetic (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12661302#action_12661302
]

Otis Gospodnetic commented on LUCENE-1513:
--

I feel like I missed some FastSS discussion on the list was there one?

I took a quick look at the paper and the code. Is the following the general
idea:
# index fuzzy/misspelled terms in addition to the normal terms (= larger
index, slower indexing). How much fuzziness one wants to allow or handle is
decided at index time.
# rewrite the query to include variations/misspellings of each terms and use
that to search (= more clauses, slower than normal search, but faster than the
normal fuzzy query whose speed depends on the number of indexed terms)
?

Quick code comments:
* Need to add ASL
* Need to replace tabs with 2 spaces and formatting in FuzzyHitCollector
* No @author
* Unit test if possible
* Should FastSSwC not be able to take a variable K?
* Should variables named after types (e.g. set in public static String
getNeighborhoodString(SetString set) { ) be renamed, so they describe what's
in them instead? (easier to understand API?)

fastss fuzzyquery
-

Key: LUCENE-1513
URL: https://issues.apache.org/jira/browse/LUCENE-1513
Project: Lucene - Java
Issue Type: New Feature
Components: contrib/*
Reporter: Robert Muir
Priority: Minor
Attachments: fastSSfuzzy.zip

code for doing fuzzyqueries with fastssWC algorithm.
FuzzyIndexer: given a lucene field, it enumerates all terms and creates an
auxiliary offline index for fuzzy queries.
FastFuzzyQuery: similar to fuzzy query except it queries the auxiliary index
to retrieve a candidate list. this list is then verified with levenstein
algorithm.
sorry but the code is a bit messy... what I'm actually using is very
different from this so its pretty much untested. but at least you can see
whats going on or fix it up.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1511) Improve Java packages (remove shared/split packages, refactore naming scheme)

2009-01-05 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12660808#action_12660808
 ] 

Otis Gospodnetic commented on LUCENE-1511:
--

Perhaps this should have been brough up on java-dev first...

How does one deal with package private classes/methods then?


 Improve Java packages (remove shared/split packages, refactore naming scheme)
 -

 Key: LUCENE-1511
 URL: https://issues.apache.org/jira/browse/LUCENE-1511
 Project: Lucene - Java
  Issue Type: Wish
  Components: contrib/*, Search
Affects Versions: 2.4
Reporter: Gunnar Wagenknecht

 I recently prepared Lucene OSGi bundles for the Eclipse Orbit repository. 
 During the preparation I discovered that some packages (eg. 
 org.apache.lucene.search) are shared between different JARs, i.e. the package 
 is in Lucene Core and in a contrib lib. While this is perfectly fine, it just 
 makes OSGi packaging more complex and complexity also has a higher potential 
 for errors. 
 Thus, my wish for a Lucene 3.0 would be to rename some packages. For example, 
 all contribs/extensions could be moved into their own package namespace.
 (Apologize if this has been reported elsewhere. I did a search in JIRA but 
 did not find a similar issue.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1491) EdgeNGramTokenFilter stops on tokens smaller then minimum gram size.

2008-12-21 Thread Otis Gospodnetic (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic updated LUCENE-1491:
-

Lucene Fields: [New, Patch Available]  (was: [New])
Fix Version/s: 2.9
 Assignee: Otis Gospodnetic

 EdgeNGramTokenFilter stops on tokens smaller then minimum gram size.
 

 Key: LUCENE-1491
 URL: https://issues.apache.org/jira/browse/LUCENE-1491
 Project: Lucene - Java
  Issue Type: Bug
  Components: Analysis
Affects Versions: 2.4, 2.4.1, 2.9, 3.0
Reporter: Todd Feak
Assignee: Otis Gospodnetic
 Fix For: 2.9

 Attachments: LUCENE-1491.patch


 If a token is encountered in the stream that is shorter in length than the 
 min gram size, the filter will stop processing the token stream.
 Working up a unit test now, but may be a few days before I can provide it. 
 Wanted to get it in the system.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1487) FieldCacheTermsFilter

2008-12-10 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12655517#action_12655517
 ] 

Otis Gospodnetic commented on LUCENE-1487:
--

Would it be possible to reformat to use Lucene code style and add a bit of 
javadoc/unit test?  Eclipse and IDEA styles are at the bottom of 
http://wiki.apache.org/lucene-java/HowToContribute


 FieldCacheTermsFilter
 -

 Key: LUCENE-1487
 URL: https://issues.apache.org/jira/browse/LUCENE-1487
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Search
Affects Versions: 2.4
Reporter: Tim Sturge
 Fix For: 2.9

 Attachments: FieldCacheTermsFilter.java


 This is a companion to FieldCacheRangeFilter except it operates on a set of 
 terms rather than a range. It works best when the set is comparatively large 
 or the terms are comparatively common.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-1380) Patch for ShingleFilter.enablePositions (or PositionFilter)

2008-12-10 Thread Otis Gospodnetic (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic updated LUCENE-1380:
-

Lucene Fields: [New, Patch Available]  (was: [Patch Available, New])
Fix Version/s: 2.4.1

 Patch for ShingleFilter.enablePositions (or PositionFilter)
 ---

 Key: LUCENE-1380
 URL: https://issues.apache.org/jira/browse/LUCENE-1380
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/analyzers
Reporter: Mck SembWever
Priority: Trivial
 Fix For: 2.4.1

 Attachments: LUCENE-1380-PositionFilter.patch, 
 LUCENE-1380-PositionFilter.patch, LUCENE-1380-PositionFilter.patch, 
 LUCENE-1380.patch, LUCENE-1380.patch


 Make it possible for *all* words and shingles to be placed at the same 
 position, that is for _all_ shingles (and unigrams if included) to be treated 
 as synonyms of each other.
 Today the shingles generated are synonyms only to the first term in the 
 shingle.
 For example the query abcd efgh ijkl results in:
(abcd abcd efgh abcd efgh ijkl) (efgh efgh ijkl) (ijkl)
 where abcd efgh and abcd efgh ijkl are synonyms of abcd, and efgh 
 ijkl is a synonym of efgh.
 There exists no way today to alter which token a particular shingle is a 
 synonym for.
 This patch takes the first step in making it possible to make all shingles 
 (and unigrams if included) synonyms of each other.
 See http://comments.gmane.org/gmane.comp.jakarta.lucene.user/34746 for 
 mailing list thread.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1026) Provide a simple way to concurrently access a Lucene index from multiple threads

2008-12-10 Thread Otis Gospodnetic (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12655525#action_12655525
]

Otis Gospodnetic commented on LUCENE-1026:
--

My impression was that this didn't stick, so I'd drop it.

Provide a simple way to concurrently access a Lucene index from multiple
threads

Key: LUCENE-1026
URL: https://issues.apache.org/jira/browse/LUCENE-1026
Project: Lucene - Java
Issue Type: New Feature
Components: Index, Search
Reporter: Mark Miller
Priority: Minor
Attachments: DefaultIndexAccessor.java,
DefaultMultiIndexAccessor.java, IndexAccessor-02.04.2008.zip,
IndexAccessor-02.07.2008.zip, IndexAccessor-02.28.2008.zip,
IndexAccessor-05.27.2008.zip, IndexAccessor-1.26.2008.zip,
IndexAccessor-2.15.2008.zip, IndexAccessor.04032008.zip, IndexAccessor.java,
IndexAccessor.zip, IndexAccessorFactory.java, MultiIndexAccessor.java,
shai-IndexAccessor-2.zip, shai-IndexAccessor.zip, shai-IndexAccessor3.zip,
SimpleSearchServer.java, StopWatch.java, TestIndexAccessor.java

For building interactive indexes accessed through a network/internet
(multiple threads).
This builds upon the LuceneIndexAccessor patch. That patch was not very
newbie friendly and did not properly handle MultiSearchers (or at the least
made it easy to get into trouble).
This patch simplifies things and provides out of the box support for sharing
the IndexAccessors across threads. There is also a simple test class and
example SearchServer to get you started.
Future revisions will be zipped.
Works pretty solid as is, but could use the ability to warm new Searchers.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

2008-12-05 Thread Otis Gospodnetic (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12653900#action_12653900
]

Otis Gospodnetic commented on LUCENE-855:
-

Hi Matt! :)

Tim, want to benchmark the two? (since you already benchmarked 1461, you should
be able to plug in Matt's thing and see how it compares)

MemoryCachedRangeFilter to boost performance of Range queries
-

Key: LUCENE-855
URL: https://issues.apache.org/jira/browse/LUCENE-855
Project: Lucene - Java
Issue Type: Improvement
Components: Search
Affects Versions: 2.1
Reporter: Andy Liu
Attachments: contrib-filters.tar.gz, FieldCacheRangeFilter.patch,
FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch,
FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch,
FieldCacheRangeFilter.patch, FieldCacheRangeFilter_Lucene_2.3.0.patch,
MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch,
TestRangeFilterPerformanceComparison.java,
TestRangeFilterPerformanceComparison.java

Currently RangeFilter uses TermEnum and TermDocs to find documents that fall
within the specified range. This requires iterating through every single
term in the index and can get rather slow for large document sets.
MemoryCachedRangeFilter reads all docId, value pairs of a given field,
sorts by value, and stores in a SortedFieldCache. During bits(), binary
searches are used to find the start and end indices of the lower and upper
bound values. The BitSet is populated by all the docId values that fall in
between the start and end indices.
TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed
index with random date values within a 5 year range. Executing bits() 1000
times on standard RangeQuery using random date intervals took 63904ms. Using
MemoryCachedRangeFilter, it took 876ms. Performance increase is less
dramatic when you have less unique terms in a field or using less number of
documents.
Currently MemoryCachedRangeFilter only works with numeric values (values are
stored in a long[] array) but it can be easily changed to support Strings. A
side benefit of storing the values are stored as longs, is that there's no
longer the need to make the values lexographically comparable, i.e. padding
numeric values with zeros.
The downside of using MemoryCachedRangeFilter is there's a fairly significant
memory requirement. So it's designed to be used in situations where range
filter performance is critical and memory consumption is not an issue. The
memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.
MemoryCachedRangeFilter also requires a warmup step which can take a while to
run in large datasets (it took 40s to run on a 3M document corpus). Warmup
can be called explicitly or is automatically called the first time
MemoryCachedRangeFilter is applied using a given field.
So in summery, MemoryCachedRangeFilter can be useful when:
- Performance is critical
- Memory is not an issue
- Field contains many unique numeric values
- Index contains large amount of documents

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1461) Cached filter for a single term field

2008-12-04 Thread Otis Gospodnetic (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12653361#action_12653361
]

Otis Gospodnetic commented on LUCENE-1461:
--

Is this related to LUCENE-855? The same? Aha, I see Paul asked the reverse
question in LUCENE-855 already... Tim?

Cached filter for a single term field
-

Key: LUCENE-1461
URL: https://issues.apache.org/jira/browse/LUCENE-1461
Project: Lucene - Java
Issue Type: New Feature
Reporter: Tim Sturge
Assignee: Michael McCandless
Fix For: 2.9

Attachments: DisjointMultiFilter.java, FieldCacheRangeFilter.patch,
LUCENE-1461.patch, LUCENE-1461a.patch, LUCENE-1461b.patch,
LUCENE-1461c.patch, RangeMultiFilter.java, RangeMultiFilter.java,
TermMultiFilter.java, TestFieldCacheRangeFilter.patch

These classes implement inexpensive range filtering over a field containing a
single term. They do this by building an integer array of term numbers
(storing the term-number mapping in a TreeMap) and then implementing a fast
integer comparison based DocSetIdIterator.
This code is currently being used to do age range filtering, but could also
be used to do other date filtering or in any application where there need to
be multiple filters based on the same single term field. I have an untested
implementation of single term filtering and have considered but not yet
implemented term set filtering (useful for location based searches) as well.
The code here is fairly rough; it works but lacks javadocs and toString() and
hashCode() methods etc. I'm posting it here to discover if there is other
interest in this feature; I don't mind fixing it up but would hate to go to
the effort if it's not going to make it into Lucene.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Assigned: (LUCENE-689) NullPointerException thrown by equals method in SpanOrQuery

2008-11-17 Thread Otis Gospodnetic (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic reassigned LUCENE-689:
---

Assignee: Otis Gospodnetic  (was: Steven Parkes)

 NullPointerException thrown by equals method in SpanOrQuery
 ---

 Key: LUCENE-689
 URL: https://issues.apache.org/jira/browse/LUCENE-689
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 2.1
 Environment: Java 1.5.0_09, RHEL 3 Linux, Tomcat 5.0.28
Reporter: Michael Goddard
Assignee: Otis Gospodnetic
Priority: Minor
 Attachments: LUCENE-689.txt


 Part of our code utilizes the equals method in SpanOrQuery and, in certain 
 cases (details to follow, if necessary), a NullPointerException gets thrown 
 as a result of the String field being null.  After applying the following 
 patch, the problem disappeared:
 Index: src/java/org/apache/lucene/search/spans/SpanOrQuery.java
 ===
 --- src/java/org/apache/lucene/search/spans/SpanOrQuery.java(revision 
 465065)
 +++ src/java/org/apache/lucene/search/spans/SpanOrQuery.java(working copy)
 @@ -121,7 +121,8 @@
  final SpanOrQuery that = (SpanOrQuery) o;
  if (!clauses.equals(that.clauses)) return false;
 -if (!field.equals(that.field)) return false;
 +if (field != null  !field.equals(that.field)) return false;
 +if (field == null  that.field != null) return false;
  return getBoost() == that.getBoost();
}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Reopened: (LUCENE-1378) Remove remaining @author references

2008-11-17 Thread Otis Gospodnetic (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic reopened LUCENE-1378:
--

Lucene Fields: [New, Patch Available]  (was: [Patch Available, New])

 Remove remaining @author references
 ---

 Key: LUCENE-1378
 URL: https://issues.apache.org/jira/browse/LUCENE-1378
 Project: Lucene - Java
  Issue Type: Task
Reporter: Otis Gospodnetic
Assignee: Otis Gospodnetic
Priority: Trivial
 Fix For: 2.4

 Attachments: LUCENE-1378.patch, LUCENE-1378.patch


 $ find . -name \*.java | xargs grep '@author' | cut -d':' -f1 | xargs perl 
 -pi -e 's/ [EMAIL PROTECTED]//'

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1439) Inconsistent API

2008-11-14 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12647672#action_12647672
 ] 

Otis Gospodnetic commented on LUCENE-1439:
--

Wiki may be more suitable for that.
Note that it may be better to work on getting more of the pending patches 
reviewed and tested, so they can be committed faster.  That way we can then 
proceed to making API changes that won't break existing/pending patches.


 Inconsistent API 
 -

 Key: LUCENE-1439
 URL: https://issues.apache.org/jira/browse/LUCENE-1439
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 3.0
 Environment: any
Reporter: Ivan.S
Priority: Minor

 The API of Lucene is totally inconsistent:
 1)
 There are a lot of containers which don't implement an interface which 
 indicates this fact
 (for pre-java-1.5 Lucene it could be Collection, for post-ajva-1.5 Lucene it 
 could be more general Iterable)
 Example:
  IndexSearcher: int maxDoc() and doc(int i)
 2)
 There are a lot of classes having non-final public accessible fields.
 3)
 Some methods which return values are named something() others are named 
 getSomething()
 Best one is: Fieldable:
 without get: String stringValue(), Reader readerValue(), byte[] 
 binaryValue(), ...
 with get: byte[] getBinaryValue(), int getBinaryLength(), ...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Issue Comment Edited: (LUCENE-524) Current implementation of fuzzy and wildcard queries inappropriately implemented as Boolean query rewrites

2008-11-13 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12647181#action_12647181
 ] 

otis edited comment on LUCENE-524 at 11/13/08 7:06 AM:
---

Based on the description, yes.  Doesn't this also sound a lot like that old 
Mark H's LUCENE-329 issue?


  was (Author: otis):
Based on the description, yes.  Doesn't this also sound a lot like that old 
Mark H's issue that you commented on earlier?

  
 Current implementation of fuzzy and wildcard queries inappropriately 
 implemented as Boolean query rewrites
 --

 Key: LUCENE-524
 URL: https://issues.apache.org/jira/browse/LUCENE-524
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 1.9
Reporter: Randy Puttick
Priority: Minor
 Attachments: MultiTermQuery.java, MultiTermScorer.java


 The implementation of MultiTermQuery in terms of BooleanQuery introduces 
 several problems:
 1) Collisions with maximum clause limit on boolean queries which throws an 
 exception.  This is most problematic because it is difficult to ascertain in 
 advance how many terms a fuzzy query or wildcard query might involve.
 2) The boolean disjunctive scoring is not appropriate for either fuzzy or 
 wildcard queries.  In effect the score is divided by the number of terms in 
 the query which has nothing to do with the relevancy of the results.
 3) Performance of disjunctive boolean queries for large term sets is quite 
 sub-optimal

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-1417) Allowing for distance measures that incorporate frequency/popularity for SuggestWord comparison

2008-11-12 Thread Otis Gospodnetic (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Otis Gospodnetic updated LUCENE-1417:
-

Priority: Minor (was: Major)

I agree with Grant. I like that we not have pluggable distance metric, for
example.

Allowing for distance measures that incorporate frequency/popularity for
SuggestWord comparison
---

Key: LUCENE-1417
URL: https://issues.apache.org/jira/browse/LUCENE-1417
Project: Lucene - Java
Issue Type: Improvement
Components: contrib/spellchecker
Affects Versions: 2.4
Reporter: Jason Rennie
Priority: Minor
Original Estimate: 4h
Remaining Estimate: 4h

Spelling suggestions are currently ordered first by a string edit distance
measure, then by popularity/frequency. This limits the ability of
popularity/frequency to affect suggestions. I think it would be better for
the distance measure to accept popularity/frequency as an argument and
provide a distance/score that incorporates any popularity/frequency
considerations. I.e. change StringDistance.getDistance to accept an
additional argument: frequency of the potential suggestion.
The new SuggestWord.compareTo function would only order by score. We could
achieve the existing behavior by adding a small inverse frequency value to
the distances.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-1296) Allow use of compact DocIdSet in CachingWrapperFilter

2008-11-12 Thread Otis Gospodnetic (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic updated LUCENE-1296:
-

Lucene Fields: [New, Patch Available]  (was: [Patch Available, New])
Fix Version/s: 2.9

 Allow use of compact DocIdSet in CachingWrapperFilter
 -

 Key: LUCENE-1296
 URL: https://issues.apache.org/jira/browse/LUCENE-1296
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Search
Reporter: Paul Elschot
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: cachedFilter20080529.patch, cachedFilter20080605.patch


 Extends CachingWrapperFilter with a protected method to determine the 
 DocIdSet to be cached.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Resolved: (LUCENE-518) document field lengths count analyzer synonym overlays

2008-11-12 Thread Otis Gospodnetic (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic resolved LUCENE-518.
-

Resolution: Fixed

I think LUCENE-1420 fixed this.


 document field lengths count analyzer synonym overlays
 --

 Key: LUCENE-518
 URL: https://issues.apache.org/jira/browse/LUCENE-518
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 1.9
 Environment: N/A
Reporter: Randy Puttick
Priority: Minor

 Using a synonym expansion analyzer to add tokens with zero offset from the 
 substituted token should not extend the length of the field in the document 
 (for scoring purposes)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-1413) Creating PlainTextDictionary with UTF8 files

2008-11-12 Thread Otis Gospodnetic (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic updated LUCENE-1413:
-

Lucene Fields: [New, Patch Available]  (was: [Patch Available, New])
Fix Version/s: (was: 2.3.3)
   2.9
   Issue Type: Improvement  (was: New Feature)

 Creating PlainTextDictionary with UTF8 files
 

 Key: LUCENE-1413
 URL: https://issues.apache.org/jira/browse/LUCENE-1413
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/spellchecker
Affects Versions: 2.3.2
 Environment: All platform / operating systems
Reporter: YourSoft
 Fix For: 2.9


 Generate indexes from text files is good, but can't read utf8 files.
 It can easily made by adding the following code to PlainTextDictionary.java:
 public PlainTextDictionary(InputStream dictFile, String fileEncoding) throws 
 UnsupportedEncodingException {
 in = new BufferedReader(new InputStreamReader(dictFile, fileEncoding));
  }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1306) CombinedNGramTokenFilter

2008-11-12 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12646971#action_12646971
 ] 

Otis Gospodnetic commented on LUCENE-1306:
--

Could/should this not be folded into the existing Ngram code in contrib?


 CombinedNGramTokenFilter
 

 Key: LUCENE-1306
 URL: https://issues.apache.org/jira/browse/LUCENE-1306
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/analyzers
Reporter: Karl Wettin
Assignee: Karl Wettin
Priority: Trivial
 Attachments: LUCENE-1306.txt, LUCENE-1306.txt


 Alternative NGram filter that produce tokens with composite prefix and suffix 
 markers.
 {code:java}
 ts = new WhitespaceTokenizer(new StringReader(hello));
 ts = new CombinedNGramTokenFilter(ts, 2, 2);
 assertNext(ts, ^h);
 assertNext(ts, he);
 assertNext(ts, el);
 assertNext(ts, ll);
 assertNext(ts, lo);
 assertNext(ts, o$);
 assertNull(ts.next());
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Resolved: (LUCENE-548) Sort bug using ParallelMultiSearcher

2008-11-12 Thread Otis Gospodnetic (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic resolved LUCENE-548.
-

Resolution: Won't Fix

I agree, doesn't seem worth fixing.  Explicit STRING is recommended.

 Sort bug using ParallelMultiSearcher
 

 Key: LUCENE-548
 URL: https://issues.apache.org/jira/browse/LUCENE-548
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 1.9
 Environment: Linux FC2 Java 1.4.9
Reporter: dan
Priority: Minor

  Output:
 java.lang.ClassCastException: java.lang.String
 at 
 org.apache.lucene.search.FieldDocSortedHitQueue.lessThan(FieldDocSortedHitQueue.java:119)
 at org.apache.lucene.util.PriorityQueue.insert(PriorityQueue.java:61)
 at 
 org.apache.lucene.search.MultiSearcherThread.run(ParallelMultiSearcher.java:271)
  Input:
 - This only occurs when searching more than one index using 
 ParallelMultiSearcher
 - I use the signature new Sort( date, true)
 - The values in dates are strings in the form 20060419
 - The call to getType in FieldDocSortedHitQueue misinterprets the value as an 
 INT, then the exception is thrown
  Available workaround
 - I use the the signature new Sort(new SortField( date, SortField.STRING, 
 true)) and the problem goes away.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Resolved: (LUCENE-711) BooleanWeight should size the weights Vector correctly

2008-11-12 Thread Otis Gospodnetic (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic resolved LUCENE-711.
-

Resolution: Fixed
  Assignee: Otis Gospodnetic

Sendingsrc/java/org/apache/lucene/search/BooleanQuery.java
Transmitting file data .
Committed revision 713634.


 BooleanWeight should size the weights Vector correctly
 --

 Key: LUCENE-711
 URL: https://issues.apache.org/jira/browse/LUCENE-711
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 1.9, 2.0.0, 2.1
Reporter: paul constantinides
Assignee: Otis Gospodnetic
Priority: Minor
 Attachments: LUCENE-711.patch, vector_sizing.patch


 The weights field on BooleanWeight uses a Vector that will always be sized 
 exactly the same as the outer class' clauses Vector, therefore can be sized 
 correctly in the constructor. This is a trivial memory saving enhancement.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-524) Current implementation of fuzzy and wildcard queries inappropriately implemented as Boolean query rewrites

2008-11-12 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12647181#action_12647181
 ] 

Otis Gospodnetic commented on LUCENE-524:
-

Based on the description, yes.  Doesn't this also sound a lot like that old 
Mark H's issue that you commented on earlier?


 Current implementation of fuzzy and wildcard queries inappropriately 
 implemented as Boolean query rewrites
 --

 Key: LUCENE-524
 URL: https://issues.apache.org/jira/browse/LUCENE-524
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 1.9
Reporter: Randy Puttick
Priority: Minor
 Attachments: MultiTermQuery.java, MultiTermScorer.java


 The implementation of MultiTermQuery in terms of BooleanQuery introduces 
 several problems:
 1) Collisions with maximum clause limit on boolean queries which throws an 
 exception.  This is most problematic because it is difficult to ascertain in 
 advance how many terms a fuzzy query or wildcard query might involve.
 2) The boolean disjunctive scoring is not appropriate for either fuzzy or 
 wildcard queries.  In effect the score is divided by the number of terms in 
 the query which has nothing to do with the relevancy of the results.
 3) Performance of disjunctive boolean queries for large term sets is quite 
 sub-optimal

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-38) RangeQuery without lower term and inclusive=false skips blank fields

2008-11-12 Thread Otis Gospodnetic (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-38?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12647185#action_12647185
]

Otis Gospodnetic commented on LUCENE-38:

This thing is 6+ years old and I don't recall this being mentioned on the list
in the last half a decade. I'll leave you the Won't Fix pleasure, Mark.

RangeQuery without lower term and inclusive=false skips blank fields

Key: LUCENE-38
URL: https://issues.apache.org/jira/browse/LUCENE-38
Project: Lucene - Java
Issue Type: Bug
Components: Search
Affects Versions: unspecified
Environment: Operating System: other
Platform: Other
Reporter: Otis Gospodnetic
Assignee: Lucene Developers
Priority: Minor
Attachments: TestRangeQuery.patch

This was reported by James Ricci [EMAIL PROTECTED] at:
http://nagoya.apache.org/eyebrowse/[EMAIL PROTECTED]msgNo=1835
When you create a ranged query and omit the lower term, my expectation
would be that I would find everything less than the upper term. Now if I pass
false for the inclusive term, then I would expect that I would find all
terms less than the upper term excluding the upper term itself.
What is happening in the case of lower_term=null, upper_term=x,
inclusive=false is that empty strings are being excluded because
inclusive is set false, and the implementation of RangedQuery creates a
default
lower term of Term(fieldName, ). Since it's not inclusive, it excludes .
This isn't what I intended, and I don't think it's what most people would
imagine RangedQuery would do in the case I've mentioned.
I equate lower=null, upper=x, inclusive=false to Field x. lower=null,
upper=x, inclusive=true would be Field = x. In both cases, the only
difference should be whether or not Field = x is true for the query.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-38) RangeQuery without lower term and inclusive=false skips blank fields

2008-11-12 Thread Otis Gospodnetic (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-38?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12647185#action_12647185
]

Otis Gospodnetic commented on LUCENE-38:

This thing is 6+ years old and I don't recall this being mentioned on the list
in the last half a decade. I'll leave you the Won't Fix pleasure, Mark.

RangeQuery without lower term and inclusive=false skips blank fields

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Resolved: (LUCENE-1180) Syns2Index fails

2008-11-12 Thread Otis Gospodnetic (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic resolved LUCENE-1180.
--

Resolution: Fixed

It looks like Mike fixed this 2 months ago:

r694222 | mikemccand | 2008-09-11 08:11:03 -0400 (Thu, 11 Sep 2008) | 1 line

fix wordnet's Syns2Index to not fiddle with mergeFactor  maxBuffereDocs (the 
latter was hitting an exception)


 Syns2Index fails
 

 Key: LUCENE-1180
 URL: https://issues.apache.org/jira/browse/LUCENE-1180
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/*
Affects Versions: 2.3
Reporter: Jeffrey Yang
Assignee: Otis Gospodnetic
Priority: Minor
 Attachments: syns2index_fix_2.3.patch, syns2index_fix_2.4-dev.patch

   Original Estimate: 1h
  Remaining Estimate: 1h

 Running Syns2Index fails with a
 java.lang.IllegalArgumentException: maxBufferedDocs must at least be 2 when 
 enabled exception.
 at 
 org.apache.lucene.index.IndexWriter.setMaxBufferedDocs(IndexWriter.java:883)
 at org.apache.lucene.wordnet.Syns2Index.index(Syns2Index.java:249)
 at org.apache.lucene.wordnet.Syns2Index.main(Syns2Index.java:208)
 The code is here
   // blindly up these parameters for speed
   writer.setMergeFactor( writer.getMergeFactor() * 2);
   writer.setMaxBufferedDocs( writer.getMaxBufferedDocs() * 2);
 It looks like getMaxBufferedDocs used to return 10, and now it returns -1, 
 not sure when that started happening.
 My suggestion would be to just remove these three lines.  Since speed has 
 already improved vastly, there isn't a need to speed things up.
 To run this, Syns2Index requires two args.  The first is the location of the 
 wn_s.pl file, and the second is the directory to create the index in.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Resolved: (LUCENE-896) Let users set Similarity for MoreLikeThis

2008-11-12 Thread Otis Gospodnetic (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic resolved LUCENE-896.
-

   Resolution: Fixed
Lucene Fields: [New, Patch Available]  (was: [Patch Available, New])

Actually, my copy of MLT already takes Similarity in ctor and has 
set/getSimilarity, so no patch is needed.  You want/need that isNoise method 
protected?


 Let users set Similarity for MoreLikeThis
 -

 Key: LUCENE-896
 URL: https://issues.apache.org/jira/browse/LUCENE-896
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Other
Reporter: Ryan McKinley
Assignee: Otis Gospodnetic
Priority: Minor
 Attachments: LUCENE-896-MoreLikeThisSimilarity.patch


 Let users set Similarity used for MoreLikeThis
 For discussion, see:
 http://www.nabble.com/MoreLikeThis-API-changes--tf3838535.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-1272) Support for boost factor in MoreLikeThis

2008-11-12 Thread Otis Gospodnetic (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic updated LUCENE-1272:
-

 Priority: Minor  (was: Major)
Lucene Fields: [New, Patch Available]  (was: [Patch Available, New])
Fix Version/s: 2.9
 Assignee: Otis Gospodnetic

I don't see any harm in this, I'll make the change later this week.

 Support for boost factor in MoreLikeThis
 

 Key: LUCENE-1272
 URL: https://issues.apache.org/jira/browse/LUCENE-1272
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Jonathan Leibiusky
Assignee: Otis Gospodnetic
Priority: Minor
 Fix For: 2.9

 Attachments: morelikethis_boostfactor.patch


 This is a patch I made to be able to boost the terms with a specific factor 
 beside the relevancy returned by MoreLikeThis. This is helpful when having 
 more then 1 MoreLikeThis in the query, so words in the field A (i.e. Title) 
 can be boosted more than words in the field B (i.e. Description).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-1424) Change all multi-term querys so that they extend MultiTermQuery and allow for a constant score mode

2008-11-09 Thread Otis Gospodnetic (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic updated LUCENE-1424:
-

Summary: Change all multi-term querys so that they extend MultiTermQuery 
and allow for a constant score mode  (was: Change all mutli term querys so that 
they extend MultiTermQuery and allow for a constant score mode)

 Change all multi-term querys so that they extend MultiTermQuery and allow for 
 a constant score mode
 ---

 Key: LUCENE-1424
 URL: https://issues.apache.org/jira/browse/LUCENE-1424
 Project: Lucene - Java
  Issue Type: New Feature
Reporter: Mark Miller
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1424.patch, LUCENE-1424.patch, LUCENE-1424.patch, 
 LUCENE-1424.patch, LUCENE-1424.patch, LUCENE-1424.patch, LUCENE-1424.patch, 
 LUCENE-1424.patch, LUCENE-1424.patch, LUCENE-1424.patch, LUCENE-1424.patch, 
 LUCENE-1424.patch


 Cleans up a bunch of code duplication, closer to how things should be - 
 design wise, gives us constant score for all the multi term queries, and 
 allows us at least the option of highlighting the constant score queries 
 without much further work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1410) PFOR implementation

2008-10-03 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12636686#action_12636686
 ] 

Otis Gospodnetic commented on LUCENE-1410:
--

For people not intimately familiar with PFOR (like me), I found the following 
to be helpful:
http://cis.poly.edu/cs912/indexcomp.pdf


 PFOR implementation
 ---

 Key: LUCENE-1410
 URL: https://issues.apache.org/jira/browse/LUCENE-1410
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Other
Reporter: Paul Elschot
Priority: Minor
 Attachments: LUCENE-1410b.patch, TestPFor2.java

   Original Estimate: 21840h
  Remaining Estimate: 21840h

 Implementation of Patched Frame of Reference.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1409) read past EOF

2008-10-01 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12636068#action_12636068
 ] 

Otis Gospodnetic commented on LUCENE-1409:
--

Since Lucene 2.4 is about to be released, if I were you I would get Lucene from 
trunk, build the jar, and replace your 2.3.2 version.  If that eliminates this 
error, could you please close this issue?


 read past EOF
 -

 Key: LUCENE-1409
 URL: https://issues.apache.org/jira/browse/LUCENE-1409
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.3.2
 Environment: jdk 1.5.0_08
Reporter: Adam Łączyński

 I create index with a lot of documents (~500 000). During add documents to 
 read past EOF error occured. It occure after random number of indexed 
 documents. I used lucene with compass framework but I think that is not 
 important. It is a link to compass forum where that problem was reporeted 
 http://forum.compass-project.org/thread.jspa?threadID=215641tstart=0
 java.io.IOException: read past EOF
   at 
 org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:146)
   at 
 org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:38)
   at org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:76)
   at org.apache.lucene.index.FieldInfos.read(FieldInfos.java:304)
   at org.apache.lucene.index.FieldInfos.init(FieldInfos.java:59)
   at 
 org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:298)
   at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:262)
   at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:197)
   at 
 org.apache.lucene.index.MultiSegmentReader.init(MultiSegmentReader.java:109)
   at 
 org.apache.lucene.index.MultiSegmentReader.doReopen(MultiSegmentReader.java:203)
   at 
 org.apache.lucene.index.DirectoryIndexReader$2.doBody(DirectoryIndexReader.java:98)
   at 
 org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:636)
   at 
 org.apache.lucene.index.DirectoryIndexReader.reopen(DirectoryIndexReader.java:92)
   at 
 org.compass.core.lucene.engine.manager.DefaultLuceneSearchEngineIndexManager.internalRefreshCache(DefaultLuceneSearchEngineIndexManager.java:368)
   at 
 org.compass.core.lucene.engine.manager.DefaultLuceneSearchEngineIndexManager.refreshCache(DefaultLuceneSearchEngineIndexManager.java:358)
   at 
 org.compass.core.lucene.engine.transaction.readcommitted.ReadCommittedTransaction$CommitCallable.call(ReadCommittedTransaction.java:422)
   at 
 org.compass.core.transaction.context.TransactionalCallable$1.doInTransaction(TransactionalCallable.java:44)
   at 
 org.compass.core.impl.DefaultCompass$CompassTransactionContext.execute(DefaultCompass.java:342)
   at 
 org.compass.core.transaction.context.TransactionalCallable.call(TransactionalCallable.java:41)
   at 
 org.compass.core.executor.DefaultExecutorManager.invokeAllWithLimit(DefaultExecutorManager.java:104)
   at 
 org.compass.core.executor.DefaultExecutorManager.invokeAllWithLimitBailOnException(DefaultExecutorManager.java:73)
   at 
 org.compass.core.lucene.engine.transaction.readcommitted.ReadCommittedTransaction.doCommit(ReadCommittedTransaction.java:142)
   at 
 org.compass.core.lucene.engine.transaction.AbstractTransaction.commit(AbstractTransaction.java:98)
   at 
 org.compass.core.lucene.engine.LuceneSearchEngine.commit(LuceneSearchEngine.java:172)
   at 
 org.compass.core.transaction.LocalTransaction.doCommit(LocalTransaction.java:97)
   at 
 org.compass.core.transaction.AbstractTransaction.commit(AbstractTransaction.java:46)
   at org.compass.core.CompassTemplate.execute(CompassTemplate.java:131)
   at org.compass.core.CompassTemplate.execute(CompassTemplate.java:112)
   at asl.simplesearch.compass.CompassService.createCall(Unknown Source)
   at asl.util.IndexCreator.createIndex(Unknown Source)
   at asl.util.IndexCreator.start(Unknown Source)
   at asl.util.IndexCreatorTestCase.main(IndexCreatorTestCase.java:20)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-1390) add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter

2008-09-18 Thread Otis Gospodnetic (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic updated LUCENE-1390:
-

 Priority: Minor  (was: Major)
Lucene Fields: [New, Patch Available]  (was: [Patch Available, New])
Fix Version/s: 2.9

 add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter
 

 Key: LUCENE-1390
 URL: https://issues.apache.org/jira/browse/LUCENE-1390
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
 Environment: any
Reporter: Andi Vajda
Priority: Minor
 Fix For: 2.9

 Attachments: ISOLatinAccentFilter.java


 The ISOLatin1AccentFilter is removing accents from accented characters in the 
 ISO Latin 1 character set.
 It does what it does and there is no bug with it.
 It would be nicer, though, if there was a more comprehensive version of this 
 code that included not just ISO-Latin-1 (ISO-8859-1) but the entire Latin 1 
 and Latin Extended A unicode blocks.
 See: http://en.wikipedia.org/wiki/Latin-1_Supplement_unicode_block
 See: http://en.wikipedia.org/wiki/Latin_Extended-A_unicode_block
 That way, all languages using roman characters are covered.
 A new class, ISOLatinAccentFilter is attached. It is intended to supercede 
 ISOLatin1AccentFilter which should get deprecated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-112) [PATCH] Add an IndexReader implementation that frees resources when idle and refreshes itself when stale

2008-09-11 Thread Otis Gospodnetic (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12630265#action_12630265
]

Otis Gospodnetic commented on LUCENE-112:
-

+1 for closing it. Half a decade ago

[PATCH] Add an IndexReader implementation that frees resources when idle and
refreshes itself when stale

Key: LUCENE-112
URL: https://issues.apache.org/jira/browse/LUCENE-112
Project: Lucene - Java
Issue Type: Improvement
Components: Index
Affects Versions: CVS Nightly - Specify date in submission
Environment: Operating System: All
Platform: All
Reporter: Eric Isakson
Priority: Minor
Attachments: IdleTimeoutRefreshingIndexReader.html,
IdleTimeoutRefreshingIndexReader.java

Here is a little something I worked on this weekend that I wanted to
contribute
back as I think others might find it very useful.
I extended IndexReader and added support for configuring an idle timeout and
refresh interval.
It uses a monitoring thread to watch for the reader going idle. When the
reader
goes idle it is closed. When the index is read again it is re-opened.
It uses another thread to periodically check when the reader needs to be
refreshed due to a change to index. When the reader is stale, it closes the
reader and reopens the index.
It is acually delegating all the work to another IndexReader implementation
and
just handling the threading and synchronization. When it closes a reader, it
delegates the close to another thread that waits a bit (configurable how
long)
before actually closing the reader it was delegating to. This gives any
consumers of the original reader a chance to finish up their last action on
the
reader.
This implementation sacrifices a little bit of speed since there is a bit
more
synchroniztion to deal with and the delegation model puts extra calls on the
stack, but it should provide long running applications that have idle periods
or frequently changing indices from having to open and close readers all the
time or hold open unused resources.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Issue Comment Edited: (LUCENE-1381) Hanging while indexing/digesting on multiple threads

2008-09-11 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12630270#action_12630270
 ] 

otis edited comment on LUCENE-1381 at 9/11/08 10:47 AM:


David, why not bring this up on java-user list first?
Are your 4 IndexWriters writing to the same index?
Is this really a Lucene problem? (I don't see any mentions of Lucene in those 
traces)



  was (Author: otis):
David, why not bring this up on java-user list first?
Are your 4 IndexWriters writing to the same index?

  
 Hanging while indexing/digesting on multiple threads
 

 Key: LUCENE-1381
 URL: https://issues.apache.org/jira/browse/LUCENE-1381
 Project: Lucene - Java
  Issue Type: Bug
  Components: Analysis
Affects Versions: 2.3.2
 Environment: Java HotSpot(TM) 64-Bit Server VM (1.5.0_16-b02 mixed 
 mode) on 2.6.9-78.0.1.ELsmp #1 SMP x86_64 x86_64 x86_64 GNU/Linux
Reporter: David Fertig

 With several older lucene projects already running and stable, I have 
 recently written a multi-threading indexer using to the 2.3.2 release.
 My volume is in the millions of documents indexed daily and I have been 
 stress testing for a while now.  My current setup has 3 JVMs, each running 6 
 threads indexing different documents, with 1 IndexWriter per JVM.  For 
 stability testing, the indexer shutsdown and exits every 5-10 minutes, with a 
 new JVM is started again for a clean restart. At this rate, I have noticed an 
 rare, but eventually consistent internal hang/deadlock in all indexer threads 
 while parsing documents.  My 'manager' thread is alive and regularly polling 
 the indexer threads and displaying their state variables, but the indexer 
 threads themselves appear not to be making progress while using up nearly 
 100% of available CPU.  Memory usage is relativly low and stable at 481m out 
 of 2048m available. 
 Most stack traces, and STAY in this state even after repeated inspections: 
 (pressing CTRL-\ in active JVM window)
 --
 Full thread dump Java HotSpot(TM) 64-Bit Server VM (1.5.0_16-b02 mixed mode):
 Thread-6 prio=1 tid=0x002b25750920 nid=0x34f6 runnable 
 [0x41465000..0x41465db0]
 at java.util.WeakHashMap.eq(WeakHashMap.java:254)
 at java.util.WeakHashMap.get(WeakHashMap.java:345)
 at 
 org.apache.commons.beanutils.MethodUtils.getMatchingAccessibleMethod(MethodUtils.java:530)
 at 
 org.apache.commons.beanutils.MethodUtils.invokeMethod(MethodUtils.java:209)
 at 
 org.apache.commons.digester.CallMethodRule.end(CallMethodRule.java:625)
 at org.apache.commons.digester.Rule.end(Rule.java:230)
 at org.apache.commons.digester.Digester.endElement(Digester.java:1130)
 at 
 com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:633)
 at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1241)
 at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(XMLDocumentFragmentScannerImpl.java:1685)
 at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:368)
 at 
 com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:834)
 at 
 com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:764)
 at 
 com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:148)
 at 
 com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1242)
 at org.apache.commons.digester.Digester.parse(Digester.java:1685)
 ...
 Thread-5 prio=1 tid=0x002b25754eb0 nid=0x34f5 runnable 
 [0x41364000..0x41364d30]
 at java.lang.String.equals(String.java:858)
 at 
 org.apache.commons.beanutils.MethodUtils$MethodDescriptor.equals(MethodUtils.java:833)
 at java.util.WeakHashMap.eq(WeakHashMap.java:254)
 at java.util.WeakHashMap.get(WeakHashMap.java:345)
 at 
 org.apache.commons.beanutils.MethodUtils.getMatchingAccessibleMethod(MethodUtils.java:530)
 at 
 org.apache.commons.beanutils.MethodUtils.invokeMethod(MethodUtils.java:209)
 at 
 org.apache.commons.digester.CallMethodRule.end(CallMethodRule.java:625)
 at org.apache.commons.digester.Rule.end(Rule.java:230)
 at org.apache.commons.digester.Digester.endElement(Digester.java:1130)
 at 
 com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:633)
 at

[jira] Commented: (LUCENE-1381) Hanging while indexing/digesting on multiple threads

2008-09-11 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12630270#action_12630270
 ] 

Otis Gospodnetic commented on LUCENE-1381:
--

David, why not bring this up on java-user list first?
Are your 4 IndexWriters writing to the same index?


 Hanging while indexing/digesting on multiple threads
 

 Key: LUCENE-1381
 URL: https://issues.apache.org/jira/browse/LUCENE-1381
 Project: Lucene - Java
  Issue Type: Bug
  Components: Analysis
Affects Versions: 2.3.2
 Environment: Java HotSpot(TM) 64-Bit Server VM (1.5.0_16-b02 mixed 
 mode) on 2.6.9-78.0.1.ELsmp #1 SMP x86_64 x86_64 x86_64 GNU/Linux
Reporter: David Fertig

 With several older lucene projects already running and stable, I have 
 recently written a multi-threading indexer using to the 2.3.2 release.
 My volume is in the millions of documents indexed daily and I have been 
 stress testing for a while now.  My current setup has 3 JVMs, each running 6 
 threads indexing different documents, with 1 IndexWriter per JVM.  For 
 stability testing, the indexer shutsdown and exits every 5-10 minutes, with a 
 new JVM is started again for a clean restart. At this rate, I have noticed an 
 rare, but eventually consistent internal hang/deadlock in all indexer threads 
 while parsing documents.  My 'manager' thread is alive and regularly polling 
 the indexer threads and displaying their state variables, but the indexer 
 threads themselves appear not to be making progress while using up nearly 
 100% of available CPU.  Memory usage is relativly low and stable at 481m out 
 of 2048m available. 
 Most stack traces, and STAY in this state even after repeated inspections: 
 (pressing CTRL-\ in active JVM window)
 --
 Full thread dump Java HotSpot(TM) 64-Bit Server VM (1.5.0_16-b02 mixed mode):
 Thread-6 prio=1 tid=0x002b25750920 nid=0x34f6 runnable 
 [0x41465000..0x41465db0]
 at java.util.WeakHashMap.eq(WeakHashMap.java:254)
 at java.util.WeakHashMap.get(WeakHashMap.java:345)
 at 
 org.apache.commons.beanutils.MethodUtils.getMatchingAccessibleMethod(MethodUtils.java:530)
 at 
 org.apache.commons.beanutils.MethodUtils.invokeMethod(MethodUtils.java:209)
 at 
 org.apache.commons.digester.CallMethodRule.end(CallMethodRule.java:625)
 at org.apache.commons.digester.Rule.end(Rule.java:230)
 at org.apache.commons.digester.Digester.endElement(Digester.java:1130)
 at 
 com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:633)
 at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1241)
 at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(XMLDocumentFragmentScannerImpl.java:1685)
 at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:368)
 at 
 com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:834)
 at 
 com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:764)
 at 
 com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:148)
 at 
 com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1242)
 at org.apache.commons.digester.Digester.parse(Digester.java:1685)
 ...
 Thread-5 prio=1 tid=0x002b25754eb0 nid=0x34f5 runnable 
 [0x41364000..0x41364d30]
 at java.lang.String.equals(String.java:858)
 at 
 org.apache.commons.beanutils.MethodUtils$MethodDescriptor.equals(MethodUtils.java:833)
 at java.util.WeakHashMap.eq(WeakHashMap.java:254)
 at java.util.WeakHashMap.get(WeakHashMap.java:345)
 at 
 org.apache.commons.beanutils.MethodUtils.getMatchingAccessibleMethod(MethodUtils.java:530)
 at 
 org.apache.commons.beanutils.MethodUtils.invokeMethod(MethodUtils.java:209)
 at 
 org.apache.commons.digester.CallMethodRule.end(CallMethodRule.java:625)
 at org.apache.commons.digester.Rule.end(Rule.java:230)
 at org.apache.commons.digester.Digester.endElement(Digester.java:1130)
 at 
 com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:633)
 at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1241)
 at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(XMLDocumentFragmentScannerImpl.java:1685)
 at

[jira] Resolved: (LUCENE-1381) Hanging while indexing/digesting on multiple threads

2008-09-11 Thread Otis Gospodnetic (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic resolved LUCENE-1381.
--

Resolution: Invalid

This is a new piece of code and the stack trace doesn't show Lucene, so I'm 
marking this as Invalid for now.


 Hanging while indexing/digesting on multiple threads
 

 Key: LUCENE-1381
 URL: https://issues.apache.org/jira/browse/LUCENE-1381
 Project: Lucene - Java
  Issue Type: Bug
  Components: Analysis
Affects Versions: 2.3.2
 Environment: Java HotSpot(TM) 64-Bit Server VM (1.5.0_16-b02 mixed 
 mode) on 2.6.9-78.0.1.ELsmp #1 SMP x86_64 x86_64 x86_64 GNU/Linux
Reporter: David Fertig

 With several older lucene projects already running and stable, I have 
 recently written a multi-threading indexer using to the 2.3.2 release.
 My volume is in the millions of documents indexed daily and I have been 
 stress testing for a while now.  My current setup has 3 JVMs, each running 6 
 threads indexing different documents, with 1 IndexWriter per JVM.  For 
 stability testing, the indexer shutsdown and exits every 5-10 minutes, with a 
 new JVM is started again for a clean restart. At this rate, I have noticed an 
 rare, but eventually consistent internal hang/deadlock in all indexer threads 
 while parsing documents.  My 'manager' thread is alive and regularly polling 
 the indexer threads and displaying their state variables, but the indexer 
 threads themselves appear not to be making progress while using up nearly 
 100% of available CPU.  Memory usage is relativly low and stable at 481m out 
 of 2048m available. 
 Most stack traces, and STAY in this state even after repeated inspections: 
 (pressing CTRL-\ in active JVM window)
 --
 Full thread dump Java HotSpot(TM) 64-Bit Server VM (1.5.0_16-b02 mixed mode):
 Thread-6 prio=1 tid=0x002b25750920 nid=0x34f6 runnable 
 [0x41465000..0x41465db0]
 at java.util.WeakHashMap.eq(WeakHashMap.java:254)
 at java.util.WeakHashMap.get(WeakHashMap.java:345)
 at 
 org.apache.commons.beanutils.MethodUtils.getMatchingAccessibleMethod(MethodUtils.java:530)
 at 
 org.apache.commons.beanutils.MethodUtils.invokeMethod(MethodUtils.java:209)
 at 
 org.apache.commons.digester.CallMethodRule.end(CallMethodRule.java:625)
 at org.apache.commons.digester.Rule.end(Rule.java:230)
 at org.apache.commons.digester.Digester.endElement(Digester.java:1130)
 at 
 com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:633)
 at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1241)
 at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(XMLDocumentFragmentScannerImpl.java:1685)
 at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:368)
 at 
 com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:834)
 at 
 com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:764)
 at 
 com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:148)
 at 
 com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1242)
 at org.apache.commons.digester.Digester.parse(Digester.java:1685)
 ...
 Thread-5 prio=1 tid=0x002b25754eb0 nid=0x34f5 runnable 
 [0x41364000..0x41364d30]
 at java.lang.String.equals(String.java:858)
 at 
 org.apache.commons.beanutils.MethodUtils$MethodDescriptor.equals(MethodUtils.java:833)
 at java.util.WeakHashMap.eq(WeakHashMap.java:254)
 at java.util.WeakHashMap.get(WeakHashMap.java:345)
 at 
 org.apache.commons.beanutils.MethodUtils.getMatchingAccessibleMethod(MethodUtils.java:530)
 at 
 org.apache.commons.beanutils.MethodUtils.invokeMethod(MethodUtils.java:209)
 at 
 org.apache.commons.digester.CallMethodRule.end(CallMethodRule.java:625)
 at org.apache.commons.digester.Rule.end(Rule.java:230)
 at org.apache.commons.digester.Digester.endElement(Digester.java:1130)
 at 
 com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:633)
 at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1241)
 at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(XMLDocumentFragmentScannerImpl.java:1685)
 at

[jira] Commented: (LUCENE-1378) Remove remaining @author references

2008-09-09 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12629631#action_12629631
 ] 

Otis Gospodnetic commented on LUCENE-1378:
--

Eh, rusty perl

$ find . -name \*.java | xargs grep '@author' | cut -d':' -f1 | xargs perl -pi 
-e 's/\* @author.*//'

Doesn't work -- that \* in front of @author doesn't cut it.

 Remove remaining @author references
 ---

 Key: LUCENE-1378
 URL: https://issues.apache.org/jira/browse/LUCENE-1378
 Project: Lucene - Java
  Issue Type: Task
Reporter: Otis Gospodnetic
Assignee: Otis Gospodnetic
Priority: Trivial
 Fix For: 2.4

 Attachments: LUCENE-1378.patch


 $ find . -name \*.java | xargs grep '@author' | cut -d':' -f1 | xargs perl 
 -pi -e 's/ [EMAIL PROTECTED]//'

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Created: (LUCENE-1378) Remove remaining @author references

2008-09-07 Thread Otis Gospodnetic (JIRA)

Remove remaining @author references
---

 Key: LUCENE-1378
 URL: https://issues.apache.org/jira/browse/LUCENE-1378
 Project: Lucene - Java
  Issue Type: Task
Reporter: Otis Gospodnetic
Priority: Trivial
 Fix For: 2.4
 Attachments: LUCENE-1378.patch

$ find . -name \*.java | xargs grep '@author' | cut -d':' -f1 | xargs perl -pi 
-e 's/ [EMAIL PROTECTED]//'


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-1378) Remove remaining @author references

2008-09-07 Thread Otis Gospodnetic (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic updated LUCENE-1378:
-

Attachment: LUCENE-1378.patch

 Remove remaining @author references
 ---

 Key: LUCENE-1378
 URL: https://issues.apache.org/jira/browse/LUCENE-1378
 Project: Lucene - Java
  Issue Type: Task
Reporter: Otis Gospodnetic
Priority: Trivial
 Fix For: 2.4

 Attachments: LUCENE-1378.patch


 $ find . -name \*.java | xargs grep '@author' | cut -d':' -f1 | xargs perl 
 -pi -e 's/ [EMAIL PROTECTED]//'

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1131) Add numDeletedDocs to IndexReader

2008-09-04 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12628355#action_12628355
 ] 

Otis Gospodnetic commented on LUCENE-1131:
--

I think so - applies and compiles.

 Add numDeletedDocs to IndexReader
 -

 Key: LUCENE-1131
 URL: https://issues.apache.org/jira/browse/LUCENE-1131
 Project: Lucene - Java
  Issue Type: New Feature
Reporter: Shai Erera
Assignee: Otis Gospodnetic
Priority: Minor
 Fix For: 2.4

 Attachments: LUCENE-1131.patch


 Add numDeletedDocs to IndexReader. Basically, the implementation is as simple 
 as doing:
 public int numDeletedDocs() {
   return deletedDocs == null ? 0 : deletedDocs.count();
 }
 in SegmentReader.
 Patch to follow to include in all IndexReader extensions.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1366) Rename Field.Index.UN_TOKENIZED/TOKENIZED/NO_NORMS

2008-08-28 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12626654#action_12626654
 ] 

Otis Gospodnetic commented on LUCENE-1366:
--

I like the name choices - they read nicely, are easy to understand, and match 
what actually happens.


 Rename Field.Index.UN_TOKENIZED/TOKENIZED/NO_NORMS
 --

 Key: LUCENE-1366
 URL: https://issues.apache.org/jira/browse/LUCENE-1366
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.4

 Attachments: LUCENE-1366.patch


 There is confusion about these current Field options and I think we
 should rename them, deprecating the old names in 2.4/2.9 and removing
 them in 3.0.  How about this:
 {code}
 TOKENIZED -- ANALYZED
 UN_TOKENIZED -- NOT_ANALYZED
 NO_NORMS -- NOT_ANALYZED_NO_NORMS
 {code}
 Should we also add ANALYZED_NO_NORMS?
 Spinoff from here:
 
 http://mail-archives.apache.org/mod_mbox/lucene-java-user/200808.mbox/%3C48a3076a.2679420a.1c53.a5c4%40mx.google.com%3E
 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Assigned: (LUCENE-1360) A Similarity class which has unique length norms for numTerms = 10

2008-08-22 Thread Otis Gospodnetic (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic reassigned LUCENE-1360:


Assignee: Otis Gospodnetic

 A Similarity class which has unique length norms for numTerms = 10
 ---

 Key: LUCENE-1360
 URL: https://issues.apache.org/jira/browse/LUCENE-1360
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Sean Timm
Assignee: Otis Gospodnetic
Priority: Trivial
 Attachments: ShortFieldNormSimilarity.java


 A Similarity class which extends DefaultSimilarity and simply overrides 
 lengthNorm.  lengthNorm is implemented as a lookup for numTerms = 10, else 
 as {{1/sqrt(numTerms)}}. This is to avoid term counts below 11 from having 
 the same lengthNorm after stored as a single byte in the index.
 This is useful if your search is only on short fields such as titles or 
 product descriptions.
 See mailing list discussion: 
 http://www.nabble.com/How-to-boost-the-score-higher-in-case-user-query-matches-entire-field-value-than-just-some-words-within-a-field-td19079221.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Closed: (LUCENE-1275) Expose Document Number

2008-08-18 Thread Otis Gospodnetic (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic closed LUCENE-1275.


   Resolution: Invalid
Lucene Fields: [New, Patch Available]  (was: [Patch Available, New])

 Expose Document Number
 --

 Key: LUCENE-1275
 URL: https://issues.apache.org/jira/browse/LUCENE-1275
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index, Store
Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.4, 2.9, 3.0
 Environment: All
Reporter: Hasan Diwan
Priority: Minor
 Attachments: lucene.pat


 Lucene maintains an internal document number, which this patch exposes using 
 a mutator/accessor pair of methods. The field is set on document addition. 
 This creates a unique way to refer to a document for editing and updating 
 individual documents.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1219) support array/offset/ length setters for Field with binary data

2008-08-18 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12623433#action_12623433
 ] 

Otis Gospodnetic commented on LUCENE-1219:
--

Eks Dev: out of curiosity, did you ever measure the before/after performance 
difference?  If so, what numbers did you see?


 support array/offset/ length setters for Field with binary data
 ---

 Key: LUCENE-1219
 URL: https://issues.apache.org/jira/browse/LUCENE-1219
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Eks Dev
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.4

 Attachments: LUCENE-1219.extended.patch, LUCENE-1219.extended.patch, 
 LUCENE-1219.patch, LUCENE-1219.patch, LUCENE-1219.patch, LUCENE-1219.patch, 
 LUCENE-1219.take2.patch, LUCENE-1219.take3.patch


 currently Field/Fieldable interface supports only compact, zero based byte 
 arrays. This forces end users to create and copy content of new objects 
 before passing them to Lucene as such fields are often of variable size. 
 Depending on use case, this can bring far from negligible  performance  
 improvement. 
 this approach extends Fieldable interface with 3 new methods   
 getOffset(); gettLenght(); and getBinaryValue() (this only returns reference 
 to the array)


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-1359) FrenchAnalyzer's tokenStream method does not honour the contract of Analyzer

2008-08-18 Thread Otis Gospodnetic (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic updated LUCENE-1359:
-

Summary: FrenchAnalyzer's tokenStream method does not honour the contract 
of Analyzer  (was: FrenchAnalyzer's tokenStream method does not honour the 
contact of Analyzer)

 FrenchAnalyzer's tokenStream method does not honour the contract of Analyzer
 

 Key: LUCENE-1359
 URL: https://issues.apache.org/jira/browse/LUCENE-1359
 Project: Lucene - Java
  Issue Type: Bug
  Components: Analysis
Affects Versions: 2.2
Reporter: Andrew Lynch

 In {{Analyzer}} :
 {code}
 /** Creates a TokenStream which tokenizes all the text in the provided
 Reader.  Default implementation forwards to tokenStream(Reader) for 
 compatibility with older version.  Override to allow Analyzer to choose 
 strategy based on document and/or field.  Must be able to handle null
 field name for backward compatibility. */
   public abstract TokenStream tokenStream(String fieldName, Reader reader);
 {code}
 and in {{FrenchAnalyzer}}
 {code}
 public final TokenStream tokenStream(String fieldName, Reader reader) {
 if (fieldName == null) throw new IllegalArgumentException(fieldName must 
 not be null);
 if (reader == null) throw new IllegalArgumentException(reader must not 
 be null);
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-1124) short circuit FuzzyQuery.rewrite when input token length is small compared to minSimilarity

2008-08-18 Thread Otis Gospodnetic (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic updated LUCENE-1124:
-

Summary: short circuit FuzzyQuery.rewrite when input token length is small 
compared to minSimilarity  (was: short circuit FuzzyQuery.rewrite when input 
okenlengh is small compared to minSimilarity)

 short circuit FuzzyQuery.rewrite when input token length is small compared to 
 minSimilarity
 ---

 Key: LUCENE-1124
 URL: https://issues.apache.org/jira/browse/LUCENE-1124
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Query/Scoring
Reporter: Hoss Man
 Attachments: LUCENE-1124.patch, LUCENE-1124.patch


 I found this (unreplied to) email floating around in my Lucene folder from 
 during the holidays...
 {noformat}
 From: Timo Nentwig
 To: java-dev
 Subject: Fuzzy makes no sense for short tokens
 Date: Mon, 31 Dec 2007 16:01:11 +0100
 Message-Id: [EMAIL PROTECTED]
 Hi!
 it generally makes no sense to search fuzzy for short tokens because changing
 even only a single character of course already results in a high edit
 distance. So it actually only makes sense in this case:
if( token.length()  1f / (1f - minSimilarity) )
 E.g. changing one character in a 3-letter token (foo) results in an edit
 distance of 0.6. And if minSimilarity (which is by default: 0.5 :-) is higher
 we can save all the expensive rewrite() logic.
 {noformat}
 I don't know much about FuzzyQueries, but this reasoning seems sound ... 
 FuzzyQuery.rewrite should be able to completely skip all TermEnumeration in 
 the event that the input token is shorter then some simple math on the 
 minSimilarity.  (i'm not smart enough to be certain that the math above is 
 right however ... it's been a while since i looked at Levenstein distances 
 ... tests needed)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1358) Deadlock for some Query objects in the equals method (f.ex. PhraseQuery) in a concurrent environment

2008-08-18 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12623437#action_12623437
 ] 

Otis Gospodnetic commented on LUCENE-1358:
--

It sounds like you are simply demonstrating an old bug, right?  If so, then we 
can close this issue, since LUCENE-1346 fixed the bug you described (I didn't 
verify that).


 Deadlock for some Query objects in the equals method (f.ex. PhraseQuery) in a 
 concurrent environment
 

 Key: LUCENE-1358
 URL: https://issues.apache.org/jira/browse/LUCENE-1358
 Project: Lucene - Java
  Issue Type: Bug
  Components: Other
Affects Versions: 2.3.2
Reporter: Torbjørn Køhler
Priority: Minor
 Attachments: TestDeadLock.java

   Original Estimate: 0h
  Remaining Estimate: 0h

 Some Query objects in lucene 2.3.2 (and previous versions) have internal 
 variables using Vector.   These variables are used during the call to the 
 equals method.   In a concurrent environment a deadlock might occur.The 
 attached code example shows this happening in lucene 2.3.2, but the patch in 
 LUCENE-1346 fixes this issue (though that doesn't seem to be the intention of 
 that patch according to the description :-)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1275) Expose Document Number

2008-08-15 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12622903#action_12622903
 ] 

Otis Gospodnetic commented on LUCENE-1275:
--

Hasan, please see Hoss' and my comments above.


 Expose Document Number
 --

 Key: LUCENE-1275
 URL: https://issues.apache.org/jira/browse/LUCENE-1275
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index, Store
Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.4, 2.9, 3.0
 Environment: All
Reporter: Hasan Diwan
Priority: Minor
 Attachments: lucene.pat


 Lucene maintains an internal document number, which this patch exposes using 
 a mutator/accessor pair of methods. The field is set on document addition. 
 This creates a unique way to refer to a document for editing and updating 
 individual documents.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1308) Remove String.intern() from Field.java to increase performance and lower contention

2008-06-17 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12605832#action_12605832
 ] 

Otis Gospodnetic commented on LUCENE-1308:
--

Rene, can you provide a patch along with unit tests?  Have you or can you run 
contrib/benchmarks and include your before the changes and after the changes 
results here, so we can see what difference this change makes?  Thanks.


 Remove String.intern() from Field.java to increase performance and lower 
 contention
 ---

 Key: LUCENE-1308
 URL: https://issues.apache.org/jira/browse/LUCENE-1308
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 2.3.2
Reporter: Rene S

 Right now, *document.Field is interning all field names. While this makes 
 sense because it lowers the overall memory consumption, the method intern() 
 of String is know to be difficult to handle. 
 1) it is a native call and therefore slower than anything on the Java level
 2) the String pool is part of the perm space and not of the general heap, so 
 it's size is more restricted and needs extra VM params to be managed
 3) Some VMs show GC problems with strings in the string pool
 Suggested solution is a WeakHashMap instead, that takes care of unifying the 
 String instances and at the same time keeping the pool in the heap space and 
 releasing the String when it is not longer needed. For extra performance in a 
 concurrent environment, a ConcurrentHashMap-like implementation of a weak 
 hashmap is recommended, because we mostly read from the pool.
 We saw a 10% improvement in throughout and response time of our application 
 and the application is not only doing searches (we read a lot of documents 
 from the result). So a single measurement test case could show even more 
 improvement in single and concurrent usage.
 The Cache:
 /** Cache to replace the expensive String.intern() call with the java version 
 */
 private final static MapString, WeakReferenceString unifiedStringsCache =
Collections.synchronizedMap(new WeakHashMapString, 
 WeakReferenceString(109));
 The access to it, instead of this.name = name.intern;
 // unify the strings, but do not use the expensive String.intern() version
 // which is not weak enough, uses the perm space and is a native call
 String unifiedName = null;
 WeakReferenceString ref = unifiedStringsCache.get(name);
 if (ref != null)
 {
 unifiedName = ref.get();
 }
 if (unifiedName == null)
 {
 unifiedStringsCache.put(name, new WeakReference(name));
 unifiedName = name;
 }
 this.name = unifiedName;
 I guess it is sufficient to have mostly all fields names interned, so I 
 skipped the additional synchronization around the access and take the risk 
 that only 99.99% :) of all field names are interned.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-1142) Updated Snowball package

2008-06-15 Thread Otis Gospodnetic (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic updated LUCENE-1142:
-

Fix Version/s: 2.4

 Updated Snowball package
 

 Key: LUCENE-1142
 URL: https://issues.apache.org/jira/browse/LUCENE-1142
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Karl Wettin
Priority: Minor
 Fix For: 2.4

 Attachments: snowball.tartarus.txt


 Updated Snowball contrib package
  * New org.tartarus.snowball java package with patched SnowballProgram to be 
 abstract to avoid using reflection.
  * Introducing Hungarian, Turkish and Romanian stemmers
  * Introducing constructor SnowballFilter(SnowballProgram)
 It is possible there have been some changes made to the some of there stemmer 
 algorithms between this patch and the current SVN trunk of Lucene, an index 
 might thus not be compatible with new stemmers!
 The API is backwards compatibile and the test pass.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Assigned: (LUCENE-1180) Syns2Index fails

2008-06-15 Thread Otis Gospodnetic (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic reassigned LUCENE-1180:


Assignee: Otis Gospodnetic

 Syns2Index fails
 

 Key: LUCENE-1180
 URL: https://issues.apache.org/jira/browse/LUCENE-1180
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/*
Affects Versions: 2.3
Reporter: Jeffrey Yang
Assignee: Otis Gospodnetic
Priority: Minor
 Attachments: syns2index_fix_2.3.patch, syns2index_fix_2.4-dev.patch

   Original Estimate: 1h
  Remaining Estimate: 1h

 Running Syns2Index fails with a
 java.lang.IllegalArgumentException: maxBufferedDocs must at least be 2 when 
 enabled exception.
 at 
 org.apache.lucene.index.IndexWriter.setMaxBufferedDocs(IndexWriter.java:883)
 at org.apache.lucene.wordnet.Syns2Index.index(Syns2Index.java:249)
 at org.apache.lucene.wordnet.Syns2Index.main(Syns2Index.java:208)
 The code is here
   // blindly up these parameters for speed
   writer.setMergeFactor( writer.getMergeFactor() * 2);
   writer.setMaxBufferedDocs( writer.getMaxBufferedDocs() * 2);
 It looks like getMaxBufferedDocs used to return 10, and now it returns -1, 
 not sure when that started happening.
 My suggestion would be to just remove these three lines.  Since speed has 
 already improved vastly, there isn't a need to speed things up.
 To run this, Syns2Index requires two args.  The first is the location of the 
 wn_s.pl file, and the second is the directory to create the index in.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Created: (LUCENE-1307) Remove Contributions page

2008-06-14 Thread Otis Gospodnetic (JIRA)

Remove Contributions page
-

 Key: LUCENE-1307
 URL: https://issues.apache.org/jira/browse/LUCENE-1307
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Website
Reporter: Otis Gospodnetic
Priority: Minor


 On Fri, May 16, 2008 at 10:06 PM, Otis Gospodnetic
 [EMAIL PROTECTED] wrote:
 Hola,

 Does anyone think the Contributions page should be removed?
 http://lucene.apache.org/java/2_3_2/contributions.html

 It looks so outdated that I think it may give newcomers a bad  
 impression of Lucene (What, this is it for contributions?).
 The only really valuable piece there is Luke, but Luke must be  
 mentioned in a dozen places on the Wiki anyway.


 Should we remove the Contributions page?

Yonik and Grant gave their +1s.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1306) CombinedNGramTokenFilter

2008-06-14 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12605120#action_12605120
 ] 

Otis Gospodnetic commented on LUCENE-1306:
--

Should there be a way for the client of this class to specify the prefix and 
suffix char?

Is having, for example, ^h as the first bi-gram token really the right thing 
to do?  Would ^he make more sense?  I know that makes it 3 characters long, 
but it's 2 chars from the input string.  Not sure, so I'm asking.

Is this primarily to distinguish between the edge and inner n-grams?  If so, 
would it make more sense to just make use of Token type variable instead?


 CombinedNGramTokenFilter
 

 Key: LUCENE-1306
 URL: https://issues.apache.org/jira/browse/LUCENE-1306
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/analyzers
Reporter: Karl Wettin
Assignee: Karl Wettin
Priority: Trivial
 Attachments: LUCENE-1306.txt


 Alternative NGram filter that produce tokens with composite prefix and suffix 
 markers.
 {code:java}
 ts = new WhitespaceTokenizer(new StringReader(hello));
 ts = new CombinedNGramTokenFilter(ts, 2, 2);
 assertNext(ts, ^h);
 assertNext(ts, he);
 assertNext(ts, el);
 assertNext(ts, ll);
 assertNext(ts, lo);
 assertNext(ts, o$);
 assertNull(ts.next());
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-1297) Allow other string distance measures in spellchecker

2008-06-14 Thread Otis Gospodnetic (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic updated LUCENE-1297:
-

Attachment: LUCENE-1297.patch

Attaching a new version (only added ASL 2.0 to StringDistance + typo fix)

Question (why - what does it do?) about this TRStringDistance change:

-return p[n];
+return 1.0f - ((float) p[n] / Math.min(other.length(), sa.length));



 Allow other string distance measures in spellchecker
 

 Key: LUCENE-1297
 URL: https://issues.apache.org/jira/browse/LUCENE-1297
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/spellchecker
Affects Versions: 2.4
 Environment: n/a
Reporter: Thomas Morton
Assignee: Otis Gospodnetic
Priority: Minor
 Fix For: 2.4

 Attachments: LUCENE-1297.patch, LUCENE-1297.patch


 Updated spelling code to allow for other string distance measures to be used.
 Created StringDistance interface.
 Modified existing Levenshtein distance measure to implement interface (and 
 renamed class).
 Verified that change to Levenshtein distance didn't impact runtime 
 performance.
 Implemented Jaro/Winkler distance metric
 Modified SpellChecker to take distacne measure as in constructor or in set 
 method and to use interface when calling.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1297) Allow other string distance measures in spellchecker

2008-06-12 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12604538#action_12604538
 ] 

Otis Gospodnetic commented on LUCENE-1297:
--

Tom, I agree with Grant and I'll assume you'll update the patch.

As for that TRStringDistance - LevensteinDistance, I'll just commit it as is 
once the patch is fully ready, and then I'll rename classes in a separate 
commit.


 Allow other string distance measures in spellchecker
 

 Key: LUCENE-1297
 URL: https://issues.apache.org/jira/browse/LUCENE-1297
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/spellchecker
Affects Versions: 2.4
 Environment: n/a
Reporter: Thomas Morton
Assignee: Otis Gospodnetic
Priority: Minor
 Fix For: 2.4

 Attachments: string_distance3.patch


 Updated spelling code to allow for other string distance measures to be used.
 Created StringDistance interface.
 Modified existing Levenshtein distance measure to implement interface (and 
 renamed class).
 Verified that change to Levenshtein distance didn't impact runtime 
 performance.
 Implemented Jaro/Winkler distance metric
 Modified SpellChecker to take distacne measure as in constructor or in set 
 method and to use interface when calling.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Resolved: (LUCENE-1178) Hits does not use MultiSearcher's createWeight

2008-06-11 Thread Otis Gospodnetic (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic resolved LUCENE-1178.
--

Resolution: Won't Fix

With Hits getting deprecated, I think it doesn't make sense to pursue this.  If 
anyone disagrees, we can reopen.


 Hits does not use MultiSearcher's createWeight
 --

 Key: LUCENE-1178
 URL: https://issues.apache.org/jira/browse/LUCENE-1178
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 2.3
Reporter: Israel Tsadok
Assignee: Otis Gospodnetic
 Attachments: hits.diff


 I am developing a distributed index, using MultiSearcher and RemoteSearcher. 
 When investigating some performance issues, I noticed that there is a lot of 
 back-and-forth traffic between the servers during the weight calculation.
 Although MultiSearcher has a method called createWeight that minimizes the 
 calls to the sub-searchers, this method never actually gets called when I 
 call search(query).
 From what I can tell, this is fixable by changing in Hits.java the line:
 weight = q.weight(s);
 to:
 weight = s.createWeight(q);

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1297) Allow other string distance measures in spellchecker

2008-06-10 Thread Otis Gospodnetic (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12604121#action_12604121
]

Otis Gospodnetic commented on LUCENE-1297:
--

Tom, note the bit about naming patches and reusing patch names on the
HowToContribute wiki page.

I see JaroWinklerDistance.java doesn't have ASL on top.

Oh, there is something funky about this patch. You created a new class
(LevenshteinDistance), but your patch shows it as an edit of TRStringDistance.
It should show it as a brand new file. Could you please provide a clean patch?
This is why the patch fails to apply.

Thanks.

Allow other string distance measures in spellchecker

Key: LUCENE-1297
URL: https://issues.apache.org/jira/browse/LUCENE-1297
Project: Lucene - Java
Issue Type: New Feature
Components: contrib/spellchecker
Affects Versions: 2.4
Environment: n/a
Reporter: Thomas Morton
Assignee: Otis Gospodnetic
Priority: Minor
Fix For: 2.4

Attachments: string_distance.patch2

Updated spelling code to allow for other string distance measures to be used.
Created StringDistance interface.
Modified existing Levenshtein distance measure to implement interface (and
renamed class).
Verified that change to Levenshtein distance didn't impact runtime
performance.
Implemented Jaro/Winkler distance metric
Modified SpellChecker to take distacne measure as in constructor or in set
method and to use interface when calling.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1297) Allow other string distance measures in spellchecker

2008-06-03 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12602151#action_12602151
 ] 

Otis Gospodnetic commented on LUCENE-1297:
--

Thomas - any chance you can write a simple unit test that exercises JaroWinkler?


 Allow other string distance measures in spellchecker
 

 Key: LUCENE-1297
 URL: https://issues.apache.org/jira/browse/LUCENE-1297
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/spellchecker
Affects Versions: 2.4
 Environment: n/a
Reporter: Thomas Morton
Assignee: Otis Gospodnetic
Priority: Minor
 Fix For: 2.4

 Attachments: string_distance.patch


 Updated spelling code to allow for other string distance measures to be used.
 Created StringDistance interface.
 Modified existing Levenshtein distance measure to implement interface (and 
 renamed class).
 Verified that change to Levenshtein distance didn't impact runtime 
 performance.
 Implemented Jaro/Winkler distance metric
 Modified SpellChecker to take distacne measure as in constructor or in set 
 method and to use interface when calling.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1297) Allow other string distance measures in spellchecker

2008-05-30 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12601335#action_12601335
 ] 

Otis Gospodnetic commented on LUCENE-1297:
--

You read my mind, Thomas.
Would it be appropriate to add and try Jaccard index and Dice coefficient, too, 
then?


 Allow other string distance measures in spellchecker
 

 Key: LUCENE-1297
 URL: https://issues.apache.org/jira/browse/LUCENE-1297
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/spellchecker
Affects Versions: 2.4
 Environment: n/a
Reporter: Thomas Morton
Assignee: Otis Gospodnetic
Priority: Minor
 Fix For: 2.4

 Attachments: string_distance.patch


 Updated spelling code to allow for other string distance measures to be used.
 Created StringDistance interface.
 Modified existing Levenshtein distance measure to implement interface (and 
 renamed class).
 Verified that change to Levenshtein distance didn't impact runtime 
 performance.
 Implemented Jaro/Winkler distance metric
 Modified SpellChecker to take distacne measure as in constructor or in set 
 method and to use interface when calling.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1295) Make retrieveTerms(int docNum) public in MoreLikeThis

2008-05-30 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12601340#action_12601340
 ] 

Otis Gospodnetic commented on LUCENE-1295:
--

I think cosmetic changes are OK if:
* they are not mixed with functional changes
* there are no patches for the cleaned-up class(es) in JIRA

In this case I see only a couple of MLT issues, all of which look like we can 
take care of them quickly, and then somebody can clean up a little if we feel 
like it.  Anyhow...


 Make retrieveTerms(int docNum) public in MoreLikeThis
 -

 Key: LUCENE-1295
 URL: https://issues.apache.org/jira/browse/LUCENE-1295
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/*
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Trivial
 Attachments: LUCENE-1295.patch


 It would be useful if 
 {code}
 private PriorityQueue retrieveTerms(int docNum) throws IOException {
 {code}
 were public, since it is similar in use to 
 {code}
 public PriorityQueue retrieveTerms(Reader r) throws IOException {
 {code}
 It also seems useful to add 
 {code}
 public String [] retrieveInterestingTerms(int docNum) throws IOException{
 {code}
 to mirror the one that works on Reader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Assigned: (LUCENE-725) NovelAnalyzer - wraps your choice of Lucene Analyzer and filters out all boilerplate text

2008-05-30 Thread Otis Gospodnetic (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic reassigned LUCENE-725:
---

Assignee: Otis Gospodnetic

 NovelAnalyzer - wraps your choice of Lucene Analyzer and filters out all 
 boilerplate text
 ---

 Key: LUCENE-725
 URL: https://issues.apache.org/jira/browse/LUCENE-725
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Analysis
Reporter: Mark Harwood
Assignee: Otis Gospodnetic
Priority: Minor
 Attachments: NovelAnalyzer.java, NovelAnalyzer.java


 This is a class I have found to be useful for analyzing small (in the 
 hundreds) collections of documents and  removing any duplicate content such 
 as standard disclaimers or repeated text in an exchange of  emails.
 This has applications in sampling query results to identify key phrases, 
 improving speed-reading of results with similar content (eg email 
 threads/forum messages) or just removing duplicated noise from a search index.
 To be more generally useful it needs to scale to millions of documents - in 
 which case an alternative implementation is required. See the notes in the 
 Javadocs for this class for more discussion on this

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1295) Make retrieveTerms(int docNum) public in MoreLikeThis

2008-05-28 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12600679#action_12600679
 ] 

Otis Gospodnetic commented on LUCENE-1295:
--

Perque no.  I see MLT is full of tabs, should you feel like fixing the 
formating.


 Make retrieveTerms(int docNum) public in MoreLikeThis
 -

 Key: LUCENE-1295
 URL: https://issues.apache.org/jira/browse/LUCENE-1295
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/*
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Trivial
 Attachments: LUCENE-1295.patch


 It would be useful if 
 {code}
 private PriorityQueue retrieveTerms(int docNum) throws IOException {
 {code}
 were public, since it is similar in use to 
 {code}
 public PriorityQueue retrieveTerms(Reader r) throws IOException {
 {code}
 It also seems useful to add 
 {code}
 public String [] retrieveInterestingTerms(int docNum) throws IOException{
 {code}
 to mirror the one that works on Reader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Assigned: (LUCENE-1178) Hits does not use MultiSearcher's createWeight

2008-05-28 Thread Otis Gospodnetic (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic reassigned LUCENE-1178:


Assignee: Otis Gospodnetic

 Hits does not use MultiSearcher's createWeight
 --

 Key: LUCENE-1178
 URL: https://issues.apache.org/jira/browse/LUCENE-1178
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 2.3
Reporter: Israel Tsadok
Assignee: Otis Gospodnetic
 Attachments: hits.diff


 I am developing a distributed index, using MultiSearcher and RemoteSearcher. 
 When investigating some performance issues, I noticed that there is a lot of 
 back-and-forth traffic between the servers during the weight calculation.
 Although MultiSearcher has a method called createWeight that minimizes the 
 calls to the sub-searchers, this method never actually gets called when I 
 call search(query).
 From what I can tell, this is fixable by changing in Hits.java the line:
 weight = q.weight(s);
 to:
 weight = s.createWeight(q);

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-954) Toggle score normalization in Hits

2008-05-27 Thread Otis Gospodnetic (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12600366#action_12600366
]

Otis Gospodnetic commented on LUCENE-954:
-

I suppose there is now suddenly no need to work on Hits. I'll resolve this as
Won't Fix in a few days, unless somebody has some more thoughts on this.

Toggle score normalization in Hits
--

Key: LUCENE-954
URL: https://issues.apache.org/jira/browse/LUCENE-954
Project: Lucene - Java
Issue Type: Improvement
Components: Search
Affects Versions: 2.2, 2.3, 2.3.1, 2.4
Environment: any
Reporter: Christian Kohlschütter
Assignee: Otis Gospodnetic
Fix For: 2.4

Attachments: hits-scoreNorm.patch, LUCENE-954.patch

The current implementation of the Hits class sometimes performs score
normalization.
In particular, whenever the top-ranked score is bigger than 1.0, it is
normalized to a maximum of 1.0.
In this case, Hits may return different score results than TopDocs-based
methods.
In my scenario (a federated search system), Hits delievered just plain wrong
results.
I was merging results from several sources, all having homogeneous statistics
(similar to MultiSearcher, but over the Internet using HTTP/XML-based
protocols).
Sometimes, some of the sources had a top-score greater than 1, so I ended up
with garbled results.
I suggest to add a switch to enable/disable this score-normalization at
runtime.
My patch (attached) has an additional peformance benefit, since score
normalization now occurs only when Hits#score() is called, not when creating
the Hits result list. Whenever scores are not required, you save one
multiplication per retrieved hit (i.e., at least 100 multiplications with the
current implementation of Hits).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Resolved: (LUCENE-691) Bob Carpenter's FuzzyTermEnum refactoring

2008-05-27 Thread Otis Gospodnetic (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic resolved LUCENE-691.
-

Resolution: Duplicate
  Assignee: Otis Gospodnetic

The patch for Bob's change suggestions is in LUCENE-1183, so this issue is 
redundant.

 Bob Carpenter's FuzzyTermEnum refactoring
 -

 Key: LUCENE-691
 URL: https://issues.apache.org/jira/browse/LUCENE-691
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Otis Gospodnetic
Assignee: Otis Gospodnetic
Priority: Minor

 I'll just paste Bob's complete email here.
 I refactored the org.apache.lucene.search.FuzzyTermEnum
 edit distance implementation.  It now only uses a single
 pair of arrays, and those never get resized.  That required
 turning the order of text/target around in the loops.  You'll
 see that with the pair of arrays method, they get re-used
 hand-over-hand, and are assigned to local variables in the
 tight loops.
 I removed the calculation of min distance and replaced
 it with a boolean -- the min wasn't needed, only the test vs.
 the max.  I also flipped some variables around so there's
 one less addition in the very inner loop and the arrays are
 now looping in the ordinary way (starting at 0 with a  comparison).
 I also commented out the redundant definition of the public close()
 [which just called super.close() and had none of its own doc.]
 I also just compute the max distance each time rather than
 fiddling with an array -- it's just a little arithmetic done once
 per term, but that could be put back.
 I also rewrote min(int,int,int) to get rid of intermediate
 assignments.  Is there a lib for this kind of thing?
 An intermediate refactoring that does the hand-over-hand
 with the existing array and resizing strategy is in
 FuzzyTermEnum.intermed.java.
 I ran the unit tests as follows on both versions (my hat's off to the
 build.xml author(s) -- this all just worked out of the box and was
 really easy to follow the first through):
 C:\java\lucene-2.0.0ant -Djunit.includes= -Dtestcase=TestFuzzyQuery test
 Buildfile: build.xml
 javacc-uptodate-check:
 javacc-notice:
 init:
 common.compile-core:
 [javac] Compiling 1 source file to
 C:\java\lucene-2.0.0\build\classes\java
 compile-core:
 compile-demo:
 common.compile-test:
 compile-test:
 test:
 [junit] Testsuite: org.apache.lucene.search.TestFuzzyQuery
 [junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 0.453 sec
 BUILD SUCCESSFUL
 Total time: 2 seconds
 Does anyone have regression/performance test harnesses?
 The unit tests were pretty minimal (which is a good thing!).
 It'd be nice to test the min impl (ternary vs. if/then)
 and the assumption about not allocating an
 array of max distances.  All told, the refactored version
 should be a modest speed improvement, mainly from
 unfolding the arrays so they're local one-dimensional refs.
 I don't know what the protocol is for one-off contributions.
 I'm happy with the Apache license, so that shouldn't
 be a problem.  I also don't know whether you use tabs
 or spaces -- I untabified the final version and used your
 two-space format in emacs.
 - Bob Carpenter
 package org.apache.lucene.search;
 /**
 * Copyright 2004 The Apache Software Foundation
 *
 * Licensed under the Apache License, Version 2.0 (the License);
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 * http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an AS IS BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
 import org.apache.lucene.index.IndexReader;
 import org.apache.lucene.index.Term;
 import java.io.IOException;
 /** Subclass of FilteredTermEnum for enumerating all terms that are similiar
 * to the specified filter term.
 *
 * pTerm enumerations are always ordered by Term.compareTo().  Each term in
 * the enumeration is greater than all that precede it.
 */
 public final class FuzzyTermEnum extends FilteredTermEnum {
   /* This should be somewhere around the average long word.
* If it is longer, we waste time and space. If it is shorter, we waste a
* little bit of time growing the array as we encounter longer words.
*/
   private static final int TYPICAL_LONGEST_WORD_IN_INDEX = 19;
   /* Allows us save time required to create a new array
* everytime similarity is called.  These are slices that
* will be reused during dynamic programming hand-over-hand
* style.
*/
   private final int[] d0;
   private final int[] d1;
   private

[jira] Commented: (LUCENE-1293) Tweaks to PhraseQuery.explain()

2008-05-23 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12599557#action_12599557
 ] 

Otis Gospodnetic commented on LUCENE-1293:
--

Itamar - could you explain, in plain English, why the above is better? (sorry, 
I'm not terribly familiar with PhraseQuery's explain(), so I can't tell why 
this reordering makes the explain output better).  Also, if you have more 
changes to make, please go ahead and put them in a patch.  Thanks!


 Tweaks to PhraseQuery.explain()
 ---

 Key: LUCENE-1293
 URL: https://issues.apache.org/jira/browse/LUCENE-1293
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.4
Reporter: Itamar Syn-Hershko
Priority: Minor
 Fix For: 2.3.2, 2.4


 The explain() function in PhraseQuery.java is very clumzy and could use many 
 optimizations. Perhaps it is only because it is intended to use while 
 debugging?
 Here's an example:
 {noformat}
   result.addDetail(fieldExpl);
   // combine them
   result.setValue(queryExpl.getValue() * fieldExpl.getValue());
   if (queryExpl.getValue() == 1.0f)
 return fieldExpl;
   return result;
}
 {noformat}
 Can easily be tweaked and become:
 {noformat}
   if (queryExpl.getValue() == 1.0f) {
 return fieldExpl;
   }
   result.addDetail(fieldExpl);
   // combine them
   result.setValue(queryExpl.getValue() * fieldExpl.getValue());
   return result;
   }
 {noformat}
 And thats really just for a start...
 Itamar.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Resolved: (LUCENE-1152) SpellChecker does not work properly on calling indexDictionary after clearIndex

2008-05-22 Thread Otis Gospodnetic (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Otis Gospodnetic resolved LUCENE-1152.
--

Resolution: Fixed

Thank you for the patch!

Committed revision 659013.

SpellChecker does not work properly on calling indexDictionary after
clearIndex
---

Key: LUCENE-1152
URL: https://issues.apache.org/jira/browse/LUCENE-1152
Project: Lucene - Java
Issue Type: Bug
Components: contrib/spellchecker
Affects Versions: 2.3
Reporter: Naveen Belkale
Assignee: Otis Gospodnetic
Priority: Minor
Attachments: spellchecker.diff, spellchecker.diff

We have to call clearIndex and indexDictionary to rebuild dictionary from
fresh. The call to SpellChecker clearIndex() function does not reset the
searcher. Hence, when we call indexDictionary after that, many entries that
are already in the stale searcher will not be indexed.
Also, I see that IndexReader reader is used for the sole purpose of obtaining
the docFreq of a given term in exist() function. This functionality can also
be obtained by using just the searcher by calling searcher.docFreq. Thus, can
we get away completely with reader and code associated with it like
if (IndexReader.isLocked(spellIndex)){
IndexReader.unlock(spellIndex);
}
and the reader related code in finalize?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1183) TRStringDistance uses way too much memory (with patch)

2008-05-22 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12598918#action_12598918
 ] 

Otis Gospodnetic commented on LUCENE-1183:
--

Committed the TRStringDistance patch -- thank you!

Committed revision 659016.


I'll leave the FuzzyTermEnum patch for a later date.  Is there anything in 
Bob's FuzzyTermEnum that is not in this patch?  Anything that you'd want to 
add, Cédrik?


 TRStringDistance uses way too much memory (with patch)
 --

 Key: LUCENE-1183
 URL: https://issues.apache.org/jira/browse/LUCENE-1183
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/*
Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3
Reporter: Cédrik LIME
Assignee: Otis Gospodnetic
Priority: Minor
 Attachments: FuzzyTermEnum.patch, TRStringDistance.java, 
 TRStringDistance.patch

   Original Estimate: 0.17h
  Remaining Estimate: 0.17h

 The implementation of TRStringDistance is based on version 2.1 of 
 org.apache.commons.lang.StringUtils#getLevenshteinDistance(String, String), 
 which uses an un-optimized implementation of the Levenshtein Distance 
 algorithm (it uses way too much memory). Please see Bug 38911 
 (http://issues.apache.org/bugzilla/show_bug.cgi?id=38911) for more 
 information.
 The commons-lang implementation has been heavily optimized as of version 2.2 
 (3x speed-up). I have reported the new implementation to TRStringDistance.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Resolved: (LUCENE-1046) Dead code in SpellChecker.java (branch never executes)

2008-05-22 Thread Otis Gospodnetic (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic resolved LUCENE-1046.
--

   Resolution: Fixed
Lucene Fields: [New, Patch Available]  (was: [Patch Available, New])

Danke schön.

Committed revision 659019.


 Dead code in SpellChecker.java (branch never executes)
 --

 Key: LUCENE-1046
 URL: https://issues.apache.org/jira/browse/LUCENE-1046
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/spellchecker
Affects Versions: 2.2
Reporter: Joe
Assignee: Otis Gospodnetic
Priority: Minor
 Attachments: LUCENE-1046.diff


 SpellChecker contains the following lines of code:
 final int goalFreq = (morePopular  ir != null) ? ir.docFreq(new 
 Term(field, word)) : 0;
 // if the word exists in the real index and we don't care for word 
 frequency, return the word itself
 if (!morePopular  goalFreq  0) {
   return new String[] { word };
 }
 The branch will never execute: the only way for goalFreq to be greater than 
 zero is if morePopular is true, but if morePopular is true, the expression in 
 the if statement evaluates to false.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-852) spellchecker: make hard-coded values configurable

2008-05-22 Thread Otis Gospodnetic (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic updated LUCENE-852:


Attachment: LUCENE-852.patch

 spellchecker: make hard-coded values configurable
 -

 Key: LUCENE-852
 URL: https://issues.apache.org/jira/browse/LUCENE-852
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/*
Reporter: karin
Assignee: Otis Gospodnetic
Priority: Minor
 Attachments: LUCENE-852.patch, LUCENE-852.patch


 the class org.apache.lucene.search.spell.SpellChecker uses the following 
 hard-coded values in its method
 indexDictionary:
 writer.setMergeFactor(300);
 writer.setMaxBufferedDocs(150);
 this poses problems when the spellcheck index is created on systems with 
 certain limits, i.e. in unix
 environments where the ulimit settings are restricted for the user 
 (http://www.gossamer-threads.com/lists/lucene/java-dev/47428#47428).
 there are several ways to circumvent this:
 1. add another indexDictionary method with additional parameters:
 public void indexDictionary (Dictionary dict, int mergeFactor, int 
 maxBufferedDocs) throws IOException
 
 2. add setter methods for mergeFactor and maxBufferedDocs 
 (see code in 
 http://www.gossamer-threads.com/lists/lucene/java-dev/47428#47428 )
 3. Make SpellChecker subclassing easier as suggested by Chris Hostetter 
(see reply  
 http://www.gossamer-threads.com/lists/lucene/java-dev/47463#47463)
 thanx,
 karin

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-898) contrib/javascript is not packaged into releases

2008-05-22 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12598924#action_12598924
 ] 

Otis Gospodnetic commented on LUCENE-898:
-

I'll take care of this in a few days...it looks like nobody will miss it.


 contrib/javascript is not packaged into releases
 

 Key: LUCENE-898
 URL: https://issues.apache.org/jira/browse/LUCENE-898
 Project: Lucene - Java
  Issue Type: Bug
  Components: Build
Reporter: Hoss Man
Assignee: Otis Gospodnetic
Priority: Trivial

 the contrib/javascript directory is (apparently) a collection of javascript 
 utilities for lucene .. but it has not build files or any mechanism to 
 package it, so it is excluded form releases.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Resolved: (LUCENE-852) spellchecker: make hard-coded values configurable

2008-05-22 Thread Otis Gospodnetic (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic resolved LUCENE-852.
-

   Resolution: Fixed
Lucene Fields: [New, Patch Available]  (was: [Patch Available, New])

Thanks for the patch, Otis.

Committed revision 659021.


 spellchecker: make hard-coded values configurable
 -

 Key: LUCENE-852
 URL: https://issues.apache.org/jira/browse/LUCENE-852
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/*
Reporter: karin
Assignee: Otis Gospodnetic
Priority: Minor
 Attachments: LUCENE-852.patch, LUCENE-852.patch


 the class org.apache.lucene.search.spell.SpellChecker uses the following 
 hard-coded values in its method
 indexDictionary:
 writer.setMergeFactor(300);
 writer.setMaxBufferedDocs(150);
 this poses problems when the spellcheck index is created on systems with 
 certain limits, i.e. in unix
 environments where the ulimit settings are restricted for the user 
 (http://www.gossamer-threads.com/lists/lucene/java-dev/47428#47428).
 there are several ways to circumvent this:
 1. add another indexDictionary method with additional parameters:
 public void indexDictionary (Dictionary dict, int mergeFactor, int 
 maxBufferedDocs) throws IOException
 
 2. add setter methods for mergeFactor and maxBufferedDocs 
 (see code in 
 http://www.gossamer-threads.com/lists/lucene/java-dev/47428#47428 )
 3. Make SpellChecker subclassing easier as suggested by Chris Hostetter 
(see reply  
 http://www.gossamer-threads.com/lists/lucene/java-dev/47463#47463)
 thanx,
 karin

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1285) WeightedSpanTermExtractor incorrectly treats the same terms occurring in different query types

2008-05-20 Thread Otis Gospodnetic (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12598419#action_12598419
]

Otis Gospodnetic commented on LUCENE-1285:
--

Mark, are you done with this/would you like to commit this? Or should I?
(Asking because of SOLR-553)

WeightedSpanTermExtractor incorrectly treats the same terms occurring in
different query types
--

Key: LUCENE-1285
URL: https://issues.apache.org/jira/browse/LUCENE-1285
Project: Lucene - Java
Issue Type: Bug
Components: contrib/highlighter
Affects Versions: 2.4
Reporter: Andrzej Bialecki
Fix For: 2.4

Attachments: highlighter-test.patch, highlighter.patch

Given a BooleanQuery with multiple clauses, if a term occurs both in a Span /
Phrase query, and in a TermQuery, the results of term extraction are
unpredictable and depend on the order of clauses. Concequently, the result of
highlighting are incorrect.
Example text: t1 t2 t3 t4 t2
Example query: t2 t3 t1 t2
Current highlighting: [t1 t2] [t3] t4 t2
Correct highlighting: [t1 t2] [t3] t4 [t2]
The problem comes from the fact that we keep a MaptermText,
WeightedSpanTerm, and if the same term occurs in a Phrase or Span query the
resulting WeightedSpanTerm will have a positionSensitive=true, whereas terms
added from TermQuery have positionSensitive=false. The end result for this
particular term will depend on the order in which the clauses are processed.
My fix is to use a subclass of Map, which on put() always sets the result to
the most lax setting, i.e. if we already have a term with
positionSensitive=true, and we try to put() a term with
positionSensitive=false, we set the result positionSensitive=false, as it
will match both cases.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-112) [PATCH] Add an IndexReader implementation that frees resources when idle and refreshes itself when stale

2008-05-20 Thread Otis Gospodnetic (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Otis Gospodnetic updated LUCENE-112:

Assignee: (was: Eric Isakson)

[PATCH] Add an IndexReader implementation that frees resources when idle and
refreshes itself when stale

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Assigned: (LUCENE-1285) WeightedSpanTermExtractor incorrectly treats the same terms occurring in different query types

2008-05-20 Thread Otis Gospodnetic (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic reassigned LUCENE-1285:


Assignee: Otis Gospodnetic

 WeightedSpanTermExtractor incorrectly treats the same terms occurring in 
 different query types
 --

 Key: LUCENE-1285
 URL: https://issues.apache.org/jira/browse/LUCENE-1285
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/highlighter
Affects Versions: 2.4
Reporter: Andrzej Bialecki 
Assignee: Otis Gospodnetic
 Fix For: 2.4

 Attachments: highlighter-test.patch, highlighter.patch


 Given a BooleanQuery with multiple clauses, if a term occurs both in a Span / 
 Phrase query, and in a TermQuery, the results of term extraction are 
 unpredictable and depend on the order of clauses. Concequently, the result of 
 highlighting are incorrect.
 Example text: t1 t2 t3 t4 t2
 Example query: t2 t3 t1 t2
 Current highlighting: [t1 t2] [t3] t4 t2
 Correct highlighting: [t1 t2] [t3] t4 [t2]
 The problem comes from the fact that we keep a MaptermText, 
 WeightedSpanTerm, and if the same term occurs in a Phrase or Span query the 
 resulting WeightedSpanTerm will have a positionSensitive=true, whereas terms 
 added from TermQuery have positionSensitive=false. The end result for this 
 particular term will depend on the order in which the clauses are processed.
 My fix is to use a subclass of Map, which on put() always sets the result to 
 the most lax setting, i.e. if we already have a term with 
 positionSensitive=true, and we try to put() a term with 
 positionSensitive=false, we set the result positionSensitive=false, as it 
 will match both cases.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1284) Set of Java classes that allow the Lucene search engine to use morphological information developed for the Apertium open-source machine translation platform (http://www

2008-05-19 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12597979#action_12597979
 ] 

Otis Gospodnetic commented on LUCENE-1284:
--

Thanks, I'll have a look later this week.  Note that if you always use the same 
file name for attachments, JIRA will manage them for you and you won't have to 
delete old ones.  Use a name such as LUCENE-1284.patch or LUCENE-1284.tgz or 
some such.

 Set of Java classes that allow the Lucene search engine to use morphological 
 information developed for the Apertium open-source machine translation 
 platform (http://www.apertium.org)
 --

 Key: LUCENE-1284
 URL: https://issues.apache.org/jira/browse/LUCENE-1284
 Project: Lucene - Java
  Issue Type: New Feature
 Environment: New feature developed under GNU/Linux, but it should 
 work in any other Java-compliance platform
Reporter: Felipe Sánchez Martínez
Assignee: Otis Gospodnetic
 Attachments: apertium-morph.2008-05-19.tgz


 Set of Java classes that allow the Lucene search engine to use morphological 
 information developed for the Apertium open-source machine translation 
 platform (http://www.apertium.org). Morphological information is used to 
 index new documents and to process smarter queries in which morphological 
 attributes can be used to specify query terms.
 The tool makes use of morphological analyzers and dictionaries developed for 
 the open-source machine translation platform Apertium (http://apertium.org) 
 and, optionally, the part-of-speech taggers developed for it. Currently there 
 are morphological dictionaries available for Spanish, Catalan, Galician, 
 Portuguese, 
 Aranese, Romanian, French and English. In addition new dictionaries are being 
 developed for Esperanto, Occitan, Basque, Swedish, Danish, 
 Welsh, Polish and Italian, among others; we hope more language pairs to be 
 added to the Apertium machine translation platform in the near future.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1290) Deprecate Hits

2008-05-19 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12598170#action_12598170
 ] 

Otis Gospodnetic commented on LUCENE-1290:
--

I'm actually feeling -1-ish about this.  I don't think Hits are hurting those 
who are truly concerned about performance.  Those who want performance have 
other API options.  But Hits is so nice and simple, and that must be valuable 
to a large portion of Lucene users (think CD searches, site searches, desktop 
search apps, etc., not massive distributed searches and such).

Why can't we let Hits live?  If we are concerned about its performance, we can 
easily javadoc and Wiki that.


 Deprecate Hits
 --

 Key: LUCENE-1290
 URL: https://issues.apache.org/jira/browse/LUCENE-1290
 Project: Lucene - Java
  Issue Type: Task
  Components: Search
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: 2.4

 Attachments: lucene-1290.patch


 The Hits class has several drawbacks as pointed out in LUCENE-954.
 The other search APIs that use TopDocCollector and TopDocs should be used 
 instead.
 This patch:
 - deprecates org/apache/lucene/search/Hits, Hit, and HitIterator, as well as
   the Searcher.search( * ) methods which return a Hits Object.
 - removes all references to Hits from the core and uses TopDocs and ScoreDoc
   instead
 - Changes the demo SearchFiles: adds the two modes 'paging search' and 
 'streaming search',
   each of which demonstrating a different way of using the search APIs. The 
 former
   uses TopDocs and a TopDocCollector, the latter a custom HitCollector 
 implementation.
 - Updates the online tutorial that descibes the demo.
 All tests pass.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

1 2 3 4 >

1 - 100 of 356 matches

Mail list logo