date:20080212

Re: [jira] Commented: (LUCENE-1173) index corruption autoCommit=false

2008-02-12 Thread Michael McCandless



Michael Busch wrote:


OK, I suggest that we should wait a couple of days before we cut 2.3.1
in case there are more problems. We should backport the patches and
commit them to the 2.3 branch. I'll then end of this week create a  
2.3.1

tag, build release artifacts and call a vote. Sounds good?


Sounds good.  Thanks Michael!

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1168) TermVectors index files can become corrupt when autoCommit=false

2008-02-12 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12568048#action_12568048
 ] 

Michael McCandless commented on LUCENE-1168:


Backported to 2.3 branch.

> TermVectors index files can become corrupt when autoCommit=false
> 
>
> Key: LUCENE-1168
> URL: https://issues.apache.org/jira/browse/LUCENE-1168
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 2.3
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 2.4
>
> Attachments: LUCENE-1168.patch
>
>
> Spinoff from this thread:
>   http://www.gossamer-threads.com/lists/lucene/java-dev/55951
> There are actually 2 separate cases here, both only happening when
> autoCommit=false:
>   * First issue was caused by LUCENE-843 (sigh): if you add a bunch of
> docs with no term vectors, such that 1 or more flushes happen;
> then you add docs that do have term vectors, the tvx file will not
> have enough entries (= corruption).
>   * Second issue was caused by bulk merging of term vectors
> (LUCENE-1120 -- only in trunk) and bulk merging of stored fields
> (LUCENE-1043, in 2.3), and only shows when autoCommit=false, and,
> the bulk merging optimization runs.  In this case, the code that
> reads the rawDocs tries to read too far in the tvx/fdx files (it's
> not really index corruption but rather a bug in the rawDocs
> reading).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Index with payloads needed

2008-02-12 Thread Andrzej Bialecki


Hi all,

I'm testing the payloads support in Luke, and I need a small index with 
payloads - if you happen to have one, please contact me off the list. 
Thank you!


--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Index with payloads needed

2008-02-12 Thread Andrzej Bialecki


Grant Ingersoll wrote:
The contrib/analyzer module has several TokenFilters that create 
Payloads using the offset or type information from a Token.  See 
o.a.l.analysis.payloads.


That should be sufficient for your testing in that it adds payloads to 
tokens based on readily available Token information.


Great, thanks - that's exactly what I needed.


--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-997) Add search timeout support to Lucene

2008-02-12 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12568292#action_12568292
 ] 

Yonik Seeley commented on LUCENE-997:
-

> My preference would be for core o.a.l.search.
+1

> Add search timeout support to Lucene
> 
>
> Key: LUCENE-997
> URL: https://issues.apache.org/jira/browse/LUCENE-997
> Project: Lucene - Java
>  Issue Type: New Feature
>Reporter: Sean Timm
>Priority: Minor
> Attachments: HitCollectorTimeoutDecorator.java, 
> LuceneTimeoutTest.java, LuceneTimeoutTest.java, MyHitCollector.java, 
> timeout.patch, timeout.patch, timeout.patch, timeout.patch, timeout.patch, 
> timeout.patch, timeout.patch, timeout.patch, TimerThreadTest.java
>
>
> This patch is based on Nutch-308. 
> This patch adds support for a maximum search time limit. After this time is 
> exceeded, the search thread is stopped, partial results (if any) are returned 
> and the total number of results is estimated.
> This patch tries to minimize the overhead related to time-keeping by using a 
> version of safe unsynchronized timer.
> This was also discussed in an e-mail thread.
> http://www.nabble.com/search-timeout-tf3410206.html#a9501029

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-1176) TermVectors corruption case when autoCommit=false

2008-02-12 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1176:
---

Attachment: LUCENE-1176.take2.patch

Attached patch fixes the corruption case.  It happens if you first add docs 
with no term-vector enabled fields, and then later add at least one doc with 
term vectors.

All tests pass.  I will commit shortly & backport to 2.3.

> TermVectors corruption case when autoCommit=false
> -
>
> Key: LUCENE-1176
> URL: https://issues.apache.org/jira/browse/LUCENE-1176
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 2.4
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 2.4
>
> Attachments: LUCENE-1176.patch, LUCENE-1176.take2.patch
>
>
> I took Yonik's awesome test case (TestStressIndexing2) and extended it to 
> also compare term vectors, and, it's failing.
> I still need to track down why, but it seems likely a separate issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Usefulness of Similarity.queryNorm()

2008-02-12 Thread Marvin Humphrey



On Feb 12, 2008, at 9:08 AM, Marvin Humphrey wrote:

What would the consequences be of eliminating  
Similarity.queryNorm()?  I cargo-culted that method when porting,  
but now I'm going through and trying to refactor for simplicity's  
sake.  If I can zap it, I'd like to.


I infer from the deafening silence that few people if any care about  
queryNorm, or have even contemplated what it's there for.


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[ANN] Luke 0.8.1 released

2008-02-12 Thread Andrzej Bialecki


Hi all,

I decided to make a quick update to the previous release and to address 
some issues related to the way you can work with TermVectors and Payloads.


As usually, you can get the binaries and sources here:

http://www.getopt.org/luke


New features and improvements:
--
* When editing document fields it's now possible to specify TermVectors
  with offsets and/or positions.

* Added ability to show term vector positions and offsets, if available.
  It's also possible to copy this list to the clipboard.

* Added ability to show term positions within a document, and display
  term payloads if available, using one of several pre-defined payload
  decoders. It's also possible to copy this list to the clipboard.

* It's possible now to view the full content of a stored field using
  various content decoders (hex, date / time, number, utf8, arrays of
  int or float)

* Layout of "Browse by Term" panel is changed so that it better reflects
  the available navigation.

Bug fixes:
--
* Check added to prevent from adding new documents if no index is open.

* Wrong class was used in IndexGate to represent deletable files, which
  caused a ClassCastException.

* Some query types may have been skipped when displaying Explanation.


Have fun!

--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Usefulness of Similarity.queryNorm()

2008-02-12 Thread Grant Ingersoll


:-)

I don't know a lot about it, but my understanding has always been that  
comparing across queries is difficult at best, so that would argue for  
removing it, but I haven't done any research into it.  I think it has  
been in Lucene for a good long time, so it may be that the history of  
why it is in there is forgotten.  Also, do you have a sense of it's  
cost in terms of performance?


-Grant

On Feb 12, 2008, at 7:08 PM, Marvin Humphrey wrote:



On Feb 12, 2008, at 9:08 AM, Marvin Humphrey wrote:

What would the consequences be of eliminating  
Similarity.queryNorm()?  I cargo-culted that method when porting,  
but now I'm going through and trying to refactor for simplicity's  
sake.  If I can zap it, I'd like to.


I infer from the deafening silence that few people if any care about  
queryNorm, or have even contemplated what it's there for.


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Usefulness of Similarity.queryNorm()

2008-02-12 Thread Marvin Humphrey



On Feb 12, 2008, at 5:04 PM, Grant Ingersoll wrote:

I don't know a lot about it, but my understanding has always been  
that comparing across queries is difficult at best, so that would  
argue for removing it, but I haven't done any research into it.  I  
think it has been in Lucene for a good long time, so it may be that  
the history of why it is in there is forgotten.


It's called once per Query during Query.weight(Searcher):

  /** Expert: Constructs and initializes a Weight for a top-level  
query. */

  public Weight weight(Searcher searcher)
throws IOException {
Query query = searcher.rewrite(this);
Weight weight = query.createWeight(searcher);
float sum = weight.sumOfSquaredWeights();
float norm = getSimilarity(searcher).queryNorm(sum); // <---  
HERE

weight.normalize(norm);
return weight;
  }

It looks like Lucene actually *does* propagate the normalized sum-of- 
squared-weights into all sub-queries.  That call to  
weight.normalize(norm) right before the end uses the value generated  
by queryNorm(); BooleanWeight.normalize() (for example) propagates the  
modified value:


public void normalize(float norm) {
  norm *= getBoost(); // incorporate boost
  for (int i = 0 ; i < weights.size(); i++) {
Weight w = (Weight)weights.elementAt(i);
// normalize all clauses, (even if prohibited in case of side  
affects)

w.normalize(norm);
  }
}

It's the *same* coefficient for all sub-clauses, so it shouldn't  
affect rankings, BUT...  relative rankings *will* be affected is some  
inner clauses have custom boost values.


It seems to me, conceptually, like code that claims to perform  
"normalization" shouldn't be able to affect rankings.  However,  
because of this side effect of incorporating boost at the  
normalization stage, it can.


I think.

This code is really hard to follow. :(


Also, do you have a sense of it's cost in terms of performance?


Nil.

It's only called once per Query and all it does by default is damp the  
weighting coefficient:


  multiplier = 1 / sqrt(multiplier)

If I reckon right, zapping it means that e.g. complex BooleanWeight  
objects which return a high value for sumOfSquaredWeights() will  
produce scores which are high, maybe startlingly high to some users.   
My guess is that the default implementation was chosen to complement  
the sum-of-squared-weights algo.


I'm not sure I care whether the scoring range expands.  Normalizing  
scores manually is cake, if people want to do that.


Heck, I'd love to eliminate ALL the automatic normalization code... if  
only I could figure out what all the hidden side effects are.  :(


My goal is to de-voodoofy the Query-Weight-Scorer compilation phase so  
that it's easier to write Query subclasses, and I'm happy to sacrifice  
consistency of scoring range if it'll help simplify things.


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Build failed in Hudson: Lucene-trunk #375

2008-02-12 Thread Apache Hudson Server

See http://hudson.zones.apache.org/hudson/job/Lucene-trunk/375/changes

Changes:

[mikemccand] LUCENE-1176: fix corruption case when adding docs with no term 
vectors followed by docs with term vectors

[doronc] LUCENE-997: Add search timeout (partial) support.

[mikemccand] LUCENE-1175: add missing synchronization

--
[...truncated 6250 lines...]
[junit] 
[junit] Testsuite: org.apache.lucene.search.TestBooleanPrefixQuery
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.492 sec
[junit] 
[junit] Testsuite: org.apache.lucene.search.TestBooleanQuery
[junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 0.366 sec
[junit] 
[junit] Testsuite: org.apache.lucene.search.TestBooleanScorer
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.502 sec
[junit] 
[junit] Testsuite: org.apache.lucene.search.TestCachingWrapperFilter
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.45 sec
[junit] 
[junit] Testsuite: org.apache.lucene.search.TestComplexExplanations
[junit] Tests run: 22, Failures: 0, Errors: 0, Time elapsed: 1.47 sec
[junit] 
[junit] Testsuite: 
org.apache.lucene.search.TestComplexExplanationsOfNonMatches
[junit] Tests run: 22, Failures: 0, Errors: 0, Time elapsed: 0.737 sec
[junit] 
[junit] Testsuite: org.apache.lucene.search.TestConstantScoreRangeQuery
[junit] Tests run: 7, Failures: 0, Errors: 0, Time elapsed: 0.995 sec
[junit] 
[junit] Testsuite: org.apache.lucene.search.TestCustomSearcherSort
[junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 1.033 sec
[junit] 
[junit] Testsuite: org.apache.lucene.search.TestDateFilter
[junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 0.52 sec
[junit] 
[junit] Testsuite: org.apache.lucene.search.TestDateSort
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.542 sec
[junit] 
[junit] Testsuite: org.apache.lucene.search.TestDisjunctionMaxQuery
[junit] Tests run: 10, Failures: 0, Errors: 0, Time elapsed: 0.825 sec
[junit] 
[junit] Testsuite: org.apache.lucene.search.TestDocBoost
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.482 sec
[junit] 
[junit] Testsuite: org.apache.lucene.search.TestExplanations
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.478 sec
[junit] 
[junit] Testsuite: org.apache.lucene.search.TestExtendedFieldCache
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.786 sec
[junit] 
[junit] Testsuite: org.apache.lucene.search.TestFilteredQuery
[junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 0.662 sec
[junit] 
[junit] Testsuite: org.apache.lucene.search.TestFilteredSearch
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.516 sec
[junit] 
[junit] Testsuite: org.apache.lucene.search.TestFuzzyQuery
[junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 0.526 sec
[junit] 
[junit] Testsuite: org.apache.lucene.search.TestMatchAllDocsQuery
[junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 0.519 sec
[junit] 
[junit] Testsuite: org.apache.lucene.search.TestMultiPhraseQuery
[junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 0.543 sec
[junit] 
[junit] Testsuite: org.apache.lucene.search.TestMultiSearcher
[junit] Tests run: 4, Failures: 0, Errors: 0, Time elapsed: 0.647 sec
[junit] 
[junit] Testsuite: org.apache.lucene.search.TestMultiSearcherRanking
[junit] Tests run: 7, Failures: 0, Errors: 0, Time elapsed: 0.765 sec
[junit] 
[junit] Testsuite: org.apache.lucene.search.TestMultiThreadTermVectors
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 1.198 sec
[junit] 
[junit] Testsuite: org.apache.lucene.search.TestNot
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.517 sec
[junit] 
[junit] Testsuite: org.apache.lucene.search.TestParallelMultiSearcher
[junit] Tests run: 4, Failures: 0, Errors: 0, Time elapsed: 0.658 sec
[junit] 
[junit] Testsuite: org.apache.lucene.search.TestPhrasePrefixQuery
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.498 sec
[junit] 
[junit] Testsuite: org.apache.lucene.search.TestPhraseQuery
[junit] Tests run: 13, Failures: 0, Errors: 0, Time elapsed: 0.972 sec
[junit] 
[junit] Testsuite: org.apache.lucene.search.TestPositionIncrement
[junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 0.544 sec
[junit] 
[junit] Testsuite: org.apache.lucene.search.TestPrefixFilter
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.502 sec
[junit] 
[junit] Testsuite: org.apache.lucene.search.TestPrefixQuery
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.524 sec
[junit] 
[junit] Testsuite: org.apache.lucene.search.TestQueryTermVector
[junit] Tests ru

[jira] Commented: (LUCENE-1175) occasional MergeException while indexing

2008-02-12 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12568316#action_12568316
 ] 

Michael McCandless commented on LUCENE-1175:


{quote}
FYI, I wasn't able to reproduce on the 2.3 branch.
{quote}
OK that's good :) I can't either.

> occasional MergeException while indexing
> 
>
> Key: LUCENE-1175
> URL: https://issues.apache.org/jira/browse/LUCENE-1175
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 2.3
>Reporter: Yonik Seeley
> Attachments: LUCENE-1175.patch
>
>
> TestStressIndexing2.testMultiConfig occasionally hits merge exceptions

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1175) occasional MergeException while indexing

2008-02-12 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12568302#action_12568302
 ] 

Yonik Seeley commented on LUCENE-1175:
--

FYI, I wasn't able to reproduce on the 2.3 branch.

> occasional MergeException while indexing
> 
>
> Key: LUCENE-1175
> URL: https://issues.apache.org/jira/browse/LUCENE-1175
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 2.3
>Reporter: Yonik Seeley
> Attachments: LUCENE-1175.patch
>
>
> TestStressIndexing2.testMultiConfig occasionally hits merge exceptions

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Index with payloads needed

2008-02-12 Thread Grant Ingersoll

The contrib/analyzer module has several TokenFilters that create  
Payloads using the offset or type information from a Token.  See  
o.a.l.analysis.payloads.


That should be sufficient for your testing in that it adds payloads to  
tokens based on readily available Token information.


-Grant

On Feb 12, 2008, at 1:15 PM, Andrzej Bialecki wrote:


Hi all,

I'm testing the payloads support in Luke, and I need a small index  
with payloads - if you happen to have one, please contact me off the  
list. Thank you!


--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-997) Add search timeout support to Lucene

2008-02-12 Thread Timo Nentwig (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12568286#action_12568286
 ] 

Timo Nentwig commented on LUCENE-997:
-

I agree, core.

> Add search timeout support to Lucene
> 
>
> Key: LUCENE-997
> URL: https://issues.apache.org/jira/browse/LUCENE-997
> Project: Lucene - Java
>  Issue Type: New Feature
>Reporter: Sean Timm
>Priority: Minor
> Attachments: HitCollectorTimeoutDecorator.java, 
> LuceneTimeoutTest.java, LuceneTimeoutTest.java, MyHitCollector.java, 
> timeout.patch, timeout.patch, timeout.patch, timeout.patch, timeout.patch, 
> timeout.patch, timeout.patch, timeout.patch, TimerThreadTest.java
>
>
> This patch is based on Nutch-308. 
> This patch adds support for a maximum search time limit. After this time is 
> exceeded, the search thread is stopped, partial results (if any) are returned 
> and the total number of results is estimated.
> This patch tries to minimize the overhead related to time-keeping by using a 
> version of safe unsynchronized timer.
> This was also discussed in an e-mail thread.
> http://www.nabble.com/search-timeout-tf3410206.html#a9501029

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-997) Add search timeout support to Lucene

2008-02-12 Thread Sean Timm (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12568280#action_12568280
 ] 

Sean Timm commented on LUCENE-997:
--

"If there are no more major concerns I think this is now ready to go in, 
question is where to - under core o.a.l.search or under contrib (query or 
misc)."

My preference would be for core o.a.l.search.

> Add search timeout support to Lucene
> 
>
> Key: LUCENE-997
> URL: https://issues.apache.org/jira/browse/LUCENE-997
> Project: Lucene - Java
>  Issue Type: New Feature
>Reporter: Sean Timm
>Priority: Minor
> Attachments: HitCollectorTimeoutDecorator.java, 
> LuceneTimeoutTest.java, LuceneTimeoutTest.java, MyHitCollector.java, 
> timeout.patch, timeout.patch, timeout.patch, timeout.patch, timeout.patch, 
> timeout.patch, timeout.patch, timeout.patch, TimerThreadTest.java
>
>
> This patch is based on Nutch-308. 
> This patch adds support for a maximum search time limit. After this time is 
> exceeded, the search thread is stopped, partial results (if any) are returned 
> and the total number of results is estimated.
> This patch tries to minimize the overhead related to time-keeping by using a 
> version of safe unsynchronized timer.
> This was also discussed in an e-mail thread.
> http://www.nabble.com/search-timeout-tf3410206.html#a9501029

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Resolved: (LUCENE-997) Add search timeout support to Lucene

2008-02-12 Thread Doron Cohen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doron Cohen resolved LUCENE-997.


   Resolution: Fixed
Lucene Fields: [Patch Available]  (was: [New, Patch Available])

Committed (under core o.a.l.search).
Thanks Sean!

> Add search timeout support to Lucene
> 
>
> Key: LUCENE-997
> URL: https://issues.apache.org/jira/browse/LUCENE-997
> Project: Lucene - Java
>  Issue Type: New Feature
>Reporter: Sean Timm
>Assignee: Doron Cohen
>Priority: Minor
> Attachments: HitCollectorTimeoutDecorator.java, 
> LuceneTimeoutTest.java, LuceneTimeoutTest.java, MyHitCollector.java, 
> timeout.patch, timeout.patch, timeout.patch, timeout.patch, timeout.patch, 
> timeout.patch, timeout.patch, timeout.patch, TimerThreadTest.java
>
>
> This patch is based on Nutch-308. 
> This patch adds support for a maximum search time limit. After this time is 
> exceeded, the search thread is stopped, partial results (if any) are returned 
> and the total number of results is estimated.
> This patch tries to minimize the overhead related to time-keeping by using a 
> version of safe unsynchronized timer.
> This was also discussed in an e-mail thread.
> http://www.nabble.com/search-timeout-tf3410206.html#a9501029

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Assigned: (LUCENE-997) Add search timeout support to Lucene

2008-02-12 Thread Doron Cohen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doron Cohen reassigned LUCENE-997:
--

Assignee: Doron Cohen

> Add search timeout support to Lucene
> 
>
> Key: LUCENE-997
> URL: https://issues.apache.org/jira/browse/LUCENE-997
> Project: Lucene - Java
>  Issue Type: New Feature
>Reporter: Sean Timm
>Assignee: Doron Cohen
>Priority: Minor
> Attachments: HitCollectorTimeoutDecorator.java, 
> LuceneTimeoutTest.java, LuceneTimeoutTest.java, MyHitCollector.java, 
> timeout.patch, timeout.patch, timeout.patch, timeout.patch, timeout.patch, 
> timeout.patch, timeout.patch, timeout.patch, TimerThreadTest.java
>
>
> This patch is based on Nutch-308. 
> This patch adds support for a maximum search time limit. After this time is 
> exceeded, the search thread is stopped, partial results (if any) are returned 
> and the total number of results is estimated.
> This patch tries to minimize the overhead related to time-keeping by using a 
> version of safe unsynchronized timer.
> This was also discussed in an e-mail thread.
> http://www.nabble.com/search-timeout-tf3410206.html#a9501029

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Usefulness of Similarity.queryNorm()

2008-02-12 Thread Marvin Humphrey


Greets,

What would the consequences be of eliminating Similarity.queryNorm()?   
I cargo-culted that method when porting, but now I'm going through and  
trying to refactor for simplicity's sake.  If I can zap it, I'd like to.


First, the theoretical angle:

According to the Similarity docs, queryNorm() doesn't impact document  
ranking, since it is applied as a multiplier to the scores of all  
matching docs.  I don't see how it's all that useful, then.


How important is it that discrete queries, or queries against  
different indexes, produce scores within a "comparable" range?  It  
seems to me that if you need that, you can always perform  
normalization after the search completes by setting the top score to  
1.0 and increasing/decreasing other scores proportionately.  Are there  
any cases where that solution wouldn't be adequate?


Second, the implementation angle:

Is it really true that document ranking is unaffected by queryNorm()?   
It seems to me that when multi-level boolean queries are normalized,  
clauses having different IDFs would end up with different  
multipliers.  Maybe I'm wrong -- it's always hard to wrap your head  
around recursion -- but is the assertion that ranking is unaffected a  
documentation glitch?


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lingustically-enhanced indexing for Lucene

2008-02-12 Thread Grant Ingersoll



On Feb 12, 2008, at 9:47 AM, [EMAIL PROTECTED] wrote:




The best way to do this is to create a patch and attach it to a JIRA
issue.  http://wiki.apache.org/lucene-java/HowToContribute has the
details.


Ok, I will read it. Thanks


Sounds like an interesting project.  What are the licensing terms for
Apertium?  On a side note, you might be interested in Mahout

(http://lucene.apache.org/mahout)

Apertium is licensed under the GNU GPL license version 2.


OK, this means that the Jars can not be included in the contrib.  The  
way to handle this is to have the build script download them for the  
user.  See the contrib/db module for how it handles the Berkeley  
database.


-Grant

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Commented: (LUCENE-1173) index corruption autoCommit=false

2008-02-12 Thread Michael Busch

Grant Ingersoll wrote:
> 
> 
> I'd suggest at least a week, as it sounds like we need to put this
> through the wringer a bit more.
> 

I agree! Shall we add a news item to the website where we list these
known issues and announce that there will be a 2.3.1 release in aprox.
1-2 weeks?

-Michael

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lingustically-enhanced indexing for Lucene

2008-02-12 Thread fsanchez


> The best way to do this is to create a patch and attach it to a JIRA  
> issue.  http://wiki.apache.org/lucene-java/HowToContribute has the  
> details.

Ok, I will read it. Thanks
> 
> Sounds like an interesting project.  What are the licensing terms for  
> Apertium?  On a side note, you might be interested in Mahout
(http://lucene.apache.org/mahout)

Apertium is licensed under the GNU GPL license version 2.

regards
--
Felipe

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Usefulness of Similarity.queryNorm()

2008-02-12 Thread Paul Elschot

Op Wednesday 13 February 2008 04:48:31 schreef Marvin Humphrey:
...
> 
> 
> Heck, I'd love to eliminate ALL the automatic normalization code... if  
> only I could figure out what all the hidden side effects are.  :(
> 
> My goal is to de-voodoofy the Query-Weight-Scorer compilation phase so  
> that it's easier to write Query subclasses, and I'm happy to sacrifice  
> consistency of scoring range if it'll help simplify things.

For consistency of scoring ranges on the leaf side of the scorer tree
LUCENE-293 might be of interest.

Regards,
Paul Elschot

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Commented: (LUCENE-1173) index corruption autoCommit=false

2008-02-12 Thread Grant Ingersoll



On Feb 11, 2008, at 6:49 PM, Michael Busch wrote:


Yonik Seeley (JIRA) wrote:
   [ https://issues.apache.org/jira/browse/LUCENE-1173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12567878 
#action_12567878 ]


Yonik Seeley commented on LUCENE-1173:
--

Hold up a bit... my random testing may have hit another bug
testMultiConfig hit an error at some point when I cranked up the  
iterations... I'm trying to reproduce.




OK, I suggest that we should wait a couple of days before we cut 2.3.1
in case there are more problems. We should backport the patches and
commit them to the 2.3 branch. I'll then end of this week create a  
2.3.1

tag, build release artifacts and call a vote. Sounds good?


I'd suggest at least a week, as it sounds like we need to put this  
through the wringer a bit more.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1175) occasional MergeException while indexing

2008-02-12 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12568145#action_12568145
 ] 

Yonik Seeley commented on LUCENE-1175:
--

I'll try on Lucene 2.3 soon.
I had assumed that the second exception had the same root cause as the first.

I also switched to fsdirectory and let it run overnight... no exceptions with 
that.

> occasional MergeException while indexing
> 
>
> Key: LUCENE-1175
> URL: https://issues.apache.org/jira/browse/LUCENE-1175
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 2.3
>Reporter: Yonik Seeley
> Attachments: LUCENE-1175.patch
>
>
> TestStressIndexing2.testMultiConfig occasionally hits merge exceptions

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-1177) IW.optimize() can do too many merges at the very end

2008-02-12 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1177:
---

Attachment: LUCENE-1177.patch

Attached patch.  Will commit shortly to 2.3.

> IW.optimize() can do too many merges at the very end
> 
>
> Key: LUCENE-1177
> URL: https://issues.apache.org/jira/browse/LUCENE-1177
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 2.3
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.4
>
> Attachments: LUCENE-1177.patch
>
>
> This was fixed on trunk in LUCENE-1044 but I'd like to separately
> backport it to 2.3.
> With ConcurrentMergeScheduler there is a bug, only when CFS is on,
> whereby after the final merge of an optimize has finished and while
> it's building its CFS, the merge policy may incorrectly ask for
> another merge to collapse that segment into a compound file.  The net
> effect is optimize can spend many extra iterations unecessarily
> merging a single segment to collapse it to compound file.
> I believe the case is rare (hard to hit), and maybe only if you have
> multiple threads calling optimize at once (the TestThreadedOptimize
> test can hit it), but it's a low-risk fix so I plan to commit to 2.3
> shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Created: (LUCENE-1177) IW.optimize() can do too many merges at the very end

2008-02-12 Thread Michael McCandless (JIRA)

IW.optimize() can do too many merges at the very end


 Key: LUCENE-1177
 URL: https://issues.apache.org/jira/browse/LUCENE-1177
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.3
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.4


This was fixed on trunk in LUCENE-1044 but I'd like to separately
backport it to 2.3.

With ConcurrentMergeScheduler there is a bug, only when CFS is on,
whereby after the final merge of an optimize has finished and while
it's building its CFS, the merge policy may incorrectly ask for
another merge to collapse that segment into a compound file.  The net
effect is optimize can spend many extra iterations unecessarily
merging a single segment to collapse it to compound file.

I believe the case is rare (hard to hit), and maybe only if you have
multiple threads calling optimize at once (the TestThreadedOptimize
test can hit it), but it's a low-risk fix so I plan to commit to 2.3
shortly.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-1175) occasional MergeException while indexing

2008-02-12 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1175:
---

Attachment: LUCENE-1175.patch

Yonik, are you able to repro this on 2.3?  I can't.

And on trunk, I can only repro your first exception.  The issue with that one 
is missing synchronization from the changes from LUCENE-1044.  I'm attaching 
the patch that fixes the first exception in my testing.

Any ideas on how to get that 2nd exception to happen would be most welcome!

> occasional MergeException while indexing
> 
>
> Key: LUCENE-1175
> URL: https://issues.apache.org/jira/browse/LUCENE-1175
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 2.3
>Reporter: Yonik Seeley
> Attachments: LUCENE-1175.patch
>
>
> TestStressIndexing2.testMultiConfig occasionally hits merge exceptions

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-1176) TermVectors corruption case when autoCommit=false

2008-02-12 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1176:
---

Attachment: LUCENE-1176.patch

Attached patch that extends TestStressIndexing2 to also check term vectors.

> TermVectors corruption case when autoCommit=false
> -
>
> Key: LUCENE-1176
> URL: https://issues.apache.org/jira/browse/LUCENE-1176
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 2.4
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 2.4
>
> Attachments: LUCENE-1176.patch
>
>
> I took Yonik's awesome test case (TestStressIndexing2) and extended it to 
> also compare term vectors, and, it's failing.
> I still need to track down why, but it seems likely a separate issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Created: (LUCENE-1176) TermVectors corruption case when autoCommit=false

2008-02-12 Thread Michael McCandless (JIRA)

TermVectors corruption case when autoCommit=false
-

 Key: LUCENE-1176
 URL: https://issues.apache.org/jira/browse/LUCENE-1176
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.4
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 2.4


I took Yonik's awesome test case (TestStressIndexing2) and extended it to also 
compare term vectors, and, it's failing.

I still need to track down why, but it seems likely a separate issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lingustically-enhanced indexing for Lucene

2008-02-12 Thread Grant Ingersoll

The best way to do this is to create a patch and attach it to a JIRA  
issue.  http://wiki.apache.org/lucene-java/HowToContribute has the  
details.


Sounds like an interesting project.  What are the licensing terms for  
Apertium?  On a side note, you might be interested in Mahout (http://lucene.apache.org/mahout 
)


Cheers,
Grant

On Feb 12, 2008, at 6:25 AM, [EMAIL PROTECTED] wrote:


The Transducens Group (http://transducens.dlsi.ua.es) at University
of Alicante (http://www.ua.es) has developed a tool that
allows the Lucene search engine to use morphological information
while indexing and then process smarter queries in which
morphological attributes can be used to specify query terms.

To that end, the tool makes use of morphological analyzers and
dictionaries developed for the open-source machine translation  
platform

Apertium (http://apertium.org) and, optionally, the part-of-speech
taggers developed for it. Currently there are morphological
dictionaries available for Spanish, Catalan, Galician, Portuguese,
Aranese, Romanian, French and English. In addition new dictionaries
are being developed for Esperanto, Occitan, Basque, Swedish, Danish,
Welsh, Polish and Italian, among others; we hope more language pairs
to be added to the Apertium machine translation platform in the
near future.

We are interested on releasing this tool as open source and we think
that the best way to do that would be to integrate it into the  
Lucene's

contrib folder, as other third-party tools. Who is the responsible
for that?, To whom should we address this petition?

Thank you very much.

== How it works ==

Indexing documents through this new framework involves the following
steps:

1. The texts to index must be analyzed using the morphological  
analyzer

and (optionally) the part-of-speech taggers of the Apertium machine
translation platform. Apertium supports files in plain text, rtf, odt,
sxw, html and doc.

2. Indexing the documents, as usual, by using a Lucene's analyzer
developed ad-hoc so as to properly interpret the documents previously
analyzed.

During indexing, the following morphological information is obtained  
for

each word: superficial form (the word as it appears in a non-analyzed
text), its lemma and relevant morphological information such as
part-of-speech and verb tense (if appropriate). The following example
illustrates which information is stored in the index for the following
English phrase "Blair does not resign":

* "Blair"
  - Superficial form: blair
  - Lemma: blair
  - Morphological information: np.ant (noun of a person)

* "does"
  - Superficial form: does
  - Lemma: do
  - Morphological information: vbdo.pri (auxiliar verb, present tense)

* "not"
  - Superficial form: no
  - Lemma: no
  - Morphological information: adv (adverb)

* "resign"
  - Superficial form: resign
  - Lemma: resign
  - Morphological information: vblex.inf (verb, infinitive tense)

To search, the language accepted by the query parser can be applied,
provided that a WhitespaceAnalyzer is used. In the query one can  
specify

information of different nature, to that end the following prefixes
are used:
- "sf:" for the superficial form (eg "sf:resign")
- "lem:" for the lema (eg "lem:resign")
- "tags:" for the morphological information (eg "tags:vblex.inf")

The following example illustrates the type of queries that can be used
to search for an specific document:

- Query: "tags:np.loc lem:airline sf:with lem:destination tags:np.loc"

This query searches for documents in which there is an airline or more
flying from anywhere to elsewhere, for example "Argentine airlines  
with

destination Madrid" or "British airlines with destination New York"


--
Felipe Sánchez Martínez <[EMAIL PROTECTED]>
Departamento de Lenguajes y Sistemas Informáticos
Universidad de Alicante, E-03071 Alicante (Spain)
Tel.: +34 965 903 400, ext: 2038 Fax: +34 965 909 326
http://www.dlsi.ua.es/~fsanchez


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
Grant Ingersoll
http://lucene.grantingersoll.com
http://www.lucenebootcamp.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Creating a index scheduler with Java.

2008-02-12 Thread Grant Ingersoll


Hi,

While this question is best asked on the java-user mailing list, I  
would have a look at the OpenSymphony Quartz Java scheduler.  Just  
search for Quartz Java.


-Grant
On Feb 12, 2008, at 5:47 AM, galford23 wrote:



Hi all,

I am trying to do some scheduling / cron job for lucene indexing.
I am very new to Lucene and Java.
Can I get some advices on how can I achieve it? Books or url link or
technology required . I have been searching the web for quite some  
time but
just cannot get the correct result.. maybe even some help on my key  
word

search in google is welcome as well.
I will be using netbean for development. my application Will be  
running on

centos5 and mysql database.


Attach below is an screen shot of the expected development result.
http://www.nabble.com/file/p15430825/Index%2BScreen%2Bshot.jpg



Please help.:confused:
--
View this message in context: 
http://www.nabble.com/Creating-a-index-scheduler-with-Java.-tp15430825p15430825.html
Sent from the Lucene - Java Developer mailing list archive at  
Nabble.com.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1163) CharArraySet.contains(char[] text, int off, int len) does not work

2008-02-12 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12568056#action_12568056
 ] 

Michael McCandless commented on LUCENE-1163:


Backported to 2.3

> CharArraySet.contains(char[] text, int off, int len) does not work
> --
>
> Key: LUCENE-1163
> URL: https://issues.apache.org/jira/browse/LUCENE-1163
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 2.3
>Reporter: Thomas Peuss
>Assignee: Michael McCandless
> Fix For: 2.4
>
> Attachments: CharArraySetShowBug.java, LUCENE-1163.patch
>
>
> I try to use the CharArraySet for a filter I am writing. I heavily use 
> char-arrays in my code to speed up things. I stumbled upon a bug in 
> CharArraySet while doing that.
> The method _public boolean contains(char[] text, int off, int len)_ seems not 
> to work.
> When I do 
> {code}
> if (set.contains(buffer,offset,length) {
>   ...
> }
> {code}
> my code fails.
> But when I do
> {code}
> if (set.contains(new String(buffer,offset,length)) {
>...
> }
> {code}
> everything works as expected.
> Both variants should behave the same. I attach a small piece of code to show 
> the problem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1173) index corruption autoCommit=false

2008-02-12 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12568053#action_12568053
 ] 

Michael McCandless commented on LUCENE-1173:



Backported to 2.3.

{quote}
Patch looks good (heh... a one liner!)
{quote}
Yeah the worst ones always seem to be one-liner fixes.  Sigh.

{quote}
Hold up a bit... my random testing may have hit another bug
testMultiConfig hit an error at some point when I cranked up the iterations... 
I'm trying to reproduce. 
{quote}
I'll go dig on that one next.

> index corruption autoCommit=false
> -
>
> Key: LUCENE-1173
> URL: https://issues.apache.org/jira/browse/LUCENE-1173
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 2.3
>Reporter: Yonik Seeley
>Assignee: Michael McCandless
>Priority: Critical
> Attachments: indexstress.patch, indexstress.patch, LUCENE-1173.patch
>
>
> In both Lucene 2.3 and trunk, the index becomes corrupted when 
> autoCommit=false

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-1166) A tokenfilter to decompose compound words

2008-02-12 Thread Thomas Peuss (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Peuss updated LUCENE-1166:
-

Attachment: CompoundTokenFilter.patch

Changes:
* added unittest
* minor tweaks for getting the encoding of the XML files right

> A tokenfilter to decompose compound words
> -
>
> Key: LUCENE-1166
> URL: https://issues.apache.org/jira/browse/LUCENE-1166
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Analysis
>Reporter: Thomas Peuss
> Attachments: CompoundTokenFilter.patch, CompoundTokenFilter.patch, 
> de.xml, hyphenation.dtd
>
>
> A tokenfilter to decompose compound words you find in many germanic languages 
> (like German, Swedish, ...) into single tokens.
> An example: Donaudampfschiff would be decomposed to Donau, dampf, schiff so 
> that you can find the word even when you only enter "Schiff".
> I use the hyphenation code from the Apache XML project FOP 
> (http://xmlgraphics.apache.org/fop/) to do the first step of decomposition. 
> Currently I use the FOP jars directly. I only use a handful of classes from 
> the FOP project.
> My question now:
> Would it be OK to copy this classes over to the Lucene project (renaming the 
> packages of course) or should I stick with the dependency to the FOP jars? 
> The FOP code uses the ASF V2 license as well.
> What do you think?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1163) CharArraySet.contains(char[] text, int off, int len) does not work

2008-02-12 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12568050#action_12568050
 ] 

Michael McCandless commented on LUCENE-1163:


I'll port this one to 2.3.1 as well.

> CharArraySet.contains(char[] text, int off, int len) does not work
> --
>
> Key: LUCENE-1163
> URL: https://issues.apache.org/jira/browse/LUCENE-1163
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 2.3
>Reporter: Thomas Peuss
>Assignee: Michael McCandless
> Fix For: 2.4
>
> Attachments: CharArraySetShowBug.java, LUCENE-1163.patch
>
>
> I try to use the CharArraySet for a filter I am writing. I heavily use 
> char-arrays in my code to speed up things. I stumbled upon a bug in 
> CharArraySet while doing that.
> The method _public boolean contains(char[] text, int off, int len)_ seems not 
> to work.
> When I do 
> {code}
> if (set.contains(buffer,offset,length) {
>   ...
> }
> {code}
> my code fails.
> But when I do
> {code}
> if (set.contains(new String(buffer,offset,length)) {
>...
> }
> {code}
> everything works as expected.
> Both variants should behave the same. I attach a small piece of code to show 
> the problem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Lingustically-enhanced indexing for Lucene

2008-02-12 Thread fsanchez

The Transducens Group (http://transducens.dlsi.ua.es) at University 
of Alicante (http://www.ua.es) has developed a tool that
allows the Lucene search engine to use morphological information
while indexing and then process smarter queries in which
morphological attributes can be used to specify query terms.

To that end, the tool makes use of morphological analyzers and
dictionaries developed for the open-source machine translation platform
Apertium (http://apertium.org) and, optionally, the part-of-speech
taggers developed for it. Currently there are morphological 
dictionaries available for Spanish, Catalan, Galician, Portuguese, 
Aranese, Romanian, French and English. In addition new dictionaries 
are being developed for Esperanto, Occitan, Basque, Swedish, Danish, 
Welsh, Polish and Italian, among others; we hope more language pairs 
to be added to the Apertium machine translation platform in the 
near future.

We are interested on releasing this tool as open source and we think 
that the best way to do that would be to integrate it into the Lucene's
contrib folder, as other third-party tools. Who is the responsible 
for that?, To whom should we address this petition?

Thank you very much.

== How it works ==

Indexing documents through this new framework involves the following
steps: 

1. The texts to index must be analyzed using the morphological analyzer
and (optionally) the part-of-speech taggers of the Apertium machine
translation platform. Apertium supports files in plain text, rtf, odt,
sxw, html and doc.

2. Indexing the documents, as usual, by using a Lucene's analyzer
developed ad-hoc so as to properly interpret the documents previously
analyzed.

During indexing, the following morphological information is obtained for
each word: superficial form (the word as it appears in a non-analyzed
text), its lemma and relevant morphological information such as
part-of-speech and verb tense (if appropriate). The following example
illustrates which information is stored in the index for the following
English phrase "Blair does not resign":

* "Blair" 
   - Superficial form: blair
   - Lemma: blair
   - Morphological information: np.ant (noun of a person) 

* "does" 
   - Superficial form: does
   - Lemma: do
   - Morphological information: vbdo.pri (auxiliar verb, present tense) 

* "not" 
   - Superficial form: no 
   - Lemma: no 
   - Morphological information: adv (adverb) 

* "resign" 
   - Superficial form: resign 
   - Lemma: resign 
   - Morphological information: vblex.inf (verb, infinitive tense) 

To search, the language accepted by the query parser can be applied,
provided that a WhitespaceAnalyzer is used. In the query one can specify
information of different nature, to that end the following prefixes
are used:
- "sf:" for the superficial form (eg "sf:resign") 
- "lem:" for the lema (eg "lem:resign") 
- "tags:" for the morphological information (eg "tags:vblex.inf") 

The following example illustrates the type of queries that can be used
to search for an specific document: 

- Query: "tags:np.loc lem:airline sf:with lem:destination tags:np.loc" 

This query searches for documents in which there is an airline or more
flying from anywhere to elsewhere, for example "Argentine airlines with 
destination Madrid" or "British airlines with destination New York"


-- 
Felipe Sánchez Martínez <[EMAIL PROTECTED]>
Departamento de Lenguajes y Sistemas Informáticos
Universidad de Alicante, E-03071 Alicante (Spain)
Tel.: +34 965 903 400, ext: 2038 Fax: +34 965 909 326
http://www.dlsi.ua.es/~fsanchez


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Creating a index scheduler with Java.

2008-02-12 Thread galford23


Hi all,

I am trying to do some scheduling / cron job for lucene indexing. 
I am very new to Lucene and Java. 
Can I get some advices on how can I achieve it? Books or url link or
technology required . I have been searching the web for quite some time but
just cannot get the correct result.. maybe even some help on my key word
search in google is welcome as well.
I will be using netbean for development. my application Will be running on
centos5 and mysql database.
 

Attach below is an screen shot of the expected development result.
http://www.nabble.com/file/p15430825/Index%2BScreen%2Bshot.jpg 



Please help.:confused:
-- 
View this message in context: 
http://www.nabble.com/Creating-a-index-scheduler-with-Java.-tp15430825p15430825.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

38 matches

Mail list logo