[jira] Commented: (LUCENE-1161) Punctuation handling in StandardTokenizer (and WikipediaTokenizer)

2008-03-13 Thread Hiroaki Kawai (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12578179#action_12578179
 ] 

Hiroaki Kawai commented on LUCENE-1161:
---

I think WhitespaceTokenizer + WordDelimiterFilter + StandardFilter might work...

 Punctuation handling in StandardTokenizer (and WikipediaTokenizer)
 --

 Key: LUCENE-1161
 URL: https://issues.apache.org/jira/browse/LUCENE-1161
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Grant Ingersoll
Priority: Minor

 It would be useful, in the StandardTokenizer, to be able to have more control 
 over in-word punctuation is handled.  For instance, it is not always 
 desirable to split on dashes or other punctuation.  In other cases, one may 
 want to output the split tokens plus a collapsed version of the token that 
 removes the punctuation.
 For example, Solr's WordDelimiterFilter provides some nice capabilities here, 
 but it can't do it's job when using the StandardTokenizer because the 
 StandardTokenizer already makes the decision on how to handle it without 
 giving the user any choice.
 I think, in JFlex, we can have a back-compatible way of letting users make 
 decisions about punctuation that occurs inside of a token.  Such as e-bay or 
 i-pod, thus allowing for matches on iPod and eBay.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-1227) NGramTokenizer to handle more than 1024 chars

2008-03-13 Thread Hiroaki Kawai (JIRA)
NGramTokenizer to handle more than 1024 chars
-

 Key: LUCENE-1227
 URL: https://issues.apache.org/jira/browse/LUCENE-1227
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/*
Reporter: Hiroaki Kawai
 Attachments: NGramTokenizer.patch

Current NGramTokenizer can't handle character stream that is longer than 1024. 
This is too short for non-whitespace-separated languages.

I created a patch for this issues.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1227) NGramTokenizer to handle more than 1024 chars

2008-03-13 Thread Hiroaki Kawai (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12578182#action_12578182
 ] 

Hiroaki Kawai commented on LUCENE-1227:
---

LUCENE-1227's NGramTokenizer.patch will also fix 
https://issues.apache.org/jira/browse/LUCENE-1225 :-)

 NGramTokenizer to handle more than 1024 chars
 -

 Key: LUCENE-1227
 URL: https://issues.apache.org/jira/browse/LUCENE-1227
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/*
Reporter: Hiroaki Kawai
 Attachments: NGramTokenizer.patch


 Current NGramTokenizer can't handle character stream that is longer than 
 1024. This is too short for non-whitespace-separated languages.
 I created a patch for this issues.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Build failed in Hudson: Lucene-trunk #399

2008-03-13 Thread Michael McCandless


Looks like Oracle finally shut off the old download location from  
Sleepycat.  It's been moved to a new location.  I'll commit a fix  
shortly.


Mike

On Mar 12, 2008, at 10:12 PM, Apache Hudson Server wrote:


See http://hudson.zones.apache.org/hudson/job/Lucene-trunk/399/changes

Changes:

[mikemccand] LUCENE-1223: fix lazy field loading to not allow  
string field to be loaded as binary, nor vice/versa


[mikemccand] LUCENE-1214: preseve original exception in  
SegmentInfos write  commit


[mikemccand] LUCENE-1212: factor DocumentsWriter into separate  
source files


--
[...truncated 2190 lines...]
javacc-notice:

jflex-uptodate-check:

jflex-notice:

common.init:

compile-demo:

javacc-uptodate-check:

javacc-notice:

jflex-uptodate-check:

jflex-notice:

init:

clover.setup:

clover.info:

clover:

common.compile-core:

compile-core:

compile-demo:
[mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/ 
Lucene-trunk/ws/trunk/build/classes/demo
[javac] Compiling 17 source files to http:// 
hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/ 
classes/demo


compile-highlighter:
 [echo] Building highlighter...

javacc-uptodate-check:

javacc-notice:

jflex-uptodate-check:

jflex-notice:

common.init:

build-lucene:

init:

clover.setup:

clover.info:

clover:

compile-core:
[mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/ 
Lucene-trunk/ws/trunk/build/contrib/highlighter/classes/java
[javac] Compiling 18 source files to http:// 
hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/ 
contrib/highlighter/classes/java
[javac] Note: http://hudson.zones.apache.org/hudson/job/Lucene- 
trunk/ws/trunk/contrib/highlighter/src/java/org/apache/lucene/ 
search/highlight/QueryScorer.java  uses or overrides a deprecated API.

[javac] Note: Recompile with -deprecation for details.

compile:

check-files:

init:

clover.setup:

clover.info:

clover:

compile-core:
[mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/ 
Lucene-trunk/ws/trunk/build/contrib/benchmark/classes/java
[javac] Compiling 89 source files to http:// 
hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/ 
contrib/benchmark/classes/java

[javac] Note: Some input files use or override a deprecated API.
[javac] Note: Recompile with -deprecation for details.

jar-core:
 [exec] Execute failed: java.io.IOException: svnversion: not found
  [jar] Building jar: http://hudson.zones.apache.org/hudson/job/ 
Lucene-trunk/ws/trunk/build/contrib/benchmark/lucene-benchmark-2.4- 
SNAPSHOT.jar


jar:

compile-test:
 [echo] Building benchmark...

javacc-uptodate-check:

javacc-notice:

jflex-uptodate-check:

jflex-notice:

common.init:

compile-demo:

javacc-uptodate-check:

javacc-notice:

jflex-uptodate-check:

jflex-notice:

init:

clover.setup:

clover.info:

clover:

common.compile-core:

compile-core:

compile-demo:

compile-highlighter:
 [echo] Building highlighter...

javacc-uptodate-check:

javacc-notice:

jflex-uptodate-check:

jflex-notice:

common.init:

build-lucene:

init:

clover.setup:

clover.info:

clover:

compile-core:

compile:

check-files:

init:

clover.setup:

clover.info:

clover:

compile-core:

common.compile-test:
[mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/ 
Lucene-trunk/ws/trunk/build/contrib/benchmark/classes/test
[javac] Compiling 5 source files to http:// 
hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/ 
contrib/benchmark/classes/test
 [copy] Copying 2 files to http://hudson.zones.apache.org/ 
hudson/job/Lucene-trunk/ws/trunk/build/contrib/benchmark/classes/test


build-artifacts-and-tests:

bdb:
 [echo] Building bdb...

javacc-uptodate-check:

javacc-notice:

jflex-uptodate-check:

jflex-notice:

common.init:

build-lucene:

contrib-build.init:

get-db-jar:
[mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/ 
Lucene-trunk/ws/trunk/contrib/db/bdb/lib
  [get] Getting: http://downloads.osafoundation.org/db/ 
db-4.3.29.jar
  [get] To: http://hudson.zones.apache.org/hudson/job/Lucene- 
trunk/ws/trunk/contrib/db/bdb/lib/db-4.3.29.jar


check-and-get-db-jar:

init:

clover.setup:

clover.info:

clover:

compile-core:
[mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/ 
Lucene-trunk/ws/trunk/build/contrib/db/bdb/classes/java
[javac] Compiling 7 source files to http:// 
hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/ 
contrib/db/bdb/classes/java
[javac] Note: http://hudson.zones.apache.org/hudson/job/Lucene- 
trunk/ws/trunk/contrib/db/bdb/src/java/org/apache/lucene/store/db/ 
DbDirectory.java  uses or overrides a deprecated API.

[javac] Note: Recompile with -deprecation for details.

jar-core:
 [exec] Execute failed: java.io.IOException: svnversion: not found
  [jar] Building jar: http://hudson.zones.apache.org/hudson/job/ 

[jira] Commented: (LUCENE-1029) Illegal character replacements in ISOLatin1AccentFilter

2008-03-13 Thread Hiroaki Kawai (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12578195#action_12578195
 ] 

Hiroaki Kawai commented on LUCENE-1029:
---

I'd like to comment that we have another tool for this. :-)

java.text.Collator can collate the texts, and the instance is base on Locale, 
wow! So, if we use this collator, you might get a better query result, i.e, 
more low search noise that German ä might hit with ae.

I'd like to submit a patch later.

 Illegal character replacements in ISOLatin1AccentFilter
 ---

 Key: LUCENE-1029
 URL: https://issues.apache.org/jira/browse/LUCENE-1029
 Project: Lucene - Java
  Issue Type: Bug
  Components: Analysis
Affects Versions: 2.2
Reporter: Marko Asplund
 Attachments: ISOLatin1AccentFilter-javadoc.patch


 The ISOLatin1AccentFilter class is responsible for replacing accented 
 characters in the ISO Latin 1 character set by their unaccented equivalent.
 Some of the replacements performed for scandinavian characters (used e.g. in 
 the finnish, swedish, danish languages etc.) are illegal. The scandinavian 
 characters are different from the accented characters used e.g. in latin 
 based languages such as french in that these characters (ä, ö, å) represent 
 entirely independent sounds in the language and therefore cannot be 
 represented with any other sound without change of meaning. It is therefore 
 illegal to replace these characters with any other character.
 This means for example that you can't change the finnish word sää (weather) 
 to saa (will have) because these are two entirely different words with 
 different meaning. The same applies to scandinavian languages as well.
 There's no connection between the sounds represented by ä and a; ö and o or å 
 and a. 
 In addition to the three characters mentioned above danish and norwegian use 
 other special characters such as ø and æ. It should be checked if the 
 replacement is legal for these characters.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1219) support array/offset/ length setters for Field with binary data

2008-03-13 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12578199#action_12578199
 ] 

Michael McCandless commented on LUCENE-1219:


Alas, I'm not really happy with introducing this API at the
AbstractField level and not in Fieldable.  It's awkward that we've
deprecated binaryValue() in AbstractField and not in Fieldable.  But,
I think it's our only way forward with this issue without breaking
backwards compatibility.

In 3.0 I'd like to at least promote this API up into Fieldable, but
even that is somewhat messy because I think in 3.0 we would then
deprecate binaryValue() and move these 3 new methods up from
AbstractField.

What I'd really like to do in 3.0 is change Fieldable to not be an
abstract base class instead.

Question: could we simply move forward without Fieldable?  Ie,
deprecate Fieldable right now and state that the migration path is
you should subclass from AbstractField?  I would leave implements
Fieldable in AbstractField now, but remove it in 3.0.  As far as I
can tell, all uses of Fieldable in Lucene are also using
AbstractField.

I guess I don't really understand the need for Fieldable.  In fact I
also don't really understand why we even needed to add AbstractField.
Why couldn't FieldForMerge and LazyField subclass Field?  It's
somewhat awkward now because we have newly added APIs to Field, like
setValue(*), which probably should have been added to Fieldable.


 support array/offset/ length setters for Field with binary data
 ---

 Key: LUCENE-1219
 URL: https://issues.apache.org/jira/browse/LUCENE-1219
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Eks Dev
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1219.patch, LUCENE-1219.patch, LUCENE-1219.patch, 
 LUCENE-1219.patch, LUCENE-1219.take2.patch


 currently Field/Fieldable interface supports only compact, zero based byte 
 arrays. This forces end users to create and copy content of new objects 
 before passing them to Lucene as such fields are often of variable size. 
 Depending on use case, this can bring far from negligible  performance  
 improvement. 
 this approach extends Fieldable interface with 3 new methods   
 getOffset(); gettLenght(); and getBinaryValue() (this only returns reference 
 to the array)


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: an API for synonym in Lucene-core

2008-03-13 Thread Mathieu Lecarme

I'll slice my contrib in small parts

Synonyms
1) Synonym (Token + a weight)
2) Synonym provider from OO.o thesaurus
3) SynonymTokenFilter
4) Query expander wich apply a filter (and a boost) on each of its TermQuery
5) a Synonym filter for the query expander
6) to be efficient, Synonym can be exclude if doesn't exist in the Index.
7) Stemming can be used as a dynamic Synonym

Spell checking or the do you mean? pattern
1) The main concept is in the SpellCheck contrib, but in a not 
expandable implementation
2) In some language, like French, homophony is very important in 
mispelling, there is more than one way to write it
3) Homophony rules is provided by Aspell in a neutral language (just 
like SnowBall for stemming), I implemented a translator to build Java 
class from aspell file (it's the same format in aspell evolution : 
myspell and hunspell, wich are used in OO.o and firefox family)

https://issues.apache.org/jira/browse/LUCENE-956

Storing information about word found in an index
1) It's the Dictionary used in SpellCheck contrib, in a more open way : 
a lexicon. It's a plain old lucene index, word become a Document, and 
Field store computed informations like size, Ngram token and homophony. 
All use filter took from TokenFilter, code duplication is avoided.
2) this information can be not synchronized with the index, in order to 
not slow indexation process, so some informations need to be lately 
check (is this synonym already exist in the index?), and lexicon 
correction can be done on the fly (if the synonym doesn't exist, write 
it in the lexicon for the next time). There is some work here to find 
the best and fastest way to keep information synchronized between index 
and lexicon (hard link, log for nightly replay, complete iteration over 
the index to find deleted and new stuff ...)

3) Similar (more than only Synonym) and Near (mispelled) words use Lexicon.
https://issues.apache.org/jira/browse/LUCENE-1190

Extending it
1) Lexicon can be used to store Noun, ie words that better work 
together, like New York, Apple II or Alexander the great. 
Extracting nouns from a thesaurus is very hard, but Wikipedia peoples 
done a part of the work, article titles can be a good start to build a 
noun list. And it works in many languages.
Noun can be used as an intuitive PhraseQuery, or as a suggestion for 
refining a results.


Implementig it well in Lucene
SpellCheck and WordNet contrib do a part of it, but in a specific and 
not extensible way, I think it's better when fundation is checked by 
Lucene maintener, and after, contrib is built on top of this fundation.


M.


Otis Gospodnetic a écrit :

Grant, I think Mathieu is hinting at his JIRA contribution (I looked at it 
briefly the other day, but haven't had the chance to really understand it).

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Mathieu Lecarme [EMAIL PROTECTED]
To: java-dev@lucene.apache.org
Sent: Wednesday, March 12, 2008 5:47:40 AM
Subject: an API for synonym in Lucene-core

Why Lucen doesn't have a clean synonym API?
WordNet contrib is not an answer, it provides an Interface for its own 
needs, and most of the world don't speak english.
Compass provides a tool, just like Solr. Lucene is the framework for 
applications like Solr, Nutch or Compass, why don't backport low level 
features of this project?
A synonym API should provide a TokenFilter, an abstract storage should 
map token - similar tokens with weight, and a tools for expanding query.
Openoffice dictionnary project can provides data in differents 
languages, with compatible licences, I  presume.


M.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


  



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1029) Illegal character replacements in ISOLatin1AccentFilter

2008-03-13 Thread Hiroaki Kawai (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hiroaki Kawai updated LUCENE-1029:
--

Attachment: ISOLatin1AccentFilter-by-Collator.patch

Wrote a patch that use java.text.Collator. 

 Illegal character replacements in ISOLatin1AccentFilter
 ---

 Key: LUCENE-1029
 URL: https://issues.apache.org/jira/browse/LUCENE-1029
 Project: Lucene - Java
  Issue Type: Bug
  Components: Analysis
Affects Versions: 2.2
Reporter: Marko Asplund
 Attachments: ISOLatin1AccentFilter-by-Collator.patch, 
 ISOLatin1AccentFilter-javadoc.patch


 The ISOLatin1AccentFilter class is responsible for replacing accented 
 characters in the ISO Latin 1 character set by their unaccented equivalent.
 Some of the replacements performed for scandinavian characters (used e.g. in 
 the finnish, swedish, danish languages etc.) are illegal. The scandinavian 
 characters are different from the accented characters used e.g. in latin 
 based languages such as french in that these characters (ä, ö, å) represent 
 entirely independent sounds in the language and therefore cannot be 
 represented with any other sound without change of meaning. It is therefore 
 illegal to replace these characters with any other character.
 This means for example that you can't change the finnish word sää (weather) 
 to saa (will have) because these are two entirely different words with 
 different meaning. The same applies to scandinavian languages as well.
 There's no connection between the sounds represented by ä and a; ö and o or å 
 and a. 
 In addition to the three characters mentioned above danish and norwegian use 
 other special characters such as ø and æ. It should be checked if the 
 replacement is legal for these characters.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1224) NGramTokenFilter creates bad TokenStream

2008-03-13 Thread Hiroaki Kawai (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hiroaki Kawai updated LUCENE-1224:
--

Attachment: NGramTokenFilter.patch

Modified to set a right start/end offset value in Token properties.

 NGramTokenFilter creates bad TokenStream
 

 Key: LUCENE-1224
 URL: https://issues.apache.org/jira/browse/LUCENE-1224
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/*
Reporter: Hiroaki Kawai
Priority: Critical
 Attachments: NGramTokenFilter.patch, NGramTokenFilter.patch


 With current trunk NGramTokenFilter(min=2,max=4) , I index abcdef string 
 into an index, but I can't query it with abc. If I query with ab, I can 
 get a hit result.
 The reason is that the NGramTokenFilter generates badly ordered TokenStream. 
 Query is based on the Token order in the TokenStream, that how stemming or 
 phrase should be anlayzed is based on the order (Token.positionIncrement).
 With current filter, query string abc is tokenized to : ab bc abc 
 meaning query a string that has ab bc abc in this order.
 Expected filter will generate : ab abc(positionIncrement=0) bc
 meaning query a string that has (ab|abc) bc in this order
 I'd like to submit a patch for this issue. :-)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Assigned: (LUCENE-1224) NGramTokenFilter creates bad TokenStream

2008-03-13 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll reassigned LUCENE-1224:
---

Assignee: Grant Ingersoll

 NGramTokenFilter creates bad TokenStream
 

 Key: LUCENE-1224
 URL: https://issues.apache.org/jira/browse/LUCENE-1224
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/*
Reporter: Hiroaki Kawai
Assignee: Grant Ingersoll
Priority: Critical
 Attachments: NGramTokenFilter.patch, NGramTokenFilter.patch


 With current trunk NGramTokenFilter(min=2,max=4) , I index abcdef string 
 into an index, but I can't query it with abc. If I query with ab, I can 
 get a hit result.
 The reason is that the NGramTokenFilter generates badly ordered TokenStream. 
 Query is based on the Token order in the TokenStream, that how stemming or 
 phrase should be anlayzed is based on the order (Token.positionIncrement).
 With current filter, query string abc is tokenized to : ab bc abc 
 meaning query a string that has ab bc abc in this order.
 Expected filter will generate : ab abc(positionIncrement=0) bc
 meaning query a string that has (ab|abc) bc in this order
 I'd like to submit a patch for this issue. :-)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1227) NGramTokenizer to handle more than 1024 chars

2008-03-13 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated LUCENE-1227:


 Priority: Minor  (was: Major)
Lucene Fields: [New, Patch Available]  (was: [Patch Available, New])

 NGramTokenizer to handle more than 1024 chars
 -

 Key: LUCENE-1227
 URL: https://issues.apache.org/jira/browse/LUCENE-1227
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/*
Reporter: Hiroaki Kawai
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: NGramTokenizer.patch


 Current NGramTokenizer can't handle character stream that is longer than 
 1024. This is too short for non-whitespace-separated languages.
 I created a patch for this issues.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Assigned: (LUCENE-1227) NGramTokenizer to handle more than 1024 chars

2008-03-13 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll reassigned LUCENE-1227:
---

Assignee: Grant Ingersoll

 NGramTokenizer to handle more than 1024 chars
 -

 Key: LUCENE-1227
 URL: https://issues.apache.org/jira/browse/LUCENE-1227
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/*
Reporter: Hiroaki Kawai
Assignee: Grant Ingersoll
 Attachments: NGramTokenizer.patch


 Current NGramTokenizer can't handle character stream that is longer than 
 1024. This is too short for non-whitespace-separated languages.
 I created a patch for this issues.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Resolved: (LUCENE-550) InstantiatedIndex - faster but memory consuming index

2008-03-13 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll resolved LUCENE-550.


Resolution: Fixed

Committed revision 636745.  Thanks Karl!

 InstantiatedIndex - faster but memory consuming index
 -

 Key: LUCENE-550
 URL: https://issues.apache.org/jira/browse/LUCENE-550
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Store
Affects Versions: 2.0.0
Reporter: Karl Wettin
Assignee: Grant Ingersoll
 Attachments: BinarySearchUtils.Apache.java, classdiagram.png, 
 HitCollectionBench.jpg, LUCENE-550.patch, LUCENE-550.patch, LUCENE-550.patch, 
 LUCENE-550_20071021_no_core_changes.txt, test-reports.zip


 Represented as a coupled graph of class instances, this all-in-memory index 
 store implementation delivers search results up to a 100 times faster than 
 the file-centric RAMDirectory at the cost of greater RAM consumption.
 Performance seems to be a little bit better than log2n (binary search). No 
 real data on that, just my eyes.
 Populated with a single document InstantiatedIndex is almost, but not quite, 
 as fast as MemoryIndex.
 At 20,000 document 10-50 characters long InstantiatedIndex outperforms 
 RAMDirectory some 30x,
 15x at 100 documents of 2000 charachters length,
 and is linear to RAMDirectory at 10,000 documents of 2000 characters length.
 Mileage may vary depending on term saturation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1227) NGramTokenizer to handle more than 1024 chars

2008-03-13 Thread Hiroaki Kawai (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hiroaki Kawai updated LUCENE-1227:
--

Attachment: NGramTokenizer.patch

bugfix that I made a mistake about char array addressing.

 NGramTokenizer to handle more than 1024 chars
 -

 Key: LUCENE-1227
 URL: https://issues.apache.org/jira/browse/LUCENE-1227
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/*
Reporter: Hiroaki Kawai
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: NGramTokenizer.patch, NGramTokenizer.patch


 Current NGramTokenizer can't handle character stream that is longer than 
 1024. This is too short for non-whitespace-separated languages.
 I created a patch for this issues.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Should Document.getFieldables really return null

2008-03-13 Thread Stefan Trcek
Hello

The 'Document.getFieldables(String name)' is documented to return 'null'  
in some cases (and really does, see the code below). However this makes 
a penalty to the client, as code like this

Document doc = hits.doc(i);
for (Fieldable f: doc.getFieldables(somefield)) {
System.out.println(f.stringValue());
}

is wrong (no check on 'null'). For the client code it would be better, 
if 'Document.getFieldables(String)' would return 'new Fieldable[0]' 
instead (no NullPointerException).

If you needn't distinguish between null-ed arrays and arrays of zero 
lenght (do you?), I suggest to never return 'null', but return an array 
of size zero. If you don't trust the just-in-time compiler (concerning 
performance), you may even define

private final static Fieldable[] EMPTY = new Fieldable[0];

and return 'EMPTY' at the (*) line. Same with

   public final Field[] getFields(String name) {
   public final String[] getValues(String name) {
   public final byte[][] getBinaryValues(String name) {
   public final byte[] getBinaryValue(String name) {

and maybe others.

Stefan

--- org.apache.lucene.document.Document.java -
   public Fieldable[] getFieldables(String name) {
 List result = new ArrayList();
 for (int i = 0; i  fields.size(); i++) {
   Fieldable field = (Fieldable)fields.get(i);
   if (field.name().equals(name)) {
 result.add(field);
   }
 }

 if (result.size() == 0)
(*)return null;

 return (Fieldable[])result.toArray(new Fieldable[result.size()]);
   }


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: an API for synonym in Lucene-core

2008-03-13 Thread J. Delgado
Mathieu,

Have you thought about incorporating a standard format for thesaurus and
thus for query/index expansion. Here is the recommendation from NISO:
http://www.niso.org/committees/MT-info.html

Beyond synonyms, having the capabilities to specify the use of BT (broader
terms or Hypernyms) or NT (narrower terms or Hyponyms) is very useful to
provide more general or specific context to the query.

There are other tricks such as weighing terms from a thesaurus based on the
number of occurences in the index, as well as extracting potencial
used-as-for terms by looking at patters such as  a word followed by a
parethesis with small number of tokens (i.e.  term (alternate term)).

J.D.


On Thu, Mar 13, 2008 at 2:52 AM, Mathieu Lecarme [EMAIL PROTECTED]
wrote:

 I'll slice my contrib in small parts

 Synonyms
 1) Synonym (Token + a weight)
 2) Synonym provider from OO.o thesaurus
 3) SynonymTokenFilter
 4) Query expander wich apply a filter (and a boost) on each of its
 TermQuery
 5) a Synonym filter for the query expander
 6) to be efficient, Synonym can be exclude if doesn't exist in the Index.
 7) Stemming can be used as a dynamic Synonym

 Spell checking or the do you mean? pattern
 1) The main concept is in the SpellCheck contrib, but in a not
 expandable implementation
 2) In some language, like French, homophony is very important in
 mispelling, there is more than one way to write it
 3) Homophony rules is provided by Aspell in a neutral language (just
 like SnowBall for stemming), I implemented a translator to build Java
 class from aspell file (it's the same format in aspell evolution :
 myspell and hunspell, wich are used in OO.o and firefox family)
 https://issues.apache.org/jira/browse/LUCENE-956

 Storing information about word found in an index
 1) It's the Dictionary used in SpellCheck contrib, in a more open way :
 a lexicon. It's a plain old lucene index, word become a Document, and
 Field store computed informations like size, Ngram token and homophony.
 All use filter took from TokenFilter, code duplication is avoided.
 2) this information can be not synchronized with the index, in order to
 not slow indexation process, so some informations need to be lately
 check (is this synonym already exist in the index?), and lexicon
 correction can be done on the fly (if the synonym doesn't exist, write
 it in the lexicon for the next time). There is some work here to find
 the best and fastest way to keep information synchronized between index
 and lexicon (hard link, log for nightly replay, complete iteration over
 the index to find deleted and new stuff ...)
 3) Similar (more than only Synonym) and Near (mispelled) words use
 Lexicon.
 https://issues.apache.org/jira/browse/LUCENE-1190

 Extending it
 1) Lexicon can be used to store Noun, ie words that better work
 together, like New York, Apple II or Alexander the great.
 Extracting nouns from a thesaurus is very hard, but Wikipedia peoples
 done a part of the work, article titles can be a good start to build a
 noun list. And it works in many languages.
 Noun can be used as an intuitive PhraseQuery, or as a suggestion for
 refining a results.

 Implementig it well in Lucene
 SpellCheck and WordNet contrib do a part of it, but in a specific and
 not extensible way, I think it's better when fundation is checked by
 Lucene maintener, and after, contrib is built on top of this fundation.

 M.


 Otis Gospodnetic a écrit :
  Grant, I think Mathieu is hinting at his JIRA contribution (I looked at
 it briefly the other day, but haven't had the chance to really understand
 it).
 
  Otis
  --
  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
 
  - Original Message 
  From: Mathieu Lecarme [EMAIL PROTECTED]
  To: java-dev@lucene.apache.org
  Sent: Wednesday, March 12, 2008 5:47:40 AM
  Subject: an API for synonym in Lucene-core
 
  Why Lucen doesn't have a clean synonym API?
  WordNet contrib is not an answer, it provides an Interface for its own
  needs, and most of the world don't speak english.
  Compass provides a tool, just like Solr. Lucene is the framework for
  applications like Solr, Nutch or Compass, why don't backport low level
  features of this project?
  A synonym API should provide a TokenFilter, an abstract storage should
  map token - similar tokens with weight, and a tools for expanding
 query.
  Openoffice dictionnary project can provides data in differents
  languages, with compatible licences, I  presume.
 
  M.
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL 

[jira] Created: (LUCENE-1228) IndexWriter.commit() does not update the index version

2008-03-13 Thread Doron Cohen (JIRA)
IndexWriter.commit()  does not update the index version
---

 Key: LUCENE-1228
 URL: https://issues.apache.org/jira/browse/LUCENE-1228
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.3.1, 2.3, 2.4
Reporter: Doron Cohen
Assignee: Doron Cohen


IndexWriter.commit() can update the index *version* and *generation* but the 
update of *version* is lost.
As result added documents are not seen by IndexReader.reopen().
(There might be other side effects that I am not aware of).
The fix is 1 line - update also the version in SegmentsInfo.updateGeneration().
(Finding this line involved more lines though... :-) )


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1228) IndexWriter.commit() does not update the index version

2008-03-13 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12578472#action_12578472
 ] 

Michael McCandless commented on LUCENE-1228:


Good catch Doron, thanks!

This only affects trunk (2.4).

 IndexWriter.commit()  does not update the index version
 ---

 Key: LUCENE-1228
 URL: https://issues.apache.org/jira/browse/LUCENE-1228
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.3, 2.3.1, 2.4
Reporter: Doron Cohen
Assignee: Doron Cohen
 Attachments: lucene-1228-commit-reopen.patch


 IndexWriter.commit() can update the index *version* and *generation* but the 
 update of *version* is lost.
 As result added documents are not seen by IndexReader.reopen().
 (There might be other side effects that I am not aware of).
 The fix is 1 line - update also the version in 
 SegmentsInfo.updateGeneration().
 (Finding this line involved more lines though... :-) )

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1228) IndexWriter.commit() does not update the index version

2008-03-13 Thread Doron Cohen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doron Cohen updated LUCENE-1228:


Lucene Fields: [New, Patch Available]  (was: [Patch Available, New])
Affects Version/s: (was: 2.3.1)
   (was: 2.3)

 IndexWriter.commit()  does not update the index version
 ---

 Key: LUCENE-1228
 URL: https://issues.apache.org/jira/browse/LUCENE-1228
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.4
Reporter: Doron Cohen
Assignee: Doron Cohen
 Attachments: lucene-1228-commit-reopen.patch


 IndexWriter.commit() can update the index *version* and *generation* but the 
 update of *version* is lost.
 As result added documents are not seen by IndexReader.reopen().
 (There might be other side effects that I am not aware of).
 The fix is 1 line - update also the version in 
 SegmentsInfo.updateGeneration().
 (Finding this line involved more lines though... :-) )

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1228) IndexWriter.commit() does not update the index version

2008-03-13 Thread Doron Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12578475#action_12578475
 ] 

Doron Cohen commented on LUCENE-1228:
-

Oh good, less migration to do.
Mmm.. so it is not related to Daniel's Document ID shuffling under 2.3.x in 
the user list.

 IndexWriter.commit()  does not update the index version
 ---

 Key: LUCENE-1228
 URL: https://issues.apache.org/jira/browse/LUCENE-1228
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.4
Reporter: Doron Cohen
Assignee: Doron Cohen
 Attachments: lucene-1228-commit-reopen.patch


 IndexWriter.commit() can update the index *version* and *generation* but the 
 update of *version* is lost.
 As result added documents are not seen by IndexReader.reopen().
 (There might be other side effects that I am not aware of).
 The fix is 1 line - update also the version in 
 SegmentsInfo.updateGeneration().
 (Finding this line involved more lines though... :-) )

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Resolved: (LUCENE-1226) IndexWriter.addIndexes(IndexReader[]) fails to create compound files

2008-03-13 Thread Michael Busch (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Busch resolved LUCENE-1226.
---

   Resolution: Fixed
Lucene Fields: [New, Patch Available]  (was: [Patch Available, New])

Committed to trunk  2.3 branch.

 IndexWriter.addIndexes(IndexReader[]) fails to create compound files
 

 Key: LUCENE-1226
 URL: https://issues.apache.org/jira/browse/LUCENE-1226
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.3, 2.3.1
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: 2.3.2, 2.4

 Attachments: lucene-1226.patch


 Even if no exception is thrown while writing the compound file at the end of 
 the 
 addIndexes() call, the transaction is rolled back and the successfully 
 written cfs 
 file deleted. The fix is simple: There is just the 
 {code:java}
 success = true;
 {code}
 statement missing at the end of the try{} clause.
 All tests pass. I'll commit this soon to trunk and 2.3.2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1228) IndexWriter.commit() does not update the index version

2008-03-13 Thread Ning Li (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12578518#action_12578518
 ] 

Ning Li commented on LUCENE-1228:
-

Does SegmentInfos really need both version and generation? Is generation 
sufficient?

 IndexWriter.commit()  does not update the index version
 ---

 Key: LUCENE-1228
 URL: https://issues.apache.org/jira/browse/LUCENE-1228
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.4
Reporter: Doron Cohen
Assignee: Doron Cohen
 Attachments: lucene-1228-commit-reopen.patch


 IndexWriter.commit() can update the index *version* and *generation* but the 
 update of *version* is lost.
 As result added documents are not seen by IndexReader.reopen().
 (There might be other side effects that I am not aware of).
 The fix is 1 line - update also the version in 
 SegmentsInfo.updateGeneration().
 (Finding this line involved more lines though... :-) )

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Build failed in Hudson: Lucene-trunk #400

2008-03-13 Thread Apache Hudson Server
See http://hudson.zones.apache.org/hudson/job/Lucene-trunk/400/changes

Changes:

[buschmi] LUCENE-1226: Fixed IndexWriter.addIndexes(IndexReader[]) to commit 
successfully created compound files.

[gsingers] LUCENE-550: put the comment in the wrong spot

[gsingers] LUCENE-550:  Added RAMDirectory alternative as a contrib. Similar to 
MemoryIndex, but can hold more than one document

[mikemccand] download bdb zip from Oracle's servers

--
[...truncated 2478 lines...]
clover.info:

clover:

compile-core:
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/db/bdb-je/classes/java
 
[javac] Compiling 6 source files to 
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/db/bdb-je/classes/java
 
[javac] Note: 
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/contrib/db/bdb-je/src/java/org/apache/lucene/store/je/JEDirectory.java
  uses or overrides a deprecated API.
[javac] Note: Recompile with -deprecation for details.

jar-core:
 [exec] Execute failed: java.io.IOException: svnversion: not found
  [jar] Building jar: 
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/db/bdb-je/lucene-bdb-je-2.4-SNAPSHOT.jar
 

default:

default:

compile-test:
 [echo] Building bdb...

javacc-uptodate-check:

javacc-notice:

jflex-uptodate-check:

jflex-notice:

common.init:

build-lucene:

contrib-build.init:

get-db-jar:

check-and-get-db-jar:

init:

compile-test:
 [echo] Building bdb...

javacc-uptodate-check:

javacc-notice:

jflex-uptodate-check:

jflex-notice:

common.init:

build-lucene:

contrib-build.init:

get-db-jar:

check-and-get-db-jar:

init:

clover.setup:

clover.info:

clover:

compile-core:

common.compile-test:
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/db/bdb/classes/test
 
[javac] Compiling 2 source files to 
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/db/bdb/classes/test
 
 [echo] Building bdb-je...

javacc-uptodate-check:

javacc-notice:

jflex-uptodate-check:

jflex-notice:

common.init:

build-lucene:

contrib-build.init:

get-je-jar:

check-and-get-je-jar:

init:

compile-test:
 [echo] Building bdb-je...

javacc-uptodate-check:

javacc-notice:

jflex-uptodate-check:

jflex-notice:

common.init:

build-lucene:

contrib-build.init:

get-je-jar:

check-and-get-je-jar:

init:

clover.setup:

clover.info:

clover:

compile-core:

common.compile-test:
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/db/bdb-je/classes/test
 
[javac] Compiling 1 source file to 
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/db/bdb-je/classes/test
 

build-artifacts-and-tests:
 [echo] Building highlighter...

javacc-uptodate-check:

javacc-notice:

jflex-uptodate-check:

jflex-notice:

common.init:

build-lucene:

init:

clover.setup:

clover.info:

clover:

compile-core:

jar-core:
 [exec] Execute failed: java.io.IOException: svnversion: not found
  [jar] Building jar: 
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/highlighter/lucene-highlighter-2.4-SNAPSHOT.jar
 

jar:

compile-test:
 [echo] Building highlighter...

javacc-uptodate-check:

javacc-notice:

jflex-uptodate-check:

jflex-notice:

common.init:

build-lucene:

init:

clover.setup:

clover.info:

clover:

compile-core:

common.compile-test:
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/highlighter/classes/test
 
[javac] Compiling 1 source file to 
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/highlighter/classes/test
 
[javac] Note: 
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/contrib/highlighter/src/test/org/apache/lucene/search/highlight/HighlighterTest.java
  uses or overrides a deprecated API.
[javac] Note: Recompile with -deprecation for details.

build-artifacts-and-tests:
 [echo] Building instantiated...

javacc-uptodate-check:

javacc-notice:

jflex-uptodate-check:

jflex-notice:

common.init:

build-lucene:

init:

clover.setup:

clover.info:

clover:

compile-core:
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/instantiated/classes/java
 
[javac] Compiling 11 source files to 
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/instantiated/classes/java
 
[javac] javac: invalid target release: 1.5
[javac] Usage: javac options source files
[javac] where possible options include:
[javac]   -gGenerate all debugging info
[javac]   -g:none   Generate no debugging info
[javac]   -g:{lines,vars,source}Generate only some debugging info
[javac]   -nowarn  

[jira] Updated: (LUCENE-1229) NGramTokenFilter optimization in query phase

2008-03-13 Thread Hiroaki Kawai (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hiroaki Kawai updated LUCENE-1229:
--

Attachment: NGramTokenFilter.patch

Added a patch NGramTokenFilter.patch. This patch includes LUCENE-1224 .

 NGramTokenFilter optimization in query phase
 

 Key: LUCENE-1229
 URL: https://issues.apache.org/jira/browse/LUCENE-1229
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/*
Reporter: Hiroaki Kawai
 Attachments: NGramTokenFilter.patch


 I found that NGramTokenFilter-ed token stream could be optimized in query.
 A standard 1,2 NGramTokenFilter will generate a token stream from abcde as 
 follows:
 a ab b bc c cd d de e
 When we index abcde, we'll use all of the tokens.
 But when we query, we only need:
 ab cd de

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Build failed in Hudson: Lucene-trunk #400

2008-03-13 Thread Chris Hostetter

: [gsingers] LUCENE-550:  Added RAMDirectory alternative as a contrib. 
: Similar to MemoryIndex, but can hold more than one document

...

: compile-core:
: [mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/instantiated/classes/java
 
: [javac] Compiling 11 source files to 
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/instantiated/classes/java
 
: [javac] javac: invalid target release: 1.5

the problem here seems to be that hudson was configured to use the 1.4 
javac.  even though core lucene is required to be compatible with 1.4 
(currently) we've allowed contribs to be 1.5 for a while now ... but since 
contrib/gdata-server was retired, we aparently haven't had any 1.5 
contribs until now, so the setting was overlooked when we switched hudson 
servers.

(at least: i think that's what happened)

i've switched the config in the hudson GUI, and triggered a new build.



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1202) Clover setup currently has some problems

2008-03-13 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12578615#action_12578615
 ] 

Hoss Man commented on LUCENE-1202:
--

bq. Does the Clover license allow instrumenting non a.o packages, as in:


...per the README file i linked to, it allows instrumenting of those exact 
packages (hence the fileset - we'll probably never have org.w3c.* packages in 
our code base, but I include them for completeness.

 Clover setup currently has some problems
 

 Key: LUCENE-1202
 URL: https://issues.apache.org/jira/browse/LUCENE-1202
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Hoss Man
 Attachments: LUCENE-1202.db-contrib-instrumentation.patch


 (tracking as a bug before it get lost in email...
   
 http://www.nabble.com/Clover-reports-missing-from-hudson--to15510616.html#a15510616
 )
 The clover setup for Lucene currently has some problems, 3 i think...
 1) instrumentation fails on contrib/db/ because it contains java packages the 
 ASF Clover lscence doesn't allow instrumentation of.  i have a patch for this.
 2) running instrumented contrib tests for other contribs produce strange 
 errors...
 {{monospaced}}
 [junit] Testsuite: org.apache.lucene.analysis.el.GreekAnalyzerTest
 [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 0.126 sec
 [junit]
 [junit] - Standard Error -
 [junit] [CLOVER] FATAL ERROR: Clover could not be initialised. Are you 
 sure you have Clover
 in the runtime classpath? (class
 java.lang.NoClassDefFoundError:com_cenqua_clover/CloverVersionInfo)
 [junit] -  ---
 [junit] Testcase: 
 testAnalyzer(org.apache.lucene.analysis.el.GreekAnalyzerTest):Caused
 an ERROR
 [junit] com_cenqua_clover/g
 [junit] java.lang.NoClassDefFoundError: com_cenqua_clover/g
 [junit] at 
 org.apache.lucene.analysis.el.GreekAnalyzer.init(GreekAnalyzer.java:157)
 [junit] at
 org.apache.lucene.analysis.el.GreekAnalyzerTest.testAnalyzer(GreekAnalyzerTest.java:60)
 [junit]
 [junit]
 [junit] Test org.apache.lucene.analysis.el.GreekAnalyzerTest FAILED
 {{monospaced}}
 ...i'm not sure what's going on here.  the error seems to happen both when
 trying to run clover on just a single contrib, or when doing the full
 build ... i suspect there is an issue with the way the batchtests fork
 off, but I can't see why it would only happen to contribs (the regular
 tests fork as well)
 3) according to Grant...
 {{quote}}
 ...There is also a bit of a change on Hudson during the migration to the new 
 servers that needs to be ironed  out. 
 {{quote}}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-1230) Source release files missing the *.pom.template files

2008-03-13 Thread Michael Busch (JIRA)
Source release files missing the *.pom.template files
-

 Key: LUCENE-1230
 URL: https://issues.apache.org/jira/browse/LUCENE-1230
 Project: Lucene - Java
  Issue Type: Bug
  Components: Build
Affects Versions: 2.3.1, 2.3, 2.2
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: 2.3.2, 2.4


The source release files should contain the *.pom.template files, otherwise it 
is not possible to build the maven artifacts using ant 
generate-maven-artifacts from official release files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]