[jira] Commented: (LUCENE-1157) Formatable changes log (CHANGES.txt is easy to edit but not so friendly to read by Lucene users)

2008-01-29 Thread Doron Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12563746#action_12563746
 ] 

Doron Cohen commented on LUCENE-1157:
-

{quote}
So, wouldn't it work to have Changes.html (and the stylesheets too) live in 
trunk/docs/ ?
{quote}
Yes I agree, they should move so that Grant's job copies them. But I would like 
to make them part of the javadocs, so that there's no need recompile with each 
change and no need to check-in Changes.html. I'll revert this and continue 
tomorrow.


> Formatable changes log  (CHANGES.txt is easy to edit but not so friendly to 
> read by Lucene users)
> -
>
> Key: LUCENE-1157
> URL: https://issues.apache.org/jira/browse/LUCENE-1157
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Website
>Reporter: Doron Cohen
>Assignee: Doron Cohen
> Fix For: 2.4
>
> Attachments: lucene-1157-take2.patch, lucene-1157-take3.patch, 
> lucene-1157.patch
>
>
> Background in http://www.nabble.com/formatable-changes-log-tt15078749.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1157) Formatable changes log (CHANGES.txt is easy to edit but not so friendly to read by Lucene users)

2008-01-29 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12563743#action_12563743
 ] 

Steven Rowe commented on LUCENE-1157:
-

bq. there are unidentifiable characters in Changes.html. They are also in 
CHANGES.txt. I'm sure I read something about why they are added but cannot find 
it now.

The first three bytes of CHANGES.txt are a UTF-8 BOM (byte-order mark).  In 
Unicode's fixed-width encodings, e.g. UTF-16, the character U+FEFF is reserved 
for the beginnings of streams to denote the endian-ness of the character 
serialization.

UTF-8 is non-endian (invariant byte order given a character); the use of the 
BOM in UTF-8, where it is serialized as three bytes, is solely to indicate that 
the encoding of the stream is UTF-8.

Microsoft's tools like to put BOMs at the beginnings of UTF-8 encoded files.

> Formatable changes log  (CHANGES.txt is easy to edit but not so friendly to 
> read by Lucene users)
> -
>
> Key: LUCENE-1157
> URL: https://issues.apache.org/jira/browse/LUCENE-1157
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Website
>Reporter: Doron Cohen
>Assignee: Doron Cohen
> Fix For: 2.4
>
> Attachments: lucene-1157-take2.patch, lucene-1157-take3.patch, 
> lucene-1157.patch
>
>
> Background in http://www.nabble.com/formatable-changes-log-tt15078749.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1157) Formatable changes log (CHANGES.txt is easy to edit but not so friendly to read by Lucene users)

2008-01-29 Thread Doron Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12563725#action_12563725
 ] 

Doron Cohen commented on LUCENE-1157:
-

Seems that Changes.html sould not be in svn at all. 
Instead, it should have same status as javadocs - both are generated 
documentation. 
Instead of creating it as part of compile-core I'll create it as part of 
javadocs-core.
Instead of created as part of committing, it would be created as part of 
nightly build, and copied to the site by Grant's scripts.
I'll go on with this tomorrow.

> Formatable changes log  (CHANGES.txt is easy to edit but not so friendly to 
> read by Lucene users)
> -
>
> Key: LUCENE-1157
> URL: https://issues.apache.org/jira/browse/LUCENE-1157
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Website
>Reporter: Doron Cohen
>Assignee: Doron Cohen
> Fix For: 2.4
>
> Attachments: lucene-1157-take2.patch, lucene-1157-take3.patch, 
> lucene-1157.patch
>
>
> Background in http://www.nabble.com/formatable-changes-log-tt15078749.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1157) Formatable changes log (CHANGES.txt is easy to edit but not so friendly to read by Lucene users)

2008-01-29 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12563724#action_12563724
 ] 

Steven Rowe commented on LUCENE-1157:
-

According to http://wiki.apache.org/lucene-java/HowToUpdateTheWebsite , 
anything checked into trunk/docs/ will be automatically mirrored to the live 
website by a cron job running under Grant's account.

So, wouldn't it work to have Changes.html (and the stylesheets too) live in 
trunk/docs/ ?

> Formatable changes log  (CHANGES.txt is easy to edit but not so friendly to 
> read by Lucene users)
> -
>
> Key: LUCENE-1157
> URL: https://issues.apache.org/jira/browse/LUCENE-1157
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Website
>Reporter: Doron Cohen
>Assignee: Doron Cohen
> Fix For: 2.4
>
> Attachments: lucene-1157-take2.patch, lucene-1157-take3.patch, 
> lucene-1157.patch
>
>
> Background in http://www.nabble.com/formatable-changes-log-tt15078749.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1157) Formatable changes log (CHANGES.txt is easy to edit but not so friendly to read by Lucene users)

2008-01-29 Thread Doron Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12563720#action_12563720
 ] 

Doron Cohen commented on LUCENE-1157:
-

Ok I checked in the creation of Changes.html from changes.txt. thanks Steven!

The Web site update part seems trickier than I thought. 
- Adding a link in the site to 
http://svn.apache.org/viewvc/lucene/java/trunk/Changes.html?view=co 
  does not work so well, because of the way that page is served by ViewVC. 
- Linking to http://svn.apache.org/repos/asf/lucene/java/trunk/Changes.html
  isn't working either because svn returns the source of that file.
- In addition there are unidentifiable characters in Changes.html. They are also
  in CHANGES.txt. I'm sure I read something about why they are added but cannot 
find it now.

Ideas?

> Formatable changes log  (CHANGES.txt is easy to edit but not so friendly to 
> read by Lucene users)
> -
>
> Key: LUCENE-1157
> URL: https://issues.apache.org/jira/browse/LUCENE-1157
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Website
>Reporter: Doron Cohen
>Assignee: Doron Cohen
> Fix For: 2.4
>
> Attachments: lucene-1157-take2.patch, lucene-1157-take3.patch, 
> lucene-1157.patch
>
>
> Background in http://www.nabble.com/formatable-changes-log-tt15078749.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: JBoss Cache as a store

2008-01-29 Thread Chris Hostetter

: Is there a set of tests in the Lucene sources I could use to test the
: "JBCDirectory", as I call it?  Perhaps something way I could change the "index
: store provider" and re-run some existing tests, and perhaps add some clustered
: tests specific to my plugin?

I think most of the existing tests have the Directory impl hardcoded in 
them ... the best thing to do might be to refactor the existing tests so 
Directory creation comes from an overridable function in a subclass...  
come ot think of it, Karl may have already done this as part of his 
InstantiatedIndex patch (check jira) but i'm not sure ... the conversation 
sounds familiar, but i think he was looking at facading the entire 
IndexReader impl not just the directory, so any refactoring approach he 
might have taken may not have gone far enough to work in this case.

It would certianly be nice if there was an easy way to run every test in 
the test suite against an arbitrary Directory implementation.

: Finally, regarding hosting, I am happy to contribute this to Lucene (alongside
: the JEDirectory, etc) but if licensing (JBoss Cache is LGPL, although the
: plugin code can be ASL if need be) or language levels (the plugin depends on
: JBoss Cache 2.x, which requires JDK 5) then I'm happy to host the plugin
: externally.

contribs can run require 1.5 already ... an soon the trunk will move to 
1.5 so that's not really an issue, the licensing may be, but it depends on 
how the integration with JBoss winds up working (ie: i don't know if 
having the build scripts download JBoss at build time to compile against 
them is allowed or not)




-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-997) Add search timeout support to Lucene

2008-01-29 Thread Paul Elschot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12563698#action_12563698
 ] 

Paul Elschot commented on LUCENE-997:
-

The idea of System.currentTimeMillis() is to guard against misbehaviour of the 
java wait() method and against unexpected delays because of thread scheduling 
during the jump back for the loop around the wait() call.
One way to reduce such misbehaviour under heavy load is by increasing the 
scheduling priority of the timing thread, but I don't think that is necessary.

Also System.currentTimeMillis() is obviously correct, whereas timeout += 
resolution is never more than an assumption about correct wait() behaviour.

Clock changes by NTP are normally so slow that they don't really matter for 
query time outs.


> Add search timeout support to Lucene
> 
>
> Key: LUCENE-997
> URL: https://issues.apache.org/jira/browse/LUCENE-997
> Project: Lucene - Java
>  Issue Type: New Feature
>Reporter: Sean Timm
>Priority: Minor
> Attachments: HitCollectorTimeoutDecorator.java, 
> LuceneTimeoutTest.java, LuceneTimeoutTest.java, MyHitCollector.java, 
> timeout.patch, timeout.patch, timeout.patch, timeout.patch, 
> TimerThreadTest.java
>
>
> This patch is based on Nutch-308. 
> This patch adds support for a maximum search time limit. After this time is 
> exceeded, the search thread is stopped, partial results (if any) are returned 
> and the total number of results is estimated.
> This patch tries to minimize the overhead related to time-keeping by using a 
> version of safe unsynchronized timer.
> This was also discussed in an e-mail thread.
> http://www.nabble.com/search-timeout-tf3410206.html#a9501029

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Resolved: (LUCENE-1158) DateTools UTC/GMT mismatch

2008-01-29 Thread Daniel Naber (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Naber resolved LUCENE-1158.
--

   Resolution: Fixed
Fix Version/s: 2.4
Lucene Fields: [New, Patch Available]  (was: [Patch Available, New])

Patch applied.

> DateTools UTC/GMT mismatch
> --
>
> Key: LUCENE-1158
> URL: https://issues.apache.org/jira/browse/LUCENE-1158
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Javadocs
>Affects Versions: 2.3
>Reporter: Daniel Naber
>Priority: Minor
> Fix For: 2.4
>
> Attachments: datetools.diff
>
>
> Post from Antony Bowesman on java-user:
> -
> I just noticed that although the Javadocs for Lucene 2.2 state that the dates 
> for DateTools use UTC as a timezone, they are actually using GMT.
> Should either the Javadocs be corrected or the code corrected to use UTC 
> instead.
> -
> I'm attaching a patch that changes the javadoc and will commit it, unless 
> someone knows a reason the javadoc is correct and the code should be changed 
> to UTC. To my understanding, there's no significant difference between UTC 
> and GMT.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-997) Add search timeout support to Lucene

2008-01-29 Thread Sean Timm (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12563632#action_12563632
 ] 

Sean Timm commented on LUCENE-997:
--

I could go either way on the System.currentTimeMillis() versus a TimerThread 
issue.  I agree nanoTime is the correct implementation when using 1.5.

It doesn't seem like on Linux running ntp it matters much either way.  NTP 
tries to perform smoothing and makes clock changes slowly over a longer period 
of time when it can rather than have an abrupt change, but YMMV if your system 
is having clock issues.  On a really overloaded Windows box, the TimerThread 
implementation will not behave well as demonstrated above.  I can't speak to 
any other platforms.

> Add search timeout support to Lucene
> 
>
> Key: LUCENE-997
> URL: https://issues.apache.org/jira/browse/LUCENE-997
> Project: Lucene - Java
>  Issue Type: New Feature
>Reporter: Sean Timm
>Priority: Minor
> Attachments: HitCollectorTimeoutDecorator.java, 
> LuceneTimeoutTest.java, LuceneTimeoutTest.java, MyHitCollector.java, 
> timeout.patch, timeout.patch, timeout.patch, timeout.patch, 
> TimerThreadTest.java
>
>
> This patch is based on Nutch-308. 
> This patch adds support for a maximum search time limit. After this time is 
> exceeded, the search thread is stopped, partial results (if any) are returned 
> and the total number of results is estimated.
> This patch tries to minimize the overhead related to time-keeping by using a 
> version of safe unsynchronized timer.
> This was also discussed in an e-mail thread.
> http://www.nabble.com/search-timeout-tf3410206.html#a9501029

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1151) Fix StandardAnalyzer to not mis-identify HOST as ACRONYM by default

2008-01-29 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12563625#action_12563625
 ] 

Grant Ingersoll commented on LUCENE-1151:
-

Here's the thread on JFlex for completeness, not that it it effects this patch: 
http://sourceforge.net/mailarchive/forum.php?thread_name=272037D7-6EA1-4D19-902F-B425A5309C2A%40apache.org&forum_name=jflex-users

> Fix StandardAnalyzer to not mis-identify HOST as ACRONYM by default
> ---
>
> Key: LUCENE-1151
> URL: https://issues.apache.org/jira/browse/LUCENE-1151
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.4
>
> Attachments: LUCENE-1151.patch
>
>
> Coming out of the discussion around back compatibility, it seems best to 
> default StandardAnalyzer to properly fix LUCENE-1068, while preserving the 
> ability to get the back-compatible behavior in the rare event that it's 
> desired.
> This just means changing the replaceInvalidAcronym = false with = true, and, 
> adding a clear entry to CHANGES.txt that this very slight non back compatible 
> change took place.
> Spinoff from here:
> http://www.gossamer-threads.com/lists/lucene/java-dev/57517#57517
> I'll commit that change in a day or two.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-997) Add search timeout support to Lucene

2008-01-29 Thread Sean Timm (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Timm updated LUCENE-997:
-

Attachment: timeout.patch

This is a minor update to timeout.patch which fixes the comment about updates 
to 32-bit-sized variables being atomic and instead talks about volatile longs, 
as pointed out by Andrzej.  It also computes the time out moment up front to 
save a subtraction on each document collection as suggested by Paul.

> Add search timeout support to Lucene
> 
>
> Key: LUCENE-997
> URL: https://issues.apache.org/jira/browse/LUCENE-997
> Project: Lucene - Java
>  Issue Type: New Feature
>Reporter: Sean Timm
>Priority: Minor
> Attachments: HitCollectorTimeoutDecorator.java, 
> LuceneTimeoutTest.java, LuceneTimeoutTest.java, MyHitCollector.java, 
> timeout.patch, timeout.patch, timeout.patch, timeout.patch, 
> TimerThreadTest.java
>
>
> This patch is based on Nutch-308. 
> This patch adds support for a maximum search time limit. After this time is 
> exceeded, the search thread is stopped, partial results (if any) are returned 
> and the total number of results is estimated.
> This patch tries to minimize the overhead related to time-keeping by using a 
> version of safe unsynchronized timer.
> This was also discussed in an e-mail thread.
> http://www.nabble.com/search-timeout-tf3410206.html#a9501029

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: JBoss Cache as a store

2008-01-29 Thread mark harwood
Hi Manik,



>>> 
Is 
there 
a 
set 
of 
tests 
in 
the 
Lucene 
sources 
I 
could 
use 
to 
test  the 
"JBCDirectory", 
as 
I 
call 
it?



You would probably need to adapt existing Junit tests in
contrib/benchmark and src/test for performance and functionality
testing, respectively.

They use the existing RAMDirectory and FSDirectory Directory
implementations so you'll need to change the test code to use your
JBCDirectory instead.



Cheers,

Mark



- Original Message 
From: Manik Surtani <[EMAIL PROTECTED]>
To: java-dev@lucene.apache.org
Sent: Tuesday, 29 January, 2008 3:38:17 PM
Subject: Re: JBoss Cache as a store

Bump.  
Anyone?


On 
24 
Jan 
2008, 
at 
14:07, 
Manik 
Surtani 
wrote:

> 
Hi 
guys
>
> 
I've 
just 
written 
a 
plugin 
for 
Lucene 
to 
use 
JBoss 
Cache 
as 
an 
index  
> 
store.  
The 
benefits 
of 
something 
like 
this 
are:
>
> 
1.  
Faster 
access 
to 
indexes 
as 
they 
will 
be 
in 
memory
> 
2.  
Indexes 
replicated 
across 
a 
cluster 
of 
servers
> 
3.  
Indexes 
"persisted" 
in 
clustered 
memory 
- 
faster 
that  
> 
persistence 
to 
disk
>
> 
The 
implementation 
I 
have 
is 
pretty 
basic 
for 
now.
>
> 
Is 
there 
a 
set 
of 
tests 
in 
the 
Lucene 
sources 
I 
could 
use 
to 
test  
> 
the 
"JBCDirectory", 
as 
I 
call 
it?  
Perhaps 
something 
way 
I 
could  
> 
change 
the 
"index 
store 
provider" 
and 
re-run 
some 
existing 
tests,  
> 
and 
perhaps 
add 
some 
clustered 
tests 
specific 
to 
my 
plugin?
>
> 
Finally, 
regarding 
hosting, 
I 
am 
happy 
to 
contribute 
this 
to 
Lucene  
> 
(alongside 
the 
JEDirectory, 
etc) 
but 
if 
licensing 
(JBoss 
Cache 
is  
> 
LGPL, 
although 
the 
plugin 
code 
can 
be 
ASL 
if 
need 
be) 
or 
language  
> 
levels 
(the 
plugin 
depends 
on 
JBoss 
Cache 
2.x, 
which 
requires 
JDK 
5)  
> 
then 
I'm 
happy 
to 
host 
the 
plugin 
externally.
>
> 
Cheers,
> 
--
> 
Manik 
Surtani
> 
Lead, 
JBoss 
Cache
> 
[EMAIL PROTECTED]
>
>
>
>
>
>

--
Manik 
Surtani
Lead, 
JBoss 
Cache
[EMAIL PROTECTED]







-
To 
unsubscribe, 
e-mail: 
[EMAIL PROTECTED]
For 
additional 
commands, 
e-mail: 
[EMAIL PROTECTED]






  ___
Support the World Aids Awareness campaign this month with Yahoo! For Good 
http://uk.promotions.yahoo.com/forgood/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-1161) Punctuation handling in StandardTokenizer (and WikipediaTokenizer)

2008-01-29 Thread Grant Ingersoll (JIRA)
Punctuation handling in StandardTokenizer (and WikipediaTokenizer)
--

 Key: LUCENE-1161
 URL: https://issues.apache.org/jira/browse/LUCENE-1161
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Grant Ingersoll
Priority: Minor


It would be useful, in the StandardTokenizer, to be able to have more control 
over in-word punctuation is handled.  For instance, it is not always desirable 
to split on dashes or other punctuation.  In other cases, one may want to 
output the split tokens plus a collapsed version of the token that removes the 
punctuation.

For example, Solr's WordDelimiterFilter provides some nice capabilities here, 
but it can't do it's job when using the StandardTokenizer because the 
StandardTokenizer already makes the decision on how to handle it without giving 
the user any choice.

I think, in JFlex, we can have a back-compatible way of letting users make 
decisions about punctuation that occurs inside of a token.  Such as e-bay or 
i-pod, thus allowing for matches on iPod and eBay.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Commented: (LUCENE-1151) Fix StandardAnalyzer to not mis-identify HOST as ACRONYM by default

2008-01-29 Thread Grant Ingersoll


On Jan 29, 2008, at 12:10 PM, Michael McCandless (JIRA) wrote:



   [ https://issues.apache.org/jira/browse/LUCENE-1151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12563576 
#action_12563576 ]


Michael McCandless commented on LUCENE-1151:


Very good question ... I don't know.  It would be awesome (and,  
amazing) if JFlex enabled some kind of inheritance.


I asked on the JFlex user list (http://sourceforge.net/mailarchive/forum.php?forum_name=jflex-users 
) but I don't see it in the docs anywhere.





Since WikipediaTokenizer doesn't have the backwards compat  
requirement of StandardTokenizer, you can presumably just fix  
ACRONYM in WikipediaTokenizer to not match host names (ie hardwire  
to "true")?


Yes.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1151) Fix StandardAnalyzer to not mis-identify HOST as ACRONYM by default

2008-01-29 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12563576#action_12563576
 ] 

Michael McCandless commented on LUCENE-1151:


Very good question ... I don't know.  It would be awesome (and, amazing) if 
JFlex enabled some kind of inheritance.

Since WikipediaTokenizer doesn't have the backwards compat requirement of 
StandardTokenizer, you can presumably just fix ACRONYM in WikipediaTokenizer to 
not match host names (ie hardwire to "true")?

> Fix StandardAnalyzer to not mis-identify HOST as ACRONYM by default
> ---
>
> Key: LUCENE-1151
> URL: https://issues.apache.org/jira/browse/LUCENE-1151
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.4
>
> Attachments: LUCENE-1151.patch
>
>
> Coming out of the discussion around back compatibility, it seems best to 
> default StandardAnalyzer to properly fix LUCENE-1068, while preserving the 
> ability to get the back-compatible behavior in the rare event that it's 
> desired.
> This just means changing the replaceInvalidAcronym = false with = true, and, 
> adding a clear entry to CHANGES.txt that this very slight non back compatible 
> change took place.
> Spinoff from here:
> http://www.gossamer-threads.com/lists/lucene/java-dev/57517#57517
> I'll commit that change in a day or two.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: JBoss Cache as a store

2008-01-29 Thread Manik Surtani

Bump.  Anyone?


On 24 Jan 2008, at 14:07, Manik Surtani wrote:


Hi guys

I've just written a plugin for Lucene to use JBoss Cache as an index  
store.  The benefits of something like this are:


1.  Faster access to indexes as they will be in memory
2.  Indexes replicated across a cluster of servers
3.  Indexes "persisted" in clustered memory - faster that  
persistence to disk


The implementation I have is pretty basic for now.

Is there a set of tests in the Lucene sources I could use to test  
the "JBCDirectory", as I call it?  Perhaps something way I could  
change the "index store provider" and re-run some existing tests,  
and perhaps add some clustered tests specific to my plugin?


Finally, regarding hosting, I am happy to contribute this to Lucene  
(alongside the JEDirectory, etc) but if licensing (JBoss Cache is  
LGPL, although the plugin code can be ASL if need be) or language  
levels (the plugin depends on JBoss Cache 2.x, which requires JDK 5)  
then I'm happy to host the plugin externally.


Cheers,
--
Manik Surtani
Lead, JBoss Cache
[EMAIL PROTECTED]








--
Manik Surtani
Lead, JBoss Cache
[EMAIL PROTECTED]







-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1151) Fix StandardAnalyzer to not mis-identify HOST as ACRONYM by default

2008-01-29 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12563548#action_12563548
 ] 

Grant Ingersoll commented on LUCENE-1151:
-

Not necessarily related, but can you think of a way that we can keep 
WikipediaTokenizer and StandardTokenizer in sync for these kind of things.  I 
guess I need to go look in JFlex to see if there is a way to do inheritance.  
Essentially, I want the WikiTokenizer to be StandardTokenizer plus handle the 
Wiki syntax appropriately.

> Fix StandardAnalyzer to not mis-identify HOST as ACRONYM by default
> ---
>
> Key: LUCENE-1151
> URL: https://issues.apache.org/jira/browse/LUCENE-1151
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.4
>
> Attachments: LUCENE-1151.patch
>
>
> Coming out of the discussion around back compatibility, it seems best to 
> default StandardAnalyzer to properly fix LUCENE-1068, while preserving the 
> ability to get the back-compatible behavior in the rare event that it's 
> desired.
> This just means changing the replaceInvalidAcronym = false with = true, and, 
> adding a clear entry to CHANGES.txt that this very slight non back compatible 
> change took place.
> Spinoff from here:
> http://www.gossamer-threads.com/lists/lucene/java-dev/57517#57517
> I'll commit that change in a day or two.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1156) Wikipedia Document Generation Changes

2008-01-29 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated LUCENE-1156:


Attachment: LUCENE-1156.patch

This patch fixes the redirect problem and the adds an option to discard image 
only documents (default is to keep them).

It does not strip the template pages, nor does it expand them.

Patch applies from contrib/benchmark

> Wikipedia Document Generation Changes
> -
>
> Key: LUCENE-1156
> URL: https://issues.apache.org/jira/browse/LUCENE-1156
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/benchmark, contrib/wikipedia
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: LUCENE-1156.patch
>
>
> The EnwikiDocMaker currently produces a fair number of documents that are in 
> the download, but are of dubious use in terms of both benchmarking and 
> indexing.  
> These issues are:
> # Redirect (it currently only handles REDIRECT and redirect, but there are 
> documents as Redirect
> # Template files appear to be useless.  These are marked by the term 
> Template: at the beginning of the body.  See for example: 
> http://en.wikipedia.org/wiki/Template:=)
> # Image only pages, as in 
> http://en.wikipedia.org/wiki/Image:Sciencefieldnewark.jpg.jpg  These are 
> about as useful as the Redirects and Templates
> # Files pending deletion:  This one is a bit trickier to handle, but they are 
> generally marked by "Wikipedia:Votes for deletion" or some variation of that 
> depending where along it is in being deleted
> I think I can implement this such that it is backward compatible, if there is 
> such a need when it comes to the contrib/benchmark suite.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1154) System Reqs page should be release specific

2008-01-29 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated LUCENE-1154:


Fix Version/s: 3.0

We can keep the existing one until 3.0 is released.

> System Reqs page should be release specific
> ---
>
> Key: LUCENE-1154
> URL: https://issues.apache.org/jira/browse/LUCENE-1154
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Website
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Trivial
> Fix For: 3.0
>
>
> The System Requirements page, currently under the Main->Resources section of 
> the website should be part of a given version's documentation, since it will 
> be changing for a given release.  
> I will "deprecate" the existing one, but leave it in place(with a message) to 
> cover the existing releases that don't have this, but will also add it to the 
> release docs for future releases.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Resolved: (LUCENE-1132) Highlighter Documentation updates

2008-01-29 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll resolved LUCENE-1132.
-

   Resolution: Fixed
Lucene Fields:   (was: [New])

Committed revision 616305.

> Highlighter Documentation updates
> -
>
> Key: LUCENE-1132
> URL: https://issues.apache.org/jira/browse/LUCENE-1132
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/*
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Trivial
> Attachments: LUCENE-1132.patch
>
>
> Various places in the Highlighter documentation refer to bytes (i.e. 
> SimpleFragmenter) when it should be chars.  See 
> http://www.gossamer-threads.com/lists/lucene/java-user/56986

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1151) Fix StandardAnalyzer to not mis-identify HOST as ACRONYM by default

2008-01-29 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1151:
---

Attachment: LUCENE-1151.patch

Attached patch that fixes the original bug (LUCENE-1068) by default, but offers 
system property & static method to keep backwards compatible yet buggy behavior.

I'll commit in a day or two.

> Fix StandardAnalyzer to not mis-identify HOST as ACRONYM by default
> ---
>
> Key: LUCENE-1151
> URL: https://issues.apache.org/jira/browse/LUCENE-1151
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.4
>
> Attachments: LUCENE-1151.patch
>
>
> Coming out of the discussion around back compatibility, it seems best to 
> default StandardAnalyzer to properly fix LUCENE-1068, while preserving the 
> ability to get the back-compatible behavior in the rare event that it's 
> desired.
> This just means changing the replaceInvalidAcronym = false with = true, and, 
> adding a clear entry to CHANGES.txt that this very slight non back compatible 
> change took place.
> Spinoff from here:
> http://www.gossamer-threads.com/lists/lucene/java-dev/57517#57517
> I'll commit that change in a day or two.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Resolved: (LUCENE-1150) The token types of the standard tokenizer is not accessible

2008-01-29 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-1150.


   Resolution: Fixed
Fix Version/s: 2.4

I just committed this.  Thanks for opening this Nicolas!

> The token types of the standard tokenizer is not accessible
> ---
>
> Key: LUCENE-1150
> URL: https://issues.apache.org/jira/browse/LUCENE-1150
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 2.3
>Reporter: Nicolas Lalevée
>Assignee: Michael McCandless
> Fix For: 2.4
>
> Attachments: LUCENE-1150.patch, LUCENE-1150.take2.patch
>
>
> The StandardTokenizerImpl not being public, these token types are not 
> accessible :
> {code:java}
> public static final int ALPHANUM  = 0;
> public static final int APOSTROPHE= 1;
> public static final int ACRONYM   = 2;
> public static final int COMPANY   = 3;
> public static final int EMAIL = 4;
> public static final int HOST  = 5;
> public static final int NUM   = 6;
> public static final int CJ= 7;
> /**
>  * @deprecated this solves a bug where HOSTs that end with '.' are identified
>  * as ACRONYMs. It is deprecated and will be removed in the next
>  * release.
>  */
> public static final int ACRONYM_DEP   = 8;
> public static final String [] TOKEN_TYPES = new String [] {
> "",
> "",
> "",
> "",
> "",
> "",
> "",
> "",
> ""
> };
> {code}
> So no custom TokenFilter can be based of the token type. Actually even the 
> StandardFilter cannot be writen outside the 
> org.apache.lucene.analysis.standard package.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1160) MergeException from CMS threads should record the Directory

2008-01-29 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1160:
---

Attachment: LUCENE-1160.patch

> MergeException from CMS threads should record the Directory
> ---
>
> Key: LUCENE-1160
> URL: https://issues.apache.org/jira/browse/LUCENE-1160
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.3
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.4
>
> Attachments: LUCENE-1160.patch
>
>
> When you hit an unhandled exception in ConcurrentMergeScheduler, it
> throws a MergePolicy.MergeException, but there's no easy way to figure
> out which index caused this (if you have more than one).
> I plan to add the Directory to the MergeException.  I also made a few
> other small changes to ConcurrentMergeScheduler:
>   * Added handleMergeException method, which is called on exception,
> so that you can subclass ConcurrentMergeScheduler to do something
> when an exception occurs.
>   * Added getMergeThread() method so you can override how the threads
> are created (eg, if you want to make them in a different thread
> group, use a pool, change priorities, etc.).
>   * Added doMerge(...) to actually do this merge, so you can do
> something before starting and after finishing a merge.
>   * Changed private -> protected on a few attrs
> I plan to commit in a day or two.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1160) MergeException from CMS threads should record the Directory

2008-01-29 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1160:
---

Attachment: (was: LUCENE-1150.patch)

> MergeException from CMS threads should record the Directory
> ---
>
> Key: LUCENE-1160
> URL: https://issues.apache.org/jira/browse/LUCENE-1160
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.3
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.4
>
> Attachments: LUCENE-1160.patch
>
>
> When you hit an unhandled exception in ConcurrentMergeScheduler, it
> throws a MergePolicy.MergeException, but there's no easy way to figure
> out which index caused this (if you have more than one).
> I plan to add the Directory to the MergeException.  I also made a few
> other small changes to ConcurrentMergeScheduler:
>   * Added handleMergeException method, which is called on exception,
> so that you can subclass ConcurrentMergeScheduler to do something
> when an exception occurs.
>   * Added getMergeThread() method so you can override how the threads
> are created (eg, if you want to make them in a different thread
> group, use a pool, change priorities, etc.).
>   * Added doMerge(...) to actually do this merge, so you can do
> something before starting and after finishing a merge.
>   * Changed private -> protected on a few attrs
> I plan to commit in a day or two.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1160) MergeException from CMS threads should record the Directory

2008-01-29 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1160:
---

Attachment: LUCENE-1150.patch

> MergeException from CMS threads should record the Directory
> ---
>
> Key: LUCENE-1160
> URL: https://issues.apache.org/jira/browse/LUCENE-1160
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.3
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.4
>
> Attachments: LUCENE-1150.patch
>
>
> When you hit an unhandled exception in ConcurrentMergeScheduler, it
> throws a MergePolicy.MergeException, but there's no easy way to figure
> out which index caused this (if you have more than one).
> I plan to add the Directory to the MergeException.  I also made a few
> other small changes to ConcurrentMergeScheduler:
>   * Added handleMergeException method, which is called on exception,
> so that you can subclass ConcurrentMergeScheduler to do something
> when an exception occurs.
>   * Added getMergeThread() method so you can override how the threads
> are created (eg, if you want to make them in a different thread
> group, use a pool, change priorities, etc.).
>   * Added doMerge(...) to actually do this merge, so you can do
> something before starting and after finishing a merge.
>   * Changed private -> protected on a few attrs
> I plan to commit in a day or two.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-1160) MergeException from CMS threads should record the Directory

2008-01-29 Thread Michael McCandless (JIRA)
MergeException from CMS threads should record the Directory
---

 Key: LUCENE-1160
 URL: https://issues.apache.org/jira/browse/LUCENE-1160
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.3
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.4
 Attachments: LUCENE-1150.patch

When you hit an unhandled exception in ConcurrentMergeScheduler, it
throws a MergePolicy.MergeException, but there's no easy way to figure
out which index caused this (if you have more than one).

I plan to add the Directory to the MergeException.  I also made a few
other small changes to ConcurrentMergeScheduler:

  * Added handleMergeException method, which is called on exception,
so that you can subclass ConcurrentMergeScheduler to do something
when an exception occurs.

  * Added getMergeThread() method so you can override how the threads
are created (eg, if you want to make them in a different thread
group, use a pool, change priorities, etc.).

  * Added doMerge(...) to actually do this merge, so you can do
something before starting and after finishing a merge.

  * Changed private -> protected on a few attrs

I plan to commit in a day or two.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]