[jira] Commented: (LUCENE-888) Improve indexing performance by increasing internal buffer sizes

2007-05-25 Thread John Haxby (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12499044
 ] 

John Haxby commented on LUCENE-888:
---

> Net/net it's between 10-18% performance gain overall. It is
> interesting that the system with the "weakest" IO system (one drive on
> Windows XP vs RAID 0/5 on the others) has the best gains.

Actually, it's not that surprising.  Linux and BSD (MacOS) kernels work hard to 
do good I/O without the user having to do that much to take it into account.   
The improvement you're seeing in those systems is as much to do with the fact 
that you're dealing with complete file system block sizes (4x4k) and complete 
VM page sizes (4x4k).   You'd probably see similar gains just going from 1k to 
4k though: even "cp" benefits from using a 4k block size rather than 1k.  I'd 
guess that a 4k or 8k buffer would be best on Linux/MacOS and that you wouldn't 
see much difference going to 16k.  In fact, in the MacOS tests the big jump 
seems to be from 1k to 4k with smaller improvements thereafer.

I'm not that surprised by the WinXP changes: the I/O subsystem on a laptop is 
usually dire and anything that will cut down on the I/O is going to be a big 
help.  I would expect that the difference would be more dramatic with a FAT32 
file system than it would be with NTFS though.

> Improve indexing performance by increasing internal buffer sizes
> 
>
> Key: LUCENE-888
> URL: https://issues.apache.org/jira/browse/LUCENE-888
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.1
>Reporter: Michael McCandless
> Assigned To: Michael McCandless
>Priority: Minor
>
> In working on LUCENE-843, I noticed that two buffer sizes have a
> substantial impact on overall indexing performance.
> First is BufferedIndexOutput.BUFFER_SIZE (also used by
> BufferedIndexInput).  Second is CompoundFileWriter's buffer used to
> actually build the compound file.  Both are now 1 KB (1024 bytes).
> I ran the same indexing test I'm using for LUCENE-843.  I'm indexing
> ~5,500 byte plain text docs derived from the Europarl corpus
> (English).  I index 200,000 docs with compound file enabled and term
> vector positions & offsets stored plus stored fields.  I flush
> documents at 16 MB RAM usage, and I set maxBufferedDocs carefully to
> not hit LUCENE-845.  The resulting index is 1.7 GB.  The index is not
> optimized in the end and I left mergeFactor @ 10.
> I ran the tests on a quad-core OS X 10 machine with 4-drive RAID 0 IO
> system.
> At 1 KB (current Lucene trunk) it takes 622 sec to build the index; if
> I increase both buffers to 8 KB it takes 554 sec to build the index,
> which is an 11% overall gain!
> I will run more tests to see if there is a natural knee in the curve
> (buffer size above which we don't really gain much more performance).
> I'm guessing we should leave BufferedIndexInput's default BUFFER_SIZE
> at 1024, at least for now.  During searching there can be quite a few
> of this class instantiated, and likely a larger buffer size for the
> freq/prox streams could actually hurt search performance for those
> searches that use skipping.
> The CompoundFileWriter buffer is created only briefly, so I think we
> can use a fairly large (32 KB?) buffer there.  And there should not be
> too many BufferedIndexOutputs alive at once so I think a large-ish
> buffer (16 KB?) should be OK.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [att: pmc] [off topic] ezmlm and reply-to

2006-07-24 Thread John Haxby

Steven Rowe wrote:

If you do want to add a reply-to list header, put ``reply-to'' into
DIR/headerremove, and ``Reply-To: [EMAIL PROTECTED]'' into DIR/headeradd.


My guess, given that the ``Reply-To: [EMAIL PROTECTED]'' header is already 
inserted into the header, is that putting ``reply-to'' into 
DIR/headerremove will fix the problem for you.



I think that that is necessary because ...

Mailing-List: contact [EMAIL PROTECTED]; run by ezmlm
List-Id: 
Reply-To: java-dev@lucene.apache.org
Date: Sun, 23 Jul 2006 19:10:47 +0200
From: "Simon Willnauer" <[EMAIL PROTECTED]>
Reply-To: [EMAIL PROTECTED]
To: java-dev@lucene.apache.org
Subject: Gdata - opening/closing index
.. that's not a legal header.   RFC2822 says that you can only have one 
Reply-To: header.   If the mailing list manager isn't deleting the 
original then it really should be merging them (you can have more than 
one address).   The fact that some mailers choose the first Reply-To: 
and some choose the second (or last) is not the problem -- if you don't 
have a legal header then any interpretation is reasonable.


jch

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Results (Re: Survey: Lucene and Java 1.4 vs. 1.5)

2006-06-22 Thread John Haxby

DM Smith wrote:
I simply meant that the change that is being made should be done in 
such a way that one applying the patch can readily see what is being 
changed. The most common case of unnecessary change is that of 
whitespace. Changing indentation, changing the placement of curly 
braces, reordering methods and variables and so forth are all 
unnecessary.


[snip]
Such a change is most likely unnecessary.
Others, probably including me, would disagree.   Changes to make the 
source have a consistent style and a consistent layout are not 
uncommon.   Look through the Linux kernel change logs for "whitespace 
clean up" (or "white space" and "cleanup", spaces are optional :-)).   
The GNU glibc maintainers will reject patches that do not conform to the 
coding style for glibc -- and that includes stylistic choices like the 
ones you mentioned (that I cut in the interests of brevity).


Style may make no functional difference to the code but it does affect 
maintainability.  It may well also affect correctness.  You could 
declare all your variables as "Object" and simply cast to the right type 
to get the method you want.  There would be no functional difference 
(one could argue that eliminating run-time type checking is merely an 
optimisation) but would you seriously want to code this way.


Similarly, and I'm struggling to keep vaguely on-topic here, the Java 
1.5 iteration constructs are functionally no different to their 1.4 
equivalent.   But to dismiss the 1.5 changes as "syntactic sugar" or 
"fluff" is to denigrate their importance to the reliability and 
maintenance of software.   If you declared all your variables as 
"Object" would your code be more reliable, about the same or less?   
(That's a rhetoric question, I hope.)


jch

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Results (Re: Survey: Lucene and Java 1.4 vs. 1.5)

2006-06-22 Thread John Haxby

DM Smith wrote:
Generally open source projects have a policy to change as little of 
the file

as possible, only changing what is necessary.
Hmmm.   Necessary by what criterion?   Necessary to make, say, Lucene 
exploit the new interator constructs to avoid run-time type-checking?   
Necessary to make the code more readable?   Necessary to prevent use 
with Java 1.4? :-)


I'm not sure I've ever seen a policy expressed in that way -- patches 
generally should be clear, concise and do what they're intended to do, 
but that doesn't necessarily mean minimising the size of the patch and 
it doesn't necessarily mean keeping the source compatible with some old 
compiler or environment.


Indeed, to be hypothetical and not entirely off-topic, it easy to 
imagine two patches that are better than one.  For the two patches, 
reorganise a class so that it exploits Java 1.5 features and the "real" 
patch that uses that new structure to cleanly and elegantly implement 
some new feature.   For the one patch, leave the code compatible with 
1.4, but the functional patch is now much larger, more complex and 
harder to verify.


It's possible that such a hypothetical patch (or pair of patches) is at 
the core of the question of this thread.  Does one embrace new language 
features because they provide some tangible benefit to the 
maintainability, functionality or complexity of the code?   If so, for 
how long?   Should Lucene development freeze at 1.4 until there's no 
working hardware that runs 1.4?  Would that also preclude changes to 
Lucene that make it work dramatically better on a machine with the 
current de-facto "standard" memory?   What happens when that's, say, 4Gb 
and some old hardware simply won't let you install that much memory?


Would it not be better to freeze application development that needs an 
old environment and simply back-port bug fixes and, where it makes 
sense, functionality to the version of Lucene that is used in that 
environment?


The approach taken by Red Hat with their Enterprise Linux series is that 
they'll support a version of the platform for several years, 
back-porting bug fixes, adding small, incremental functional changes and 
so on.  That means that this antique computer that happily runs RHEL3 
will be able to carry on running an OS and applications that work on 
that hardware until it finally gives p the ghost .


jch

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Results (Re: Survey: Lucene and Java 1.4 vs. 1.5)

2006-06-20 Thread John Haxby

Robert Engels wrote:

To set the record straight, I think the Lucene product and community are
fantastic. Period.
  

Ditto.

[snip]
After almost 2 years I now back the move. Why? Several reasons:

1. Sun is very slow, if at all to fix bugs in 1.4 (of which there are many).
For example, the current problems in Lucene regarding ThreadLocals. Although
this is not a bug per se, it is probably not intuitive or desired behavior.
The Lucene developers have been forced to both diagnose and create
workarounds "problems" already fixed in 1.5. The licensing of Java does not
allow for the easy fix bugs by non-Sun developers.
2. The type safe collections are far more efficient to program/debug with.
3. The standardized concurrent facilities can be of great benefit to
multithreaded programs.
4. It is what students graduating from college understand and use.
5. It is what the currently available books explain and use.
  
For my money (2) is the most important reason for moving to 1.5.   My 
early years (!) involved a lot of work with programming languages and 
finally having type-safe collections and the syntactic conventions that 
go along with those immediately struck me as a good step forward.   That 
first impression was borne out when I converted some of my java code to 
use the 1.5 type-safe constructs.   Not only was the code shorter and 
more understandable (aka more maintainable) but it brought to light some 
bugs that had lain their dormant for quite a while.


Lucene is well-tested and stable and written by better people better at 
writing Java than me so it's unlikely that there'll be any bugs like 
that lurking in dark corners.  On the other hand, new code stands a 
better chance of being bug free precisely because of the improvements in 
the language and the improved type-safety.


(3) comes second for me though -- I'm a big fan of Doug Lea's 
util.concurrent classes and having them well integrated in Java 1.5 
makes them even better, but that's the operating system personality talking.


jch

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: wildcard search with variable length

2006-02-22 Thread John Haxby

Doug Cutting wrote:


DM Smith wrote:

Personally, I don't want an either/or. I want a both/and. Modern unix 
shells provide both/and, albeit with different syntax.


I see this more as a feature request than an argument as to the 
usefulness or properness of either. Both are useful. Both are proper. 
Both are intuitive. Both are counterintuitive. It all depends on your 
"tradition".


+1

Doug


Doesn't the RegexQuery do this for you?

jch

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: wildcard search with variable length

2006-02-22 Thread John Haxby

Andrzej Bialecki wrote:


Tiago Silveira wrote:

IMHO, using "cat cat?" or even "cat cat? cat??" is so simple that it 
doesn't

justify keeping the old, undocumented, arguably incorrect behavior.


I have a different view on this issue - IMHO treating "?" as "exactly 
one character" is counterintuitive for people familiar with the use of 
wildcards: in all popular regular expression languages, and also in 
DTD/XML world, a single "?" metacharacter means "zero or one", which 
is probably why the original behavior was introduced (or at least it 
was more compatible with the use of "?" in other contexts).


Ahh.   Well.   If "cat?" is a regular expression then it will match "ca" 
and "cat".   "cat??" is probably not a valid regular expression: the 
final ? means "one or zero occurances of t?" which means that it too 
matches "ca" and "cat".   However, the javadoc defines "?" and its 
definition matches the shell glob definition and it's quite clear that 
WildcardQuery is not a RegexQuery just from the docs.


I can't comment about the wildcard character a DTD/XML context, I'm not 
that familiar with it.


jch


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: wildcard search with variable length

2006-02-22 Thread John Haxby

Tiago Silveira wrote:


IMHO, using "cat cat?" or even "cat cat? cat??" is so simple that it doesn't
justify keeping the old, undocumented, arguably incorrect behavior.
 

I don't think there's any question of the old behaviour being incorrect 
-- the javadoc says that ? matches a single character, not zero or one 
characters, a single character.


On the other hand, does Erik's new RegexQuery support "cat.?" (the ".?" 
does match zero or one characters)?(Where's the javadoc for that?   
I don't see any comments in the source, let alone anything else :-))


jch

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-497) update copyright (and licence) prior to release of 1.9

2006-02-15 Thread John Haxby (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-497?page=comments#action_12366465 ] 

John Haxby commented on LUCENE-497:
---

It's not as if I'm a lawyer either or what I say is likely to carry much 
weight, but what Yonik and Erik say matches what the legal people at my 
previous employer (HP) said -- don't change the copyright date unless the file 
has changed.

The underlying reason, it was explained to me, is that if you blindly claim 
copyright for years when you didn't do anything then a judge (if it came down 
to that) is going to take the view that your copyright notices don't actually 
have much value.   It's not, they said, the actual form of the copyright notice 
its convincing a judge that you do hold the copyright.   To that end, I asked, 
a good copyright notice is a help and a bad copyright notice is a hindrance.   
Yes, came the answer.

> update copyright (and licence) prior to release of 1.9
> --
>
>  Key: LUCENE-497
>  URL: http://issues.apache.org/jira/browse/LUCENE-497
>  Project: Lucene - Java
> Type: New Feature
> Reporter: Hoss Man
> Priority: Minor

>
> As discussed in email earlier today, it wouldn't hurt to update the Copyright 
> on all of the source files before release 1.9.
> Rather then try to submit a path with all the changes, here's a oneliner that 
> should work on any unix box to update in mass.  If it sees a Copyright string 
> it recognizes, it preserves the start year and adds/replaces the end year...
> find -name \*.java | xargs perl -pi -e 's/Copyright (\(c\) 
> )?(200[0-5])(-\d+)? (The )?Apache Software Foundation/Copyright ${2}-2006 The 
> Apache Software Foundation/;'
> ...it would make sense for someone with commit permissions to run that 
> themselves.
> It also cleans up a few that have a " (c) " in them that doesn't seem 
> standard across the rest of the files, and makes sure that the ASF is refered 
> to as "The" ASF.
> Even after all that, there are a few that may need cleaned up by hand...
> ./src/test/org/apache/lucene/store/TestLock.java: * Copyright (c) 2001,2004 
> The Apache Software Foundation.  All rights
> ./src/test-deprecated/org/apache/lucene/index/DocHelper.java: * Copyright 
> 2004.  Center For Natural Language Processing
> ./contrib/analyzers/src/java/org/apache/lucene/analysis/cn/ChineseAnalyzer.java:
>  * Copyright:   Copyright (c) 2001
> ./contrib/analyzers/src/java/org/apache/lucene/analysis/cn/ChineseFilter.java:
>  * Copyright:Copyright (c) 2001
> ./contrib/analyzers/src/java/org/apache/lucene/analysis/cn/ChineseTokenizer.java:
>  * Copyright:   Copyright (c) 2001
> ...the first is just an anoying format, the rest either have non ASF 
> copyrights, or dual copyrights (!?)
> It also may be a good time to take a look at all the (non-JavaCC generated) 
> java files that don't mention the Apache License, Version 2.0 ...
> @asimov:~/svn/lucene/java$ find src -name \*.java | xargs grep -L "Generated 
> By:JavaCC" | xargs grep -L LICENSE-2.0
> src/java/org/apache/lucene/search/SortComparatorSource.java
> src/java/org/apache/lucene/search/SortComparator.java
> src/test/org/apache/lucene/index/TestTermVectorsReader.java
> src/test/org/apache/lucene/index/TestSegmentTermEnum.java
> src/test/org/apache/lucene/index/TestFieldInfos.java
> src/test/org/apache/lucene/index/TestIndexWriter.java
> src/test/org/apache/lucene/store/TestLock.java
> src/test/org/apache/lucene/store/_TestHelper.java
> src/test/org/apache/lucene/search/TestRangeQuery.java
> src/test/org/apache/lucene/TestHitIterator.java
> src/test/org/apache/lucene/document/TestBigBinary.java
> src/test/org/apache/lucene/analysis/TestISOLatin1AccentFilter.java
> src/test-deprecated/org/apache/lucene/index/TestTermVectorsReader.java
> src/test-deprecated/org/apache/lucene/index/store/FSDirectoryTestCase.java
> src/test-deprecated/org/apache/lucene/index/TestSegmentTermEnum.java
> src/test-deprecated/org/apache/lucene/index/DocHelper.java
> src/test-deprecated/org/apache/lucene/index/TestIndexWriter.java
> src/test-deprecated/org/apache/lucene/search/TestRangeQuery.java

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: svn commit: r375070

2006-02-09 Thread John Haxby

Daniel Naber wrote:


On Sonntag 05 Februar 2006 19:45, Pasha Bizhan wrote:
 


Does this patch require to reindex all data?
   


URL: http://svn.apache.org/viewcvs?rev=375070&view=rev
Log:
DateTools needs to use UTC for correct collation
(LUCENE-491), patch by John Haxby
 



If your timezone is not UTC and your dates need to be accurate to the hour, 
then yes.
 

That's true, but be aware that, for example, Monday, 6pm PST is Tuesday, 
2am, GMT.Even if you have Resolution.YEAR then the last few hours of 
2005 in California are actually 2006 GMT.


If you're worried about events crossing date boundary then you'll need 
to re-index.


jch

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-491) DateTools needs to use UTC for correct collation,

2006-02-02 Thread John Haxby (JIRA)
 [ http://issues.apache.org/jira/browse/LUCENE-491?page=all ]

John Haxby updated LUCENE-491:
--

Attachment: patch

Patch for problem.   Basically, whereever a timezone can be used, we use GMT.

> DateTools needs to use UTC for correct collation,
> -
>
>  Key: LUCENE-491
>  URL: http://issues.apache.org/jira/browse/LUCENE-491
>  Project: Lucene - Java
> Type: Bug
> Versions: CVS Nightly - Specify date in submission
>  Environment: svn trunk at 02-Feb-2005, noon GMT.  OS independent.
> Reporter: John Haxby
>  Attachments: patch, testcase.java
>
> If your local timezone is Europe/London then the times Sun, 30 Oct 2005 
> 00:00:00 + and exactly one hour later are both converted to 20053001 
> by DateTools.dateToString() with minute resolution.   The Linux date command 
> is useful in seeing why:
> $ date --date "Sun, 30 Oct 2005 00:00:00 +"
> Sun Oct 30 01:00:00 BST 2005
> $ date --date "Sun, 30 Oct 2005 01:00:00 +"
> Sun Oct 30 01:00:00 GMT 2005
> Both times are 1am in the morning, but one is when DST is in force, the other 
> isn't.   Of course, these are actually different times!
> Of course, if dates are stored in the index with implicit timezone 
> information then not only do we get problems when the clocks go back at the 
> end of summer, but we also have problems crossing timezones.   If a database 
> is created in California and used in Paris then the times are going to be 
> badly skewed (there's a nine hour time difference most of the year).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-491) DateTools needs to use UTC for correct collation,

2006-02-02 Thread John Haxby (JIRA)
 [ http://issues.apache.org/jira/browse/LUCENE-491?page=all ]

John Haxby updated LUCENE-491:
--

Attachment: testcase.java

TestCase for problem.

> DateTools needs to use UTC for correct collation,
> -
>
>  Key: LUCENE-491
>  URL: http://issues.apache.org/jira/browse/LUCENE-491
>  Project: Lucene - Java
> Type: Bug
> Versions: CVS Nightly - Specify date in submission
>  Environment: svn trunk at 02-Feb-2005, noon GMT.  OS independent.
> Reporter: John Haxby
>  Attachments: testcase.java
>
> If your local timezone is Europe/London then the times Sun, 30 Oct 2005 
> 00:00:00 + and exactly one hour later are both converted to 20053001 
> by DateTools.dateToString() with minute resolution.   The Linux date command 
> is useful in seeing why:
> $ date --date "Sun, 30 Oct 2005 00:00:00 +"
> Sun Oct 30 01:00:00 BST 2005
> $ date --date "Sun, 30 Oct 2005 01:00:00 +"
> Sun Oct 30 01:00:00 GMT 2005
> Both times are 1am in the morning, but one is when DST is in force, the other 
> isn't.   Of course, these are actually different times!
> Of course, if dates are stored in the index with implicit timezone 
> information then not only do we get problems when the clocks go back at the 
> end of summer, but we also have problems crossing timezones.   If a database 
> is created in California and used in Paris then the times are going to be 
> badly skewed (there's a nine hour time difference most of the year).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-491) DateTools needs to use UTC for correct collation,

2006-02-02 Thread John Haxby (JIRA)
DateTools needs to use UTC for correct collation,
-

 Key: LUCENE-491
 URL: http://issues.apache.org/jira/browse/LUCENE-491
 Project: Lucene - Java
Type: Bug
Versions: CVS Nightly - Specify date in submission
 Environment: svn trunk at 02-Feb-2005, noon GMT.  OS independent.
Reporter: John Haxby


If your local timezone is Europe/London then the times Sun, 30 Oct 2005 
00:00:00 + and exactly one hour later are both converted to 20053001 by 
DateTools.dateToString() with minute resolution.   The Linux date command is 
useful in seeing why:

$ date --date "Sun, 30 Oct 2005 00:00:00 +"
Sun Oct 30 01:00:00 BST 2005

$ date --date "Sun, 30 Oct 2005 01:00:00 +"
Sun Oct 30 01:00:00 GMT 2005

Both times are 1am in the morning, but one is when DST is in force, the other 
isn't.   Of course, these are actually different times!

Of course, if dates are stored in the index with implicit timezone information 
then not only do we get problems when the clocks go back at the end of summer, 
but we also have problems crossing timezones.   If a database is created in 
California and used in Paris then the times are going to be badly skewed 
(there's a nine hour time difference most of the year).


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: access rights

2006-01-24 Thread John Haxby

I'd suggest using a QueryFilter.

What happens when the access rights for a document or a collection of 
documents change?How do you deal with new users?   I'm asking 
because I'm looking at the same problem but rather than attempting to 
keep the access rights in the index consistent with the access rights in 
the original store, I'm looking at using a QueryFilter that checks the 
original document for access.   It's slow, but it can be cached.   This 
question probably belongs on java-user though.


jch

Maros Ivanco wrote:


Hi,

I try to implement acces rights mechanism on the top of the lucene. My
situation looks like this: Indexed documents have associated access rights
information. When I construct the query, I append a part, which matches
actual user identity with access rights in the documents. This way the user
gets only the documents s/he can really access, and the number of hits is
really the number of documents s/he can potentionally access. The approach
works (it respects access rights), but the access rights (AR) query part
also affects the score of the documents. I tried two approaches to avoid
the effect of AR query part. First, I tried to set boost factor of the AR
fields during the document indexation to zero. This way I was unable to get
any results. Next, I set boost factor of the AR fields to small number
(0.001). This way I get the results but the computed score is really small
(less then 1%) for the first document in the results.
So, is there any possibility to effectivelly exclude certain fields from
score computation?
Any idea regarding the access rights issue, suggestion for better approach,
... is welcome.

Maros.

P.S.
I found this post on the user list:
http://www.gossamer-threads.com/lists/lucene/java-user/14973?do=post_view_threaded

My approach is the number 3 in the post, but unfortunately no reply deal
with it.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

 




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-489) Wildcard Queries with leading "*"

2006-01-24 Thread John Haxby (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-489?page=comments#action_12363822 ] 

John Haxby commented on LUCENE-489:
---

I'm sure someone mentioned on one of the lists a while back, but there's a 
technique that we used for an LDAP server that's applicable here.   It's a bit 
like injecting synonyms: you'd have, say, a SubwordFilter that given "brown" 
would emit "rown" and "own" at the same position.  A "*own" query would then 
simply drop the leading wildcard and look for the word.   We stopped at three 
letters in the LDAP server.   An alternative is to use a 
ReverseAlternativeFilter (say) that emits "brown" and "nworb" at the same 
position, but that only deals with prefix or postfix wildcards, but not both.

I'm not sure how you'd stop "own" matching "brown" though.   If someone could 
come up with some example code I don't suppose I'd be the only one who would be 
interested! 

> Wildcard Queries with leading "*"
> -
>
>  Key: LUCENE-489
>  URL: http://issues.apache.org/jira/browse/LUCENE-489
>  Project: Lucene - Java
> Type: Wish
>   Components: QueryParser
> Reporter: Peter Schäfer

>
> It would be nice to have wildcard queries with a leading wildcard ("?" or 
> "*").
> I'm aware that this is a well-known issue, and I do understand the reasons 
> behind it,
> but try explaining that to our end-users ... :-(

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Created: (LUCENE-489) Wildcard Queries with leading "*"

2006-01-23 Thread John Haxby

Peter Schäfer (JIRA) wrote:


It would be nice to have wildcard queries with a leading wildcard ("?" or "*").

I'm aware that this is a well-known issue, and I do understand the reasons 
behind it,
but try explaining that to our end-users ... :-(
 

I'm sure someone mentioned this a while back, but there's a technique 
that we used for an LDAP server that's applicable here.   It's a bit 
like injecting synonyms: you'd have, say, a SubwordFilter that given 
"brown" would emit "rown" and "own" at the same position.  A "*own" 
query would then simply drop the leading wildcard and look for the 
word.   We stopped at three letters in the LDAP server.   An alternative 
is to use a ReverseAlternativeFilter (say) that emits "brown" and 
"nworb" at the same position, but that only deals with prefix or postfix 
wildcards, but not both.


I'm not sure how you'd stop "own" matching "brown" though.   If someone 
could come up with some example code I don't suppose I'd be the only one 
who would be interested!


jch

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Inconsistency between MultiFieldQueryParser and QueryParser

2006-01-17 Thread John Haxby

Daniel Naber wrote:

These are just simple convenience methods that create a BooleanQuery. 
Making them non-static would create a different set of problems, e.g that 
you need to pass them an array with the same number of elements as the 
constructor was given. So I don't know if this is something that should be 
changed.
 


Ahh.   Sorry, I think I misunderstood.   I think I wanted to do

  qp = new MutliFieldQueryParser(new String[]{"subject", "text"},
 new BooleanClause.Occur[]{MUST, SHOULD},
 analyzer);
  qp.setLocale(locale);
  qp.setSlop(slop);

and I'm no longer sure that that has any worthwhile benefit.   I think I 
was also thinking that other combinations of occurances would be useful 
but, this morning and for the life of me, I can't see why.  I think 
having both fields required was possibly what I was thinking of.  Maybe 
I've just lost my marbles :-)


jch

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Inconsistency between MultiFieldQueryParser and QueryParser

2006-01-16 Thread John Haxby

Hello All,

Not sure if this should be user, dev or a bug report.  Apologies if this 
is the wrong message to the wrong place!   Happy to correct it if needed.


QueryParser's static parse() method is deprecated, but 
MultiFieldQueryParser has three static parse() methods, moreover there's 
constructor that takes a BooleanClause.Occur[] and no non-static method 
that takes a String[] queries.


Am I missing something here?   I have to admit I haven't looked at the 
code to see what's going on.


jch

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: NioFile cache performance

2005-12-09 Thread John Haxby

Robert Engels wrote:

Using a 4mb file (so I could be "guarantee" the disk data would be in 
the OS cache as well), the test shows the following results.


Which OS?   If it's Linux, what kernel version and distro?   What 
hardware (disk type, controller etc).


It's important to know: I/O (and caching) is very different between 
Linux 2.4 and 2.6.   The choice of I/O scheduler can also make a 
significant difference on 2.6, depending on the workload.   The type of 
disk and its controller is also important -- and when you get really 
picky, the mobo model number.


I don't dispute your finding for a second, but it would be good to run 
the same test on other platforms to get comparative data: not least 
because you can get the kind of I/O time improvement you're seeing on 
some workloads on different versions of the Linux kernel.


jch

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: "Advanced" query language

2005-12-05 Thread John Haxby

Yonik Seeley wrote:


I looked into this a year ago... most scripting languages have an
emphasis on script execution speed, not script parsing speed (which is
what we would need).  The scripting languages I tried were horribly
slow at parsing a small script.  The only one that could parse at a
reasonable speed was rhino (javascript) in interp mode.
 

I've always found the lisp syntax very easy to parse.  In this case, 
it's just prefix with the nam of he operator being first in the list, eg 
(and "eggs" "oranges").   There are wrinkles for named and optional 
parameters, but the basic syntax is a doddle.


jch

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Version 1.9

2005-09-18 Thread John Haxby

John Haxby wrote:


[...] compiled with gcj that I believe is compiled with gcj [...]


It's only compiled once with gcj, if at all :-)

You can get it from 
http://download.fedora.redhat.com/pub/fedora/linux/core/updates/4/SRPMS/lucene-1.4.3-1jpp_3fc.src.rpm


A quick inspection of the .spec file suggests that it's compiled with 
gcj and, indeed, the compiled RPM 
(http://download.fedora.redhat.com/pub/fedora/linux/core/updates/4/i386/lucene-1.4.3-1jpp_3fc.i386.rpm) 
has both shared libraries and a jar.


jch

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Version 1.9

2005-09-18 Thread John Haxby

Jeff Breidenbach wrote:


2) Is anyone testing against kaffe or other non-sun compilers?
This is important to Debian as any software that can only
be built from a closed-source JDK is considered
a second class citizen. As you can see, we've been poking
at this issue on Lucene 1.4.3 for quite some time [1]
and it is tricky. Support from upstream is always appreciated.
 

I don't know much about it, but Fedora Core 4 ships a version of Lucene 
compiled with gcj that I believe is compiled with gcj.   It'd be easy 
enough to see what they do in the .src.rpm though and as Debian 
maintainer I'm sure you're well versed in getting ideas from other builds!


jch

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]