Re: ParallelReader

2005-04-29 Thread Paul Smith
Doug Cutting wrote:
Please find attached something I wrote today.  It has not been yet 
tested extensively, and the documentation could be improved, but I 
thought it would be good to get comments sooner rather than later.

Would folks find this useful?
My Answer: "Is the Pope German?"
Very useful for something we're about to start, where we have millions 
of construction documents, with meta-data and content.  Most searching 
is done against the meta-data (and must be _fast_), but occasionally 
people want to be able to look inside the file contents, so I can see 
the 1% searching inside the content index in this case.

Should it go into the core or in contrib?
+1 to core... (non-binding of course).
Paul Smith
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: [Performanc]

2005-04-29 Thread Paul Smith

 

At this rate I'm
only getting on average 300-400-ish items/second added to the index.
   

I think that's realistic for typical uses of Lucene on common hardware.
 

Thanks Daniel, that's at least comforting to know that it's at least 
expected.  Can you or anyone else comment on the CPU profile I sent in?  
If there was a way of optimizing that loop, then it could mean a 
reasonable improvement in indexing speed.

cheers,
Paul Smith


[Performance]: IndexWriter again...

2005-05-15 Thread Paul Smith
Ok, I'm just following up on my email from 29th April titled '[Performanc]'  (don't you love it when you send before you've typed your subject line completely).  The thread is here:http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200504.mbox/[EMAIL PROTECTED]In summary, I still firmly believe that the IndexWriter.maybeMergeSegments() is chewing a lot more CPU than would be ideal.  So I ran a simple test.  I ran the same test I've done before, using mergeFactor(1000) maxBufferedDocs(1), useCompondFile(false), indexing 5 fields (user first/lastname/email address)As a baseline using the latest SVN source code, I'm getting an indexing rate of between 490-515 items/second of a number of runs.By applying the attached simple patch to IndexWriter, I'm getting between 945-970 of a number of test runs.  That's a significant speed up.  All the patch is doing is deferring the call to maybeMergeSegments so it only does it every 2000 iterations (2000 is totally arbitrary on my part).I've verified with Luke that the index generated contains the same # documents, and same # terms, but I have not had a chance to properly setup my local environment to run the test cases.  Obviously the attached patch is a dirty hack of the highest order. In my case I'm re-indexing from scratch every time, so there may be a reason why we shouldn't be doing this sort of deferring of method calls.  Perhaps the source code is optimized around incremental/batch updates to _existing_ indexes, but creating a new index, but with a penalty of creating a new index performs slower than one would like.Perhaps IndexWriter could benefit from another setting that lets one configure how often to call maybeMergeSegments()?  That could of course confuse more people than it helps.I would really appreciate anyones thoughts on this, I'll be very happy to be proven wrong because it will just help me understand more of Lucene.  I would hope that speeding up indexing would benefit everyone?  Particularly the large scale sites out there.cheers,Paul Smith

IndexWriter.patch
Description: Binary data


Re: [Performance]: IndexWriter again...

2005-05-15 Thread Paul Smith
Silly me, here's the patch with the extra code NOT commented out...
Oh my, how embarrassing... :)

Paul
On 16/05/2005, at 4:15 PM, Paul Smith wrote:


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [Performance]: IndexWriter again...

2005-05-15 Thread Paul Smith
I'm not even going to say anything this time :-$
On 16/05/2005, at 4:17 PM, Paul Smith wrote:
Silly me, here's the patch with the extra code NOT commented out...
Oh my, how embarrassing... :)

Paul
On 16/05/2005, at 4:15 PM, Paul Smith wrote:



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [Performance]: IndexWriter again...

2005-05-15 Thread Paul Smith
something very odd is going on with my attachments...  sorry for the  
spam.

On 16/05/2005, at 4:22 PM, Paul Smith wrote:
I'm not even going to say anything this time :-$
On 16/05/2005, at 4:17 PM, Paul Smith wrote:

Silly me, here's the patch with the extra code NOT commented out...
Oh my, how embarrassing... :)

Paul
On 16/05/2005, at 4:15 PM, Paul Smith wrote:




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [Performance]: IndexWriter again...

2005-05-16 Thread Paul Smith
On 16/05/2005, at 5:00 PM, Paul Elschot wrote:
On Monday 16 May 2005 08:24, Paul Smith wrote:
something very odd is going on with my attachments...  sorry for the
spam.

It's usually easier open a bug in bugzilla and post the code and
the concerns there. The only disadvantage of bugzilla is that
you can only add attachment after the bug is opened for the first  
time:
http://issues.apache.org/bugzilla/enter_bug.cgi

Thanks Paul, I'm not sure why subsequent attempts are still stripping  
the attachment, I'll go ahead and file something in bugzilla, and  
cross my fingers I don't look any sillier than I do now.

cheers,
Paul Smith
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Map-Reduce

2005-08-03 Thread Paul Smith
I've been reading the Nutch MapReduce stuff[1], and the original  
Google paper [2].


I know there's a mapreduce branch in the nutch project, but is there  
any plan/talk of perhaps integrating something like that directly  
into the Lucene API?  For projects that need a lower-level API like  
Lucene, rather than the crawl-like nature of Nutch, the potential to  
index lots of information in an efficient manner is very appealing  
indeed.


I'm not suggesting this is _easy_, just curious of what folks on the  
Lucene-side of things think.  Perhaps a chance to refactor out from  
nutch a shared library?


I would love to hear anyones thoughts on the matter.

cheers,

Paul Smith

[1] http://wiki.apache.org/nutch-data/attachments/Presentations/ 
attachments/oscon05.pdf

[2] http://labs.google.com/papers/mapreduce-osdi04.pdf

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Map-Reduce

2005-08-04 Thread Paul Smith


On 05/08/2005, at 4:10 AM, Doug Cutting wrote:


Doug Cutting wrote:

Perhaps we need to factor Nutch into two projects, one with NDFS  
and MapReduce and the other with the search-specific code.  This  
falls almost exactly on package lines.  The packages  
org.apache.nutch.{io,ipc,fs,ndfs,mapred} are not dependent on the  
rest of Nutch.




FYI, over on the nutch-dev list, I just proposed that we split  
these packages into a new project that Nutch then depends on, since  
there seems to be interest in using them independently of Nutch.   
Such a split probably wouldn't happen for at least a month.


http://www.mail-archive.com/nutch-dev%40lucene.apache.org/ 
msg00312.html



Awesome, thanks Doug!  I really believe that having this out as a  
separate project will be more useful for everyone.   This will also  
give more exposure to Nutch and Lucene as a whole, because people  
will experiment with the NDFS/MapReduce stuff first (smaller thing to  
comprehend first).


cheers,

Paul

Re: Considering lucene

2005-09-29 Thread Paul Smith
This requirement is almost exactly the same as my requirement for the  
log4j project I work on where I wanted to be able to index every row  
in a text log file to be it's own Document.


It works fine, but treating each line as a Document turns out to take  
a while to index (searching is fantastic though I have to say) due to  
the cost of adding a Document to an index.  I don't think Lucene is  
currently tuned (or tunable) to that level of Document granularity,  
so it'll depend on your requirement of timeliness of the indexing.


I was hoping (of course it's a big ask) to be able to index a million  
rows of relatively short lines of text (as log files tend to be) in a  
'few moments", no more than 1 minute, but even with pretty grunty  
hardware you run up against the bottleneck of the tokenization  
process (the StandardAnalyzer is not optimal at all in this case  
because of the way it 'signals' EOF with an exception).


There was someone (apoligise, I've forgotten his name, I blame the  
holiday I just came back from) that could treat a relatively small  
file, such as an XML file, and very quickly index that for on the fly  
XPath like queries using Lucene which apparently works very well, but  
I'm not sure it scales to massive documents such as log files (and  
your requirements).


cheers,

Paul Smith


On 30/09/2005, at 3:17 PM, <[EMAIL PROTECTED]> <[EMAIL PROTECTED]> wrote:


Hi,



My name is Palmik Bijani and I have recently started a new software  
company
in India. After initial research, Lucene has surfaced as a leading  
contender
for our needs. We have also purchased the Lucene book which we are  
expecting
in a couple weeks. However, I was hoping to get an answer to the  
following
as we are unable to find this information from everything we have  
read so
far on Lucene. We don’t know if the book covers this requirement of  
ours.




Our requirement is for row based keyword search in a single very  
large text
file which can potentially hold millions of rows (with delimited  
fields per
row). In other words, we would like Lucene to filter and return  
only the row
numbers within a file for the respective row that hold the keywords  
we query

for a particular field in each row.



From everything we have seen so far, Lucene can handle a large set  
of files
and tokenizes the keywords within each file and returns the  
matching file

name per keyword – but I have not seen anything about segmenting and
searching by rows.



From Lucene’s context, one can think of each row as a separate  
file, field
data within each row as document content, and each row number as  
the unique

file name.



From what I have read about Lookoutsoft had used Lucene for Outlook  
email
searches, it seems to me that it should be possible as  
fundamentally even

email searching is row based.



Is our requirement something that Lucene can inherently handle  
well, or

would it require extensive tweaking and code changes on our end?



Your response is greatly appreciated.



Thank you,

Palmik








--
No virus found in this outgoing message.
Checked by AVG Anti-Virus.
Version: 7.0.344 / Virus Database: 267.11.9/115 - Release Date:  
9/29/2005






-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Considering lucene

2005-10-02 Thread Paul Smith


On 01/10/2005, at 6:30 AM, Erik Hatcher wrote:



On Sep 30, 2005, at 1:26 AM, Paul Smith wrote:


This requirement is almost exactly the same as my requirement for  
the log4j project I work on where I wanted to be able to index  
every row in a text log file to be it's own Document.


It works fine, but treating each line as a Document turns out to  
take a while to index (searching is fantastic though I have to  
say) due to the cost of adding a Document to an index.  I don't  
think Lucene is currently tuned (or tunable) to that level of  
Document granularity, so it'll depend on your requirement of  
timeliness of the indexing.




There are several tunable indexing parameters that can help with  
batch indexing.  By default it is mostly tuned for incremental  
indexing, but for rapid batch indexing you may need to tune it to  
merge less often.


Yep, mergeFactor et al.  We currently have it at 1000 (with 8  
concurrent threads creating Project-based indices, so that could be  
8000 open files during search, unless I'm mistaken), plus increased  
the value for maxBufferedDocs as per standard practices.





I was hoping (of course it's a big ask) to be able to index a  
million rows of relatively short lines of text (as log files tend  
to be) in a 'few moments", no more than 1 minute, but even with  
pretty grunty hardware you run up against the bottleneck of the  
tokenization process (the StandardAnalyzer is not optimal at all  
in this case because of the way it 'signals' EOF with an exception).




Signals EOF with an exception?  I'm not following that.  Where does  
that occur?




See our recent YourKit "sampling" profile export here:

http://people.apache.org/~psmith/For%20Lucene%20list/ 
IOExceptionProfiling.html


This is a full production test run over 5 hours indexing 6.5 million  
records (approx 30 fields) running on Dual P4 Xeon servers with 10K  
SCSI disks. You'll note that a good chunk (35%) of the time of the  
indexing thread is spent in 2 methods of the  
StandardTokenizerManager.  When you look at the source code for these  
2 methods you will see that it relies FastCharStream's method  of  
IOException to 'flag' EOF:


if (charsRead == -1)
  throw new IOException("read past eof");

(line 72-ish)

Of course, we _could_ always write our own analyzer, but it would be  
real nice if the out-of-the-box one was even better.





There was someone (apoligise, I've forgotten his name, I blame the  
holiday I just came back from) that could treat a relatively small  
file, such as an XML file, and very quickly index that for on the  
fly XPath like queries using Lucene which apparently works very  
well, but I'm not sure it scales to massive documents such as log  
files (and your requirements).




Wolfgang Hoschek and the NUX project may be what you're referring  
to.  He contributed the MemoryIndex feature found under contrib/ 
memory.  I'm not sure that feature is a good fit for the log file  
or indexing files line-by-line though.


Yes, Wolfgang's code is very cool, but would only work on small texts.

cheers,

Paul Smith

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Float.floatToRawIntBits

2005-11-16 Thread Paul Smith
I can confirm this takes ~ 20% of an overall Indexing operation (see  
attached link from YourKit).


http://people.apache.org/~psmith/luceneYourkit.jpg

Mind you, the whole "signalling via IOException" in the  
FastCharStream is a way bigger overhead, although I agree much harder  
to fix.


Paul Smith

On 17/11/2005, at 7:21 AM, Yonik Seeley wrote:


Float.floatToRawIntBits (in Java1.4) gives the raw float bits without
normalization (like *(int*)&floatvar would in C).  Since it doesn't do
normalization of NaN values, it's faster (and hopefully optimized to a
simple inline machine instruction by the JVM).

On my Pentium4, using floatToRawIntBits is over 5 times as fast as
floatToIntBits.
That can really add up in something like Similarity.floatToByte() for
encoding norms, especially if used as a way to compress an array of
float during query time as suggested by Doug.

-Yonik
Now hiring -- http://forms.cnet.com/slink?231706

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]






smime.p7s
Description: S/MIME cryptographic signature


Re: Float.floatToRawIntBits

2005-11-16 Thread Paul Smith


On 17/11/2005, at 9:24 AM, Doug Cutting wrote:

In general I would not take this sort of profiler output too  
literally.  If floatToRawIntBits is 5x faster, then you'd expect a  
16% improvement from using it, but my guess is you'll see far  
less.  Still, it's probably worth switching & measuring as it might  
be significant.


Yes I don't think we'll get 5x speed update, as it will likely move  
the bottleneck back to the IO layer, but still...  If you can reduce  
CPU usage, then multithreaded indexing operations can gain better CPU  
utilization (doing other stuff while waiting for IO).  Seems like an  
easy win and dead easy to unit test?


I've been meaning to have a crack at reworking FastCharStream but  
everytime I start thinking about it I realise there is a bit of a  
depency on this IOExecption signalling EOF that I'm pretty sure it's  
going to be much harder task.  The JavaCC stuff is really designed  
for compiling tree's which is usually a 'once off' type usage, but  
Lucenes usage of it (large indexing operations) means the flaws in it  
are exacerbated.


Paul



smime.p7s
Description: S/MIME cryptographic signature


Re: Float.floatToRawIntBits

2005-11-16 Thread Paul Smith


On 17/11/2005, at 10:21 AM, Chris Lamprecht wrote:


1. Run profiler
2. Sort methods by CPU time spent
3. Optimize
4. Repeat

:)



Umm, well I know I could make it quicker, it's just whether it still  
_works_ as expected  Maintaining the contract means I'll need to  
develop some good junit tests that I feel confident cover the current  
workings before making changes.  That's the hard bit.


Paul



smime.p7s
Description: S/MIME cryptographic signature


Re: NioFile cache performance

2005-12-08 Thread Paul Smith
  Most of the CPU time is actually used during the synchronization with multiple threads. I hacked together a version of MemoryLRUCache that used a ConcurrentHashMap from JDK 1.5, and it was another 50% faster ! At a minimum, if the ReadWriteLock class was modified to use the 1.5 facilities some significant additional performance gains should be realized. Would you be able to run the same test in JDK 1.4 but use the util.concurrent compatibility pack? (supposedly the same classes in Java5)  It would be nice to verify whether the gain is the result of the different ConcurrentHashMap vs the different JDK itself.Paul Smith

smime.p7s
Description: S/MIME cryptographic signature


Re: "Advanced" query language

2005-12-21 Thread Paul Smith

Hey all,

I haven't been paying real close attention to this thread, but if any  
of you are looking for something that has _easy_ Object->XML->Object  
you should seriously try XStream (http://xstream.codehaus.org)..   
Simplest/easiest api I've seen.  BSD licensed too (Apache friendly).   
One can register a Converter class to assist with anything the built- 
in converters don't handle well. The Convertor code is nice and elegant.


Just something to think about maybe?

cheers,

Paul
On 22/12/2005, at 11:20 AM, Chris Hostetter wrote:



I finally got a chance to look at this code today (the best part  
about the
last day before vacation, is no one expects you to get anything  
done, so

you can ignore your "real work" and spend time on things that are more
important in the long run) and while I still havne't wrapped my head
arround all of it, I wanted to share my thoughts so far on the API...

1) I aplaud the plugable nature of your solution. Looking at the Test
Case, it is easy to see exactly how a service provider could
do things like override the behavior of a  to be  
implimented

as a SpanQuery without their clients being affected at all.  Kudos.

2) Digging into what was involved in writting an ObjectBuilder, I  
found
the api somewhat confusion.  I was reminded of this exchange you  
had with

Yonik...

: > While SAX is fast, I've found callback interfaces
: > more difficult to
: > deal with while generating nested object graphs...
: > it normally
: > requires one to maintain state in stack(s).
:
: I've gone to some trouble to avoid the effects of this
: on the programming model.

As someone who feels very comfortable with Lucene, but has no
practical experience with SAX, I have to say that I don't really  
feel like

the API has a very clean seperation from SAX.

I think that the ideal API wouldn't require people writing  
ObjectBuilders

to know anything about sax, or to ever need to import anything from
org.xml.** or javax.xml.**


3) While the *need* to maintaing/pass state information should be  
avoided.
I can definitely think of uses for this framework that may *want*  
to pass

state information -- both down to the ObjectBuilders that get used in
inner nodes, as well as up to wrapping nodes, and there doesn't  
seem to be
an easy way to that.  (it could just be my lack of SAX knowledge  
though)


The best example i can give is if someone (ie: me) wanted to use this
framework to allow boolean queries to be written like this...

   
  
  "a phrase" fuzzy~
   ...i want to be able to write an  
"BooleanClauseWrapperObjectBuilder" that

can be wrapped around any other ObjectBuilder and will return whatever
object it does, but will also check for and "occurs" attribute, and  
put
that in a state bucket somewhere that the BooleanQuery has access  
to it

when adding the Query it gets back.

Going the ooposite direction, I'd like to be able to have tags that  
set

state which is accesible to descendent tags (even if hte tags in teh
middle don't know anything about that bit of state.  for example:
specifying how much slop should be used by default in phrase  
queries...


   
  ...
  
 
How Now Brown Cow?
 
 ...
  


I haven't had a chance to try implimenting this, but at a high  
level, it

seems like all of this should be possible and still easy to use.
Here's a real rough cut at what i've had floating arround in the back
of my head (I'm doing this straight into email, pardon any typo's or
psuedo code) ...



/** could be implimented with SAX, or DOM, or Pull */
public interface LuceneXmlParser {
/** this method will call setParser(this) on each handler */
public void registerHandler(String tag, LuceneXmlHandler h);
/**
 primary method for clients, parses the xml and calls processNode
 on the root node
 */
public Query parse(InputStream xml);
/**
 dispatches to the appropriate handler's process method based
 on the Node name, may be called by handlers for recursion of  
children

 nodes
 */
public Query processNode(LuceneXmlNode n, State s)
}
public interface LuceneXmlHandler {
public void setParser(LuceneXmlParser p)
/**
 should return a Query that corrisponds to the specified node.
 may rea/modify state in any way it wants ... it is recommended  
that
 all implimenting methods wrap their state before passing it on  
when

 processing children.
 */
public Query process(LuceneXmlNode n, State s)
}
/**
 A State is a stack frame that can delegate read operations to another
 State it wraps (if there is one).  but it cannot delegate modifying
 operations.
 Classes implimenting State should provide a constructor that takes
 another State to wrap.
*/
public interface State extends Map {
   /**
for callers that wnat to know what's in the immeidate stack
frame without any delegation
*/
   public Map getOuterFrame();
   /* should return a new state that

Re: "Advanced" query language

2006-01-02 Thread Paul Smith


On 03/01/2006, at 11:08 AM, markharw00d wrote:




I thought
you said you "didn't really want to have to design a general API for
parsing XML as part of this project" ?   :)



Having grown tired of messing with my own solution I tried using  
commons Digester with my example XML but ran into issues so I'm  
back looking at a custom solution.


Seriously... Did you try out Xstream?  Digester is just too hard,  
Xstream will work so easily you'll be pleasantly suprised..


Paul


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: JIRA html problems?

2006-01-26 Thread Paul Smith

looks a bit b0rk3n to me as well.

Maybe some text being displayed isn't being escaped properly causing  
HTML mayhem?


Paul Smith

On 27/01/2006, at 8:12 AM, Yonik Seeley wrote:


I've been getting bad HTML out of JIRA lately:

http://issues.apache.org/jira/browse/LUCENE

Anyone else?

-Yonik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Tree based BitSet (aka IntegerSet, DocSet...)

2006-01-28 Thread Paul Smith


Unfortunately, the license distributed with the JAR (which we must  
assume takes precedence over whatever is stated on the web pages)  
is much more restrictive, it's the Java Research License, which  
specifically disallows any commercial use. So, short of  
reimplementing it from scratch it's of no use except for academic  
study. Pity.


Would that preclude re-implementing the same algorithm in new source  
code?  I'm not clear on whether that violates the license.


cheers,

Paul Smith

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Tree based BitSet (aka IntegerSet, DocSet...)

2006-01-28 Thread Paul Smith


No, I'm pretty sure it wouldn't, so long as you don't look at this  
code, lest you become "tainted" ... ;-)


Isn't that where the phrase "I have no recollection of that Senator"  
comes in handy? :)


Paul

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene and Java 1.5

2006-05-30 Thread Paul Smith


On 31/05/2006, at 7:45 AM, Robert Engels wrote:

Log4j can be configured to use delegate to the standard 1.5  
logging. In fact

this is preferred so you have STANDARDIZED logging support (and not a
different logger for every library).

All NEW code should use the 1.5 logging for simplicity of  
configuration and

for future ease of integration.



Now, first off, I'm a log4j developer, so thought I'd state that up  
front.


Java's own logging is 'ok', it does a job, but is no way close to the  
production quality sort of control over logging that you really need.


I would recommend you do not bind yourself to the Java logging, but  
decide between JCL or log4j itself.  I have a distinct anti-JCL  
feeling because of classloader hell issues in the past, but I  
understand the latest version is a lot better, and many Apache  
projects are already linked against it.


Before you make any decision, I'd sit down and plan what events  
you'll actually want to log and at what level.  Good planning will  
make the Lucene library very useful.  You can then decide how you're  
going to log them.


cheers,

Paul Smith

smime.p7s
Description: S/MIME cryptographic signature


Re: svn commit: r437897 [1/2] - in /lucene/java/trunk: ./ src/java/org/apache/lucene/index/ src/java/org/apache/lucene/store/ src/test/org/apache/lucene/store/

2006-08-28 Thread Paul Smith


Added:
lucene/java/trunk/src/java/org/apache/lucene/index/ 
IndexWriter.java.orig
lucene/java/trunk/src/java/org/apache/lucene/index/ 
doron_2_IndexWriter.patch   (with props)



Just in case this was accidental, was the orig and patch files meant  
to be added to the repo?


cheers,

Paul Smith




smime.p7s
Description: S/MIME cryptographic signature


Re: Is it save to remove the throw from FastCharStream.refill() ?

2006-10-03 Thread Paul Smith
The throwing of an exception by this class is still being done on the  
Java side at this stage IIRC, and is also extremely bad for  
performance in Java.  However I think the client of the class (one of  
the Filters I think) is expecting the EOF exception as a signal that  
it has received the end of the stream for tokenization point of view.


I would love to get rid of it, but I think it will break a lot of  
behaviour.


cheers,

Paul Smith

On 04/10/2006, at 11:48 AM, George Aroush wrote:


Hi folks,

Over at Lucene.Net, we are trying to determine if it's safe to do the
following change: http://issues.apache.org/jira/browse/LUCENENET-8

Can you tell us, if this change is done on the Java Lucene code,  
how it will
effect Lucene?  Do you expect the it to run faster but more  
importantly, is

it safe?

Thanks.

-- George Aroush


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]







smime.p7s
Description: S/MIME cryptographic signature


Re: Is it safe to remove the throw from FastCharStream.refill() ?

2006-10-04 Thread Paul Smith
Title: Aconex Email Template
On 05/10/2006, at 3:34 PM, Doron Cohen wrote:If I read the JIRA issue right, it look as if this is fixed in Lucene 2.0.1. Is it?If so, where can I download 2.0.1? No 2.0.1 was released (yet).This issue is fixed in the "svn head".Nightly builds that include this (and other things) are found inhttp://people.apache.org/dist/lucene/java/nightly/Be aware that these are not announced releases, just nightly builds..But, that issue address a performance problem with IndexWriter, not the underlying FastCharStream+IOException==bad problem, I'm pretty sure that's still there.cheers,Paul

smime.p7s
Description: S/MIME cryptographic signature


Re: ThreadLocal leak (was Re: Leaking org.apache.lucene.index.* objects)

2006-12-17 Thread Paul Smith


On 16/12/2006, at 6:15 PM, Otis Gospodnetic wrote:


Moving to java-dev, I think this belongs here.
I've been looking at this problem some more today and reading about  
ThreadLocals.  It's easy to misuse them and end up with memory  
leaks, apparently... and I think we may have this problem here.


The problem here is that ThreadLocals are tied to Threads, and I  
think the assumption in TermInfosReader and SegmentReader is that  
(search) Threads are short-lived: they come in, scan the index, do  
the search, return and die.  In this scenario, their ThreadLocals  
go to heaven with them, too, and memory is freed up.


Otis, we have an index server being served inside Tomcat, where an  
Application instance makes a search request via vanilla HTTP post, so  
our connector threads definitely do stay alive for quite a while.   
We're using Lucene 2.0, and our Index server is THE most stable of  
all our components, up for over month (before being taken down for  
updates) searching hundreds of various sized indexes sized up to 7Gb  
in size, serving 1-2 requests/second during peak usage.


No memory leak spotted at our end, but I'm watching this thread with  
interest! :)


cheers,

Paul Smith

smime.p7s
Description: S/MIME cryptographic signature


Large scale sorting

2007-04-06 Thread Paul Smith
rtHitQueue et al, and I realize that the use of NIO  
immediately pins Lucene to Java 1.4, so I'm sure this is  
controversial.  But, if we wish Lucene to go beyond where it is now,  
I think we need to start thinking about this particular problem  
sooner rather than later.


Happy Easter to all,

Paul Smith

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Large scale sorting

2007-04-09 Thread Paul Smith


On 10/04/2007, at 4:18 AM, Doug Cutting wrote:


Paul Smith wrote:

Disadvantages to this approach:
* It's a lot more I/O intensive


I think this would be prohibitive.  Queries matching more than a  
few hundred documents will take several seconds to sort, since  
random disk accesses are required per matching document.  Such an  
approach is only practical if you can guarantee that queries match  
fewer than a hundred documents, which is not generally the case,  
especially with large collections.




I don't disagree with the premise that it involves substantial I/O  
and would increase the time taken to sort, and why this approach  
shouldn't be the default mechanism, but it's not too difficult to  
build a disk I/O subsystem that can allocate many spindles to service  
this and to allow the underlying OS to use it's buffer cache (yes  
this is sounding like a database server now isn't it).


I'm working on the basis that it's a LOT harder/more expensive to  
simply allocate more heap size to cover the current sorting  
infrastructure.   One hits memory limits faster.  Not everyone can  
afford 64-bit hardware with many Gb RAM to allocate to a heap.  It  
_is_ cheaper/easier to build a disk subsystem to tune this I/O  
approach, and one can still use any RAM as buffer cache for the  
memory-mapped file anyway.


In my experience, raw search time starts to climb towards one  
second per query as collections grow to around 10M documents (in  
round figures and with lots of assumptions).  Thus, searching on a  
single CPU is less practical as collections grow substantially  
larger than 10M documents, and distributed solutions are required.   
So it would be convenient if sorting is also practical for ~10M  
document collections on standard hardware.  If 10M strings with 20  
characters are required in memory for efficient search, this  
requires 400MB.  This is a lot, but not an unusual amount on todays  
machines.  However, if you have a large number of fields, then this  
approach may be problematic and force you to consider a distributed  
solution earlier than you might otherwise.


400Mb is not a lot in of itself, but when one has many of these types  
of indexes, with many sorting fields with many locales on the same  
host it becomes problematic.  I'm sure there's a point where  
distributing doesn't work over really large collections, because even  
if one partitioned an index across many hosts, one still needs to  
merge sort the results together.


It would be disappointing if Lucene's innate design limited itself to  
10M document collections before needing to consider distributed  
solutions.  10M is not that many.   It would be better if the sorting  
mechanism in Lucene was a little more decoupled such that more  
customised designs could be utilitised for specific scenarios.  Right  
now it's a one-for-all approach without substantial gutting of the code.


cheers,

Paul


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Large scale sorting

2007-04-09 Thread Paul Smith


Now, if we could use integers to represent the sort field values,  
which is
typically the case for most applications, maybe we can afford to  
have the
sort field values stored in the disk and do disk lookup for each  
document
matched? The look up of the sort field value will be as simple as  
docNo * 4

* offset.

This way, we use the same approach as constructing the norms  
(proper merging
for incremental indexing), but, at search time, we don't load the  
sort field

values into memory, instead, just store them in disk.

Will this approach be good enough?


While a nifty idea, I think this only works for a single sort  
locale.  I initially came up with a similar idea that the terms are  
already stored in 'sorted' order and one might be able to use the  
terms position for sorting, it's just that the terms ordering  
position is different in different locales.


Paul

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Large scale sorting

2007-04-09 Thread Paul Smith


In our application, we have to sync up the index pretty frequently,  
the

warm-up of the index is killing it.



Yep, it speeds up the first sort, but at the cost of making all the  
others slower (maybe significantly so).  That's obviously not ideal  
but could make use of sorts in larger indexes practical.


To address your concern about single sort locale, what about  
creating a sort
field for each sort locale? So, if you have, say, 10 locales, you  
will have
10 sort fields, each utilizing the mechanism of constructing the  
norms.




I really don't understand norms properly so I'm not sure exactly how  
that would help.  I'll have to go over your original email again to  
understand.


My main goal is to get some discussion going amongst the community,  
which hopefully we've kicked along.



Paul

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Large scale sorting

2007-04-09 Thread Paul Smith


A memory saving optimization would be to not load the corresponding
String[] in the string index (as discussed previously), but there is
currently no way to tell the FieldCachethat the strings are unneeded.
The String values are only needed for merging results in a
MultiSearcher.


Yep, which happens all the time for us specifically, because we have  
an 'archive' and 'week' index. the week index is merged once per  
week, so the search is always a merged sort across the 2. (the week  
index is reloaded every 5 seconds or so, the archive index is kept in  
memory once loaded).






-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Tests, Contribs, and Releases

2007-05-16 Thread Paul Smith
Does Lucene have a gump run descriptor?  That's quite useful for  
tracking this sort of thing too.  It's very good at nagging! :)


The standard maven assembly packaging runs the unit tests by default  
too.  Changing the lucene build system to maven is not something  
you'd want to jump at without careful thought , but might be worth  
considering.  I used to be anti-maven, but since version 2, and since  
Curt Arnold has been setting up the log4j build environment for  
maven, I've been quite impressed with it's capability.


cheers,

Paul Smith


On 17/05/2007, at 8:02 AM, Chris Hostetter wrote:



Hey everybody, this thread has been sitting in my inbox for a while
waiting for me to have a few minutes to look into it...

http://www.nabble.com/Packaging-Lucene-2.1.0-for-Debian--found-2- 
junit-errors-tf3571676.html


In a nutshell, when a guy from Debian went looking to package  
Lucene he
noticed that the official 2.1.0 release contained 2 test failures  
-- one

each in the highlighter and spellchecker contribs.

The specifics of the test failures don't really interest me as much  
as the

question: how did we manage to have a release with test failures?

A few things have jumped out at me while looking into this...

1) the task "build-contrib" can be used to walk the contrib directory
   building each contrib, the task "test-contrib" can be used to  
walk the

   contrib directory testing each contrib.
2) the "test" task only tests the lucene-core ... it does not  
depend on

   (or call) "test-contrib"
3) The "nightly" build task depends on the "test" and "package-tgz"  
task
   (which depends on "build-contrib") but at no point is "test- 
contrib"

   run.
4) The steps for creating an official release...
  http://wiki.apache.org/lucene-java/ReleaseTodo
   ...specify using the "dist" and
   "dist-src" tasks -- neither of which depend on *ANY* tests being  
run

   (let alone contrib tests)

This seems very strange to me ... i would think that we would want:

  a) nightly builds to run the tests for all contribs, ala...
 
  b) the release insctructions to make it clear that all unit tests  
(core

 and contrib) should be run prior to creating teh distribution.


Does anyone see any reason not to make these changes?



-Hoss


-----
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Paul Smith
Core Engineering Manager

Aconex
The easy way to save time and money on your project

696 Bourke Street, Melbourne,
VIC 3000, Australia
Tel: +61 3 9240 0200  Fax: +61 3 9240 0299
Email: [EMAIL PROTECTED]  www.aconex.com

This email and any attachments are intended solely for the addressee.  
The contents may be privileged, confidential and/or subject to  
copyright or other applicable law. No confidentiality or privilege is  
lost by an erroneous transmission. If you have received this e-mail  
in error, please let us know by reply e-mail and delete or destroy  
this mail and all copies. If you are not the intended recipient of  
this message you must not disseminate, copy or take any action in  
reliance on it. The sender takes no responsibility for the effect of  
this message upon the recipient's computer system.






Re: Tests, Contribs, and Releases

2007-05-16 Thread Paul Smith


To answer your question, though, I don't see any reason not to make  
the changes to make the current process more repeatable.


Yeah, mod'ing the ant process now is going to be simpler to catch the  
current problem.  Still, I'd check the Gump stuff for Lucene, because  
I'd be surprised that wouldn't have caught this and continuously  
nagged you all to get it fixed.. :)


Paul

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How to handle servlet-api.jar in build?

2007-06-12 Thread Paul Smith


On 12/06/2007, at 5:09 PM, markharw00d wrote:

As part of the documentation push I was considering putting  
together an updated demo web app which showed a number of things  
(indexing, search, highlighting, XML Query templates etc) and was  
wondering what that might mean to the build system if I was  
dependent on the servlet API. Are there any licence concerns around  
handling servlet-api.jar that I should be aware of? I know Apache  
foundation does not like linking to non-Apache code.



You should be fine on this, since many Apache apps need to reference  
the servlet-api and jsp-api jars. (JSTL for a start..).


I just don't think you can 'package' up a distribution that includes  
these jars in your distribution.  That is, a downloaded unit from  
Apache can't include that jar in the distribution.


The log4j projects I work in references quite a few non-ASL licensed  
things, and as long as you can build a distribution environment that  
requires the user to download that (and agree to any licensing bits  
and bobs), you should be fine.


This is where Maven is cool...

cheers,

Paul

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How to handle servlet-api.jar in build?

2007-06-12 Thread Paul Smith


On 12/06/2007, at 7:07 PM, mark harwood wrote:


Thanks for the pointers Paul.

I just don't think you can 'package' up a distribution that  
includes these jars in your distribution.


Clearly the binary distribution need not bundle servlet-api.jar - a  
demo.war file is all that is needed.
However, is the source distribution exempt from this restriction?  
It would be convenient if the build.xml "just worked", referencing  
our included copy of servlet-api.jar rather than requiring the user  
to configure the build.properties etc to point to their copy of the  
API. If this bundling was an issue, would an acceptable solution be  
to have an ANT task to download the servlet-api.jar from a Sun server?





From other people's posts it sounds like the servlet spec jars have  
a compatible license (quite odd of Sun to do that! :) ), so perhaps  
my comments are more relevant to other licensed jars such as LGPL  
projects.


The ant idea of automatically getting the file if not provided is a  
good one, i've done that before.  Having said that though, this is  
exactly what Maven can do for you with it's dependency management. (I  
was anti-maven for quite a while, but now I'm converted).


Paul

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene upload to Maven 2 repository

2007-06-18 Thread Paul Smith


On 19/06/2007, at 6:14 AM, Michael Busch wrote:


Hello,

looking at JIRA and the email archives I find several people asking  
us to upload Lucene to the Maven2 repository. Currently there are  
only the artifacts from Lucene core 1.9.1 and 2.0.0 in the  
repository. 1.9.1 is even incomplete, as LUCENE-867 indicates.  
Therefore I ported the maven patch (LUCENE-622) back to the  
releases 1.9.1, 2.0.0, and 2.1.0, added LICENSE.txt and NOTICE.txt  
to the jars and generated all maven artifacts for those releases  
(core + contribs). I uploaded everything for reviewing to the  
staging area: http://people.apache.org/~buschmi/staging_area/maven/.


I made a few tests and the artifacts seem to be fine. I intend to  
upload everything to the maven repository when I officially release  
2.2 unless there are objections.




Any chance of adding source jars as artifacts too?  Makes the Maven  
Eclipse plugin rather nice.  I appreciate the effort in organizing  
the artifacts (particularly the older versions).


cheers,

Paul

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene upload to Maven 2 repository

2007-06-18 Thread Paul Smith
I'm just kidding, of course! I'll try to take a look at that.  
However, making these artifacts was already a lot of work and I'm  
not sure how soon I can work on the source artifacts.





I might try and grab the trunk and see if I can work out what's  
needed to do that..


Paul

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene upload to Maven 2 repository

2007-06-18 Thread Paul Smith
quick check, I haven't tried the maven build system for lucene yet,  
but getting a clean trunk, and doing this:


 mvn -f lucene-parent-pom.xml -Dversion=2.2 install

It appears to be ignoring the version property:

Installing /workspace/lucene-svn/lucene-parent-pom.xml to /Users/ 
paulsmith/.m2/repository/org/apache/lucene/lucene-parent/@version@/ 
[EMAIL PROTECTED]@.pom



Am I missing something ?

Paul
On 19/06/2007, at 10:15 AM, Michael Busch wrote:


Paul Smith wrote:


I might try and grab the trunk and see if I can work out what's  
needed to do that..


Paul


That'd be great! In particular we need to figure out which changes  
to the pom.xml files are necessary.


- Michael

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Paul Smith
Core Engineering Manager

Aconex
The easy way to save time and money on your project

696 Bourke Street, Melbourne,
VIC 3000, Australia
Tel: +61 3 9240 0200  Fax: +61 3 9240 0299
Email: [EMAIL PROTECTED]  www.aconex.com

This email and any attachments are intended solely for the addressee.  
The contents may be privileged, confidential and/or subject to  
copyright or other applicable law. No confidentiality or privilege is  
lost by an erroneous transmission. If you have received this e-mail  
in error, please let us know by reply e-mail and delete or destroy  
this mail and all copies. If you are not the intended recipient of  
this message you must not disseminate, copy or take any action in  
reliance on it. The sender takes no responsibility for the effect of  
this message upon the recipient's computer system.






Re: Lucene upload to Maven 2 repository

2007-06-18 Thread Paul Smith



lucene_pom.patch
Description: Binary data
Attached is a quick patch for the lucene-core pom so that it does compile and package successfully:mvn -f lucene-core.pom.xml packageEnds up with a binary jar in the target/ sub-foldermvn assembly:assemblyCreates a source distribution in the target folder too.  I'm assuming Lucene requires 1.4 compiled code (default Maven is 1.3).Possibly need to refactor some of this pom, eventually, into the parent pom, but it does build cleanly, other than the weird @version@ still appearing.  I think that is because you are using ant as the primary build mechanism and forking some 'mavenness'.  We've been mavenizing the log4j project, so I'm gaining some experience with this sort of stuff.cheers,PaulOn 19/06/2007, at 10:40 AM, Michael Busch wrote:Paul Smith wrote: quick check, I haven't tried the maven build system for lucene yet, but getting a clean trunk, and doing this: mvn -f lucene-parent-pom.xml -Dversion=2.2 installIt appears to be ignoring the version property:Installing /workspace/lucene-svn/lucene-parent-pom.xml to /Users/paulsmith/.m2/repository/org/apache/lucene/lucene-parent/@version@/[EMAIL PROTECTED]@.pom Am I missing something ? Yes... there is no maven build system for Lucene yet ;-)We actually build with ant and use the maven-ant-tasks to deploy the m2 artifacts. The pom.xml files in trunk are templates. The ant target "generate-maven-artifacts" takes those templates, replaces the @version@ by the actual version number and creates a maven dist directory where it deploys all the artifacts for the ibiblio upload.- Michael-To unsubscribe, e-mail: [EMAIL PROTECTED]For additional commands, e-mail: [EMAIL PROTECTED]  Paul SmithCore Engineering Manager AconexThe easy way to save time and money on your project696 Bourke Street, Melbourne,VIC 3000, Australia Tel: +61 3 9240 0200  Fax: +61 3 9240 0299Email: [EMAIL PROTECTED]  www.aconex.com This email and any attachments are intended solely for the addressee. The contents may be privileged, confidential and/or subject to copyright or other applicable law. No confidentiality or privilege is lost by an erroneous transmission. If you have received this e-mail in error, please let us know by reply e-mail and delete or destroy this mail and all copies. If you are not the intended recipient of this message you must not disseminate, copy or take any action in reliance on it. The sender takes no responsibility for the effect of this message upon the recipient's computer system. 

Re: Lucene upload to Maven 2 repository

2007-06-18 Thread Paul Smith
Enhanced version of previous patch.  Now compiles and executes all  
unit tests (although some of them are failing for me)




mvn -f lucene-core.pom.xml test


you can still do a package (including source distro) and skip the tests

mvn -f lucene-core-pom.xml -Dmaven.test.skip=true package  
assembly:assembly


cheers,

Paul


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene upload to Maven 2 repository

2007-06-18 Thread Paul Smith
*sigh*, with attachment this time:

lucene_pom.2.patch
Description: Binary data
On 19/06/2007, at 11:42 AM, Paul Smith wrote:Enhanced version of previous patch.  Now compiles and executes all unit tests (although some of them are failing for me)mvn -f lucene-core.pom.xml testyou can still do a package (including source distro) and skip the testsmvn -f lucene-core-pom.xml -Dmaven.test.skip=true package assembly:assemblycheers,Paul-To unsubscribe, e-mail: [EMAIL PROTECTED]For additional commands, e-mail: [EMAIL PROTECTED]   

Re: Lucene upload to Maven 2 repository

2007-06-18 Thread Paul Smith


On 19/06/2007, at 9:58 AM, Michael Busch wrote:


Paul Smith wrote:



Any chance of adding source jars as artifacts too?  Makes the  
Maven Eclipse plugin rather nice.  I appreciate the effort in  
organizing the artifacts (particularly the older versions).


cheers,

Paul



In German we have a saying, something like "Offer them your pinky,  
and they rip off your whole hand." ;-)


I'm just kidding, of course! I'll try to take a look at that.  
However, making these artifacts was already a lot of work and I'm  
not sure how soon I can work on the source artifacts.




Incidentally for those wanting some links to refer to, the maven- 
assembly-plugin docs are here:


http://maven.apache.org/plugins/maven-assembly-plugin/

Probably the best reference in this case is this one:

http://maven.apache.org/plugins/maven-assembly-plugin/descriptor- 
refs.html


cheers,

Paul




maven snapshots available for 2.3?

2007-08-08 Thread Paul Smith
I'm thinking no, but just in case, are lucene 2.3 snapshots published  
anywhere, or should I build one locally?


More broadly, is there any plan to fully Mavenize the lucene trunk ?   
I'm not sure if anyone had a chance to look the patch I supplied a  
while back with a pom.xml that appeared to build and test ok.  I'm  
happy to pitch in here.


cheers,

Paul Smith


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-467) Use Float.floatToRawIntBits over Float.floatToIntBits

2005-11-16 Thread Paul Smith (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-467?page=comments#action_12357839 ] 

Paul Smith commented on LUCENE-467:
---

I probably didn't make my testing framework as clear as I should.  Yourkit was 
setup to use method sampling (waking up every X milliseconds).  I wouldn't use 
the 20% as a 'accurate' figure but suffice to say that improving this method 
would 'certainly' improve things.  Only testing the way you have will flush out 
the correct numbers.

We don't use -server (due to some Linux vagaries we've been careful with 
-server because of some stability problems)

> Use Float.floatToRawIntBits over Float.floatToIntBits
> -
>
>  Key: LUCENE-467
>  URL: http://issues.apache.org/jira/browse/LUCENE-467
>  Project: Lucene - Java
> Type: Improvement
>   Components: Other
> Versions: 1.9
> Reporter: Yonik Seeley
> Priority: Minor

>
> Copied From my Email:
>   Float.floatToRawIntBits (in Java1.4) gives the raw float bits without
> normalization (like *(int*)&floatvar would in C).  Since it doesn't do
> normalization of NaN values, it's faster (and hopefully optimized to a
> simple inline machine instruction by the JVM).
> On my Pentium4, using floatToRawIntBits is over 5 times as fast as
> floatToIntBits.
> That can really add up in something like Similarity.floatToByte() for
> encoding norms, especially if used as a way to compress an array of
> float during query time as suggested by Doug.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-467) Use Float.floatToRawIntBits over Float.floatToIntBits

2005-11-17 Thread Paul Smith (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-467?page=comments#action_12357925 ] 

Paul Smith commented on LUCENE-467:
---

If you can create a patch against 1.4.3 there is a reasonable possibility that 
I could create a 1.4.3 Lucene+ThisPatch jar and re-index in our test 
environment that was the source of the YourKit graph I provided earlier.  This 
should reflect how useful the change might be against a decent baseline?

> Use Float.floatToRawIntBits over Float.floatToIntBits
> -
>
>  Key: LUCENE-467
>  URL: http://issues.apache.org/jira/browse/LUCENE-467
>  Project: Lucene - Java
> Type: Improvement
>   Components: Other
> Versions: 1.9
> Reporter: Yonik Seeley
> Priority: Minor

>
> Copied From my Email:
>   Float.floatToRawIntBits (in Java1.4) gives the raw float bits without
> normalization (like *(int*)&floatvar would in C).  Since it doesn't do
> normalization of NaN values, it's faster (and hopefully optimized to a
> simple inline machine instruction by the JVM).
> On my Pentium4, using floatToRawIntBits is over 5 times as fast as
> floatToIntBits.
> That can really add up in something like Similarity.floatToByte() for
> encoding norms, especially if used as a way to compress an array of
> float during query time as suggested by Doug.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-388) [PATCH] IndexWriter.maybeMergeSegments() takes lots of CPU resources

2006-08-13 Thread Paul Smith (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-388?page=comments#action_12427818 ] 

Paul Smith commented on LUCENE-388:
---

geez, yep definitely don't put this in, my patch was only a 'suggestion' to 
highlight how it fixes the root cause of the problem. iIt is interesting that 
originally, all the test cases still pass, yet the problems Yonik highlights is 
real.  Might warrant some extra test cases to cover exactly those situation, 
even if this problem is not addressed.

Be great if this could be fixed completely though, but I haven't got any 
headspace left to continue research on this one.. sorry :(

> [PATCH] IndexWriter.maybeMergeSegments() takes lots of CPU resources
> 
>
> Key: LUCENE-388
> URL: http://issues.apache.org/jira/browse/LUCENE-388
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: CVS Nightly - Specify date in submission
> Environment: Operating System: Mac OS X 10.3
> Platform: Macintosh
>Reporter: Paul Smith
> Attachments: IndexWriter.patch, log-compound.txt, 
> log.optimized.deep.txt, log.optimized.txt, Lucene Performance Test - with & 
> without hack.xls, lucene.34930.patch
>
>
> Note: I believe this to be the same situation with 1.4.3 as with SVN HEAD.
> Analysis using hprof utility shows that during index creation with many
> documents highlights that the CPU spends a large portion of it's time in
> IndexWriter.maybeMergeSegments(), which seems to be a 'waste' compared with
> other valuable CPU intensive operations such as tokenization etc.
> Using the following test snippet to retrieve some rows from the db and create 
> an
> index:
> Analyzer a = new StandardAnalyzer();
> writer = new IndexWriter(indexDir, a, true);
> writer.setMergeFactor(1000);
> writer.setMaxBufferedDocs(1);
> writer.setUseCompoundFile(false);
> connection = DriverManager.getConnection(
> "jdbc:inetdae7:tower.aconex.com?database=", "secret",
> "squirrel");
> String sql = "select userid, userfirstname, userlastname, email from 
> userx";
> LOG.info("sql=" + sql);
> Statement statement = connection.createStatement();
> statement.setFetchSize(5000);
> LOG.info("Executing sql");
> ResultSet rs = statement.executeQuery(sql);
> LOG.info("ResultSet retrieved");
> int row = 0;
> LOG.info("Indexing users");
> long begin = System.currentTimeMillis();
> while (rs.next()) {
> int userid = rs.getInt(1);
> String firstname = rs.getString(2);
> String lastname = rs.getString(3);
> String email = rs.getString(4);
> String fullName = firstname + " " + lastname;
> Document doc = new Document();
> doc.add(Field.Keyword("userid", userid+""));
> doc.add(Field.Keyword("firstname", firstname.toLowerCase()));
> doc.add(Field.Keyword("lastname", lastname.toLowerCase()));
> doc.add(Field.Text("name", fullName.toLowerCase()));
> doc.add(Field.Keyword("email", email.toLowerCase()));
> writer.addDocument(doc);
> row++;
> if((row % 100)==0){
> LOG.info(row + " indexed");
> }
> }
> double end = System.currentTimeMillis();
> double diff = (end-begin)/1000;
> double rate = row/diff;
> LOG.info("rate:" +rate);
> On my 1.5GHz PowerBook with 1.5Gb RAM and a 5400 RPM drive, my CPU is maxed 
> out,
> and I end up getting a rate of indexing between 490-515 documents/second run
> over 10 times in succession.  
> By applying a simple patch to IndexWriter (see attached shortly), which defers
> the calling of maybeMergeSegments() so that it is only called every 2000
> times(an arbitrary figure), I appear to get a new rate of between 945-970
> documents/second.  Using Luke to look inside each index created between these 
> 2
> there does not appear to be any difference.  Same number of Documents, same
> number of Terms.
> I'm not suggesting one should apply this patch, I'm just highlighting the
> difference in performance that this sort of change gives you.  
> We are about to use Lucene to index 4 million construction document records, 
> and
> so speedi

[jira] Commented: (LUCENE-388) [PATCH] IndexWriter.maybeMergeSegments() takes lots of CPU resources

2006-08-14 Thread Paul Smith (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-388?page=comments#action_12427975 ] 

Paul Smith commented on LUCENE-388:
---

This is where some tracing logging code would be useful.  Maybe a YourKit 
memory snapshot to see what's going on.. ?  I can't see Yonik's patch should 
influence the memory profile. It's just delaying the check for merging until an 
appropriate time, and should not be removing opportunities to merge segments.  
I can't see why checking less often uses more memory.

Obviously something strange is happening.  

> [PATCH] IndexWriter.maybeMergeSegments() takes lots of CPU resources
> 
>
> Key: LUCENE-388
> URL: http://issues.apache.org/jira/browse/LUCENE-388
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: CVS Nightly - Specify date in submission
> Environment: Operating System: Mac OS X 10.3
> Platform: Macintosh
>Reporter: Paul Smith
> Assigned To: Yonik Seeley
> Attachments: IndexWriter.patch, log-compound.txt, 
> log.optimized.deep.txt, log.optimized.txt, Lucene Performance Test - with & 
> without hack.xls, lucene.34930.patch, yonik_indexwriter.diff
>
>
> Note: I believe this to be the same situation with 1.4.3 as with SVN HEAD.
> Analysis using hprof utility shows that during index creation with many
> documents highlights that the CPU spends a large portion of it's time in
> IndexWriter.maybeMergeSegments(), which seems to be a 'waste' compared with
> other valuable CPU intensive operations such as tokenization etc.
> Using the following test snippet to retrieve some rows from the db and create 
> an
> index:
> Analyzer a = new StandardAnalyzer();
> writer = new IndexWriter(indexDir, a, true);
> writer.setMergeFactor(1000);
> writer.setMaxBufferedDocs(1);
> writer.setUseCompoundFile(false);
> connection = DriverManager.getConnection(
> "jdbc:inetdae7:tower.aconex.com?database=", "secret",
> "squirrel");
> String sql = "select userid, userfirstname, userlastname, email from 
> userx";
> LOG.info("sql=" + sql);
> Statement statement = connection.createStatement();
> statement.setFetchSize(5000);
> LOG.info("Executing sql");
> ResultSet rs = statement.executeQuery(sql);
> LOG.info("ResultSet retrieved");
> int row = 0;
> LOG.info("Indexing users");
> long begin = System.currentTimeMillis();
> while (rs.next()) {
> int userid = rs.getInt(1);
> String firstname = rs.getString(2);
> String lastname = rs.getString(3);
> String email = rs.getString(4);
> String fullName = firstname + " " + lastname;
> Document doc = new Document();
> doc.add(Field.Keyword("userid", userid+""));
> doc.add(Field.Keyword("firstname", firstname.toLowerCase()));
> doc.add(Field.Keyword("lastname", lastname.toLowerCase()));
> doc.add(Field.Text("name", fullName.toLowerCase()));
> doc.add(Field.Keyword("email", email.toLowerCase()));
> writer.addDocument(doc);
> row++;
> if((row % 100)==0){
> LOG.info(row + " indexed");
> }
> }
> double end = System.currentTimeMillis();
> double diff = (end-begin)/1000;
> double rate = row/diff;
> LOG.info("rate:" +rate);
> On my 1.5GHz PowerBook with 1.5Gb RAM and a 5400 RPM drive, my CPU is maxed 
> out,
> and I end up getting a rate of indexing between 490-515 documents/second run
> over 10 times in succession.  
> By applying a simple patch to IndexWriter (see attached shortly), which defers
> the calling of maybeMergeSegments() so that it is only called every 2000
> times(an arbitrary figure), I appear to get a new rate of between 945-970
> documents/second.  Using Luke to look inside each index created between these 
> 2
> there does not appear to be any difference.  Same number of Documents, same
> number of Terms.
> I'm not suggesting one should apply this patch, I'm just highlighting the
> difference in performance that this sort of change gives you.  
> We are about to use Lucene to index 4 million construction document records, 
> and
> so speed

[jira] Commented: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

2006-09-20 Thread Paul Smith (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-675?page=comments#action_12436437 ] 

Paul Smith commented on LUCENE-675:
---

If you're looking for freely available text in bulk, what about:

http://www.gutenberg.org/wiki/Main_Page

> Lucene benchmark: objective performance test for Lucene
> ---
>
> Key: LUCENE-675
> URL: http://issues.apache.org/jira/browse/LUCENE-675
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Andrzej Bialecki 
> Attachments: LuceneBenchmark.java
>
>
> We need an objective way to measure the performance of Lucene, both indexing 
> and querying, on a known corpus. This issue is intended to collect comments 
> and patches implementing a suite of such benchmarking tests.
> Regarding the corpus: one of the widely used and freely available corpora is 
> the original Reuters collection, available from 
> http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz 
> or 
> http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz.
>  I propose to use this corpus as a base for benchmarks. The benchmarking 
> suite could automatically retrieve it from known locations, and cache it 
> locally.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

2006-09-20 Thread Paul Smith (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-675?page=comments#action_12436443 ] 

Paul Smith commented on LUCENE-675:
---

>From a strict performance point of view, a standard set of important, but 
>don't forget other languages.

>From a tokenization point of view (seperate to this issues), perhaps the 
>Gutenberg project would be useful to test correctness of the analysis phase.

> Lucene benchmark: objective performance test for Lucene
> ---
>
> Key: LUCENE-675
> URL: http://issues.apache.org/jira/browse/LUCENE-675
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Andrzej Bialecki 
> Attachments: LuceneBenchmark.java
>
>
> We need an objective way to measure the performance of Lucene, both indexing 
> and querying, on a known corpus. This issue is intended to collect comments 
> and patches implementing a suite of such benchmarking tests.
> Regarding the corpus: one of the widely used and freely available corpora is 
> the original Reuters collection, available from 
> http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz 
> or 
> http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz.
>  I propose to use this corpus as a base for benchmarks. The benchmarking 
> suite could automatically retrieve it from known locations, and cache it 
> locally.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-833) Indexing of Subversion Repositories.

2007-03-15 Thread Paul Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12481324
 ] 

Paul Smith commented on LUCENE-833:
---

You should try Fisheye!  It uses Lucene internally.

http://www.cenqua.com/fisheye

> Indexing of Subversion Repositories.
> 
>
> Key: LUCENE-833
> URL: https://issues.apache.org/jira/browse/LUCENE-833
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Reporter: Matt Inger
>
> It would be a big help if Lucene had the ability to index Subversion (or CVS, 
> or whatever) repositories, including revision history.
> Searches (beyond basic text of the source code) might include:
> path:/branches/mybranch  Foo
> history:Foo

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1741) Make MMapDirectory.MAX_BBUF user configureable to support chunking the index files in smaller parts

2009-07-13 Thread Paul Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12730551#action_12730551
 ] 

Paul Smith commented on LUCENE-1741:


An algorithm is nice if there are no specific settings specified, but in an 
environment where large indexes may be opened more frequently than the common 
use cases, then what is happening is that the Memory layer is getting OOM 
conditions too much, forcing too much GC activity to attempt the operation.  

I'd vote for checking if settings have been requested and using them, and if 
not set rely on a self-tuning algorithm.  

In a really long running application, the process address space may become more 
and more fragmented, and the malloc library may not be able to defragment it, 
so the auto-tuning is nice, but it may not be great for all peoples needs.  

For example, our specific use case (crazy as this may be) is to have many 
different indexes open at any one time, closing and opening them frequently 
(the Realtime Search stuff we are following very closely indeed.. :) ).  I'm 
just thinking that our VM (64bit) may find it difficult to find the contiguous 
non-heap space for the MMap operation after many days/weeks in operation.  

Maybe I'm just paranoid. But for operational purposes, it'd be nice to know we 
could change the setting based on our observations.

thanks!



> Make MMapDirectory.MAX_BBUF user configureable to support chunking the index 
> files in smaller parts
> ---
>
> Key: LUCENE-1741
> URL: https://issues.apache.org/jira/browse/LUCENE-1741
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 2.9
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1741.patch, LUCENE-1741.patch
>
>
> This is a followup for java-user thred: 
> http://www.lucidimagination.com/search/document/9ba9137bb5d8cb78/oom_with_2_9#9bf3b5b8f3b1fb9b
> It is easy to implement, just add a setter method for this parameter to 
> MMapDir.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1749) FieldCache introspection API

2009-07-24 Thread Paul Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12735242#action_12735242
 ] 

Paul Smith commented on LUCENE-1749:


You know what would be absolute icing on the cake here would be some way during 
the introspection by some code looking for large sort fields that perhaps can 
be discarded/unloaded as needed (programmatically).

What I'm thinking here is a use case we've come into where we have had to sort 
by subject.  Well the unique # subjects gets pretty large, and while we still 
need to support the use case, it'd be nice to be able to periodically 'toss' 
sort fields like this so they don't hog memory permanently while the 
IndexReader is still in memory.  (sorting by subject is used, just not often so 
a good candidate for tossing)

Because we have multiple large IndexReaders open concurrently, it'd be nice to 
be able to scan periodically and kick out any unneeded ones.

It's nice to be able to inspect and print out these, but even better if one can 
make changes based on what one finds.



> FieldCache introspection API
> 
>
> Key: LUCENE-1749
> URL: https://issues.apache.org/jira/browse/LUCENE-1749
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Hoss Man
>Priority: Minor
> Fix For: 2.9
>
> Attachments: fieldcache-introspection.patch, LUCENE-1749.patch, 
> LUCENE-1749.patch, LUCENE-1749.patch, LUCENE-1749.patch
>
>
> FieldCache should expose an Expert level API for runtime introspection of the 
> FieldCache to provide info about what is in the FieldCache at any given 
> moment.  We should also provide utility methods for sanity checking that the 
> FieldCache doesn't contain anything "odd"...
>* entries for the same reader/field with different types/parsers
>* entries for the same field/type/parser in a reader and it's subreader(s)
>* etc...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1935) Generify PriorityQueue

2009-10-01 Thread Paul Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12761395#action_12761395
 ] 

Paul Smith commented on LUCENE-1935:


I shall perhaps regret asking this, but is there any reason not to use 
java.util.PriorityQueue instead? Seems like reinventing the wheel a bit there 
(I understand historically why Lucene has this class).

(is Lucene 2.9+ now Java 5, or is that a different discussion altogether?)

> Generify PriorityQueue
> --
>
> Key: LUCENE-1935
> URL: https://issues.apache.org/jira/browse/LUCENE-1935
> Project: Lucene - Java
>  Issue Type: Task
>  Components: Other
>Affects Versions: 2.9
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
> Fix For: 3.0
>
> Attachments: LUCENE-1935.patch
>
>
> Priority Queue should use generics like all other Java 5 Collection API 
> classes. This very simple, but makes code more readable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1935) Generify PriorityQueue

2009-10-01 Thread Paul Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12761408#action_12761408
 ] 

Paul Smith commented on LUCENE-1935:


thanks Uwe, I thought I would regret asking, good points there.  Shame the JDK 
doesn't have a fixed size PriorityQueue implementation, that seems a bit of a 
glaring omission. 

> Generify PriorityQueue
> --
>
> Key: LUCENE-1935
> URL: https://issues.apache.org/jira/browse/LUCENE-1935
> Project: Lucene - Java
>  Issue Type: Task
>  Components: Other
>Affects Versions: 2.9
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
> Fix For: 3.0
>
> Attachments: LUCENE-1935.patch
>
>
> Priority Queue should use generics like all other Java 5 Collection API 
> classes. This very simple, but makes code more readable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1282) Sun hotspot compiler bug in 1.6.0_04/05 affects Lucene

2008-05-11 Thread Paul Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12595946#action_12595946
 ] 

Paul Smith commented on LUCENE-1282:


Another workaround might be to use '-client' instead of the default '-server' 
(for server class machines).  This affects a few things, not least this switch:

-XX:CompileThreshold=1  Number of method invocations/branches before 
compiling [-client: 1,500]

-server implies a 1 value.  I have personally observed similar behaviour 
like problems like the above with -server, and usually -client ends up 
'solving' them.

I'm sure there was also a way to mark a method to not jit compile too (rather 
than resort to -Xint which disables i for everything), but now I cant' find 
what that syntax is at all.

> Sun hotspot compiler bug in 1.6.0_04/05 affects Lucene
> --
>
> Key: LUCENE-1282
> URL: https://issues.apache.org/jira/browse/LUCENE-1282
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 2.3, 2.3.1
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.4
>
>
> This is not a Lucene bug.  It's an as-yet not fully characterized Sun
> JRE bug, as best I can tell.  I'm opening this to gather all things we
> know, and to work around it in Lucene if possible, and maybe open an
> issue with Sun if we can reduce it to a compact test case.
> It's hit at least 3 users:
>   
> http://mail-archives.apache.org/mod_mbox/lucene-java-user/200803.mbox/[EMAIL 
> PROTECTED]
>   
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200804.mbox/[EMAIL 
> PROTECTED]
>   
> http://mail-archives.apache.org/mod_mbox/lucene-java-user/200805.mbox/[EMAIL 
> PROTECTED]
> It's specific to at least JRE 1.6.0_04 and 1.6.0_05, that affects
> Lucene.  Whereas 1.6.0_03 works OK and it's unknown whether 1.6.0_06
> shows it.
> The bug affects bulk merging of stored fields.  When it strikes, the
> segment produced by a merge is corrupt because its fdx file (stored
> fields index file) is missing one document.  After iterating many
> times with the first user that hit this, adding diagnostics &
> assertions, its seems that a call to fieldsWriter.addDocument some
> either fails to run entirely, or, fails to invoke its call to
> indexStream.writeLong.  It's as if when hotspot compiles a method,
> there's some sort of race condition in cutting over to the compiled
> code whereby a single method call fails to be invoked (speculation).
> Unfortunately, this corruption is silent when it occurs and only later
> detected when a merge tries to merge the bad segment, or an
> IndexReader tries to open it.  Here's a typical merge exception:
> {code}
> Exception in thread "Thread-10" 
> org.apache.lucene.index.MergePolicy$MergeException: 
> org.apache.lucene.index.CorruptIndexException:
> doc counts differ for segment _3gh: fieldsReader shows 15999 but 
> segmentInfo shows 16000
> at 
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:271)
> Caused by: org.apache.lucene.index.CorruptIndexException: doc counts differ 
> for segment _3gh: fieldsReader shows 15999 but segmentInfo shows 16000
> at 
> org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:313)
> at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:262)
> at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:221)
> at 
> org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3099)
> at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:2834)
> at 
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:240)
> {code}
> and here's a typical exception hit when opening a searcher:
> {code}
> org.apache.lucene.index.CorruptIndexException: doc counts differ for segment 
> _kk: fieldsReader shows 72670 but segmentInfo shows 72671
> at 
> org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:313)
> at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:262)
> at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:230)
> at 
> org.apache.lucene.index.DirectoryIndexReader$1.doBody(DirectoryIndexReader.java:73)
> at 
> org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:636)
> at 
> or

[jira] Commented: (LUCENE-1282) Sun hotspot compiler bug in 1.6.0_04/05 affects Lucene

2008-05-14 Thread Paul Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12596964#action_12596964
 ] 

Paul Smith commented on LUCENE-1282:


Throwing up an idea here for consideration.  I'm sure it could be shot down, 
but I thought I'd raise it just in case it hasn't already been considered and 
discarded.. 

One of the _classic_ problems between -client and -server mode is the way the 
CPU registers are used.  Is it possible that some of the fields are suffering 
from concurrency issues?  I was wondering if, say, BufferedInfexOutput.buffer* 
may need to be marked volatile ?

One easy way to test if this makes a difference is to just try switching 
between explicit use of '-client' and '-server'.  Most newer machines (even 
desktops & laptops) appear to qualify for Sun's 'am I a server-class machine' 
check.  By switching to -client, if these problems disappear, this to me would 
smell more and more like a 'volatile' like behaviour, because AIUI, -server 
will be more aggressive with some of it's register optimizations and I've seen 
behaviour just like this where variables that have clearly been written, the 
changes are not 'appearing' on the other side.  Even the same thread marking 
the change can be switched across to a different CPU right in the middle, and 
could see different results.

Of course those people with lots of concurrency experience can probably dismiss 
this theory in a second, but that's fine.  

> Sun hotspot compiler bug in 1.6.0_04/05 affects Lucene
> --
>
> Key: LUCENE-1282
> URL: https://issues.apache.org/jira/browse/LUCENE-1282
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 2.3, 2.3.1
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.4
>
> Attachments: corrupt_merge_out15.txt
>
>
> This is not a Lucene bug.  It's an as-yet not fully characterized Sun
> JRE bug, as best I can tell.  I'm opening this to gather all things we
> know, and to work around it in Lucene if possible, and maybe open an
> issue with Sun if we can reduce it to a compact test case.
> It's hit at least 3 users:
>   
> http://mail-archives.apache.org/mod_mbox/lucene-java-user/200803.mbox/[EMAIL 
> PROTECTED]
>   
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200804.mbox/[EMAIL 
> PROTECTED]
>   
> http://mail-archives.apache.org/mod_mbox/lucene-java-user/200805.mbox/[EMAIL 
> PROTECTED]
> It's specific to at least JRE 1.6.0_04 and 1.6.0_05, that affects
> Lucene.  Whereas 1.6.0_03 works OK and it's unknown whether 1.6.0_06
> shows it.
> The bug affects bulk merging of stored fields.  When it strikes, the
> segment produced by a merge is corrupt because its fdx file (stored
> fields index file) is missing one document.  After iterating many
> times with the first user that hit this, adding diagnostics &
> assertions, its seems that a call to fieldsWriter.addDocument some
> either fails to run entirely, or, fails to invoke its call to
> indexStream.writeLong.  It's as if when hotspot compiles a method,
> there's some sort of race condition in cutting over to the compiled
> code whereby a single method call fails to be invoked (speculation).
> Unfortunately, this corruption is silent when it occurs and only later
> detected when a merge tries to merge the bad segment, or an
> IndexReader tries to open it.  Here's a typical merge exception:
> {code}
> Exception in thread "Thread-10" 
> org.apache.lucene.index.MergePolicy$MergeException: 
> org.apache.lucene.index.CorruptIndexException:
> doc counts differ for segment _3gh: fieldsReader shows 15999 but 
> segmentInfo shows 16000
> at 
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:271)
> Caused by: org.apache.lucene.index.CorruptIndexException: doc counts differ 
> for segment _3gh: fieldsReader shows 15999 but segmentInfo shows 16000
> at 
> org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:313)
> at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:262)
> at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:221)
> at 
> org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3099)
> at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:2834)
> at 
> org.apache.lucene.index.ConcurrentMergeSch

[jira] Commented: (LUCENE-1282) Sun hotspot compiler bug in 1.6.0_04/05 affects Lucene

2008-07-13 Thread Paul Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12613208#action_12613208
 ] 

Paul Smith commented on LUCENE-1282:


 Can anyone comment as to whether the JRE 1.6.04+ bug affects any _earlier_ 
versions of Lucene? (say, 2.0.. which we're still using) .

I was just reviewing this issue and noticed Michael mentioned this behaviour 
shows in both the ConcurrentMergeScheduler and the SerialMergeScheduler.  
AIUI,. the SerialMergeScheduler is effectively the 'old' way of previous 
versions of Lucene, so I'm just starting to think about what affect 1.6.04 
might have on earlier versions? (this bug is only marked as affecting 2.3+).

The reason I ask is that we're just about to upgrade to 1.6.04 -server in some 
of our production machines.. (reason why not going to 1.6.06 is we only started 
our development test cycle months ago and stuck with .04 until next cycle).

> Sun hotspot compiler bug in 1.6.0_04/05 affects Lucene
> --
>
> Key: LUCENE-1282
> URL: https://issues.apache.org/jira/browse/LUCENE-1282
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 2.3, 2.3.1
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.4
>
> Attachments: corrupt_merge_out15.txt, crashtest, crashtest.log, 
> hs_err_pid27359.log
>
>
> This is not a Lucene bug.  It's an as-yet not fully characterized Sun
> JRE bug, as best I can tell.  I'm opening this to gather all things we
> know, and to work around it in Lucene if possible, and maybe open an
> issue with Sun if we can reduce it to a compact test case.
> It's hit at least 3 users:
>   
> http://mail-archives.apache.org/mod_mbox/lucene-java-user/200803.mbox/[EMAIL 
> PROTECTED]
>   
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200804.mbox/[EMAIL 
> PROTECTED]
>   
> http://mail-archives.apache.org/mod_mbox/lucene-java-user/200805.mbox/[EMAIL 
> PROTECTED]
> It's specific to at least JRE 1.6.0_04 and 1.6.0_05, that affects
> Lucene.  Whereas 1.6.0_03 works OK and it's unknown whether 1.6.0_06
> shows it.
> The bug affects bulk merging of stored fields.  When it strikes, the
> segment produced by a merge is corrupt because its fdx file (stored
> fields index file) is missing one document.  After iterating many
> times with the first user that hit this, adding diagnostics &
> assertions, its seems that a call to fieldsWriter.addDocument some
> either fails to run entirely, or, fails to invoke its call to
> indexStream.writeLong.  It's as if when hotspot compiles a method,
> there's some sort of race condition in cutting over to the compiled
> code whereby a single method call fails to be invoked (speculation).
> Unfortunately, this corruption is silent when it occurs and only later
> detected when a merge tries to merge the bad segment, or an
> IndexReader tries to open it.  Here's a typical merge exception:
> {code}
> Exception in thread "Thread-10" 
> org.apache.lucene.index.MergePolicy$MergeException: 
> org.apache.lucene.index.CorruptIndexException:
> doc counts differ for segment _3gh: fieldsReader shows 15999 but 
> segmentInfo shows 16000
> at 
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:271)
> Caused by: org.apache.lucene.index.CorruptIndexException: doc counts differ 
> for segment _3gh: fieldsReader shows 15999 but segmentInfo shows 16000
> at 
> org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:313)
> at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:262)
> at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:221)
> at 
> org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3099)
> at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:2834)
> at 
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:240)
> {code}
> and here's a typical exception hit when opening a searcher:
> {code}
> org.apache.lucene.index.CorruptIndexException: doc counts differ for segment 
> _kk: fieldsReader shows 72670 but segmentInfo shows 72671
> at 
> org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:313)
> at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:262)
> at org.apache.lucene.index.SegmentReader.get(SegmentReader.ja

[jira] Commented: (LUCENE-1372) Proposal: introduce more sensible sorting when a doc has multiple values for a term

2008-09-04 Thread Paul Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628480#action_12628480
 ] 

Paul Smith commented on LUCENE-1372:


Having a Document sorted last because it has "zebra", even though it has 
"apple" seems way incorrect.  Yes it would be ideal if Lucene _could_ perform 
the multi-term sort properly, but in the absence of an effective fix in the 
short term, having the lexographically earlier term 'picked' as the primary 
sort candidate is likely to generate results that match what users would expect 
(even if it's not quite perfect).

Right now it looks blatantly silly at the presentation layer when one presents 
the search results with their data, and show that "apple,zebra" appears last in 
the list..

> Proposal: introduce more sensible sorting when a doc has multiple values for 
> a term
> ---
>
> Key: LUCENE-1372
> URL: https://issues.apache.org/jira/browse/LUCENE-1372
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.3.2
>Reporter: Paul Cowan
>Priority: Minor
> Attachments: lucene-multisort.patch
>
>
> At the moment, FieldCacheImpl has somewhat disconcerting values when sorting 
> on a field for which multiple values exist for one document. For example, 
> imagine a field "fruit" which is added to a document multiple times, with the 
> values as follows:
> doc 1: {"apple"}
> doc 2: {"banana"}
> doc 3: {"apple", "banana"}
> doc 4: {"apple", "zebra"}
> if one sorts on the field "fruit", the loop in 
> FieldCacheImpl.stringsIndexCache.createValue() (and similarly for the other 
> methods in the various FieldCacheImpl caches) does the following:
>   while (termDocs.next()) {
> retArray[termDocs.doc()] = t;
>   }
> which means that we look over the terms in their natural order and, on each 
> one, overwrite retArray[doc] with the value for each document with that term. 
> Effectively, this overwriting means that a string sort in this circumstance 
> will sort by the LAST term lexicographically, so the docs above will 
> effecitvely be sorted as if they had the single values ("apple", "banana", 
> "banana", "zebra") which is nonintuitive. To change this to sort on the first 
> time in the TermEnum seems relatively trivial and low-overhead; while it's 
> not perfect (it's not local-aware, for example) the behaviour seems much more 
> sensible to me. Interested to see what people think.
> Patch to follow.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1372) Proposal: introduce more sensible sorting when a doc has multiple values for a term

2008-09-04 Thread Paul Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628513#action_12628513
 ] 

Paul Smith commented on LUCENE-1372:


bq. I'm not following this argument. Will it be less silly when {zebra,apple} 
sorts before {banana} ?

Well, at the presentation layer I don't think you'd present it like that (we 
don't).  We'd sort the list of attributes so that it would appear as 
"apple,zebra".

> Proposal: introduce more sensible sorting when a doc has multiple values for 
> a term
> ---
>
> Key: LUCENE-1372
> URL: https://issues.apache.org/jira/browse/LUCENE-1372
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.3.2
>Reporter: Paul Cowan
>Priority: Minor
> Attachments: lucene-multisort.patch
>
>
> At the moment, FieldCacheImpl has somewhat disconcerting values when sorting 
> on a field for which multiple values exist for one document. For example, 
> imagine a field "fruit" which is added to a document multiple times, with the 
> values as follows:
> doc 1: {"apple"}
> doc 2: {"banana"}
> doc 3: {"apple", "banana"}
> doc 4: {"apple", "zebra"}
> if one sorts on the field "fruit", the loop in 
> FieldCacheImpl.stringsIndexCache.createValue() (and similarly for the other 
> methods in the various FieldCacheImpl caches) does the following:
>   while (termDocs.next()) {
> retArray[termDocs.doc()] = t;
>   }
> which means that we look over the terms in their natural order and, on each 
> one, overwrite retArray[doc] with the value for each document with that term. 
> Effectively, this overwriting means that a string sort in this circumstance 
> will sort by the LAST term lexicographically, so the docs above will 
> effecitvely be sorted as if they had the single values ("apple", "banana", 
> "banana", "zebra") which is nonintuitive. To change this to sort on the first 
> time in the TermEnum seems relatively trivial and low-overhead; while it's 
> not perfect (it's not local-aware, for example) the behaviour seems much more 
> sensible to me. Interested to see what people think.
> Patch to follow.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1342) 64bit JVM crashes on Linux

2008-11-18 Thread Paul Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12648768#action_12648768
 ] 

Paul Smith commented on LUCENE-1342:


java version "1.6.0_10"
Java(TM) SE Runtime Environment (build 1.6.0_10-b33)
Java HotSpot(TM) 64-Bit Server VM (build 11.0-b15, mixed mode)

For clarity, there's 2 Paul's, myself included, and Alison here on the 
discussion thread, all from Aconex (we're all talking about the same problem at 
the same company, but are sharing in the discussion based on different analysis 
we're doing.

We've recently upgraded to using Lucene 2.2 from 2.0 (yes, way behind, but 
we're cautious here..), and about 4 days from going into production with it.  

First off, an observation.  The original bug report here was reported against 
Lucene 2.0, which we've been using in production for nearly 2 years against a 
few different JVM's (Java 1.5, plus a few builds of Java 1.6 up to and 
including 1.6.04).  We've never encountered this in production or in our load 
test area using Lucene 2.0.  However as soon as we switched to Lucene 2.2, 
using the same JRE as production (1.6.04), we started seeing these problems.  
After reviewing another HotSpot crash bug (LUCENE-1282) we decided to see if 
JRE 1.6.010 made a difference.  Initially it did, we didn't find a problem with 
several load testing runs and we thought we were fine.  Then a few weeks later, 
we started to see it occurring more frequently, yet none of the code changes in 
our application since the initial 1.6.010 switch could logically be connected 
to the indexing system at all (our application is spilt between an App, and an 
Index/Search server, and the SVN diff between the load testing tag runs didn't 
have any code change that was Indexer/Search related).

At the same time we had a strange network problem going on in the load testing 
area that was causing problems with the App talking to the Indexer, which was 
caused by a local DNS problem.  Inexplicably the JRE crash hasn't happened that 
I'm aware of; how that is related to the JRE hotspot compilation of Lucene 
byte-code, I have no idea.. BUT, since we had several weeks of stability and 
then several crashes, this is purely anecdotal/coincidental.  I'm still rubbing 
my rabbits foot here.  I need to chat with Alison & Paul Cowan about this to 
get more specific details about if/when the crash has occurred since the DNS 
problem was resolved, because it could purely be a statistical anomaly (we 
simply may not have done many runs to flush it out), and frankly I could be 
mistaken in the # crashes in the load testing env.

For incremental indexing (which is what is happening during the load test that 
crashes) we are using compound file format, merge factor =default(10), 
minMergeDocs=200, maxMergeDocs=Default(MAX_INT). it's pretty vanilla really.. 
(the reason for a low mergeFactor is that we have several hundred indexes open 
at the same time for different projects, so open file handles becomes a 
problem).

I'll let Alison/Paul Cowan comment further, this is just my 5 Aussie cents 
worth.



> 64bit JVM crashes on Linux
> --
>
> Key: LUCENE-1342
> URL: https://issues.apache.org/jira/browse/LUCENE-1342
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 2.0.0
> Environment: 2.6.18-53.el5 x86_64  GNU/Linux
> Java(TM) SE Runtime Environment (build 1.6.0_04-b12)
>Reporter: Kevin Richards
>
> Whilst running lucene in our QA environment we received the following 
> exception. This problem was also reported here : 
> http://confluence.atlassian.com/display/KB/JSP-20240+-+POSSIBLE+64+bit+JDK+1.6+update+4+may+have+HotSpot+problems.
> Is this a JVM problem or a problem in Lucene.
> #
> # An unexpected error has been detected by Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x2adb9e3f, pid=2275, tid=1085356352
> #
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (10.0-b19 mixed mode linux-amd64)
> # Problematic frame:
> # V  [libjvm.so+0x1fce3f]
> #
> # If you would like to submit a bug report, please visit:
> #   http://java.sun.com/webapps/bugreport/crash.jsp
> #
> ---  T H R E A D  ---
> Current thread (0x2aab0007f000):  JavaThread "CompilerThread0" daemon 
> [_thread_in_vm, id=2301, stack(0x40a13000,0x40b14000)]
> siginfo:si_signo=SIGSEGV: si_errno=0, si_code=1 (SEGV_MAPERR), 
> si_addr=0x
> Registers:
> RAX=0x, RBX=0x2aab0007f000, RCX=0x, 
> RDX=0x2aab00309aa0
> RSP=0x40b10f60, RBP=0x40b10fb0, RSI=0x2

[jira] Updated: (LUCENE-1342) 64bit JVM crashes on Linux

2008-11-23 Thread Paul Smith (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Smith updated LUCENE-1342:
---

Attachment: hs_err_pid27882.log
hs_err_pid21301.log

2 crash dumps attached.

> 64bit JVM crashes on Linux
> --
>
> Key: LUCENE-1342
> URL: https://issues.apache.org/jira/browse/LUCENE-1342
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 2.0.0
> Environment: 2.6.18-53.el5 x86_64  GNU/Linux
> Java(TM) SE Runtime Environment (build 1.6.0_04-b12)
>Reporter: Kevin Richards
> Attachments: hs_err_pid10565.log, hs_err_pid21301.log, 
> hs_err_pid27882.log
>
>
> Whilst running lucene in our QA environment we received the following 
> exception. This problem was also reported here : 
> http://confluence.atlassian.com/display/KB/JSP-20240+-+POSSIBLE+64+bit+JDK+1.6+update+4+may+have+HotSpot+problems.
> Is this a JVM problem or a problem in Lucene.
> #
> # An unexpected error has been detected by Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x2adb9e3f, pid=2275, tid=1085356352
> #
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (10.0-b19 mixed mode linux-amd64)
> # Problematic frame:
> # V  [libjvm.so+0x1fce3f]
> #
> # If you would like to submit a bug report, please visit:
> #   http://java.sun.com/webapps/bugreport/crash.jsp
> #
> ---  T H R E A D  ---
> Current thread (0x2aab0007f000):  JavaThread "CompilerThread0" daemon 
> [_thread_in_vm, id=2301, stack(0x40a13000,0x40b14000)]
> siginfo:si_signo=SIGSEGV: si_errno=0, si_code=1 (SEGV_MAPERR), 
> si_addr=0x
> Registers:
> RAX=0x, RBX=0x2aab0007f000, RCX=0x, 
> RDX=0x2aab00309aa0
> RSP=0x40b10f60, RBP=0x40b10fb0, RSI=0x2aaab37d1ce8, 
> RDI=0x2aaad000
> R8 =0x2b40cd88, R9 =0x0ffc, R10=0x2b40cd90, 
> R11=0x2b410810
> R12=0x2aab00ae60b0, R13=0x2aab0a19cc30, R14=0x40b112f0, 
> R15=0x2aab00ae60b0
> RIP=0x2adb9e3f, EFL=0x00010246, CSGSFS=0x0033, 
> ERR=0x0004
>   TRAPNO=0x000e
> Top of Stack: (sp=0x40b10f60)
> 0x40b10f60:   2aab0007f000 
> 0x40b10f70:   2aab0a19cc30 0001
> 0x40b10f80:   2aab0007f000 
> 0x40b10f90:   40b10fe0 2aab0a19cc30
> 0x40b10fa0:   2aab0a19cc30 2aab00ae60b0
> 0x40b10fb0:   40b10fe0 2ae9c2e4
> 0x40b10fc0:   2b413210 2b413350
> 0x40b10fd0:   40b112f0 2aab09796260
> 0x40b10fe0:   40b110e0 2ae9d7d8
> 0x40b10ff0:   2b40f3d0 2aab08c2a4c8
> 0x40b11000:   40b11940 2aab09796260
> 0x40b11010:   2aab09795b28 
> 0x40b11020:   2aab08c2a4c8 2aab009b9750
> 0x40b11030:   2aab09796260 40b11940
> 0x40b11040:   2b40f3d0 2023
> 0x40b11050:   40b11940 2aab09796260
> 0x40b11060:   40b11090 2b0f199e
> 0x40b11070:   40b11978 2aab08c2a458
> 0x40b11080:   2b413210 2023
> 0x40b11090:   40b110e0 2b0f1fcf
> 0x40b110a0:   2023 2aab09796260
> 0x40b110b0:   2aab08c2a3c8 40b123b0
> 0x40b110c0:   2aab08c2a458 40b112f0
> 0x40b110d0:   2b40f3d0 2aab00043670
> 0x40b110e0:   40b11160 2b0e808d
> 0x40b110f0:   2aab000417c0 2aab009b66a8
> 0x40b11100:    2aab009b9750
> 0x40b0:   40b112f0 2aab009bb360
> 0x40b11120:   0003 40b113d0
> 0x40b11130:   01002aab0052d0c0 40b113d0
> 0x40b11140:   00b3 40b112f0
> 0x40b11150:   40b113d0 2aab08c2a108 
> Instructions: (pc=0x2adb9e3f)
> 0x2adb9e2f:   48 89 5d b0 49 8b 55 08 49 8b 4c 24 08 48 8b 32
> 0x2adb9e3f:   4c 8b 21 8b 4e 1c 49 8d 7c 24 10 89 cb 4a 39 34 
> Stack: [0x40a13000,0x40b14000],  sp=0x40b10f60,  free 
> space=1015k
> Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native 
> code)
> V  [libjvm.so+0x1fce3f]
> V  [libjvm.so+0x2df2e4]
> V  [libjvm.so+0x2e07d8]
> V  [libjvm.so+0x52b08d]
> V  [libjvm

[jira] Commented: (LUCENE-1342) 64bit JVM crashes on Linux

2008-11-23 Thread Paul Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12650051#action_12650051
 ] 

Paul Smith commented on LUCENE-1342:


yeah, it's definitely a Sun bug, not a Lucene one, but like the other recent 
JVM crash issue it sort of 'affects' Lucene specifically.  Must be something 
about that byte code.  No idea why it does/does not trigger it.

We've raised a Sun bug, but it hasn't 'appeared' online yet (Paul Cowan raised 
it).  Will post the cross link to it once we have confirmation that Sun has 
deemed it 'worthy' to accept it.



> 64bit JVM crashes on Linux
> --
>
> Key: LUCENE-1342
> URL: https://issues.apache.org/jira/browse/LUCENE-1342
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 2.0.0
> Environment: 2.6.18-53.el5 x86_64  GNU/Linux
> Java(TM) SE Runtime Environment (build 1.6.0_04-b12)
>Reporter: Kevin Richards
> Attachments: hs_err_pid10565.log, hs_err_pid21301.log, 
> hs_err_pid27882.log
>
>
> Whilst running lucene in our QA environment we received the following 
> exception. This problem was also reported here : 
> http://confluence.atlassian.com/display/KB/JSP-20240+-+POSSIBLE+64+bit+JDK+1.6+update+4+may+have+HotSpot+problems.
> Is this a JVM problem or a problem in Lucene.
> #
> # An unexpected error has been detected by Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x2adb9e3f, pid=2275, tid=1085356352
> #
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (10.0-b19 mixed mode linux-amd64)
> # Problematic frame:
> # V  [libjvm.so+0x1fce3f]
> #
> # If you would like to submit a bug report, please visit:
> #   http://java.sun.com/webapps/bugreport/crash.jsp
> #
> ---  T H R E A D  ---
> Current thread (0x2aab0007f000):  JavaThread "CompilerThread0" daemon 
> [_thread_in_vm, id=2301, stack(0x40a13000,0x40b14000)]
> siginfo:si_signo=SIGSEGV: si_errno=0, si_code=1 (SEGV_MAPERR), 
> si_addr=0x
> Registers:
> RAX=0x, RBX=0x2aab0007f000, RCX=0x, 
> RDX=0x2aab00309aa0
> RSP=0x40b10f60, RBP=0x40b10fb0, RSI=0x2aaab37d1ce8, 
> RDI=0x2aaad000
> R8 =0x2b40cd88, R9 =0x0ffc, R10=0x2b40cd90, 
> R11=0x2b410810
> R12=0x2aab00ae60b0, R13=0x2aab0a19cc30, R14=0x40b112f0, 
> R15=0x2aab00ae60b0
> RIP=0x2adb9e3f, EFL=0x00010246, CSGSFS=0x0033, 
> ERR=0x0004
>   TRAPNO=0x000e
> Top of Stack: (sp=0x40b10f60)
> 0x40b10f60:   2aab0007f000 
> 0x40b10f70:   2aab0a19cc30 0001
> 0x40b10f80:   2aab0007f000 
> 0x40b10f90:   40b10fe0 2aab0a19cc30
> 0x40b10fa0:   2aab0a19cc30 2aab00ae60b0
> 0x40b10fb0:   40b10fe0 2ae9c2e4
> 0x40b10fc0:   2b413210 2b413350
> 0x40b10fd0:   40b112f0 2aab09796260
> 0x40b10fe0:   40b110e0 2ae9d7d8
> 0x40b10ff0:   2b40f3d0 2aab08c2a4c8
> 0x40b11000:   40b11940 2aab09796260
> 0x40b11010:   2aab09795b28 
> 0x40b11020:   2aab08c2a4c8 2aab009b9750
> 0x40b11030:   2aab09796260 40b11940
> 0x40b11040:   2b40f3d0 2023
> 0x40b11050:   40b11940 2aab09796260
> 0x40b11060:   40b11090 2b0f199e
> 0x40b11070:   40b11978 2aab08c2a458
> 0x40b11080:   2b413210 2023
> 0x40b11090:   40b110e0 2b0f1fcf
> 0x40b110a0:   2023 2aab09796260
> 0x40b110b0:   2aab08c2a3c8 40b123b0
> 0x40b110c0:   2aab08c2a458 40b112f0
> 0x40b110d0:   2b40f3d0 2aab00043670
> 0x40b110e0:   40b11160 2b0e808d
> 0x40b110f0:   2aab000417c0 2aab009b66a8
> 0x40b11100:    2aab009b9750
> 0x40b0:   40b112f0 2aab009bb360
> 0x40b11120:   0003 40b113d0
> 0x40b11130:   01002aab0052d0c0 40b113d0
> 0x40b11140:   00b3 40b112f0
> 0x40b11150:   40b113d0 2aab08c2a108 
> Instructions: (pc=0x2adb9e3f)
> 0x2adb9e2f:   48 89 5d b0 49 8b 55 0

[jira] Commented: (LUCENE-2075) Share the Term -> TermInfo cache across threads

2009-11-18 Thread Paul Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779641#action_12779641
 ] 

Paul Smith commented on LUCENE-2075:


bq. This cache impl should be able to support 1B operations per second for 
almost 300 years (i.e. the time it would take to overflow a long).

Hopefully Sun has released Java 7 by then. :)

> Share the Term -> TermInfo cache across threads
> ---
>
> Key: LUCENE-2075
> URL: https://issues.apache.org/jira/browse/LUCENE-2075
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael McCandless
>Priority: Minor
> Fix For: 3.1
>
> Attachments: ConcurrentLRUCache.java, LUCENE-2075.patch, 
> LUCENE-2075.patch, LUCENE-2075.patch
>
>
> Right now each thread creates its own (thread private) SimpleLRUCache,
> holding up to 1024 terms.
> This is rather wasteful, since if there are a high number of threads
> that come through Lucene, you're multiplying the RAM usage.  You're
> also cutting way back on likelihood of a cache hit (except the known
> multiple times we lookup a term within-query, which uses one thread).
> In NRT search we open new SegmentReaders (on tiny segments) often
> which each thread must then spend CPU/RAM creating & populating.
> Now that we are on 1.5 we can use java.util.concurrent.*, eg
> ConcurrentHashMap.  One simple approach could be a double-barrel LRU
> cache, using 2 maps (primary, secondary).  You check the cache by
> first checking primary; if that's a miss, you check secondary and if
> you get a hit you promote it to primary.  Once primary is full you
> clear secondary and swap them.
> Or... any other suggested approach?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-966) A faster JFlex-based replacement for StandardAnalyzer

2007-07-26 Thread Paul Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515882
 ] 

Paul Smith commented on LUCENE-966:
---

We did pretty much the same thing here at Aconex,   The tokenization mechanism 
in the old javacc-based analyser is woeful compared to what JFlex outputs.

Nice work!  

> A faster JFlex-based replacement for StandardAnalyzer
> -
>
> Key: LUCENE-966
> URL: https://issues.apache.org/jira/browse/LUCENE-966
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Reporter: Stanislaw Osinski
> Fix For: 2.3
>
> Attachments: AnalyzerBenchmark.java, jflex-analyzer-patch.txt
>
>
> JFlex (http://www.jflex.de/) can be used to generate a faster (up to several 
> times) replacement for StandardAnalyzer. Will add a patch and a simple 
> benchmark code in a while.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]