Re: Indexing speed

Paul Smith Sun, 30 Jan 2005 21:30:26 -0800

Thanks Otis, tried Field.Keyword but that didn't seem to make any appreciatable difference.

I'll have a hunt around with a profiler and see what I can find. I guess my use case is unusual, I need to create a LOT of very small documents.

cheers,

Paul

Otis Gospodnetic wrote:

I believe most of the time is being spent in the Analyzer.  It should
be easy to empirically test this claim by using Field.Keyword instead
of Field.Text (Field.Keyword fields are not analyzed).  If that turns
out to be correct, then you could play with writing a custom and
optimal Analyzer.
Otis
--- Paul Smith <[EMAIL PROTECTED]> wrote:
This relates to a previous post of mine regarding Context of 'lines' of text (log4j events in my case):

http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg11869.html
I'm going through the process of writing quick and dirty test-case/test-bed classes to validate whether my ideas are going to work or not.

For my first test, I thought I would write a quick indexer that indexed a traditional log file by lines, with each line being a Document, so that I could then search for matching lines and then do a context search. Yes this is exactly what 'grep' does and does very well, but I thought if one was doing a lot of analysis of a log file (typical when mentally analysing log files) it might be best to index it once, and then search quickly many times.

Turns out that even using JUST a RamDirectory (which suprised me), writing a Document for every line of text isn't as fast as I was hoping, it is taking significantly longer than I hoped. I played around with

the mergeFactor settings etc, but nothing really made much difference to the indexing speed, other than NOT adding the Document to the index.... I have tried this out on my Mac laptop, as well as a test Linux server with no noticeable difference. (Both scenarios have the reading log file, and new index on the same physical drive, which I know is not the _best_ setup, but still).
This could well be my own stupidness, so here's what I'm doing.
Statistics on the Log File
=================
The log file is 28meg, consisting of 409566 lines, of the form:
[2004-12-21 00:00:00,935 INFO ][ommand.ProcessFaxCmd][http-80-Processor9][192.168.0.220][] Finished processing [mail box=stagingfax][MsgCount=0] [2004-12-21 00:00:00,986 INFO ][ommand.ProcessFaxCmd][http-80-Processor9][192.168.0.220][] Finished processing [mail box=aconexnz9000][MsgCount=0] [2004-12-21 00:00:01,126 INFO ][ monitor][http-80-Processor9][192.168.0.220][] Controller duration: 212ms url=/Fax, fowardDuration=-1, total=212 [2004-12-21 00:00:03,668 ERROR][essFaxDeliveryAction][Thread-157][][]

Could not connect to mail server! [EMAIL PROTECTED] javax.mail.AuthenticationFailedException: Login failed: authentication failure at com.sun.mail.imap.IMAPStore.protocolConnect(IMAPStore.java:330) at javax.mail.Service.connect(Service.java:233) at javax.mail.Service.connect(Service.java:134) at
com.aconex.fax.action.ProcessFaxDeliveryAction.perform(ProcessFaxDeliveryAction.java:68)

at

com.aconex.scheduler.automatedTasks.FaxOutDeliveryMessageProcessorAT.run(FaxOutDeliveryMessageProcessorAT.java:62)
==================
Source code for test-bed:
==================
public class TestBed1 {
public static void main(String[] args) throws Exception { if(args.length <1) throw new IllegalArgumentException("not enough args"); String filename = args[0]; File file = new File(filename); Analyzer a = new SimpleAnalyzer(); String indexLoc = "/tmp/testbed1/"; //IndexWriter writer = new IndexWriter(indexLoc, a, true); RAMDirectory ramDir = new RAMDirectory(); IndexWriter ramWriter = new IndexWriter(ramDir, a, true); long length = file.length(); BufferedReader fileReader = new BufferedReader(new FileReader(file)); String line = ""; double processed = 0; NumberFormat nf = NumberFormat.getPercentInstance(); nf.setMaximumFractionDigits(0); String percent = ""; String lastPercent = " "; long lines =0; while ((line = fileReader.readLine())!=null) { Document doc = new Document(); doc.add(Field.UnStored("Line", line) ); ramWriter.addDocument(doc); processed +=line.length(); lines++; percent = nf.format(processed/length); if (!percent.equals(lastPercent)){ lastPercent = percent; System.out.println(percent + "(lines=" + lines + ")"); } } //writer.optimize(); //writer.close(); } }
=======
I did other simple tests by testing exactly how long it takes Java to
just read the lines of the file, and that is mega quick in comparison. It's actually the "ramWriter.addDocument(doc)" line which seems to have the biggest amount of work to do, and probably for good reason. I had originally tried to use Field.Text(...) to keep the line with the index for Context later on, but even Unstored doesn't really make that much

difference from a stopwatch time point of view (creates a bigger index of course).

I might setup a profiler and work through where it's taking the the time, but you guys probably already know the answer.
I'm going to need much higher throughput for my utility to be useful.
Maybe that's just not achievable.
Thoughts?
cheers,
Paul Smith
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Indexing speed

Reply via email to