On Dec 20, 2004, at 4:08 AM, Daniel Cortes wrote:
I've to show to my boss if Lucene is the best option for create a search engine of a new portal.
I want to now how many documents do you have in your index?
And how many bigger is your DB?

I highly recommend you use Luke to examine the index. It is a great tool to have handy. It shows these statistics and many others.


the types of formats who has to support the portal are html jsp txt doc pdf ppt

HTML, TXT, DOC, and PDF are all quite straightforward to do. PPT is possible, perhaps POI will do the trick. JSP depends on how you want to analyze it. If any text in the file should be indexed (including JSP directives, taglibs, and HTML) then you can treat it as a text file. If you need to eliminate the tags then you'll need to parse the JSP somehow, however I strongly recommend that content not reside in JSP pages but rather in a content management system, database, or such.


another question that I have is:
I'm playing with the files of the book Lucene in Action and I try to use the example of handling types.The folder data contains 5 files, and created index contain five
documents what the only one that contains any word in the index is the .html file
Everybody have the same result?

Perhaps you are taking the output you see from "ant ExtensionFileHandler" as an indication of what words were indexed. This output, however, is showing Document.toString() which only shows the text in stored fields. This particular example does not actually index the documents - it shows the generalized handling framework and the parsing of the files into a Lucene Document. Most of the file handlers use unstored fields. The output I get is shown below. The handlers have successfully extracted the text from the files. Maybe you're referring to the FileIndexer example? We did not expose this one to the Ant launcher. If FileIndexer is the code you're trying, let me know what you've tried and how you're looking for the words that you expect to see. Again, most of the fields are unstored (meaning the original content is not stored in the index, only the terms extracted through analysis).


        Erik


# to make the output cleaner for e-mailing I set ANT_ARGS like this: % echo $ANT_ARGS -logger org.apache.tools.ant.NoBannerLogger -emacs -Dnopause=true

% ant ExtensionFileHandler -Dfile=src/lia/handlingtypes/data/addressbook-entry.xml
Buildfile: build.xml


ExtensionFileHandler:

This example demonstrates the file extension document handler.
Documents with extensions .xml, .rtf, .doc, .pdf, .html, and .txt are
all handled by the framework. The contents of the Lucene Document
built for the specified file is displayed.


skipping input as property nopause has already been set.
skipping input as property file has already been set.
Running lia.handlingtypes.framework.ExtensionFileHandler...
log4j:WARN No appenders could be found for logger (org.apache.commons.digester.Digester.sax).
log4j:WARN Please initialize the log4j system properly.
Document<Keyword<type:individual> Keyword<name:Zane Pasolini> Keyword<address:999 W. Prince St.> Keyword<city:New York> Keyword<province:NY> Keyword<postalcode:10013> Keyword<country:USA> Keyword<telephone:+1 212 345 6789>>


% ant ExtensionFileHandler -Dfile=src/lia/handlingtypes/data/HTML.html
Buildfile: build.xml

ExtensionFileHandler:

This example demonstrates the file extension document handler.
Documents with extensions .xml, .rtf, .doc, .pdf, .html, and .txt are
all handled by the framework. The contents of the Lucene Document
built for the specified file is displayed.


skipping input as property nopause has already been set.
skipping input as property file has already been set.
Running lia.handlingtypes.framework.ExtensionFileHandler...
Document<Text<title:Laptop power supplies are available in First Class only> Text<body:Code, Write, Fly This chapter is being written 11,000 meters above New Foundland.>>


% ant ExtensionFileHandler -Dfile=src/lia/handlingtypes/data/PlainText.txt
Buildfile: build.xml


ExtensionFileHandler:

This example demonstrates the file extension document handler.
Documents with extensions .xml, .rtf, .doc, .pdf, .html, and .txt are
all handled by the framework. The contents of the Lucene Document
built for the specified file is displayed.


skipping input as property nopause has already been set.
skipping input as property file has already been set.
Running lia.handlingtypes.framework.ExtensionFileHandler...
Document<UnStored<body>>

% ant ExtensionFileHandler -Dfile=src/lia/handlingtypes/data/PDF.pdf
Buildfile: build.xml

ExtensionFileHandler:

This example demonstrates the file extension document handler.
Documents with extensions .xml, .rtf, .doc, .pdf, .html, and .txt are
all handled by the framework. The contents of the Lucene Document
built for the specified file is displayed.


skipping input as property nopause has already been set.
skipping input as property file has already been set.
Running lia.handlingtypes.framework.ExtensionFileHandler...
log4j:WARN No appenders could be found for logger (org.pdfbox.pdfparser.PDFParser).
log4j:WARN Please initialize the log4j system properly.
Document<UnStored<body>>


% ant ExtensionFileHandler -Dfile=src/lia/handlingtypes/data/RTF.rtf
Buildfile: build.xml

ExtensionFileHandler:

This example demonstrates the file extension document handler.
Documents with extensions .xml, .rtf, .doc, .pdf, .html, and .txt are
all handled by the framework. The contents of the Lucene Document
built for the specified file is displayed.


skipping input as property nopause has already been set.
skipping input as property file has already been set.
Running lia.handlingtypes.framework.ExtensionFileHandler...
Document<UnStored<body>>

% ant ExtensionFileHandler -Dfile=src/lia/handlingtypes/data/MSWord.doc
Buildfile: build.xml

ExtensionFileHandler:

This example demonstrates the file extension document handler.
Documents with extensions .xml, .rtf, .doc, .pdf, .html, and .txt are
all handled by the framework. The contents of the Lucene Document
built for the specified file is displayed.


skipping input as property nopause has already been set.
skipping input as property file has already been set.
Running lia.handlingtypes.framework.ExtensionFileHandler...
Document<UnStored<body>>


--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to