Re: Indexing other documents type than html and txt (XML)
I second the motion to have a place to store contributed Document generators. I've developed an HTML file handler that creates a Document using JTidy under the covers to DOM'ify it and pull out only the non-HTML tagged text into a content field and strips the title out as a separate field. It actually would be far more extensible if it handed off the DOM'ified HTML to Peter's XMLDocument class such that XPath could be used to turn things into fields. I'm not sure how my code compares to the demo HTMLParser.jj (mine probably requires cleaner HTML, and may not be as fast? but has the ability to use a DOM to extract elements/attributes) How does lucene-dev feel about creating a 'contrib' area in CVS for these kinds of things that folks really need to make Lucene come to life for them, but are obviously not part of the main engine? Erik - Original Message - From: [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Thursday, November 29, 2001 12:03 PM Subject: Re: Indexing other documents type than html and txt (XML) I have started to create a set of generic lucene document types that can be easily manipulated depending on the fields. I know other have generated Documents out of PDF. Is there some place we can add contributed classes to the lucene web page? Here my current version of the XMLDocument based on . It's a bit slow. It uses a path (taken from Document example) and based on a field name / xpath pair (key / value) from either an array or property file generates an appropriate lucene document with the specified fields. I have not tested all permutations of Document (I have used the File, Properties) and it works. Note: It uses the xalan example ApplyXpath class to get the xml xpath. I hope this helps. --Peter -- package xxx.lucene.xml; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.document.DateField; import org.apache/ApplyXpath; import java.util.Properties; import java.io.File; import java.util.Enumeration; import java.io.FileInputStream; /** * A utility for making lucene document from an XML source and a set of xpaths * based on Document example from Lucene * */ public class XMLDocument { private XMLDocument() { } /** * @param file Document that to be converted to a lucene document * @param propertyList properties where the key is the field name and the value is the * XML xpath. * @throws FileNotFoundException * @throws Exception * @return lucene document */ public static Document Document (File file, Properties propertyList) throws java.io.FileNotFoundException , Exception { Document doc = new Document(); // add path doc.add(Field.Text(path, file.getPath())); //add date modified doc.add(Field.Keyword(modified, DateField.timeToString(file.lastModified(; //add field list in property list Enumeration e = propertyList.propertyNames(); while (e.hasMoreElements()) { String key = (String) e.nextElement(); String xpath = propertyList.getProperty(key); String[] valueArray = ApplyXpath(file.getPath(),xpath); StringBuffer value = new StringBuffer(); for (int i=0; i valueArray.length; i++) { value.append(valueArray[i]); } //System.out.println(add key +key+ wtih value = +value); filter(key,value); doc.add(Field.Text(key,value.toString())); } return doc; } /** * @return lucene document * @param fieldNames field names for the lucene document * @param file Document that to be converted to a lucene document * @param xpaths XML xpaths for the information you want to get * @throws Exception */ public static Document Document(File file, java.lang.String[] fieldNames, java.lang.String[] xpaths) { if (fieldNames.length != xpaths.length) { throw new IllegalArgumentException (String arrays are not equal size); } Properties propertyList = new Properties(); // generate properties from the arrays for (int i=0;ifieldNames.length;i++) { propertyList.setProperty(fieldNames[i],xpaths[i]); } Document doc = Document (file, propertyList); return doc; } /** * @param path path of the Document that to be converted to a lucene document * @param keys * @param xpaths * @throws Exception * @return */ public static Document Document(String path, String[] fieldNames, String[] xpaths) throws Exception { File file = new File(path); Document doc = Document (file, fieldNames, xpaths); return
Re: Indexing other documents type than html and txt
Here is another version of something I had posted earlier. It attempts to read the text out of binary files. Not perfect and doesn't work at all on PDF. It permits you use the reader form of a Field to index. import java.util.*; import java.io.*; /** pThis class is designed to retrieve text from binary files. The occasion for its development was to find a way generic way to index typical office documents which are almost always in a a proprietary and binary form. pThis class will bnot/b work with PDF files. pYou can exercise some control over the result by using the codesetCharArray()/code method and the codesetShortestToken()/code method. ul licodesetCharArray()/code: allows you to override the default characters to keep. All others are eliminated. The default keepers are all ASCII character plus whitespace. This means that if a text file is the input, it will pass thru unchanged (except that consequtive blanks are squeezed to a single blank). licodesetShortestToken()/code: allows you only keep strings of a minimum length. By default the length is zero, meaning that all tokens are passed. /ul pNote lastly that this class is only designed to work with ASCII. It may not be difficult to change to support Unicode, but I do not know how to do that. */ public class BinaryReader extends java.io.FilterReader { // private vars // for debugging private int count=0; private int rawcnt=0; private int shortestToken = 0; // default char set to keep, blank out everything else private char[][] charArray = { {'!', '~'}, {'\t', '\t'}, {'\r', '\r'}, {'\n', '\n'}, }; private String leftovers=; private char charFilter(char c) { for (int i=0; i charArray.length; i++) { if ( c = charArray[i][0] c = charArray[i][1] ) { return c; } } return ' '; } public BinaryReader(Reader in) { super(in); } /** pThis method may be used to override the ranges of characters that are retained. All others are elminiated. The default is: code private char[][] charArray = { {'!', '~'}, {'\t', '\t'}, {'\r', '\r'}, {'\n', '\n'}, }; /code pNote that the ranges are inclusive and that to pick our a single character instead of a range, just make that character both the min and max (as shown for the whitespace characters above). @param char[][] - array of ranges to keep */ public void setCharArray( char[][] keepers ) { // in each row, column 1 is min and column 2 is max // to pick out a single character instead of a range // just make it both min and max. charArray = keepers; } /** pThis method may be used to eliminate short strings of text. By default it takes even single letters, since the value is initialized to zero. For example, if the length 3 is used, the single and two letter strings will not be returned. pbWarning: the test doesn't always work for strings that begin a line of text (at least in DOS/Windows)./b @param int len - length of shortest strings to pass */ public void setShortestToken(int len) { shortestToken = len; } /** pReads a single character and runs it through the filter. The (int) character returned will either be -1 for end-of-file, a blank (indicating it was filtered), or the character unchanged. */ public int read() throws IOException { int c = in.read(); if ( c != -1 ) return c; rawcnt++; count++; return charFilter((char)c); } /** pReads from stream and populates the supplied char array. @param char[] cbuf - character buffer to fill @return int - number of characters actually placed into the buffer */ public int read(char[] cbuf) throws IOException { return read(cbuf, 0, cbuf.length); } /** pReads from stream and populates the supplied char array using the offset and length provided. @param char[] cbuf - character buffer to fill @param int offset - offset to being filling array @param int length - maximun characters to place into the array @return int - number of characters actually placed into the buffer */ public int read(char[] cbuf, int off, int len) throws IOException { char[] cb = new char[len]; int cnt = in.read(cb); if ( cnt == -1 ) { file://System.out.println(At end, rawcnt is +rawcnt); return cnt; // done } int cnt2=cnt; int loc = off; for ( int i=0; i cnt; i++ ) { cbuf[loc++] = charFilter(cb[i]); } char[] weeded = filter(new String(cbuf, off, cnt)); if ( weeded.length -1 ) { cnt2 = weeded.length; // redo buffer for (int i=0; i cnt2; i++) { cbuf[off+i] = weeded[i]; } } rawcnt += cnt; count += cnt2; return cnt2; } private char[] filter(String instring) { // record the buffer size (ie, size of incoming string) int max = instring.length(); // combine leftovers into incoming string and reset leftovers String s = leftovers+instring; leftovers=; StringBuffer sb = new
Re: Indexing other documents type than html and txt (XML)
I'll take on creating a Document repository. I would like to get some ideas from people about what kind of Documents they are creating and what people want. What's the next step Doug? --Peter On Friday, November 30, 2001, at 08:22 AM, Doug Cutting wrote: From: Erik Hatcher [mailto:[EMAIL PROTECTED]] How does lucene-dev feel about creating a 'contrib' area in CVS for these kinds of things that folks really need to make Lucene come to life for them, but are obviously not part of the main engine? I think this is a fine idea, but it needs to be managed. We don't want an area where anyone can upload anything. It could easily become filled with things that don't even compile, and would cause folks more headaches than it would relieve. So if someone would like to volunteer to administer this area, then I'm for it. Administration would include some limited testing of each contributed module, ensuring that each has documentation, rejecting poorly written modules, writing a top-level document that describes all contributed modules, etc. Anyone interested? Doug -- To unsubscribe, e-mail: mailto:lucene-user- [EMAIL PROTECTED] For additional commands, e-mail: mailto:lucene-user- [EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Indexing other documents type than html and txt
You'd have to write parsers for each of those document types to convert it to text and then index it. Sure, you can feed it something like XML, but then you may consider something like xmldb.org instead. Otis --- Antonio Vazquez [EMAIL PROTECTED] wrote: Hi all, I have a doubt. I know that lucene can index html and text documents, but can it index other type of documents like pdf,docs, and xls documents? if it can, how can I implement it? Perhaps can implement it like html and txt indexing? regards Antonio _ Do You Yahoo!? Get your free @yahoo.com address at http://mail.yahoo.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Yahoo! GeoCities - quick and easy web site hosting, just $8.95/month. http://geocities.yahoo.com/ps/info1 -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Indexing other documents type than html and txt (XML)
I have started to create a set of generic lucene document types that can be easily manipulated depending on the fields. I know other have generated Documents out of PDF. Is there some place we can add contributed classes to the lucene web page? Here my current version of the XMLDocument based on . It's a bit slow. It uses a path (taken from Document example) and based on a field name / xpath pair (key / value) from either an array or property file generates an appropriate lucene document with the specified fields. I have not tested all permutations of Document (I have used the File, Properties) and it works. Note: It uses the xalan example ApplyXpath class to get the xml xpath. I hope this helps. --Peter -- package xxx.lucene.xml; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.document.DateField; import org.apache/ApplyXpath; import java.util.Properties; import java.io.File; import java.util.Enumeration; import java.io.FileInputStream; /** * A utility for making lucene document from an XML source and a set of xpaths * based on Document example from Lucene * */ public class XMLDocument { private XMLDocument() { } /** * @param file Document that to be converted to a lucene document * @param propertyList properties where the key is the field name and the value is the * XML xpath. * @throws FileNotFoundException * @throws Exception * @return lucene document */ public static Document Document (File file, Properties propertyList) throws java.io.FileNotFoundException , Exception { Document doc = new Document(); // add path doc.add(Field.Text(path, file.getPath())); //add date modified doc.add(Field.Keyword(modified, DateField.timeToString(file.lastModified(; //add field list in property list Enumeration e = propertyList.propertyNames(); while (e.hasMoreElements()) { String key = (String) e.nextElement(); String xpath = propertyList.getProperty(key); String[] valueArray = ApplyXpath(file.getPath(),xpath); StringBuffer value = new StringBuffer(); for (int i=0; i valueArray.length; i++) { value.append(valueArray[i]); } //System.out.println(add key +key+ wtih value = +value); filter(key,value); doc.add(Field.Text(key,value.toString())); } return doc; } /** * @return lucene document * @param fieldNames field names for the lucene document * @param file Document that to be converted to a lucene document * @param xpaths XML xpaths for the information you want to get * @throws Exception */ public static Document Document(File file, java.lang.String[] fieldNames, java.lang.String[] xpaths) { if (fieldNames.length != xpaths.length) { throw new IllegalArgumentException (String arrays are not equal size); } Properties propertyList = new Properties(); // generate properties from the arrays for (int i=0;ifieldNames.length;i++) { propertyList.setProperty(fieldNames[i],xpaths[i]); } Document doc = Document (file, propertyList); return doc; } /** * @param path path of the Document that to be converted to a lucene document * @param keys * @param xpaths * @throws Exception * @return */ public static Document Document(String path, String[] fieldNames, String[] xpaths) throws Exception { File file = new File(path); Document doc = Document (file, fieldNames, xpaths); return doc; } /** * @param path path of document you want to convert to a lucene document * @param propertyList properties where the key is the field name and the value is the * XML xpath. * @throws Exception * @return lucene document */ public static Document Document(String path, Properties propertyList) throws Exception { File file = new File(path); Document doc = Document (file, propertyList); return doc; } /** * @param documentPath path of the