subject:"Re\: Indexing other documents type than html and txt"

Re: Indexing other documents type than html and txt (XML)

2001-11-30 Thread Erik Hatcher


I second the motion to have a place to store contributed Document
generators.

I've developed an HTML file handler that creates a Document using JTidy
under the covers to DOM'ify it and pull out only the non-HTML tagged text
into a content field and strips the title out as a separate field.  It
actually would be far more extensible if it handed off the DOM'ified HTML to
Peter's XMLDocument class such that XPath could be used to turn things into
fields.  I'm not sure how my code compares to the demo HTMLParser.jj (mine
probably requires cleaner HTML, and may not be as fast? but has the ability
to use a DOM to extract elements/attributes)

How does lucene-dev feel about creating a 'contrib' area in CVS for these
kinds of things that folks really need to make Lucene come to life for them,
but are obviously not part of the main engine?

Erik

- Original Message -
From: [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Thursday, November 29, 2001 12:03 PM
Subject: Re: Indexing other documents type than html and txt (XML)


 I have started to create a set of generic lucene document types that can
 be easily manipulated depending on the fields.
 I know other have generated Documents out of PDF.
 Is there some place we can add contributed classes to the lucene web
 page?

 Here my current version of the XMLDocument based on . It's a bit slow.
 It uses a path (taken from Document example) and based on a field name /
 xpath pair (key / value) from either an array or property file generates
 an appropriate lucene document with the specified fields.

 I have not tested all permutations of Document (I have used the File,
 Properties) and it works.

 Note:
 It uses the xalan example ApplyXpath class to get the xml xpath.

 I hope this helps.

 --Peter

 --

 package xxx.lucene.xml;

 import org.apache.lucene.document.Document;
 import org.apache.lucene.document.Field;
 import org.apache.lucene.document.DateField;

 import org.apache/ApplyXpath;
 import java.util.Properties;
 import java.io.File;
 import java.util.Enumeration;
 import java.io.FileInputStream;

 /**
 * A utility for making lucene document from an XML source and a set of
 xpaths
 * based on Document example from Lucene
 *
 */
 public class XMLDocument
 {
 private XMLDocument() { }

  /**
   * @param file Document that to be converted to a lucene document
   * @param propertyList properties where the key is the field
 name and the value is the
   * XML xpath.
   * @throws FileNotFoundException
   * @throws Exception
   * @return lucene document
   */
 public static Document Document (File file, Properties propertyList)
 throws java.io.FileNotFoundException , Exception
 {
 Document doc = new Document();

 // add path
 doc.add(Field.Text(path, file.getPath()));

 //add date modified
 doc.add(Field.Keyword(modified,
 DateField.timeToString(file.lastModified(;

 //add field list in property list
 Enumeration e = propertyList.propertyNames();
 while (e.hasMoreElements())
 {
 String key = (String) e.nextElement();
 String xpath = propertyList.getProperty(key);
 String[] valueArray = ApplyXpath(file.getPath(),xpath);
 StringBuffer value = new StringBuffer();
 for (int i=0; i  valueArray.length; i++)
 {
 value.append(valueArray[i]);
 }
 //System.out.println(add key +key+ wtih value = +value);
  filter(key,value);
 doc.add(Field.Text(key,value.toString()));
 }

 return doc;
 }

  /**
   * @return lucene document
   * @param fieldNames field names for the lucene document
   * @param file Document that to be converted to a lucene document
   * @param xpaths XML xpaths for the information you want to get
   * @throws Exception
   */
  public static Document Document(File file, java.lang.String[]
 fieldNames, java.lang.String[] xpaths)
  {
  if (fieldNames.length != xpaths.length)
  {
  throw new IllegalArgumentException (String arrays are
 not equal size);
  }

  Properties propertyList = new Properties();

  // generate properties from the arrays
  for (int i=0;ifieldNames.length;i++) {
  propertyList.setProperty(fieldNames[i],xpaths[i]);
  }

  Document doc = Document (file, propertyList);
  return doc;
  }

  /**
   * @param path path of the Document that to be converted to a
 lucene document
   * @param keys
   * @param xpaths
   * @throws Exception
   * @return
   */
  public static Document Document(String path, String[]
 fieldNames, String[] xpaths)
  throws Exception
  {
  File file = new File(path);
  Document doc = Document (file, fieldNames, xpaths);
  return

Re: Indexing other documents type than html and txt

2001-11-30 Thread Cecil, Paula New


Here is another version of something I had posted earlier.  It attempts to
read the text out of binary files.  Not perfect and doesn't work at all on
PDF.  It permits you use the reader form of a Field to index.
import java.util.*;
import java.io.*;

/**
pThis class is designed to retrieve text from binary files.
The occasion for its development was to find a way generic way to
index typical office documents which are almost always in a
a proprietary and binary form.
pThis class will bnot/b work with PDF files.
pYou can exercise some control over the result by using the
codesetCharArray()/code method and the
codesetShortestToken()/code method.
ul
licodesetCharArray()/code: allows you to override the default
characters to keep.  All others are eliminated.  The default keepers
are all ASCII character plus whitespace.  This means that if a
text file is the input, it will pass thru unchanged (except that
consequtive blanks are squeezed to a single blank).
licodesetShortestToken()/code: allows you only keep strings of
a minimum length.  By default the length is zero, meaning that all
tokens are passed.
/ul
pNote lastly that this class is only designed to work with ASCII.
It may not be difficult to change to support Unicode, but I do
not know how to do that.
*/

public class BinaryReader
  extends java.io.FilterReader
{
  // private vars
  // for debugging
  private int count=0;
  private int rawcnt=0;
  private int shortestToken = 0;
  // default char set to keep, blank out everything else
  private char[][] charArray = {
{'!', '~'},
{'\t', '\t'},
{'\r', '\r'},
{'\n', '\n'},
  };

  private String leftovers=;

  private char charFilter(char c) {
for (int i=0; i  charArray.length; i++) {
  if ( c = charArray[i][0]  c = charArray[i][1] ) {
return c;
  }
}
return ' ';
  }

  public BinaryReader(Reader in) {
super(in);
  }

/**
pThis method may be used to override the ranges of characters
that are retained.  All others are elminiated.  The default is:
code
  private char[][] charArray = {
{'!', '~'},
{'\t', '\t'},
{'\r', '\r'},
{'\n', '\n'},
  };
/code
pNote that the ranges are inclusive and that to pick our a
single character instead of a range, just make that character
both the min and max (as shown for the whitespace characters above).
@param char[][] - array of ranges to keep
*/
  public void setCharArray( char[][] keepers ) {
// in each row, column 1 is min and column 2 is max
// to pick out a single character instead of a range
// just make it both min and max.
charArray = keepers;
  }

/**
pThis method may be used to eliminate short strings of text.
By default it takes even single letters, since the value is
initialized to zero.  For example, if the
length 3 is used, the single and two letter strings will not
be returned.
pbWarning: the test doesn't always work for strings that
begin a line of text (at least in DOS/Windows)./b
@param int len - length of shortest strings to pass
*/
  public void setShortestToken(int len) {
shortestToken = len;
  }

/**
pReads a single character and runs it through the filter.  The
(int) character returned will either be -1 for end-of-file,
a blank (indicating it was filtered), or the character unchanged.
*/
  public int read() throws IOException
  {
int c = in.read();
if ( c != -1 ) return c;
rawcnt++;
count++;
return charFilter((char)c);
  }
/**
pReads from stream and populates the supplied char array.
@param char[] cbuf - character buffer to fill
@return int - number of characters actually placed into the buffer
*/
  public int read(char[] cbuf) throws IOException
  {
return read(cbuf, 0, cbuf.length);
  }

/**
pReads from stream and populates the supplied char array
using the offset and length provided.
@param char[] cbuf - character buffer to fill
@param int offset - offset to being filling array
@param int length - maximun characters to place into the array
@return int - number of characters actually placed into the buffer
*/

  public int read(char[] cbuf, int off, int len)
throws IOException
  {
char[] cb = new char[len];
int cnt = in.read(cb);
if ( cnt == -1 ) {
  file://System.out.println(At end, rawcnt is +rawcnt);
  return cnt; // done
}
int cnt2=cnt;
int loc = off;
for ( int i=0; i  cnt; i++ ) {
cbuf[loc++] = charFilter(cb[i]);
}

char[] weeded = filter(new String(cbuf, off, cnt));
if ( weeded.length  -1 ) {
  cnt2 = weeded.length;
  // redo buffer
  for (int i=0; i  cnt2; i++) {
cbuf[off+i] = weeded[i];
  }
}

rawcnt += cnt;
count += cnt2;
return cnt2;
  }

  private char[] filter(String instring)
  {
// record the buffer size (ie, size of incoming string)
int max = instring.length();
// combine leftovers into incoming string and reset leftovers
String s = leftovers+instring;
leftovers=;

StringBuffer sb = new

Re: Indexing other documents type than html and txt (XML)

2001-11-30 Thread carlson


I'll take on creating a Document repository.
I would like to get some ideas from people about what kind of Documents 
they are creating and what people want.

What's the next step Doug?

--Peter


On Friday, November 30, 2001, at 08:22 AM, Doug Cutting wrote:

 From: Erik Hatcher [mailto:[EMAIL PROTECTED]]

 How does lucene-dev feel about creating a 'contrib' area in
 CVS for these
 kinds of things that folks really need to make Lucene come to
 life for them,
 but are obviously not part of the main engine?

 I think this is a fine idea, but it needs to be managed.  We don't want 
 an
 area where anyone can upload anything.  It could easily become filled 
 with
 things that don't even compile, and would cause folks more headaches 
 than it
 would relieve.

 So if someone would like to volunteer to administer this area, then I'm 
 for
 it.  Administration would include some limited testing of each 
 contributed
 module, ensuring that each has documentation, rejecting poorly written
 modules, writing a top-level document that describes all contributed
 modules, etc.  Anyone interested?

 Doug

 --
 To unsubscribe, e-mail:   mailto:lucene-user-
 [EMAIL PROTECTED]
 For additional commands, e-mail: mailto:lucene-user-
 [EMAIL PROTECTED]




--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]

Re: Indexing other documents type than html and txt

2001-11-29 Thread Otis Gospodnetic


You'd have to write parsers for each of those document types to convert
it to text and then index it.
Sure, you can feed it something like XML, but then you may consider
something like xmldb.org instead.

Otis

--- Antonio Vazquez [EMAIL PROTECTED] wrote:
 
 Hi all,
 I have a doubt. I know that lucene can index html and text documents,
 but
 can it index other type of documents like pdf,docs, and xls
 documents? if it
 can, how can I implement it? Perhaps can implement it like html and
 txt
 indexing?
 
 regards
 
 Antonio
 
 
 _
 Do You Yahoo!?
 Get your free @yahoo.com address at http://mail.yahoo.com
 
 
 --
 To unsubscribe, e-mail:  
 mailto:[EMAIL PROTECTED]
 For additional commands, e-mail:
 mailto:[EMAIL PROTECTED]
 


__
Do You Yahoo!?
Yahoo! GeoCities - quick and easy web site hosting, just $8.95/month.
http://geocities.yahoo.com/ps/info1

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]

Re: Indexing other documents type than html and txt (XML)

2001-11-29 Thread carlson


I have started to create a set of generic lucene document types that can 
be easily manipulated depending on the fields.
I know other have generated Documents out of PDF.
Is there some place we can add contributed classes to the lucene web 
page?

Here my current version of the XMLDocument based on . It's a bit slow.
It uses a path (taken from Document example) and based on a field name / 
xpath pair (key / value) from either an array or property file generates
an appropriate lucene document with the specified fields.

I have not tested all permutations of Document (I have used the File, 
Properties) and it works.

Note:
It uses the xalan example ApplyXpath class to get the xml xpath.

I hope this helps.

--Peter

--

package xxx.lucene.xml;

import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.DateField;

import org.apache/ApplyXpath;
import java.util.Properties;
import java.io.File;
import java.util.Enumeration;
import java.io.FileInputStream;

/**
* A utility for making lucene document from an XML source and a set of 
xpaths
* based on Document example from Lucene
*
*/
public class XMLDocument
{
private XMLDocument() { }

 /**
  * @param file Document that to be converted to a lucene document
  * @param propertyList properties where the key is the field 
name and the value is the
  * XML xpath.
  * @throws FileNotFoundException
  * @throws Exception
  * @return lucene document
  */
public static Document Document (File file, Properties propertyList)
throws java.io.FileNotFoundException , Exception
{
Document doc = new Document();

// add path
doc.add(Field.Text(path, file.getPath()));

//add date modified
doc.add(Field.Keyword(modified, 
DateField.timeToString(file.lastModified(;

//add field list in property list
 Enumeration e = propertyList.propertyNames();
 while (e.hasMoreElements())
 {
String key = (String) e.nextElement();
String xpath = propertyList.getProperty(key);
String[] valueArray = ApplyXpath(file.getPath(),xpath);
StringBuffer value = new StringBuffer();
for (int i=0; i  valueArray.length; i++)
{
value.append(valueArray[i]);
}
//System.out.println(add key +key+ wtih value = +value);
 filter(key,value);
doc.add(Field.Text(key,value.toString()));
 }

 return doc;
}

 /**
  * @return lucene document
  * @param fieldNames field names for the lucene document
  * @param file Document that to be converted to a lucene document
  * @param xpaths XML xpaths for the information you want to get
  * @throws Exception
  */
 public static Document Document(File file, java.lang.String[] 
fieldNames, java.lang.String[] xpaths)
 {
 if (fieldNames.length != xpaths.length)
 {
 throw new IllegalArgumentException (String arrays are 
not equal size);
 }

 Properties propertyList = new Properties();

 // generate properties from the arrays
 for (int i=0;ifieldNames.length;i++) {
 propertyList.setProperty(fieldNames[i],xpaths[i]);
 }

 Document doc = Document (file, propertyList);
 return doc;
 }

 /**
  * @param path path of the Document that to be converted to a 
lucene document
  * @param keys
  * @param xpaths
  * @throws Exception
  * @return
  */
 public static Document Document(String path, String[] 
fieldNames, String[] xpaths)
 throws Exception
 {
 File file = new File(path);
 Document doc = Document (file, fieldNames, xpaths);
 return doc;
 }

 /**
  * @param path path of document you want to convert to a lucene 
document
  * @param propertyList properties where the key is the field 
name and the value is the
  * XML xpath.
  * @throws Exception
  * @return lucene document
  */
 public static Document Document(String path, Properties 
propertyList)
 throws Exception
 {
 File file = new File(path);
 Document doc = Document (file, propertyList);
 return doc;
 }

 /**
  * @param documentPath path of the

Re: Indexing other documents type than html and txt (XML)

Re: Indexing other documents type than html and txt

Re: Indexing other documents type than html and txt (XML)

Re: Indexing other documents type than html and txt

Re: Indexing other documents type than html and txt (XML)

5 matches

Site Navigation

Mail list logo

Footer information