Re: indexing incrementally concurrently

2004-07-05 Thread Michael Wechner
Erik Hatcher wrote:
On Jul 5, 2004, at 9:00 AM, Michael Wechner wrote:
If several users are saving documents on the server concurrently
and during saving the index shall be updated incrementally ... do
I have to make sure that it's going to be threadsave or does Lucene
take care of this?

Only a single IndexWriter instance at a time can be used - so you will 
need to coordinate things.  Multiple threads can share a single 
IndexWriter though, so no worries there.

ok. Thanks very much for the info
Michi
Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


--
Michael Wechner
Wyona Inc.  -   Open Source Content Management   -   Apache Lenya
http://www.wyona.com  http://cocoon.apache.org/lenya/
[EMAIL PROTECTED][EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


incrementally indexing a million documents

2004-06-14 Thread Michael Wechner
I try to index around a million documents. The problem is
that I run out of memory during sorting by uid when I go through
the directory recursively.
Well, I could add more memory, but this wouldn't really solve my problem,
because at some point I will always run out of memory (e.g. 10 million 
documents).

Is there another approach than sorting by uid?
Thanks
Michi
--
Michael Wechner
Wyona Inc.  -   Open Source Content Management   -   Apache Lenya
http://www.wyona.com  http://cocoon.apache.org/lenya/
[EMAIL PROTECTED][EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: what web crawler work best with Lucene?

2004-04-27 Thread Michael Wechner
Tuan Jean Tee wrote:

Have anyone implemented any open source web crawler with Lucene? I have
a dynamic website and are looking at putting in a search tools. Your
advice is very much appreciated.
 

there is a crawler included within Apache Lenya 
http://cocoon.apache.org/lenya/

src/java/org/apache/lenya/search/crawler/*

or you might try LARM

http://jakarta.apache.org/lucene/docs/lucene-sandbox/larm/overview.html

HTH

Michi


Thank you.

IMPORTANT -

This email and any attachments are confidential and may be privileged in which case neither is intended to be waived. If you have received this message in error, please notify us and remove it from your system. It is your responsibility to check any attachments for viruses and defects before opening or sending them on. Where applicable, liability is limited by the Solicitors Scheme approved under the Professional Standards Act 1994 (NSW). Minter Ellison collects personal information to provide and market our services. For more information about use, disclosure and access, see our privacy policy at www.minterellison.com.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 



--
Michael Wechner
Wyona Inc.  -   Open Source Content Management   -   Apache Lenya
http://www.wyona.com  http://cocoon.apache.org/lenya/
[EMAIL PROTECTED][EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


sorting by date (XML)

2004-04-27 Thread Michael Wechner
my XML files contain something like

date
 year2004/yearmonth04/monthday27/day...
/date
and I would like to sort by this date.

So I guess I need to modify the Documentparser and generate something like
a millisecond field and then sort by this, correct?
Has anyone done something like this yet?

Thanks

Michi

--
Michael Wechner
Wyona Inc.  -   Open Source Content Management   -   Apache Lenya
http://www.wyona.com  http://cocoon.apache.org/lenya/
[EMAIL PROTECTED][EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: sorting by date (XML)

2004-04-27 Thread Michael Wechner
Nader S. Henein wrote:

Here's my two cents on this:
Both ways you will need to combine the date in one field, but if you use a
millisecond representation you will not be able to use the FLOAT sort type
and you'll have use STRING sort (Slower) because the millisecond
representation is longer than FLOAT allows, so you have three options:
1) Use MMDD and sort by FLOAT type
 

ok, I guess then will take the FLOAT type

2) Use the millisecond representation and sort by STRING type
3) If the date you're entering here is the date of indexing then you can
just sort by DOC type (which is the DOC ID) and save yourself the pain
 

unfortunately this isn't possible.

Thanks a lot for your help

Michi

Hope this helps.

Nader Henein

-Original Message-
From: Michael Wechner [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, April 27, 2004 3:52 PM
To: Lucene Users List
Subject: sorting by date (XML)

my XML files contain something like

date
 year2004/yearmonth04/monthday27/day...
/date
and I would like to sort by this date.

So I guess I need to modify the Documentparser and generate something like a
millisecond field and then sort by this, correct?
Has anyone done something like this yet?

Thanks

Michi

 



--
Michael Wechner
Wyona Inc.  -   Open Source Content Management   -   Apache Lenya
http://www.wyona.com  http://cocoon.apache.org/lenya/
[EMAIL PROTECTED][EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: sorting by date (XML)

2004-04-27 Thread Michael Wechner
Robert Koberg wrote:

Ah. Great - thanks! I see you added it to the wiki. Thanks again :)


I guess you mean

http://wiki.apache.org/jakarta-lucene/IndexingDateFields

Thanks as well

Michi


This is perfect in my case since iso8601 is in the format:

2004-04-27T01:23:33

Luckily so far, from my logs, hardly anyone uses the date search. I 
guess I should have been doing this from the beginning, don't know why 
I didn't...

best,
-Rob

Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
Michael Wechner
Wyona Inc.  -   Open Source Content Management   -   Apache Lenya
http://www.wyona.com  http://cocoon.apache.org/lenya/
[EMAIL PROTECTED][EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: org.apache.lucene.demo.IndexHTML - parse JSP files?

2003-03-24 Thread Michael Wechner
John Bresnik wrote:

anyone know of a quick and easy way to get this demo
[org.apache.lucene.demo.IndexHTML] to parse JSP files as well? I used to a
crawler to create a local [static] version of the site [i.e. they are not
longer JSP files just the html output from the original JSP file  - but in
the interest of keeping the URL intact, I need to parse the JSP extentions -
the short question is, does anyone know of a way to *not* ignore the *.jsp
files?
just modify IndexHTML: there is one line in there which decides what 
extension it will index.

HTH

Michael

thanks.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: xpdf parser usage for lucene

2003-02-25 Thread Michael Wechner
Pinky Iyer wrote:

Hi !
  I am trying to use xpdf for pdf parser, the problem i encounter is when i encounter 
a file with .pdf extension, i call the pdftotext script to convert to text, which in 
turn uses the file system and leaves the same file with .txt extension in same dir. 
How can i get this as a stream and not use the file system at all. Also How do i 
access the summary and title info.
xpdf has an option to turn the PDF into an HTML instead of txt, which 
allows you to use an HTMLParser
for populating the fields.

Concerning the extension: when you create your Lucene document, you 
could replace the txt extension
by the pdf extension in the case of the uri field.

HTH

Michael

Anybody who has done this before, please help!
Thanks!
Pinky Iyer
 



-
Do you Yahoo!?
Yahoo! Tax Center - forms, calculators, tips, and more
 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: PLAN: WebLucene -- Lucene Web interface, use XML as a lightweightprotocol.

2003-02-20 Thread Michael Wechner
That's very interesting.

I have tried something similar by integrating
Lucene into Wyona, which is a CMS based on Cocoon,
and I also separated Structure from Layout. You can try it out at

HTML:

http://195.226.6.70:8080/wyona-cms/oscom/search-oscom/lucene?publication-id=allqueryString=Cocoon+Wyonafields=allfind=Search

XML:

http://195.226.6.70:8080/wyona-cms/oscom/search-oscom/lucene.xml?publication-id=allqueryString=Cocoon+Wyonafields=allfind=Search

I think XooMLe also did a pretty good job:

http://www.dentedreality.com.au/xoomle/search/

Maybe we find a way how to join efforts

Thanks

Michael


Che Dong wrote:

http://sourceforge.net/projects/weblucene/

WebLucene: Lucene Web interface, use XML as a lightweight protocol. 

Developer convert data source (text, DB, MS Word, PDF... etc) into standard xml format indexing with lucene engine, and get full text search result via HTTP, with XML format output, user can easily intergrated with JSP ASP PHP front end or use XSLT at server side transform output.

Developer can intergrate lucene full text search engine with old MSSQL + ASP MySQL + PHP Oracle + JSP based web applications.

MySQL  \  / JSP
Oracle - DB  -  ==   XML == (Lucene Index) == XML  -  ASP
MSSQL  /  -  PHP
 MS Word /\ / XHTML
 PDF / =XSLT= -  text
\ XML
 
 \_Web Lucene/ 
   
i18n issue: for Java is Unicode based, user can indexing data source(XML) in different charset into one lucene index(in unicode) and output result according to client browser support languages.
  GBK  \   / BIG5
  BIG5  -  UNICODE   Unicode -  GB2312
  SJIS  -   (XML) (XML)   -  SJIS
  ISO-8859-1   /   \ ISO-8859-1


Che, Dong
http://www.chedong.com/tech/





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




no-index or index

2003-01-30 Thread Michael Wechner
Hi

I am looking for an HTMLParser which skips text tagged by

no-index  or something similar. This way I could exclude for
instance a global navigation section within the HTML

no-index
Internationalbr
Businessbr
Sciencebr
...
/no-index

It seems that the current demo/HTMLParser 
(http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi?file=chapter.indexingtoc=faq#q11)
is not capable of doing something like that.

Any pointers are very welcome.

Thanks a lot

Michael


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: no-index or index

2003-01-30 Thread Michael Wechner
Erik Hatcher wrote:


If you look at the contributions/ant area of the Lucene sandbox in 
CVS  you'll see my HtmlDocument class which uses JTidy.

Rather than making up some invalid HTML tag, I'd recommend you 
separate  your navigation section with a div or span with a 
special  class=navigation or something like that.  Then use JTidy to 
ignore  such tags that have that class.  Then you get valid, clean 
HTML and the  ability to filter it for indexing. 


Well, I haven't  found out how to use JTidy to ignore such tags that 
have such a class. So I just
added some code to your class HtmlDocument within the getBodyText method:

 if(child.getNodeName().equals(span)){
 org.w3c.dom.Attr 
attribute=((Element)child).getAttributeNode(class);
 if(attribute != null){
if(attribute.getValue().equals(lucene-no-index)){
  
System.out.println(HtmlDocument.getBodyText(): ignore span!);
  break;
  }
}
  System.out.println(HtmlDocument.getBodyText(): 
accept span!);
  }

This way text will be ignored within span 
class=lucene-no-index.../span
It's not perfect, but it's working very well for the moment.

Two remarks:

1) I noticed that demo/HTMLDocument (resp. demo/html/HTMLParser) sets:

 contents= title + body

 and your class HtmlDocument

contents=body


2) I got two Javadoc warnings, because @return was empty within 
HtmlDocument (getDocument() and Document())


Thanks very much for your help

Michael







Erik



On Thursday, January 30, 2003, at 04:56  AM, Michael Wechner wrote:


Hi

I am looking for an HTMLParser which skips text tagged by

no-index  or something similar. This way I could exclude for
instance a global navigation section within the HTML

no-index
Internationalbr
Businessbr
Sciencebr
...
/no-index

It seems that the current demo/HTMLParser  
(http://lucene.sourceforge.net/cgi-bin/faq/ 
faqmanager.cgi?file=chapter.indexingtoc=faq#q11)
is not capable of doing something like that.

Any pointers are very welcome.

Thanks a lot

Michael


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: no-index or index

2003-01-30 Thread Michael Wechner
Ronnie Kolehmainen wrote:


Michael,

the HtmlDocument class supports ignoring tags, ie all text inside specified
tag names is ignored. Look at the setIgnoreTags(String [] ignoredtags)
method. Remember to also include script and style in this array along
with your custom tag names.



I am not able to find the method setIgnoreTags() (I have updated my 
jakarta-lucene and
jakarta-lucene-sandbox). Or would that have been within the attachment? 
I guess the attachments
are skiped by the mailing list server.

I am now using Erik's code from sandbox.

Anyway, thanks a lot for your help

Michael


Hope this is any help for you.

See below for the message from an old thread.

/Ronnie


 

Hi

I am looking for an HTMLParser which skips text tagged by

no-index  or something similar. This way I could exclude for
instance a global navigation section within the HTML

no-index
Internationalbr
   

Businessbr
 

Sciencebr
...
/no-index
   


 

It seems that the current demo/HTMLParser
(http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi?file=chapter.inde
   

xingtoc=faq#q11)
 

is not capable of doing something like that.

Any pointers are very welcome.

Thanks a lot

Michael

   


Message sent on dec 9 2002:


HI,

these are the classes i use. I only use them to extract the text stuff, so
they don't have methods for getting document title and such. However text
extraction has worked fine for me.

The HtmlParser main method takes a file path as argument and outputs the
contents to a file named html.txt - useful when testing.

/Ronnie


 

-Ursprungligt meddelande-
Fran: Otis Gospodnetic [mailto:[EMAIL PROTECTED]]
Skickat: den 7 december 2002 17:12
Till: Lucene Users List
Amne: Re: SV: Indexing HTML


I have had good experiences with nekoHTML parser.

Otis

--- Leo Galambos [EMAIL PROTECTED] wrote:
   

I'm not sure this is a solution to your problem. However, it seems
   

that the
 

HTMLParser used by the IndexHTML class has problems parsing the
   

document
 

(there is a test class included in the jar):


   

java -cp C:\projects\lucene\jakarta-lucene\bin\lucene-demos.jar
 

org.apache.lucene.demo.html.Test f01529.txt
Title: Webcz.cz - Power of search
Parse Aborted: Encountered \' at line 106, column 27.
Was expecting one of:
   ArgName ...
   TagEnd ...
/Ronnie
   

Hi Ronnie!

I know about it and the exception is handled well (see log file
below). I
have found a better example than 1529, try this:
http://com-os2.ms.mff.cuni.cz/bugs/f00034.txt This file cannot go
throught
Lucene HTML parser (I have tried 1.2 and IBM JDK 1.3.1r3). The file
is
specific, i.e. it has two titles, two base tags etc.

I have not debugger here, so I cannot find the line where is the bug.
If
you try your magic, please, let me know about the patch. :) THX

-g-



adding save/d00320/f01516.html
Parse Aborted: Lexical error at line 68, column 11.  Encountered:
\u0178
(376), after : 
:
adding save/d00320/f01527.html
Parse Aborted: Encountered = at line 83, column 48.
Was expecting one of:
   ArgName ...
   TagEnd ...

adding save/d00320/f01528.html



--
To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
For additional commands, e-mail:
mailto:[EMAIL PROTECTED]

 

__
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com

--
To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
For additional commands, e-mail:
mailto:[EMAIL PROTECTED]


   


 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: no-index or index

2003-01-30 Thread Michael Wechner
Kelvin Tan wrote:


My suggestion would be to modify HTMLParser to do the job. Don't think it's 
very difficult. I'm unaware of any existing HTML Parsers which support that 
functionality...


Maybe Erik wants to include an improved version of my code snippet 
into CVS.

I guess I am not the only one wanting to exclude certain parts from an 
HTML page ;-)

All the best

Michael



Regards,
Kelvin


The book giving manifesto - http://how.to/sharethisbook


On Thu, 30 Jan 2003 10:56:50 +0100, Michael Wechner said:
 

Hi

I am looking for an HTMLParser which skips text tagged by

no-index  or something similar. This way I could exclude for
instance a global navigation section within the HTML

no-index Internationalbr Businessbr Sciencebr ...
/no-index

It seems that the current demo/HTMLParser
(http://lucene.sourceforge.net/cgi-
bin/faq/faqmanager.cgi?file=chapter.indexingtoc=faq#q11) is not
capable of doing something like that.

Any pointers are very welcome.

Thanks a lot

Michael



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
   





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

 




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: no-index or index

2003-01-30 Thread Michael Wechner
Erik Hatcher wrote:


On Thursday, January 30, 2003, at 06:59  PM, Michael Wechner wrote:


snip/




2) I got two Javadoc warnings, because @return was empty within 
HtmlDocument (getDocument() and Document())


picky picky!  :)  But thanks - I'll correct those too. 


sorry for that, but ant resp. javadoc was picky :-)




I'm not ready to commit my changes - I'll do so in a few weeks when I 
get some refactoring done on IndexTask. 


No problem

Thanks

Michael




Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: no-index or index

2003-01-30 Thread Michael Wechner
Erik Hatcher wrote:


On Thursday, January 30, 2003, at 07:07  PM, Michael Wechner wrote:


Maybe Erik wants to include an improved version of my code snippet 
into CVS.


Only if it can be made generic somehow - but that might be a bit 
tricky to implement depending on how crazy we wanted to get with it.  
The HtmlDocument class is really meant to be just an example of how to 
use the Ant index task I wrote along with the 
FileExtensionDocumentHandler that uses it.  So its original purpose 
was not to be a robust HTML document indexer, but an example piece of 
a larger puzzle. 


sure, no problem. Actually I think it's good to have small demo code and 
larger industrial strength code.




I guess I am not the only one wanting to exclude certain parts from 
an HTML page ;-)


I've seen this request come up in the recent past, in fact.  And its a 
perfectly reasonable one, especially if you are in charge of the HTML. 


yeah, I am not sure if there is a standard way to do this. I just know 
from an Atomz demo that they
are using something like this.
It would be nice if there would be a standard tag for this, or at 
least that the Open Source Search Engines
projects could agree on one. To have it configurable would also be nice 
of course, but I think it
wouldn't be necessary for the beginning.

Thanks

Michael



Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: Indexing other documents (.pdf et .doc)

2002-12-19 Thread Michael Wechner
Friaa Nafaa wrote:

 Hello,I use Lucene with Tomcat and I can now index and search all html documents. But I would like to index other documents such us pdf or Word (.doc), I hope that sameone can help me !



Concerning PDF:

Before indexing you should extract the text from the PDF and save it
as .txt (Then you can index the .txt, but reference the PDF uri). To do 
this have a look at


http://www.foolabs.com/xpdf/download.html

or

http://www.pdfbox.org/

These links are listed at

http://jakarta.apache.org/lucene/docs/contributions.html

Also take a look at the FAQ

HTH

Michael

___
Join Excite! - http://www.excite.com
The most personalized portal on the Web!





--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Score: Lucene 1.2 versus 1.3-dev1

2002-12-16 Thread Michael Wechner
Hi

I started to deploy Lucene 1.3-dev1 from CVS very recently and
noticed that the score is kind of different.

In the case of Lucene1.2 I received scores such as for instance

 3.45345234 * 10e-1

In the case of Lucene1.3-dev1 I am receiving scores such as for instance

 3.23232131 *10e-8

Is this correct or have I to change something within my Lucene 
implementation?

Thanks a lot in advance

Michael


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]



Re: Score: Lucene 1.2 versus 1.3-dev1

2002-12-16 Thread Michael Wechner
Eric Isakson wrote:

Did you rebuild your index?


No, of course not ;-)

Thanks a lot for the pointer

Michael




from CHANGES.TXT:
 12. Added support for boosting the score of documents and fields via
 the new methods Document.setBoost(float) and Field.setBoost(float).

 Note: This changes the encoding of an indexed value.  Indexes
 should be re-created from scratch in order for search scores to
 be correct.  With the new code and an old index, searches will
 yield very large scores for shorter fields, and very small scores
 for longer fields.  Once the index is re-created, scores will be
 as before. (cutting)

-Original Message-
From: Michael Wechner [mailto:[EMAIL PROTECTED]]
Sent: Monday, December 16, 2002 4:34 PM
To: [EMAIL PROTECTED]
Subject: Score: Lucene 1.2 versus 1.3-dev1


Hi

I started to deploy Lucene 1.3-dev1 from CVS very recently and
noticed that the score is kind of different.

In the case of Lucene1.2 I received scores such as for instance

  3.45345234 * 10e-1

In the case of Lucene1.3-dev1 I am receiving scores such as for instance

  3.23232131 *10e-8

Is this correct or have I to change something within my Lucene 
implementation?

Thanks a lot in advance

Michael


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]