Re: Lucene crawler plan

2003-06-30 Thread Clemens Marschner
There's an experimental webcrawler in the lucene-sandbox area called larm-webcrawler (see http://jakarta.apache.org/lucene/docs/lucene-sandbox/larm/overview.html), and a project on Sourceforge (http://larm.sf.net) that tries to leverage this on a higher level. I want to encourage you to go on that

Re: Fw: LARM

2003-06-07 Thread Clemens Marschner
After some sabbatical time, I created a project on Sourceforge to restart the LARM development. LARM will be a full-featured search engine based on Lucene. The scope is corporate intranets or portions of the web, databases, and file systems, for people that want a Java open source solution. The U

Re: User documentation for scoring

2003-02-24 Thread Clemens Marschner
That seems to be empty --Clemens - Original Message - From: "Ype Kingma" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Sunday, February 23, 2003 7:24 PM Subject: User documentation for scoring > Dear developers, > > Attached is a first attempt for some user documentation for > scori

Lucene at Jahia

2003-02-19 Thread Clemens Marschner
I noticed the Jahia CMS System (www.jahia.org) uses the Lucene search engine by default (and includes the Lucene logo) in the default installation. Clemens - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-

Re: [LARM] using SEDA, pros and cons

2003-02-18 Thread Clemens Marschner
Couldn't look at your code yet, but: AFAIK Seda was developed for asynchronous I/O, which would mean a redesign of the central FetcherTask class. If every thread downloads 50 files at once, you only need a couple of them in parallel to saturate the network interfaces. Clemens - Original Mes

[VOTE] anybody? Re: [LARM] Merlin and other additions

2003-02-16 Thread Clemens Marschner
> very cool, thanks, David. > > +1 that we put this into lucene-sandbox/projects/larm and use it as a basis > for our developments. > > +1 that we David gets commiter access to lucene-sandbox > > Clemens > - To unsubscribe, e-m

Re: [LARM] Merlin and other additions

2003-02-13 Thread Clemens Marschner
very cool, thanks, David. +1 that we put this into lucene-sandbox/projects/larm and use it as a basis for our developments. +1 that we David gets commiter access to lucene-sandbox Clemens - Original Message - From: "David Worms" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Tuesday,

[LARM] next steps

2003-01-31 Thread Clemens Marschner
Great, so how should we go on? I suggest we wait for you, David, so that you can make the code a little more stable and change the things you mentioned. You said something about two weeks (?) I would say we should then be at a point where we could get rid of de.lanlab.* packages and move the rest

Avalonized LARM

2003-01-31 Thread Clemens Marschner
One more Q: 7. As far as I can see, each MessageProcessor (State/MessageListener in your terms) adds _itself_ to a message handler that it has to know about (as defined in DefaultMessageListenerSelector.xinfo). Doesn't this violate the IoC pattern? Shouldn't an external component initialize the me

Fw: LARM / Re: Avalonized WebCrawler

2003-01-31 Thread Clemens Marschner
While sipping a grande latte, I came up with the following questions (more tom come): 1. I wonder how ...crawl.fetcher is working, since there seem to be some typos: - DefaultFetcherTaskFacotry.xinfo (o<->t) contains a reference to com.celavi.crawl.fetcher.FetcherTaskFacotry which

Re: Avalonized WebCrawler

2003-01-28 Thread Clemens Marschner
David, just one thing, while I'm reading the code: Have you had a look on our thoughts here: http://cvs.apache.org/viewcvs.cgi/jakarta-lucene-sandbox/projects/larm/docs/ ? > Future: > I am committed to pursue the development of the crawler. I hope many > current and future developers will follow

Re: Avalonized WebCrawler

2003-01-28 Thread Clemens Marschner
Great news, this will push us forward! Will have a look on it immediately (after breakfast, of course ! :-) Clemens - Original Message - From: "Otis Gospodnetic" <[EMAIL PROTECTED]> To: "Lucene Developers List" <[EMAIL PROTECTED]>; "Avalon framework users" <[EMAIL PROTECTED]> Sent: Tuesd

The LARM project

2002-12-01 Thread Clemens Marschner
eir mailing list asking for support, and received an enormous response. We would like to get in contact with them - perhaps a viable collaboration can emerge. The rest of the documents can be downloaded from CVS from jakarta-lucene-sandbox/projects/larm/docs Regards, Clemens Marschner (with Oti

Re: collections classes

2002-11-18 Thread Clemens Marschner
It doesn't use a lot of collection classes, but if, it uses ones that are JDK 1.1.8 compatible: Vector instead of ArrayList, Enumeration instead of Iterator. There was a discussion about this a while ago, but the result was that until there is a compelling reason, this behavior will not be changed.

Re: getAllFieldNames diffs

2002-11-12 Thread Clemens Marschner
Instead of returning Object[] or Collection I would consider returning an iterator. Iterators may be designed data-driven, that is, temporary objects are only created when next() is called and not at the time the method is called. There are powerful frameworks like the XXL library that extensively

Re: Diffs for enabling query rewriting

2002-11-10 Thread Clemens Marschner
for enabling query rewriting > Hm, developers are not responding to this 3 week old email. :( > Clemens, could you also provide some unit tests with this? > > Thanks, > Otis > > > --- Clemens Marschner <[EMAIL PROTECTED]> wrote: > > Enclosed you find the diffs

Forrestize Lucene

2002-11-06 Thread Clemens Marschner
Hi, there's a project called Apache Forrest that seeks to replace jakarta-site2 by a Cocoon driven process that creates better navigation and printable document versions (PDF) on the fly. I think Lucene should adapt that asap as well. I'm always confused when I use Jakarta, because navigation ten

Re: LARM web crawler: use lucene itself for visited URLs

2002-10-31 Thread Clemens Marschner
uot;Lucene Developers List" <[EMAIL PROTECTED]>; "Clemens Marschner" <[EMAIL PROTECTED]> Sent: Thursday, October 31, 2002 9:10 AM Subject: Re: LARM web crawler: use lucene itself for visited URLs > On Wednesday 30 October 2002 23:30, Clemens Marschner wrote: >

Re: LARM web crawler: use lucene itself for visited URLs

2002-10-30 Thread Clemens Marschner
There's a good paper on compressing URLs in http://citeseer.nj.nec.com/suel01compressing.html It takes advantage of the regular structure of the sorted list of URLs and compresses the resulting structure with some Huffman encoding. I have already implemented a somewhat simpler algorithm that can co

Re: Development plans for Lucene?

2002-10-30 Thread Clemens Marschner
Otis, Kelvin and me have been discussing how we could leverage Lucene on a next level. We have some components in the sandbox (LARM Webcrawler, Indyo indexing framework) that have to be weaved together with Lucene. This could end up in a real search engine server. I call it "Lucene Advanced Retrie

Re: Diffs for enabling query rewriting

2002-10-30 Thread Clemens Marschner
Could any commiter please have a look at the diffs I posted a week ago? --Clemens - Original Message - From: "Clemens Marschner" <[EMAIL PROTECTED]> To: "Lucene Developers List" <[EMAIL PROTECTED]> Sent: Wednesday, October 23, 2002 10:38 PM Subject: Di

Re: Lucene Site Updated

2002-10-30 Thread Clemens Marschner
Thanks Peter for the update, for those of you who read the doc on the LARM crawler: I added some new sections, reflecting the changed state of the CVS version of the crawler: - command line options were extended such that a list of URLs can now be transmitted, not only one. - URL normalization wa

Re: Lucene files

2002-10-28 Thread Clemens Marschner
Should I cast this doc into XML? I would like to do that, as well as the ranking function from the FAQ --Clemens - Original Message - From: "Peter Carlson" <[EMAIL PROTECTED]> To: "Lucene Developers List" <[EMAIL PROTECTED]> Sent: Monday, October 28, 2002 4:18 PM Subject: Re: Lucene files

Diffs for enabling query rewriting

2002-10-23 Thread Clemens Marschner
Enclosed you find the diffs I promised for enabling query rewriting. This also enables tools such as the HTML term highlighter (http://www.iq-computing.de/lucene/highlight.jsp). There's one difference to the white paper there: I didn't want to make arrays public, so getClauses() in BooleanClause o

Re: Docs@Sandbox

2002-10-23 Thread Clemens Marschner
Do I need committer access to /Lucene for that? I only have it for lucene-sandbox --C. - Original Message - From: "Otis Gospodnetic" <[EMAIL PROTECTED]> To: "Lucene Developers List" <[EMAIL PROTECTED]>; "Clemens Marschner" <[EMAIL PROTECTE

Docs@Sandbox

2002-10-23 Thread Clemens Marschner
Could someone (Otis?) please give some hints on how docs for sandbox projects can be created such that they appear on the Lucene web site? Clemens -- http://www.cmarschner.net -- To unsubscribe, e-mail:

Re: LARM work

2002-10-18 Thread Clemens Marschner
to understand the concepts. That again lets entry costs for other people raise. Clemens - Original Message ----- From: "Otis Gospodnetic" <[EMAIL PROTECTED]> To: "Clemens Marschner" <[EMAIL PROTECTED]>; "Mehran Mehr" <[EMAIL PROTECTED]> Sent:

Re: your crawler

2002-09-20 Thread Clemens Marschner
Re: cvs commit: jakarta-lucene/src/java/org/apache/lucene/index F ieldsReader.java>- Original Message - >From: Halácsy Péter >To: [EMAIL PROTECTED] >Sent: Friday, September 20, 2002 12:10 PM >Subject: your crawler > > >BTW what is the status of the LARM crawler. 2 months ago I promised I c

Re: cvs commit: jakarta-lucene/src/java/org/apache/lucene/index F ieldsReader.java

2002-09-19 Thread Clemens Marschner
I would favor again a configuration file for the JRefactory pretty printer that contains those standards. That pp is available as a cmd line tool and as an integration with a variety of IDEs, and it guarantees standard conformance. Clemens -- To unsubscribe, e-mail:

Re: Query Rewriting

2002-09-12 Thread Clemens Marschner
I would prefer a standardized template for the pretty printer of JRefactory, which can be used stand-alone or from within IDEs (JBuilder, Cafe, Netbeans, Elixir). http://jrefactory.sourceforge.net/cspretty.html I haven't spent a minute on thinking about coding standards since I got that. You can w

Re: Query Rewriting

2002-09-12 Thread Clemens Marschner
> Please submit diffs. Yeah, I'll do that on the weekend, have to get the latest from CVS (no fast Internet connection today). My IDE converts tabs to spaces when saving. Severe? >@@ -151,7 +151,7 @@ > Term term = enum.term(); > if (term != null && term.fi

Uppercase/lowercase in GermanStemmer

2002-09-11 Thread Clemens Marschner
I had a problem with the German stemmer, since it tries to detect nouns by looking for an uppercase first letter. This information is only used when a word ends with "t" in which case it is not stemmed. However, it's very naive to think words are nouns if and only if they begin with a capital let

Query Rewriting

2002-09-08 Thread Clemens Marschner
I want to perform some rewriting rules on the queries I get. The best way to do that is to edit the parse tree. However, the Query classes do not contain any methods for reading out or altering their contents or to clone them. Is there any reason for that? Or is this just a feature nobody has ne

Re: AND on two weighted fields

2002-08-21 Thread Clemens Marschner
No, since it has to be possible that one or more of the tokens occur in only one field. That means with the current query parser I can only simulate that with (+((field1:(+token1 +token2 +token3)^2 +field2:(token1 token2 token3))) (+((field1:(token1 +token2 +token3)^2 +field2:(+token1 token2 toke

AND on two weighted fields

2002-08-21 Thread Clemens Marschner
I need to perform an AND query on two fields and weight the results according to in which fields the results came from. That is, I would need something like (field1^2 OR field2^1):(+token1 +token2 +token3) This means that _all_ of the tokens _have_ to occur in either one of these fields, and

Re: document & field boosting

2002-08-12 Thread Clemens Marschner
Hi, Doug, do you think the ranking function as stated in the FAQ (http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi?file=chapter.searc h&toc=faq#q31 is still correct after the recent changes? Clemens - Original Message - From: "Doug Cutting" <[EMAIL PROTECTED]> To: <[EMAIL PROT

Article about Lucene

2002-08-12 Thread Clemens Marschner
As I already mentioned, "Java Magazin" has a feature story about Lucene in its current issue. http://www.javamagazin.de/itr/ausgaben/show.php3?id=99&nodeid=20 Unfortunately, it's not available online. Only the source codes for the example application. Clemens --

Re: LARM: Configuration RFC

2002-08-12 Thread Clemens Marschner
> My overall impression is that this is overly complicated. > My brain is probably tired (past 1 AM), but I can't help but think that > there must be a simpler way Hi Otis, sorry for the delay, because of that I will repeat most of my original message: > > I distinguish (logically, not nece

Re: I need your advice

2002-07-29 Thread Clemens Marschner
If you like to get some insight on the LARM crawler, feel free to read http://cvs.apache.org/viewcvs.cgi/jakarta-lucene-sandbox/contributions/webcr awler-LARM/doc/webcrawler_tech_overview.pdf http://cvs.apache.org/viewcvs.cgi/jakarta-lucene-sandbox/contributions/webcr awler-LARM/CHANGES.txt http:/

Re: Good Software/Documentation was Re: I need your advice

2002-07-29 Thread Clemens Marschner
> > I get rated by the feedback I get from readers, so if you like > the articles, please fill out the little form at the bottom. > That's the kind of feedback we'd need for all our docs :-) --Clemens -- To unsubscribe, e-mail: For additional command

Jakarta docs // Re: I need your advice

2002-07-29 Thread Clemens Marschner
May I add that the jakarta site structure doesn't really support good documentation? I often find it very confusing. I often find myself on jakarta-general pages when I expect to be on lucene pages. Docs for different projects differ more than you would expect from the standardized output format.

Book on crawlers (and almost all other features of current search engines)

2002-07-28 Thread Clemens Marschner
Soumen Chakrabarti will have a chapter on crawlers in his new book "Mining the Web: Discovering Knowledge from Hypertext Data" (Morgan Kauffman), which will be the first one I know off about this topic. http://www.cse.iitb.ac.in/soumen/main/book-toc.ps http://www.amazon.com/exec/obidos/ASIN/1

LARM as an Avalon Phoenix application

2002-07-16 Thread Clemens Marschner
The more I read about Avalon, the more I get the impression that this framework would solve a lot of the needs I outlined in the "Config RFC" document I posted some days ago. I'm really thinking about "avalonizing" the crawler. Any comments? Clemens -- http

LARM as an Avalon Phoenix application

2002-07-16 Thread Clemens Marschner
The more I read about Avalon, the more I get the impression that this framework would solve a lot of the needs I outlined in the "Config RFC" document I posted some days ago. I'm really thinking about "avalonizing" the crawler. Any comments? Clemens -- http

Re: Configuration RFC

2002-07-15 Thread Clemens Marschner
> Below are some comments and questions from someone just getting into the > crawling concepts, but trying to provide constructive ideas. Very very good. That's the kind of discussion I wanted. > 1) The MessageQueue system seems to be somewhat problematic because of > memory issues. This seems l

Re: Configuration RFC

2002-07-14 Thread Clemens Marschner
> This could be done in two ways: ... sorry.. the second way: Write a Source that puts in new messages, but runs in its own thread. I have to add that sources don't exist at this time. Clemens -- To unsubscribe, e-mail: For additional commands, e-mail:

Re: Configuration RFC

2002-07-14 Thread Clemens Marschner
> >> Also, would the source be able to say, I want to get all files which meet > >> this pattern and blindly attempt to get the set of files with changing > >> parameters? A message generator? > > > > I don't really know if I get your point. What do you want to accomplish? > > So there are some si

Re: Configuration RFC

2002-07-13 Thread Clemens Marschner
With (at least) one thing you're right: This all seems pretty much Avalon to me. After writing the doc, I read a little in the Avalon docs and found all that very similar. I already mentioned that some days ago. --Clemens -- To unsubscribe, e-mail: For additional co

Re: Configuration RFC

2002-07-13 Thread Clemens Marschner
> I think you may have mentioned it, but how do the Sources fit in. > > For example, one of the goal I am trying to get would be to get a URL > (xyz.html) and then change the url based on some pattern. So change xyz.html > to xyz.xml and get the .xml as a request. > I think you mentioned below th

LARM: Configuration RFC

2002-07-13 Thread Clemens Marschner
ok, this is my proposal for the crawler configuration. And you tell me if I'm reinventing the wheel: Overview I distinguish (logically, not necessarily on a class level) between 5 different types of components: - "filters" are parts of the message pipeline. They get a message and eithe

Fw: Configuration RFC

2002-07-13 Thread Clemens Marschner
ok, this is my proposal for the crawler configuration. And you tell me if I'm reinventing the wheel: Overview I distinguish (logically, not necessarily on a class level) between 5 different types of components: - "filters" are parts of the message pipeline. They get a message and eithe

Article about Lucene

2002-07-11 Thread Clemens Marschner
For the ones capable of reading German... there will be an article about the Lucene search engine in the next issue of the German "Java-Magazin". So expect a lot of new Lucene users with questions like "vat is ze vay how lucene is vorking?" :-) --Clemens -

Mehran Mehr as Lucene-Sandbox developer

2002-06-28 Thread Clemens Marschner
Hi, I'd like to propose adding Mehran Mehr to the list of Lucene Sandbox developers. He's contributed build files for the LARM web crawler, and wants to maintain them in the future. He said that after finishing a major project in a few weeks he'd like to contribute to other parts of the LARM sub

Re: Becoming a collaborator?

2002-06-28 Thread Clemens Marschner
> Perhaps a jakarta.apache.org/lucene/sandbox/ documentation directory is > in order. +1 :-) If it can be maintained by lucene-sandbox contributors: perfect. Clemens -- To unsubscribe, e-mail: For additional commands, e-mail:

Re: Becoming a collaborator for LARM?

2002-06-28 Thread Clemens Marschner
VS version of the lucene-sandbox with the LARM repository (if you don't know how, please refer to http://jakarta.apache.org/site/cvsindex.html) - read docs/technical_overview.{rtf|pdf}, CHANGES.txt, and TODO.txt (the 2 latter being updated at every commit) - tell me your ideas and what you

LARM Crawler: Repository

2002-06-21 Thread Clemens Marschner
Ok I think I got your point. > You have MySQL to hold your links. > You have N crawler threads. > You don't want to hit MySQL a lot, so you get links to crawl in batches > (e.g. each crawler thread tells MySQL: give me 1000 links to crawl). [just to make it clear: this looks like the threads wo

Avalon anybody?

2002-06-21 Thread Clemens Marschner
> > One last thought: > > - the crawler should be be started as a daemon process (at least > > optionally) > > - it should wake up from time to time to crawl changed pages > > - it should provide a management and status interface to the outside. > > - it internally needs the ability to run servi

Re: LARM Crawler and the case of the missing HTTPClient.zip

2002-06-21 Thread Clemens Marschner
I don't get a 404 with this URL: http://www.innovation.ch/java/HTTPClient/ The URL of the archives are http://www.innovation.ch/java/HTTPClient/HTTPClient.tar.gz or http://www.innovation.ch/java/HTTPClient/HTTPClient.zip Regards, Clemens - Original Message - From: "Matthew King" <[EMAI

Re: LARM Web Crawler: note on normalized URLs

2002-06-21 Thread Clemens Marschner
> It may be even nicer to use some DB implemented in Java, such as > HyperSQL (I think that's the name) or Smyle > (https://sourceforge.net/projects/smyle/) or Berkeley DB > (http://www.sleepycat.com/), although MySQL may be simpler if you want > to create a crawler that can be run on a cluster o

LARM Crawler: Status

2002-06-11 Thread Clemens Marschner
Hi, I just want to keep you informed on how we plan to integrate the LARM crawler with Lucene. I'm working with Mehran Mehr on two major topics: 1. Lucene storage: We want to see a web document as a bunch of name-value pairs, one of which is the URL and the other could be the document itself. >

Re: Clemens Marschner as Lucene Sandbox developer

2002-05-04 Thread Clemens Marschner
Thanks for the flowers. I'm writing a technical overview document at the moment, to facilitate further development. Clemens > +1 - I am excited to try out his contribution. -- To unsubscribe, e-mail: For additional commands, e-mail:

Re: Compressing Links / Was: Re: Web Crawler

2002-04-24 Thread Clemens Marschner
> see this:http://www.almaden.ibm.com/cs/k53/www9.final/ > > . "In CS2, each URL is stored in 10 bytes. In CS1, each link requires 8 bytes to store as both an in-link and out-link; in CS2, an average of only 3.4 bytes are used. Second, CS2 provides additional functionality in the form of a host da

Re: Web Crawler

2002-04-24 Thread Clemens Marschner
> I can tell you in advance that have all the visited links in memory will kill your machine after about 150'000 links, i tested that, i crawled amazon.com and after 200'000 links the cpu was 100%, no response to event,nothing.The best thing? < It's not that bad with my crawler. I crawled 600.000

Web Crawler

2002-04-24 Thread Clemens Marschner
n() method). Since I just used it for myself, this was fine so far. Cheers, Clemens Marschner -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>