[Nutch-dev] Re: Nutch / CGI

2005-03-31 Thread Olaf Thiele
Hi everybody, I haven't used either PLucene or PyLucene. That were just guesses for what could be used. Thanks for pointing to PyLucene though, didn't know it exists. Jeff: Concerning your CGI implemenentation, you might want to follow the OpenSearch API Thread. The REST implementation proposed

[Nutch-dev] war target in build.xml

2005-03-31 Thread Jack Tang
Hi I don't know why does not war target in build.xml(in svn) include jakarta-oro-2.0.7.jar. target name=war depends=jar,generate-docs ... lib dir=${lib.dir} include name=lucene*.jar/ include name=taglibs-*.jar/ include

Re: [Nutch-dev] Re: New nutch plugin

2005-03-31 Thread Stefan Groschupf
Am 31.03.2005 um 03:55 schrieb Rohit Kulkarni: I tried to search for the parse zip files plugin implementation you mentioned...but couldn't find it It was in the old bug tracking, but this is not avaiolable any more. However your plugin is easy to realize. Just uncompress the content and then

[Nutch-dev] Re: tools cleanup

2005-03-31 Thread Doug Cutting
Doug Cutting wrote: The proposal: One more: 7. No code should call NutchConf.get() except a tool's main(). Doug --- This SF.net email is sponsored by Demarc: A global provider of Threat Management Solutions. Download our HomeAdmin security

[Nutch-dev] Re: tools cleanup

2005-03-31 Thread Andrzej Bialecki
John X wrote: On Thu, Mar 31, 2005 at 12:45:39AM +0200, Stefan Groschupf wrote: Actually it is difficult to have tools using ndfs and local file system. What do people think about introducing a ndfs notation in paths like it is used in protocol handlers? (ala http:// or file://) I don't mean to

[Nutch-dev] Re: Nutch / CGI

2005-03-31 Thread Jeff Breidenbach
Actually. I was thinking about a program witten in Java using the CGI interface instead of the servelet framework. Sounds like nobody is working on that. On Wed, 30 Mar 2005 2:40 am, Olaf Thiele wrote: Hi Jeff, as segments are stored in a home-grown file format you will need to program your

Re: [Nutch-dev] Re: New nutch plugin

2005-03-31 Thread Rohit Kulkarni
Hi Stefan, I am new to this mailing list and came across the parse zip file plugin discussion.. Back in the days I already contributed such plugin. Browse the old list archive or bugzilla. I tried to search for the parse zip files plugin implementation you mentioned...but couldn't find it

[Nutch-dev] Re: OpenSearch API (Re: Nutch / CGI)

2005-03-31 Thread Jérôme Charron
In particular, I would love to see a REST contribution. Yes, I think it's a great idea too. Once this is implemented, search.jsp can be replaced with a filter that applies a stylesheet to XML search results. Servlet = XML = HTML instead of Servlet = HTML In my opinion, it is the front-end

[Nutch-dev] hits page list

2005-03-31 Thread Feri
Dear Developers, I have a problem: I would like a page list (1-10) to end of hit-list (as google). I have a problem when more hits from a site, the Hits.getTotal() not return by the real end of the hits. When I click to eg. on the 3. page, the result is an empty page (NutchBean.search is out of

[Nutch-dev] tools cleanup

2005-03-31 Thread Doug Cutting
I propose we cleanup Nutch's tools as follows. First, some definitions: 1. An action is an operation on Nutch data. For example, GenerateSegmentFromDB, FetchSegment, UpdateDB, IndexSegment, MergeIndexes, SearchServer, etc. are all actions. 2. A tool invokes an action from the command line. The

[Nutch-dev] Date range and url search

2005-03-31 Thread Rohit Kulkarni
Hi, Just wanted to know if nutch supports date range search (say query for web pages updated in last X days) and url search (like the site: in google) yet. If yes what syntax should be used while giving the query ? Thanks, Rohit --- This

[Nutch-dev] Re: OpenSearch API (Re: Nutch / CGI)

2005-03-31 Thread Doug Cutting
Andrzej Bialecki wrote: This is yet another case that speaks in favor of adding an out-of-the-box XML API to Nutch. Yes, I agree. * REST - HTTP GET or POST request, with query parameters contained in GET or POST parameters. An XML data document with results is a response. Lightweight, easy to

[Nutch-dev] [jira] Créée: (NUTCH-33) MIME content type detector (using magic char sequences)

2005-03-31 Thread Jerome Charron (JIRA)
MIME content type detector (using magic char sequences) --- Key: NUTCH-33 URL: http://issues.apache.org/jira/browse/NUTCH-33 Project: Nutch Type: New Feature Reporter: Jerome Charron Priority: Minor

[Nutch-dev] Re: tools cleanup

2005-03-31 Thread Stefan Groschupf
Doug, The proposal: 1. Actions and tools should be separate classes, in separate files. Wonderful! :-) That will make a set of things (e.g. run nutch in a container) very easy. 3. All actions must implement the following interface: Inversion of control makes a lot of sense! 5. All plugins must

[Nutch-dev] [jira] Commented: (NUTCH-7) analyze tool takes up all the disk space when there are circular links

2005-03-31 Thread Phoebe Miller (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-7?page=comments#action_61899 ] Phoebe Miller commented on NUTCH-7: --- I have fixed this problem by changing the update database tool, basically, links from a page is not added if the page has already been

[Nutch-dev] Re: tools cleanup

2005-03-31 Thread Doug Cutting
Andrzej Bialecki wrote: This also nicely solves the non-obvious requirement that all ndfs paths must begin with a slash... I fixed that a while back. Things that don't start with a slash are currently made relative to /user/$USER. Doug ---

[Nutch-dev] Re: OpenSearch API (Re: Nutch / CGI)

2005-03-31 Thread Doug Cutting
Jérôme Charron wrote: Servlet = XML = HTML instead of Servlet = HTML In my opinion, it is the front-end dreamed architecture. But more pragmatically, I'm not sure it's a good idea. XSL transformation is a rather slow process!! And the Nutch front-end must be very responsive. I don't think this

[Nutch-dev] RE: [jira] Commented: (NUTCH-7) analyze tool takes up all the dis k space when there are circular links

2005-03-31 Thread Jay Yu
Is your change to the update db tool going to be in the next release? Have you tested it? Thanks for the fix! -Original Message- From: Phoebe Miller (JIRA) [mailto:[EMAIL PROTECTED] Sent: Thursday, March 31, 2005 8:59 AM To: nutch-dev@incubator.apache.org Subject: [jira] Commented:

[Nutch-dev] Re: hits page list

2005-03-31 Thread Roger Dunk
The way I do it is thus: When hits.totalIsExact(), the final page can be found simply from hits.getTotal() When NOT hits.totalIsExact(), I run the query again, this time retrieving say 1000 urls (the max number of results I allow to be returned). Using a loop (increment counter by number of