[Nutch-dev] Re: hits page list

2005-03-31 Thread Roger Dunk
The way I do it is thus: When hits.totalIsExact(), the final page can be found simply from hits.getTotal() When NOT hits.totalIsExact(), I run the query again, this time retrieving say 1000 urls (the max number of results I allow to be returned). Using a loop (increment counter by number of res

[Nutch-dev] RE: [jira] Commented: (NUTCH-7) analyze tool takes up all the dis k space when there are circular links

2005-03-31 Thread Jay Yu
Is your change to the update db tool going to be in the next release? Have you tested it? Thanks for the fix! -Original Message- From: Phoebe Miller (JIRA) [mailto:[EMAIL PROTECTED] Sent: Thursday, March 31, 2005 8:59 AM To: nutch-dev@incubator.apache.org Subject: [jira] Commented: (NUTC

[Nutch-dev] Re: OpenSearch API (Re: Nutch / CGI)

2005-03-31 Thread Doug Cutting
Jérôme Charron wrote: Servlet => XML => HTML instead of Servlet => HTML In my opinion, it is the front-end "dreamed" architecture. But more pragmatically, I'm not sure it's a good idea. XSL transformation is a rather slow process!! And the Nutch front-end must be very responsive. I don't think this

[Nutch-dev] Re: tools cleanup

2005-03-31 Thread Doug Cutting
Andrzej Bialecki wrote: This also nicely solves the non-obvious requirement that all ndfs paths must begin with a slash... I fixed that a while back. Things that don't start with a slash are currently made relative to /user/$USER. Doug --- Thi

[Nutch-dev] [jira] Commented: (NUTCH-7) analyze tool takes up all the disk space when there are circular links

2005-03-31 Thread Phoebe Miller (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-7?page=comments#action_61899 ] Phoebe Miller commented on NUTCH-7: --- I have fixed this problem by changing the update database tool, basically, links from a page is not added if the page has already been pro

[Nutch-dev] Re: tools cleanup

2005-03-31 Thread Stefan Groschupf
Doug, The proposal: 1. Actions and tools should be separate classes, in separate files. Wonderful! :-) That will make a set of things (e.g. run nutch in a container) very easy. 3. All actions must implement the following interface: Inversion of control makes a lot of sense! 5. All plugins must imp

[Nutch-dev] [jira] Créée: (NUTCH-33) MIME content type detector (using magic char sequences)

2005-03-31 Thread Jerome Charron (JIRA)
MIME content type detector (using magic char sequences) --- Key: NUTCH-33 URL: http://issues.apache.org/jira/browse/NUTCH-33 Project: Nutch Type: New Feature Reporter: Jerome Charron Priority: Minor Extens

[Nutch-dev] Re: OpenSearch API (Re: Nutch / CGI)

2005-03-31 Thread Doug Cutting
Andrzej Bialecki wrote: This is yet another case that speaks in favor of adding an "out-of-the-box" XML API to Nutch. Yes, I agree. * REST - HTTP GET or POST request, with query parameters contained in GET or POST parameters. An XML data document with results is a response. Lightweight, easy to

[Nutch-dev] Date range and url search

2005-03-31 Thread Rohit Kulkarni
Hi, Just wanted to know if nutch supports date range search (say query for web pages updated in last X days) and url search (like the site: in google) yet. If yes what syntax should be used while giving the query ? Thanks, Rohit --- This SF.n

[Nutch-dev] tools cleanup

2005-03-31 Thread Doug Cutting
I propose we cleanup Nutch's tools as follows. First, some definitions: 1. An "action" is an operation on Nutch data. For example, GenerateSegmentFromDB, FetchSegment, UpdateDB, IndexSegment, MergeIndexes, SearchServer, etc. are all actions. 2. A "tool" invokes an action from the command line.

[Nutch-dev] hits page list

2005-03-31 Thread Feri
Dear Developers, I have a problem: I would like a page list (1-10) to end of hit-list (as google). I have a problem when more hits from a site, the Hits.getTotal() not return by the real end of the hits. When I click to eg. on the 3. page, the result is an empty page (NutchBean.search is out of t

Re: [Nutch-dev] Re: Nutch / CGI

2005-03-31 Thread Minty
<[EMAIL PROTECTED]> wrote: > Plucene's index format is currently/still compatible with that of > Lucene's (hm, which version... don't remember off the top of my head), > but may not be for much longer. not at all convinced about that, if you check out the latest "stable" code from cpan.org it's u

[Nutch-dev] Re: OpenSearch API (Re: Nutch / CGI)

2005-03-31 Thread Jérôme Charron
> In particular, I would love to see a REST contribution. Yes, I think it's a great idea too. > Once this is implemented, search.jsp can be replaced with a filter that > applies a stylesheet to XML search results. Servlet => XML => HTML instead of Servlet => HTML In my opinion, it is the front-end

Re: [Nutch-dev] Re: New nutch plugin

2005-03-31 Thread Rohit Kulkarni
Hi Stefan, I am new to this mailing list and came across the parse zip file plugin discussion.. >Back in the days I already contributed such plugin. >Browse the old list archive or bugzilla. I tried to search for the parse zip files plugin implementation you mentioned...but couldn't find it co

[Nutch-dev] Re: Nutch / CGI

2005-03-31 Thread Jeff Breidenbach
Actually. I was thinking about a program witten in Java using the CGI interface instead of the servelet framework. Sounds like nobody is working on that. On Wed, 30 Mar 2005 2:40 am, Olaf Thiele wrote: Hi Jeff, as segments are stored in a home-grown file format you will need to program your own

[Nutch-dev] Re: tools cleanup

2005-03-31 Thread Andrzej Bialecki
John X wrote: On Thu, Mar 31, 2005 at 12:45:39AM +0200, Stefan Groschupf wrote: Actually it is difficult to have tools using ndfs and local file system. What do people think about introducing a ndfs notation in paths like it is used in protocol handlers? (ala http:// or file://) I don't mean to wr

[Nutch-dev] Re: tools cleanup

2005-03-31 Thread Feng Zhou
I second this. But it would still be useful to keep the current NDFS config entries. This is because if these URI's become the main method of using ndfs, they could end up in a lot of scripts users write. Then it would be inconvenient to change the namenode. Maybe we could use ndfs:///path (three s

[Nutch-dev] Re: tools cleanup

2005-03-31 Thread Doug Cutting
Doug Cutting wrote: The proposal: One more: 7. No code should call NutchConf.get() except a tool's main(). Doug --- This SF.net email is sponsored by Demarc: A global provider of Threat Management Solutions. Download our HomeAdmin security softwa

Re: [Nutch-dev] Re: New nutch plugin

2005-03-31 Thread Stefan Groschupf
Am 31.03.2005 um 03:55 schrieb Rohit Kulkarni: I tried to search for the parse zip files plugin implementation you mentioned...but couldn't find it It was in the old bug tracking, but this is not avaiolable any more. However your plugin is easy to realize. Just uncompress the content and then query

[Nutch-dev] war target in build.xml

2005-03-31 Thread Jack Tang
Hi I don't know why does not war target in build.xml(in svn) include jakarta-oro-2.0.7.jar. ... ... Can someone explain it? Regards /Jack --

[Nutch-dev] Re: Nutch / CGI

2005-03-31 Thread Olaf Thiele
Hi everybody, I haven't used either PLucene or PyLucene. That were just guesses for what could be used. Thanks for pointing to PyLucene though, didn't know it exists. Jeff: Concerning your CGI implemenentation, you might want to follow the OpenSearch API Thread. The REST implementation proposed by