[Nutch-dev] [jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml characters

2006-06-16 Thread John VanDyk (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-110?page=all ] John VanDyk updated NUTCH-110: -- Attachment: fixIllegalXmlChars08-v2.patch Stefan's patch didn't apply cleanly for me on svn revision 413155 so I re-did it. This patch fixes the i

[Nutch-dev] Re: Nutch Parsing PDFs, and general PDF extraction

2006-02-28 Thread John X
a pdf file largely depends on the pdf parsing library it uses, currently PDFBox. It won't be very difficult to switch to other libraries. However it seems hard to find a free/open implementation that can parse every pdf file in the wild. There is an alternative: use nutch's

[Nutch-dev] Re: tool to mount nutch filesystem

2006-02-07 Thread John X
Hi, Mike, On Tue, Feb 07, 2006 at 10:18:11AM -0800, Michael Cafarella wrote: > > John, > > This is a pretty awesome idea. Do you have any performance > numbers or experience with it you can share? No number yet. Just created it for my immediate use of browsing and moving a

[Nutch-dev] [jira] Commented: (NUTCH-193) move NDFS and MapReduce to a separate project

2006-02-02 Thread John Xing (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-193?page=comments#action_12365051 ] John Xing commented on NUTCH-193: - what's in the name hadoop? Because "had oops"? > move NDFS and MapReduce to

[Nutch-dev] Re: tool to mount nutch filesystem

2006-02-02 Thread John X
On Sat, Jan 21, 2006 at 09:23:01AM -0800, John X wrote: > Hi, Sami, > > On Sat, Jan 21, 2006 at 05:32:37PM +0200, Sami Siren wrote: > > >I have created a simple tool to mount nutch filesystem on linux. > > >http://nutch.neasys.com/ndfs/fuse-nutchfs-0.1.0.tar.gz > >

[Nutch-dev] [jira] Created: (NUTCH-199) tool to mount ndfs on linux

2006-02-02 Thread John Xing (JIRA)
tool to mount ndfs on linux --- Key: NUTCH-199 URL: http://issues.apache.org/jira/browse/NUTCH-199 Project: Nutch Type: New Feature Components: ndfs Environment: linux only Reporter: John Xing Assigned to: John Xing tool to mount

[Nutch-dev] [jira] Updated: (NUTCH-199) tool to mount ndfs on linux

2006-02-02 Thread John Xing (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-199?page=all ] John Xing updated NUTCH-199: Attachment: fuse-hadoop-0.1.0.tar.gz It works with current nutch-0.8-dev. Will be ported to hadoop after ndfs is moved. > tool to mount ndfs on li

[Nutch-dev] Re: need volunteer to develop search for apache.org

2006-01-27 Thread John X
Hi, Stefan, On Thu, Jan 26, 2006 at 10:17:52PM +0100, Stefan Groschupf wrote: > John, > if you need any kind of support let me know. Especially I can help > out with UI related stuff, however I also can help with all other > issues. Really appreciated. With all the support from t

[Nutch-dev] Re: need volunteer to develop search for apache.org

2006-01-26 Thread John X
On Thu, Jan 26, 2006 at 12:19:38PM -0800, Doug Cutting wrote: > John X wrote: > >Please count me in. > > Thanks, John. My pleasure. > > I forgot to mention that I'd prefer a committer for this, and you're a > committer, so that works well! > > >Is

[Nutch-dev] Re: tool to mount nutch filesystem

2006-01-21 Thread John X
rowse, no read/write yet. > > > >Doug and Mike: any plan to make ndfs codes into a separate package? > > John, > > I didn't check out your version yet, but I have also written > a version wich is read/write capable, should we combine our efforts here? Sure, why not ;-)

[Nutch-dev] Re: tool to mount nutch filesystem

2006-01-21 Thread John X
Hi, Otis, On Fri, Jan 20, 2006 at 09:31:16PM -0800, [EMAIL PROTECTED] wrote: > Hi John, > > NDFS + MapReduce will soon become a separate Lucene sub-project. In one sub-project or two separately? Thanks, John > > Otis > > - Original Message > From: John X

[Nutch-dev] tool to mount nutch filesystem

2006-01-20 Thread John X
I have created a simple tool to mount nutch filesystem on linux. http://nutch.neasys.com/ndfs/fuse-nutchfs-0.1.0.tar.gz Check README inside for how to set up. It is very barebone, only browse, no read/write yet. Doug and Mike: any plan to make ndfs codes into a separate package? Best, John

[Nutch-dev] small bug

2005-08-24 Thread John Maraist
Hi, Just found your project's web page from its log entries at my web site. Just to let you knonw, it leaves an outdated URL in the log, http://www.nutch.org/docs/en/bot.html , which gives an error on your current documentation. Best,

[Nutch-dev] Re: extend java.net.URL?

2005-08-11 Thread John X
y used in nutch, I am also reluctant > > to replace every one with my own MyURL. It seems I will have > > to hack java.net.URL source directly. This is not portable > > though. I am wondering if there are better alternatives, or > > some tricks can be applied. > >

[Nutch-dev] extend java.net.URL?

2005-08-10 Thread John X
source directly. This is not portable though. I am wondering if there are better alternatives, or some tricks can be applied. Thanks, John --- SF.Net email is Sponsored by the Better Software Conference & EXPO September 19-22, 2005 * San Franc

[Nutch-dev] Re: [VOTE] new Nutch committers

2005-06-08 Thread John X
maintain it. > > Formally, by Apache rules, we need a total of three +1 votes and no -1 > votes from the Lucene PMC. Votes by non-PMC committers and developers > are not binding but are encouraged. > > My votes: > > J?r?me Charron: +1 > Piotr

[Nutch-dev] Re: Upcoming work on Fetcher

2005-04-28 Thread John X
Hi, Andrzej, Could you give us a brief on what you are going to change, so that we can wheather your storm better ;-)? Thanks, John On Fri, Apr 29, 2005 at 12:13:49AM +0200, Andrzej Bialecki wrote: > Hi, > > This is just a heads-up that I will be working extensively (under a >

[Nutch-dev] Bug: Nutch indexer crashed

2005-04-25 Thread John Doe
I spent about 30 minutes trying to figure out how to submit a bug via JIRA. There must be a way, but it's not shown on any of the JIRA pages I clicked on. Anyway, here's the bug report: Component: indexer Priority: major After running for several hours on the intranet, the Nutch indexer crashe

[Nutch-dev] [jira] Closed: (NUTCH-33) MIME content type detector (using magic char sequences)

2005-04-17 Thread John Xing (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-33?page=history ] John Xing closed NUTCH-33: -- Resolution: Fixed > MIME content type detector (using magic char sequences) > --- > > K

[Nutch-dev] [jira] Commented: (NUTCH-33) MIME content type detector (using magic char sequences)

2005-04-17 Thread John Xing (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-33?page=comments#action_63022 ] John Xing commented on NUTCH-33: Just committed. Thanks. Nutch is licensed under the Apache License. If freedesktop mime database uses GPL, it could be problematic to have it

[Nutch-dev] [jira] Closed: (NUTCH-19) Space in Java.exe path chokes bin/nutch

2005-04-17 Thread John Xing (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-19?page=history ] John Xing closed NUTCH-19: -- Resolution: Fixed > Space in Java.exe path chokes bin/nutch > --- > > Key: NUTCH-19 >

[Nutch-dev] [jira] Closed: (NUTCH-22) ontology supported query refinement

2005-04-17 Thread John Xing (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-22?page=history ] John Xing closed NUTCH-22: -- Resolution: Fixed > ontology supported query refinement > --- > > Key: NUTCH-22 > URL: http://is

[Nutch-dev] [jira] Commented: (NUTCH-30) rss feed parser

2005-04-17 Thread John Xing (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-30?page=comments#action_63010 ] John Xing commented on NUTCH-30: Could we have an updated patch & zip against most recent svn? Also I am not sure it is a good idea to have parse-rss capture any mime

[Nutch-dev] [jira] Commented: (NUTCH-33) MIME content type detector (using magic char sequences)

2005-04-17 Thread John Xing (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-33?page=comments#action_63004 ] John Xing commented on NUTCH-33: Hi, Jerome and Hari, I committed your contributions last night. Thanks a lot. However, I just noticed that TestMimeTypes.java uses

[Nutch-dev] [jira] Commented: (NUTCH-33) MIME content type detector (using magic char sequences)

2005-04-14 Thread John Xing (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-33?page=comments#action_62877 ] John Xing commented on NUTCH-33: My +1 vote for this contribution. If no objection, I will commit it over the weekend. John > MIME content type detector (using magic c

[Nutch-dev] [jira] Commented: (NUTCH-33) MIME content type detector (using magic char sequences)

2005-04-07 Thread John Xing (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-33?page=comments#action_62415 ] John Xing commented on NUTCH-33: >What is your opinion about this point: >1. Is it the calling code that check the mime.magic property and call >the >getMimeTyp

[Nutch-dev] [jira] Commented: (NUTCH-33) MIME content type detector (using magic char sequences)

2005-04-06 Thread John Xing (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-33?page=comments#action_62316 ] John Xing commented on NUTCH-33: Hi, Jerome, I guess file extension check will be on all the time, but magic check can be an option. Though not ideal, a system wide property

[Nutch-dev] [jira] Commented: (NUTCH-33) MIME content type detector (using magic char sequences)

2005-04-05 Thread John Xing (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-33?page=comments#action_62195 ] John Xing commented on NUTCH-33: Just skimmed the code. The xml approach looks good. Two minor comments: (1) make magic check an option with a boolean property such as

[Nutch-dev] [jira] Assigned: (NUTCH-33) MIME content type detector (using magic char sequences)

2005-04-05 Thread John Xing (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-33?page=history ] John Xing reassigned NUTCH-33: -- Assign To: John Xing > MIME content type detector (using magic char sequences) > --- > >

[Nutch-dev] Re: Date range and url search

2005-04-01 Thread John X
have to be included (property plugin.includes in conf/nutch-default.xml or better conf/nutch-site.xml) John On Wed, Mar 30, 2005 at 05:33:46PM -0800, Rohit Kulkarni wrote: > Hi, > > Just wanted to know if nutch supports date range search (say query for > web pages updated in last X days) an

[Nutch-dev] Re: tools cleanup

2005-03-30 Thread John X
I don't mean to write a protocoll hander for ndfs (this would be nice > to have) but I just mean something like: > > bin/nutch generate ndfs://namenode:8010/myNDFSFolder/mydb > /mylocalsegment/ > ndfs path: ndfs://namenode:8010/myNDFSFolder/mydb > local pat

[Nutch-dev] Re: tools cleanup

2005-03-30 Thread John X
ld be processed uniformly too. There are various styles. We need to agree on one. John --- This SF.net email is sponsored by Demarc: A global provider of Threat Management Solutions. Download our HomeAdmin security software for free today! http:/

Re: [Nutch-dev] I made parse-rss work, but ... Re: Huge Problem trying to develop plugin for Nutch

2005-03-28 Thread John X
eve it's also bad idea to use jaxen-full.jar (use jaxen-core.jar plus a more specific jaxen dom jar) Do you really need commons-httpclient-3.0-beta1.jar (and possibly others)? John > > Thanks again! > > > Cheers, > Chris > > > > On 3/28/05 9:37 AM, "

Re: [Nutch-dev] I made parse-rss work, but ... Re: Huge Problem trying to develop plugin for Nutch

2005-03-28 Thread John X
; I will have a look and will try to find a fix. His zip is still available at http://baron.pagemewhen.com:8080/~chris/parse-rss.zip John > > Stefan > > __ http://www.neasys.com - A Good Place to Be Come to visit us today! -

[Nutch-dev] I made parse-rss work, but ... Re: Huge Problem trying to develop plugin for Nutch

2005-03-27 Thread John X
sible causes? One note: there is a tool called net.nutch.parse.ParserChecker, that you can use to debug parser plugins. It is more convenient to use it than start a crawler. Will you be able to contribute this plugin after the dust settles? Best, John On Sat, Mar 26, 2005 at 01:32:34PM -0800, CH

[Nutch-dev] Re: Huge Problem trying to develop plugin for Nutch

2005-03-26 Thread John X
On Sat, Mar 26, 2005 at 01:13:33PM -0800, CHRIS A MATTMANN wrote: > Hi John, > > Thanks for your reply. Actually I already have the feedparser working from > the command line. I also included a program, test2.java with my original > email that shows how I can dynamically loa

[Nutch-dev] Re: Huge Problem trying to develop plugin for Nutch

2005-03-26 Thread John X
Why try it the hard way? You may want to create a simple tool, just calling feedparser to parse your hi.rss? Have that work first, then worry about dynamic loading and nutch plugin system. Let us know when you have the simple tool. John On Fri, Mar 25, 2005 at 06:08:50PM -0800, Chris Mattmann

[Nutch-dev] Re: Mime/Magic mapper

2005-03-25 Thread John X
On Sat, Mar 26, 2005 at 01:48:05AM +0100, J?r?me Charron wrote: > Does somebody know why John Xing deactivate the mime.magic.file > support in protocol-file plugin? The "disabled" are only hooks to use mimetype/magic mapper. The mapper I used in a project had license issue (can&#x

[Nutch-dev] Re: Licenses

2005-03-23 Thread John X
On Wed, Mar 23, 2005 at 10:36:12PM -0800, Hari Kodungallur wrote: > Hi John, > > I will do them. No problems. > But one question: for (3) is there any other way other than reading > the /usr/share/magic.mime file? I am curious whether there is a > platform independent way which

[Nutch-dev] Re: servlet Cached.java

2005-03-23 Thread John X
On Wed, Mar 23, 2005 at 09:10:43AM -0800, Doug Cutting wrote: > John X wrote: > >Attached please find servlet Cached.java that serves raw Content > >of any mime type. Current cached.jsp handles mime type text/* only. > >If no objection, it is going to be committed in a few day

[Nutch-dev] Re: servlet Cached.java

2005-03-23 Thread John X
7;s the format I'm working > on right now and I think its use is widespread so it might be useful to > implement these features. Could you provide a code snippet or better a patch? Thanks, John > > Stephan > > > On Wed, March 23, 2005 11:19, Andrzej Bialecki said:

[Nutch-dev] Re: servlet Cached.java

2005-03-23 Thread John X
On Wed, Mar 23, 2005 at 11:19:36AM +0100, Andrzej Bialecki wrote: > John X wrote: > >Hi, All, > > > >Attached please find servlet Cached.java that serves raw Content > >of any mime type. Current cached.jsp handles mime type text/* only. > >If no objection, it is go

[Nutch-dev] Re: Licenses

2005-03-23 Thread John X
On Wed, Mar 23, 2005 at 12:35:30AM -0800, Hari Kodungallur wrote: > On Wed, 23 Mar 2005 00:51:53 -0800, John X <[EMAIL PROTECTED]> wrote: > > > > It will be great if you can help on that. Plugin index-more also uses it. > > I know there are two opensource efforts: >

[Nutch-dev] Re: Licenses

2005-03-23 Thread John X
now. I am currently short of time, so any help will be greatly appreciated. One interesting observation: there is an activation.jar (under ./common/lib/) in jakarta-tomcat-4.1.31.tar.gz We need to find out which one this is? John > > Aside: it would have been nice if there was a Mimetype map

[Nutch-dev] servlet Cached.java

2005-03-22 Thread John X
Hi, All, Attached please find servlet Cached.java that serves raw Content of any mime type. Current cached.jsp handles mime type text/* only. If no objection, it is going to be committed in a few days. John --- Cached.java --- diff -Nur --exclude='

[Nutch-dev] Re: [jira] Updated: (NUTCH-10) extension points are defined multiple times

2005-03-16 Thread John X
Hi, Stefan, The patch does not seem to include the code of nutch-extensionpoints. Or am I missing something? Thanks, John On Wed, Mar 16, 2005 at 08:09:21PM +0100, Stefan Grroschupf (JIRA) wrote: > [ http://issues.apache.org/jira/browse/NUTCH-10?page=history ] > > Stefan G

[Nutch-dev] Re: good luck for incubation time

2005-03-11 Thread John X
Thomas & Stefan, nice ;-) Is favicon updated too? John On Thu, Mar 10, 2005 at 11:02:26PM +0100, Stefan Groschupf wrote: > Dear nutch developers, > > congratulations for joining apache incubation program! > I'm personal sure that this is another big step for nutch and it i

[Nutch-dev] Re: fetcher.retry.max

2005-03-11 Thread John X
Looks like a relic from older crawler, RequestScheduler.java, that was removed from source quite a while ago. John On Fri, Mar 11, 2005 at 10:49:45PM +0100, Stefan Groschupf wrote: > k, > you are right it was sounding so curious that i was searching in the > latest subversion code,

[Nutch-dev] Re: Test Failures

2005-03-11 Thread John X
Which version you use? Recently nutch's moved from sf.net to apache, due to concerns over licences of some jars, a few plugins have been "disabled". It takes time to make all clean again. Meanwhile, you may want to ignore ant test. John On Fri, Mar 11, 2005 at 01:48:57PM -0800, H

[Nutch-dev] Re: Simple bug in jsp pages.

2005-03-11 Thread John X
> nobody noticed it for quite a while, refine-query-init.jsp seems to be > commented out by default, but the code > in search.jsp should be executed in my opinion. > Regards, > Piotr You are right. I will fix them in repository this weekend. John > > > ---

Re: [Nutch-dev] Plugins - sum up

2005-03-08 Thread John X
ys.com/patch/20040703/note.txt > Unknown (or bad-known) by myself : > ONTHOLOGY It supports ontology based heuristic query. Furthermore, url filters have been converted into plugins: urlfilter-prefix and urlfilter-regex John --- SF email is sp

Re: [Nutch-dev] Plugins problems

2005-03-04 Thread John X
On Thu, Mar 03, 2005 at 09:52:40AM +0100, Christophe Noel wrote: > Hello, > > I need to know more about the parse-ext plugin ... what can it do for > example ? > There is a note about parse-ext. Please go to http://nutch.neasys.com/patch/ and check links under &q

Re: [Nutch-dev] Injecting URLs from database

2005-02-16 Thread John X
ler" for NNTP, any code that tries to new a URL > with "nntp://"; will get an exception (and I think the URL filtering does > this). > > Question: Does this make sense that that Nutch depends on URLs, thus any > schema not supported by the JVM (JVM supports h

Re: [Nutch-dev] make URLFilter as plugin

2005-02-15 Thread John X
Hi, Chirag, Thanks for your detailed report. Do you think rules engine would be good for UrlNormalizer? Can nutch possibly benifit from rules engine in other ways? John On Mon, Feb 14, 2005 at 04:08:19PM -0500, Chirag Chaman wrote: > John: > > We did some research and ran some te

Re: [Nutch-dev] make URLFilter as plugin

2005-02-08 Thread John X
On Tue, Feb 08, 2005 at 09:41:28AM -0500, Chirag Chaman wrote: > John: > > We tested with QuickRules (YasuTech). > The only non-commercial one I've used is Jess -- though it may have license > issues. > > I know there is a big move to get open source XML rules engine ma

Re: [Nutch-dev] make URLFilter as plugin

2005-02-08 Thread John X
On Tue, Feb 08, 2005 at 09:41:28AM -0500, Chirag Chaman wrote: > John: > > We tested with QuickRules (YasuTech). > The only non-commercial one I've used is Jess -- though it may have license > issues. > > I know there is a big move to get open source XML rules engine ma

Re: [Nutch-dev] make URLFilter as plugin

2005-02-08 Thread John X
Hi, Chirag, By all means, please go ahead. I will check them too, for my own needs, then compare note with you or whoever must be interested. We can have 1. sorted out first and worry about 2. later. Thanks, John On Tue, Feb 08, 2005 at 11:39:09AM -0500, Chirag Chaman wrote: > Thankx -- t

Re: [Nutch-dev] make URLFilter as plugin

2005-02-08 Thread John X
Hi, Chirag, Since nutch urlfilter has been converted into plugin, I am going to take on the idea of rule-based filtering as you suggested before, maybe a new urlfilter plugin. Which commercial RETE engine you used? Any open source one? Thanks, John On Mon, Jan 31, 2005 at 08:03:03PM -0500

[Nutch-dev] Re: patch available now Re: make URLFilter as plugin

2005-02-04 Thread John X
If no objection, I will commit tomorrow. John On Tue, Feb 01, 2005 at 07:03:10PM -0800, John X wrote: > Attached please find my patch to make current url filters > as plugins. Now I can apply both net.nutch.net.RegexURLFilter > and net.nutch.net.PrefixURLFilter at the same time. &g

[Nutch-dev] patch available now Re: make URLFilter as plugin

2005-02-01 Thread John X
lugin might have to be written. John On Mon, Jan 31, 2005 at 04:53:08PM -0800, John X wrote: > Hi, All, > > I propose to define plugin extension point for URLFilter, and > convert current RegexURLFilter.java, PrefixURLFilter.java, etc., into > plugins. However there is one requireme

Re: [Nutch-dev] make URLFilter as plugin

2005-02-01 Thread John X
On Tue, Feb 01, 2005 at 10:38:06AM +0100, Andrzej Bialecki wrote: > John X wrote: > >Stefan, > > > >On Tue, Feb 01, 2005 at 01:55:03AM +0100, Stefan Groschupf wrote: > > > >>John, > >> > >>by the way, is the url filter multithreaded? > &

Re: [Nutch-dev] make URLFilter as plugin

2005-01-31 Thread John X
suggest, either by own invention or by calling commercial lib/engine. However, I do not quite follow your discussion about 3xx forwards. John On Mon, Jan 31, 2005 at 08:03:03PM -0500, Chirag Chaman wrote: > John: > > This is a very good idea -- and one that we currently use as a "hack

Re: [Nutch-dev] make URLFilter as plugin

2005-01-31 Thread John X
Stefan, On Tue, Feb 01, 2005 at 01:55:03AM +0100, Stefan Groschupf wrote: > John, > > by the way, is the url filter multithreaded? > Do you think it is possible to implement the url filter extension > point multithreaded? As far as I know, none of the tools that currentl

[Nutch-dev] make URLFilter as plugin

2005-01-31 Thread John X
applied. I have not checked, but I assume, by default, we can always name plugins in alphabetical order. Stefan: any better way to do this? If no one thinks this is a bad idea, I am going to start work on it right way. John --- This SF.Net email

Re: [Nutch-dev] patch for segslice to filter url by pattern

2005-01-30 Thread John X
On Sun, Jan 30, 2005 at 05:57:46PM +0100, Andrzej Bialecki wrote: > John X wrote: > >Hi, All, > > > >Attached is a patch for segslice to filter entries by url pattern. > >If no objection, I will commit tomorrow. > > I couldn't object, because I was a

Re: [Nutch-dev] need a lib to know location of html element

2005-01-30 Thread John X
Hi, Stefan, On Sun, Jan 30, 2005 at 12:31:12PM +0100, Stefan Groschupf wrote: > John, > > >I need a lib/tool that can tell me physical location of a particular > >html element as the page would have been displayed by a browser. > I'm not sure if I understand you correc

[Nutch-dev] need a lib to know location of html element

2005-01-29 Thread John X
eatly appreciated. John --- This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting Tool for open source databases. Create drag-&-drop reports. Save time by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc. Downlo

Re: [Nutch-dev] get unparsed content

2005-01-29 Thread John X
st means it stores FetcherOutput.java data structure, not that of Content.java. John > > Can someone give me any hint where the unparsed raw content is stored > in unparsed and in parsing mode? > > Thanks! > Stefan > > > > -

Re: [Nutch-dev] Does crawl fetch documents that can't have outbound links before the indexing phase?

2005-01-28 Thread John X
ranet crawls > until the final indexing would cut out a lot of unneeded work, I > think. Yes. Typically you crawl htmls in first rounds. Then crawl other mimetypes. The control is done via ./conf/regex-urlfilter.txt and ./conf/nutch-site.xml John ---

[Nutch-dev] patch for segslice to filter url by pattern

2005-01-28 Thread John X
Hi, All, Attached is a patch for segslice to filter entries by url pattern. If no objection, I will commit tomorrow. John __ http://www.neasys.com - A Good Place to Be Come to visit us today! --- src/java/net/nutch/segment/SegmentSlicer.java.ori 2004-12

Re: [Nutch-dev] ArrayIndexOutOfBoundsException during fetch

2005-01-08 Thread John X
l be some codes are not thread safe. I had identified it a while a ago. Guess have to commit the patch using Doug's suggestion. Will try to have that fixed over the week end. Meanwhile you may want to search list archive. The thread was probably before Xmas. John > > Before I try to

Re: [Nutch-dev] make BasicUrlNormalizer.java thread safe

2004-12-20 Thread John X
On Mon, Dec 20, 2004 at 03:40:44PM -0800, Doug Cutting wrote: > John X wrote: > >BasicUrlNormalizer.java should be made thread safe as > > > >< public String normalize(String urlString) > >--- > > > >> public synchronized String normalize(String

[Nutch-dev] make BasicUrlNormalizer.java thread safe

2004-12-20 Thread John X
BasicUrlNormalizer.java should be made thread safe as < public String normalize(String urlString) --- > public synchronized String normalize(String urlString) If no objection, I will commit it late. John __ http://www.neasys.com - A Good Pl

Re: [Nutch-dev] improve fetcher thread handling?

2004-12-20 Thread John X
s more sophisticated in design, but had issues and was abondoned due to lack of maintenance. You may want to check it again. John --- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real

Re: [Nutch-dev] to delete or not, that's a question

2004-12-16 Thread John X
On Wed, Dec 15, 2004 at 12:32:18AM +0100, Andrzej Bialecki wrote: > John X wrote: > >Hi, Mike, > > > >The current behavior of addLocalFile() in nfs is: > >local src is REMOVED after being added to nfs. > > > >Doing so has the benefit of space saving, but

[Nutch-dev] minor patch for SegmentReader.java

2004-12-16 Thread John X
Hi, Andrzej, SegmentReader.java fails when option -nocontent, etc. are on. Attached is a patch. If looks okay, I will commit it with my other patch for ndfs tomorrow. John --- ./nutch-cvs-20041215/src/java/net/nutch/segment/SegmentReader.java 2004-12-05 01:43:48.0 -0800 +++ ./nutch-cvs

Re: [Nutch-dev] to delete or not, that's a question

2004-12-15 Thread John X
On Wed, Dec 15, 2004 at 02:33:29AM +0100, Andrzej Bialecki wrote: > John X wrote: > >On Wed, Dec 15, 2004 at 12:32:18AM +0100, Andrzej Bialecki wrote: > > > >>John X wrote: > >> > >>>Hi, Mike, > >>> > >>>The current behavior of

[Nutch-dev] to delete or not, that's a question

2004-12-14 Thread John X
other tools). If this sounds good, I will prepare a patch. John --- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the

Re: [Nutch-dev] kick-start: Nutch to top-level project (search.apache.org)

2004-12-14 Thread John X
ns. > > Yes, we need to go through the lib/ and src/plugin/*/lib directories and > check the license of each jar file. Could someone volunteer to write up > an inventory of these, with licenses? Thanks. I will create a list for this. John -

Re: [Nutch-dev] Alternative Content Types

2004-12-08 Thread John X
. Interested? John On Tue, Dec 07, 2004 at 11:39:02AM -0500, Mike Richmond wrote: > To Whom It May Concern: > > I am a Java developer looking to get involved with a project. I came across > your site and noticed that there is a lot of attention paid to PDF parsing. > I'm curious why

[Nutch-dev] URGENT FUND TRANSFER\ INVESTMENT

2004-12-07 Thread john koromah
FROM:JOHN  KOROMAH PHON:27-73-267-1376 EMAIL: [EMAIL PROTECTED] ATT: DIRECTOR/CEO, I got your contact through the South African Trade and Business Information Services by a very    reliable friend of mine who introduced your capability, personality and business address to me. I am MR john Koromah

Re: [Nutch-dev] Alternative Content Types

2004-12-07 Thread John X
fied in http://www.nutch.org/docs/en/policies.html We are always in need of good content parsers, any type! Thanks, John On Tue, Dec 07, 2004 at 06:00:52PM +0100, St?phane Lagraulet wrote: __ http://www.neasys.com - A Good Place to Be Come to vi

Re: [Nutch-dev] ontology query refinement

2004-11-29 Thread John X
It's committed. On Mon, Nov 15, 2004 at 11:53:34PM -0800, John X wrote: > Hi, All, > > I have tried this plugin. It is quite useful. thanks mike. > If no one objects, I will commit it with a few modifications > late this week. > > John > > On Tue, Nov 09, 200

Re: [Nutch-dev] Experience with a big index

2004-11-28 Thread John X
for now, I might have more ;-) Thanks a lot. John On Sun, Nov 28, 2004 at 05:33:59PM -0800, Michael Cafarella wrote: > > Hi everyone, > > A few weeks ago I completed a research project that involved building > a 50-100m page Nutch crawl. I've been working on Nutch a

Re: [Nutch-dev] Strange refetch date

2004-11-18 Thread John X
On Thu, Nov 18, 2004 at 11:07:50AM -0800, John X wrote: > On Thu, Nov 18, 2004 at 01:54:57PM +0100, Sven Wende wrote: > > Hi, > > > > I used the "readdb" command to list some information about the pages in my > > database. > > > > Th

Re: [Nutch-dev] Strange refetch date

2004-11-18 Thread John X
ET 292278994 > > Other pages have "normal" dates, like: > > Next fetch: Wed Dec 22 13:50:42 CET 2004 > > > > I wonder about that strange year indicator. Is this a bug, or a "feature" ? Which option did you use? A more detailed log/dump will be

Re: [Nutch-dev] contribution to nutch development - alternative content types

2004-11-18 Thread John X
. I think that is to much for a > newbie, give people the chance to start get starting fast and small. I agree. We'd better do it sooner than later. John --- This SF.Net email is sponsored by: InterSystems CACHE FREE OODBMS DOWNLOAD -

Re: [Nutch-dev] ontology query refinement

2004-11-15 Thread John X
Hi, All, I have tried this plugin. It is quite useful. thanks mike. If no one objects, I will commit it with a few modifications late this week. John On Tue, Nov 09, 2004 at 08:41:05PM -0800, michael j pan wrote: > hi all, > > i have developed a plugin for ontology-supported query r

Re: [Nutch-dev] New SegmentMergeTool in CVS now

2004-11-14 Thread John X
ault mode of > operation, because it saves a lot of disk IO, and anyway parsing in a > separate stage is more bullet-proof. > My +1 vote. John > -- > Best regards, > Andrzej Bialecki --- This SF.Net email is sponsored by: InterSyst

Re: [Nutch-dev] Segment API

2004-11-14 Thread John X
On Sun, Nov 14, 2004 at 10:02:13PM +0100, Andrzej Bialecki wrote: > John X wrote: > > >One thorny issue is: how to deal with various FetcherOutput states. > >Before parsing was separated from fetching, failed parsing > >was logged as NOT_FOUND. Now it will be marked as CA

Re: [Nutch-dev] Segment API

2004-11-13 Thread John X
as separated from fetching, failed parsing was logged as NOT_FOUND. Now it will be marked as CANT_PARSE. We may have to increase VERSION in FetcherOutput from 4 to 5, so that "old" ./fetcher can be easily distignushed from new ./fetcher and ./fetcher_output. I did not do that because not feel c

Re: [Nutch-dev] Re: Fetcher Hung

2004-11-12 Thread John X
hat take less than 12 hours to fetch, > using, e.g., the -numFetchers parameter when generating fetchlists. But > this is substantially more complicated if you're currently using the > crawl command. There is one more way to debug: run fetcher with -noParsing option, then run

Re: [Nutch-dev] VOTE: licenses

2004-11-11 Thread John X
On Fri, Nov 12, 2004 at 11:38:04AM +1030, Nick Lothian wrote: > > > I believe that the best way to encourage people to contribute is for the > community to continue to develop innovative and useful software. That > encourages people to continually update their versions, which acts as a > significa

Re: [Nutch-dev] VOTE: licenses

2004-11-11 Thread John X
+1 for license change and top-level search.apache.org John On Thu, Nov 11, 2004 at 12:17:08PM +0100, Doug Cutting wrote: > > My belief is that we should disband the Nutch non-profit organization > and assign the copyright for Nutch software to the Apache Foundation, > switching Nut

[Nutch-dev] domain

2004-11-10 Thread John
Hi I own webspider.com, would you be interested in using the domain to implement nutch. John --- This SF.Net email is sponsored by: Sybase ASE Linux Express Edition - download now for FREE LinuxWorld Reader's Choice Award Winner for

Re: [Nutch-dev] Segment tools

2004-11-05 Thread John X
ery useful. > > I attach the source code. The tool to dump segments data (currently in > net.nutch.tools.DumpSegment) could be moved here, or left as it is - > suggestions are welcome. Please do so. Thanks, John --- This SF.Net email

Re: [Nutch-dev] Re: Any way to sort hits by date?

2004-11-01 Thread John X
cook a query plugin to search for it. I planned to do a query-more plugin (for content-type, content-length as well as last-modified), but never got around to do it and won't have time soon. It will be great if you can contribute along the line. John

Re: [Nutch-dev] PDF parsing speed

2004-10-31 Thread John X
On Sun, Oct 31, 2004 at 03:43:22PM -0500, Luke Baker wrote: > On 10/31/2004 12:22 PM, John X wrote: > [snip] > >What are the numbers for kb/s and bytes/page? > >I have a collection of mostly mswords, ppts and some pdfs, the numbers are > >041001 194517 10 status: 0.1771225

Re: [Nutch-dev] PDF parsing speed

2004-10-31 Thread John X
On Sat, Oct 30, 2004 at 11:06:18AM -0400, Luke Baker wrote: > Hey all, > > Does anyone else have the problem of the pdf parser taking up so many > resources that it slows down the whole parsing process? I ran the fetch > with the -noParsing option (thanks John!). I then ran th

Re: [Nutch-dev] adding extra fields to the default nutch crawl.

2004-10-28 Thread John X
riting the page would be trivial, but I'm not > sure if you can be running a fetch or something while injecting a new > url. (does that make sense?) No. Please explain. John > > also.. the clustering option is really nice, IMHO it should be on by > default. > &g

  1   2   3   >