[ http://issues.apache.org/jira/browse/NUTCH-110?page=all ]
John VanDyk updated NUTCH-110:
--
Attachment: fixIllegalXmlChars08-v2.patch
Stefan's patch didn't apply cleanly for me on svn revision 413155 so I re-did
it.
This patch fixes the i
a pdf file largely
depends on the pdf parsing library it uses, currently PDFBox.
It won't be very difficult to switch to other libraries.
However it seems hard to find a free/open implementation
that can parse every pdf file in the wild. There is an alternative:
use nutch's
Hi, Mike,
On Tue, Feb 07, 2006 at 10:18:11AM -0800, Michael Cafarella wrote:
>
> John,
>
> This is a pretty awesome idea. Do you have any performance
> numbers or experience with it you can share?
No number yet. Just created it for my immediate use of browsing
and moving a
[
http://issues.apache.org/jira/browse/NUTCH-193?page=comments#action_12365051 ]
John Xing commented on NUTCH-193:
-
what's in the name hadoop? Because "had oops"?
> move NDFS and MapReduce to
On Sat, Jan 21, 2006 at 09:23:01AM -0800, John X wrote:
> Hi, Sami,
>
> On Sat, Jan 21, 2006 at 05:32:37PM +0200, Sami Siren wrote:
> > >I have created a simple tool to mount nutch filesystem on linux.
> > >http://nutch.neasys.com/ndfs/fuse-nutchfs-0.1.0.tar.gz
> >
tool to mount ndfs on linux
---
Key: NUTCH-199
URL: http://issues.apache.org/jira/browse/NUTCH-199
Project: Nutch
Type: New Feature
Components: ndfs
Environment: linux only
Reporter: John Xing
Assigned to: John Xing
tool to mount
[ http://issues.apache.org/jira/browse/NUTCH-199?page=all ]
John Xing updated NUTCH-199:
Attachment: fuse-hadoop-0.1.0.tar.gz
It works with current nutch-0.8-dev. Will be ported to hadoop after ndfs is
moved.
> tool to mount ndfs on li
Hi, Stefan,
On Thu, Jan 26, 2006 at 10:17:52PM +0100, Stefan Groschupf wrote:
> John,
> if you need any kind of support let me know. Especially I can help
> out with UI related stuff, however I also can help with all other
> issues.
Really appreciated. With all the support from t
On Thu, Jan 26, 2006 at 12:19:38PM -0800, Doug Cutting wrote:
> John X wrote:
> >Please count me in.
>
> Thanks, John.
My pleasure.
>
> I forgot to mention that I'd prefer a committer for this, and you're a
> committer, so that works well!
>
> >Is
rowse, no read/write yet.
> >
> >Doug and Mike: any plan to make ndfs codes into a separate package?
>
> John,
>
> I didn't check out your version yet, but I have also written
> a version wich is read/write capable, should we combine our efforts here?
Sure, why not ;-)
Hi, Otis,
On Fri, Jan 20, 2006 at 09:31:16PM -0800, [EMAIL PROTECTED] wrote:
> Hi John,
>
> NDFS + MapReduce will soon become a separate Lucene sub-project.
In one sub-project or two separately?
Thanks,
John
>
> Otis
>
> - Original Message
> From: John X
I have created a simple tool to mount nutch filesystem on linux.
http://nutch.neasys.com/ndfs/fuse-nutchfs-0.1.0.tar.gz
Check README inside for how to set up.
It is very barebone, only browse, no read/write yet.
Doug and Mike: any plan to make ndfs codes into a separate package?
Best,
John
Hi,
Just found your project's web page from its log entries at my web site. Just
to let you knonw, it leaves an outdated URL in the log,
http://www.nutch.org/docs/en/bot.html , which gives an error on your current
documentation.
Best,
y used in nutch, I am also reluctant
> > to replace every one with my own MyURL. It seems I will have
> > to hack java.net.URL source directly. This is not portable
> > though. I am wondering if there are better alternatives, or
> > some tricks can be applied.
> >
source
directly. This is not portable though. I am wondering if there are
better alternatives, or some tricks can be applied.
Thanks,
John
---
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Franc
maintain it.
>
> Formally, by Apache rules, we need a total of three +1 votes and no -1
> votes from the Lucene PMC. Votes by non-PMC committers and developers
> are not binding but are encouraged.
>
> My votes:
>
> J?r?me Charron: +1
> Piotr
Hi, Andrzej,
Could you give us a brief on what you are going to change,
so that we can wheather your storm better ;-)?
Thanks,
John
On Fri, Apr 29, 2005 at 12:13:49AM +0200, Andrzej Bialecki wrote:
> Hi,
>
> This is just a heads-up that I will be working extensively (under a
>
I spent about 30 minutes trying to figure out how to submit a bug via JIRA.
There must be a way, but it's not shown on any of the JIRA pages I clicked on.
Anyway, here's the bug report:
Component: indexer
Priority: major
After running for several hours on the intranet, the Nutch indexer crashe
[ http://issues.apache.org/jira/browse/NUTCH-33?page=history ]
John Xing closed NUTCH-33:
--
Resolution: Fixed
> MIME content type detector (using magic char sequences)
> ---
>
> K
[ http://issues.apache.org/jira/browse/NUTCH-33?page=comments#action_63022
]
John Xing commented on NUTCH-33:
Just committed. Thanks.
Nutch is licensed under the Apache License.
If freedesktop mime database uses GPL, it could be problematic
to have it
[ http://issues.apache.org/jira/browse/NUTCH-19?page=history ]
John Xing closed NUTCH-19:
--
Resolution: Fixed
> Space in Java.exe path chokes bin/nutch
> ---
>
> Key: NUTCH-19
>
[ http://issues.apache.org/jira/browse/NUTCH-22?page=history ]
John Xing closed NUTCH-22:
--
Resolution: Fixed
> ontology supported query refinement
> ---
>
> Key: NUTCH-22
> URL: http://is
[ http://issues.apache.org/jira/browse/NUTCH-30?page=comments#action_63010
]
John Xing commented on NUTCH-30:
Could we have an updated patch & zip against most recent svn?
Also I am not sure it is a good idea to have parse-rss capture
any mime
[ http://issues.apache.org/jira/browse/NUTCH-33?page=comments#action_63004
]
John Xing commented on NUTCH-33:
Hi, Jerome and Hari,
I committed your contributions last night. Thanks a lot.
However, I just noticed that TestMimeTypes.java uses
[ http://issues.apache.org/jira/browse/NUTCH-33?page=comments#action_62877
]
John Xing commented on NUTCH-33:
My +1 vote for this contribution.
If no objection, I will commit it over the weekend.
John
> MIME content type detector (using magic c
[ http://issues.apache.org/jira/browse/NUTCH-33?page=comments#action_62415
]
John Xing commented on NUTCH-33:
>What is your opinion about this point:
>1. Is it the calling code that check the mime.magic property and call >the
>getMimeTyp
[ http://issues.apache.org/jira/browse/NUTCH-33?page=comments#action_62316
]
John Xing commented on NUTCH-33:
Hi, Jerome,
I guess file extension check will be on all the time,
but magic check can be an option. Though not ideal, a system
wide property
[ http://issues.apache.org/jira/browse/NUTCH-33?page=comments#action_62195
]
John Xing commented on NUTCH-33:
Just skimmed the code. The xml approach looks good.
Two minor comments:
(1) make magic check an option with a boolean property
such as
[ http://issues.apache.org/jira/browse/NUTCH-33?page=history ]
John Xing reassigned NUTCH-33:
--
Assign To: John Xing
> MIME content type detector (using magic char sequences)
> ---
>
>
have to be included
(property plugin.includes in conf/nutch-default.xml or better
conf/nutch-site.xml)
John
On Wed, Mar 30, 2005 at 05:33:46PM -0800, Rohit Kulkarni wrote:
> Hi,
>
> Just wanted to know if nutch supports date range search (say query for
> web pages updated in last X days) an
I don't mean to write a protocoll hander for ndfs (this would be nice
> to have) but I just mean something like:
>
> bin/nutch generate ndfs://namenode:8010/myNDFSFolder/mydb
> /mylocalsegment/
> ndfs path: ndfs://namenode:8010/myNDFSFolder/mydb
> local pat
ld be processed uniformly too.
There are various styles. We need to agree on one.
John
---
This SF.net email is sponsored by Demarc:
A global provider of Threat Management Solutions.
Download our HomeAdmin security software for free today!
http:/
eve it's also bad idea to use jaxen-full.jar
(use jaxen-core.jar plus a more specific jaxen dom jar)
Do you really need commons-httpclient-3.0-beta1.jar (and possibly others)?
John
>
> Thanks again!
>
>
> Cheers,
> Chris
>
>
>
> On 3/28/05 9:37 AM, "
; I will have a look and will try to find a fix.
His zip is still available at
http://baron.pagemewhen.com:8080/~chris/parse-rss.zip
John
>
> Stefan
>
>
__
http://www.neasys.com - A Good Place to Be
Come to visit us today!
-
sible causes?
One note: there is a tool called net.nutch.parse.ParserChecker, that
you can use to debug parser plugins. It is more convenient
to use it than start a crawler.
Will you be able to contribute this plugin after the dust settles?
Best,
John
On Sat, Mar 26, 2005 at 01:32:34PM -0800, CH
On Sat, Mar 26, 2005 at 01:13:33PM -0800, CHRIS A MATTMANN wrote:
> Hi John,
>
> Thanks for your reply. Actually I already have the feedparser working from
> the command line. I also included a program, test2.java with my original
> email that shows how I can dynamically loa
Why try it the hard way? You may want to
create a simple tool, just calling feedparser to parse your hi.rss?
Have that work first, then worry about dynamic loading and nutch plugin system.
Let us know when you have the simple tool.
John
On Fri, Mar 25, 2005 at 06:08:50PM -0800, Chris Mattmann
On Sat, Mar 26, 2005 at 01:48:05AM +0100, J?r?me Charron wrote:
> Does somebody know why John Xing deactivate the mime.magic.file
> support in protocol-file plugin?
The "disabled" are only hooks to use mimetype/magic mapper.
The mapper I used in a project had license issue (can
On Wed, Mar 23, 2005 at 10:36:12PM -0800, Hari Kodungallur wrote:
> Hi John,
>
> I will do them. No problems.
> But one question: for (3) is there any other way other than reading
> the /usr/share/magic.mime file? I am curious whether there is a
> platform independent way which
On Wed, Mar 23, 2005 at 09:10:43AM -0800, Doug Cutting wrote:
> John X wrote:
> >Attached please find servlet Cached.java that serves raw Content
> >of any mime type. Current cached.jsp handles mime type text/* only.
> >If no objection, it is going to be committed in a few day
7;s the format I'm working
> on right now and I think its use is widespread so it might be useful to
> implement these features.
Could you provide a code snippet or better a patch?
Thanks,
John
>
> Stephan
>
>
> On Wed, March 23, 2005 11:19, Andrzej Bialecki said:
On Wed, Mar 23, 2005 at 11:19:36AM +0100, Andrzej Bialecki wrote:
> John X wrote:
> >Hi, All,
> >
> >Attached please find servlet Cached.java that serves raw Content
> >of any mime type. Current cached.jsp handles mime type text/* only.
> >If no objection, it is go
On Wed, Mar 23, 2005 at 12:35:30AM -0800, Hari Kodungallur wrote:
> On Wed, 23 Mar 2005 00:51:53 -0800, John X <[EMAIL PROTECTED]> wrote:
> >
> > It will be great if you can help on that. Plugin index-more also uses it.
> > I know there are two opensource efforts:
>
now.
I am currently short of time, so any help will be greatly appreciated.
One interesting observation: there is an activation.jar
(under ./common/lib/) in jakarta-tomcat-4.1.31.tar.gz
We need to find out which one this is?
John
>
> Aside: it would have been nice if there was a Mimetype map
Hi, All,
Attached please find servlet Cached.java that serves raw Content
of any mime type. Current cached.jsp handles mime type text/* only.
If no objection, it is going to be committed in a few days.
John
--- Cached.java ---
diff -Nur --exclude='
Hi, Stefan,
The patch does not seem to include the code of nutch-extensionpoints.
Or am I missing something? Thanks,
John
On Wed, Mar 16, 2005 at 08:09:21PM +0100, Stefan Grroschupf (JIRA) wrote:
> [ http://issues.apache.org/jira/browse/NUTCH-10?page=history ]
>
> Stefan G
Thomas & Stefan,
nice ;-) Is favicon updated too?
John
On Thu, Mar 10, 2005 at 11:02:26PM +0100, Stefan Groschupf wrote:
> Dear nutch developers,
>
> congratulations for joining apache incubation program!
> I'm personal sure that this is another big step for nutch and it i
Looks like a relic from older crawler, RequestScheduler.java,
that was removed from source quite a while ago.
John
On Fri, Mar 11, 2005 at 10:49:45PM +0100, Stefan Groschupf wrote:
> k,
> you are right it was sounding so curious that i was searching in the
> latest subversion code,
Which version you use?
Recently nutch's moved from sf.net to apache, due to concerns over
licences of some jars, a few plugins have been "disabled".
It takes time to make all clean again.
Meanwhile, you may want to ignore ant test.
John
On Fri, Mar 11, 2005 at 01:48:57PM -0800, H
> nobody noticed it for quite a while, refine-query-init.jsp seems to be
> commented out by default, but the code
> in search.jsp should be executed in my opinion.
> Regards,
> Piotr
You are right. I will fix them in repository this weekend.
John
>
>
> ---
ys.com/patch/20040703/note.txt
> Unknown (or bad-known) by myself :
> ONTHOLOGY
It supports ontology based heuristic query.
Furthermore, url filters have been converted into plugins:
urlfilter-prefix and urlfilter-regex
John
---
SF email is sp
On Thu, Mar 03, 2005 at 09:52:40AM +0100, Christophe Noel wrote:
> Hello,
>
> I need to know more about the parse-ext plugin ... what can it do for
> example ?
>
There is a note about parse-ext. Please go to http://nutch.neasys.com/patch/
and check links under &q
ler" for NNTP, any code that tries to new a URL
> with "nntp://"; will get an exception (and I think the URL filtering does
> this).
>
> Question: Does this make sense that that Nutch depends on URLs, thus any
> schema not supported by the JVM (JVM supports h
Hi, Chirag,
Thanks for your detailed report.
Do you think rules engine would be good for UrlNormalizer?
Can nutch possibly benifit from rules engine in other ways?
John
On Mon, Feb 14, 2005 at 04:08:19PM -0500, Chirag Chaman wrote:
> John:
>
> We did some research and ran some te
On Tue, Feb 08, 2005 at 09:41:28AM -0500, Chirag Chaman wrote:
> John:
>
> We tested with QuickRules (YasuTech).
> The only non-commercial one I've used is Jess -- though it may have license
> issues.
>
> I know there is a big move to get open source XML rules engine ma
On Tue, Feb 08, 2005 at 09:41:28AM -0500, Chirag Chaman wrote:
> John:
>
> We tested with QuickRules (YasuTech).
> The only non-commercial one I've used is Jess -- though it may have license
> issues.
>
> I know there is a big move to get open source XML rules engine ma
Hi, Chirag,
By all means, please go ahead.
I will check them too, for my own needs, then compare note with you
or whoever must be interested.
We can have 1. sorted out first and worry about 2. later.
Thanks,
John
On Tue, Feb 08, 2005 at 11:39:09AM -0500, Chirag Chaman wrote:
> Thankx -- t
Hi, Chirag,
Since nutch urlfilter has been converted into plugin, I am going to
take on the idea of rule-based filtering as you suggested before, maybe
a new urlfilter plugin.
Which commercial RETE engine you used?
Any open source one?
Thanks,
John
On Mon, Jan 31, 2005 at 08:03:03PM -0500
If no objection, I will commit tomorrow.
John
On Tue, Feb 01, 2005 at 07:03:10PM -0800, John X wrote:
> Attached please find my patch to make current url filters
> as plugins. Now I can apply both net.nutch.net.RegexURLFilter
> and net.nutch.net.PrefixURLFilter at the same time.
&g
lugin might have to be written.
John
On Mon, Jan 31, 2005 at 04:53:08PM -0800, John X wrote:
> Hi, All,
>
> I propose to define plugin extension point for URLFilter, and
> convert current RegexURLFilter.java, PrefixURLFilter.java, etc., into
> plugins. However there is one requireme
On Tue, Feb 01, 2005 at 10:38:06AM +0100, Andrzej Bialecki wrote:
> John X wrote:
> >Stefan,
> >
> >On Tue, Feb 01, 2005 at 01:55:03AM +0100, Stefan Groschupf wrote:
> >
> >>John,
> >>
> >>by the way, is the url filter multithreaded?
> &
suggest, either by own invention or by calling commercial lib/engine.
However, I do not quite follow your discussion about 3xx forwards.
John
On Mon, Jan 31, 2005 at 08:03:03PM -0500, Chirag Chaman wrote:
> John:
>
> This is a very good idea -- and one that we currently use as a "hack
Stefan,
On Tue, Feb 01, 2005 at 01:55:03AM +0100, Stefan Groschupf wrote:
> John,
>
> by the way, is the url filter multithreaded?
> Do you think it is possible to implement the url filter extension
> point multithreaded?
As far as I know, none of the tools that currentl
applied. I have not checked, but
I assume, by default, we can always name plugins in alphabetical order.
Stefan: any better way to do this?
If no one thinks this is a bad idea, I am going to start work on it
right way.
John
---
This SF.Net email
On Sun, Jan 30, 2005 at 05:57:46PM +0100, Andrzej Bialecki wrote:
> John X wrote:
> >Hi, All,
> >
> >Attached is a patch for segslice to filter entries by url pattern.
> >If no objection, I will commit tomorrow.
>
> I couldn't object, because I was a
Hi, Stefan,
On Sun, Jan 30, 2005 at 12:31:12PM +0100, Stefan Groschupf wrote:
> John,
>
> >I need a lib/tool that can tell me physical location of a particular
> >html element as the page would have been displayed by a browser.
> I'm not sure if I understand you correc
eatly appreciated.
John
---
This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting
Tool for open source databases. Create drag-&-drop reports. Save time
by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc.
Downlo
st means it stores FetcherOutput.java
data structure, not that of Content.java.
John
>
> Can someone give me any hint where the unparsed raw content is stored
> in unparsed and in parsing mode?
>
> Thanks!
> Stefan
>
>
>
> -
ranet crawls
> until the final indexing would cut out a lot of unneeded work, I
> think.
Yes. Typically you crawl htmls in first rounds. Then crawl other mimetypes.
The control is done via ./conf/regex-urlfilter.txt and ./conf/nutch-site.xml
John
---
Hi, All,
Attached is a patch for segslice to filter entries by url pattern.
If no objection, I will commit tomorrow.
John
__
http://www.neasys.com - A Good Place to Be
Come to visit us today!
--- src/java/net/nutch/segment/SegmentSlicer.java.ori 2004-12
l be some codes are not thread safe.
I had identified it a while a ago. Guess have to commit the patch
using Doug's suggestion. Will try to have that fixed over the week end.
Meanwhile you may want to search list archive. The thread was probably
before Xmas.
John
>
> Before I try to
On Mon, Dec 20, 2004 at 03:40:44PM -0800, Doug Cutting wrote:
> John X wrote:
> >BasicUrlNormalizer.java should be made thread safe as
> >
> >< public String normalize(String urlString)
> >---
> >
> >> public synchronized String normalize(String
BasicUrlNormalizer.java should be made thread safe as
< public String normalize(String urlString)
---
> public synchronized String normalize(String urlString)
If no objection, I will commit it late.
John
__
http://www.neasys.com - A Good Pl
s more sophisticated in design,
but had issues and was abondoned due to lack of maintenance.
You may want to check it again.
John
---
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real
On Wed, Dec 15, 2004 at 12:32:18AM +0100, Andrzej Bialecki wrote:
> John X wrote:
> >Hi, Mike,
> >
> >The current behavior of addLocalFile() in nfs is:
> >local src is REMOVED after being added to nfs.
> >
> >Doing so has the benefit of space saving, but
Hi, Andrzej,
SegmentReader.java fails when option -nocontent, etc. are on.
Attached is a patch. If looks okay, I will commit it with my other
patch for ndfs tomorrow.
John
--- ./nutch-cvs-20041215/src/java/net/nutch/segment/SegmentReader.java
2004-12-05 01:43:48.0 -0800
+++ ./nutch-cvs
On Wed, Dec 15, 2004 at 02:33:29AM +0100, Andrzej Bialecki wrote:
> John X wrote:
> >On Wed, Dec 15, 2004 at 12:32:18AM +0100, Andrzej Bialecki wrote:
> >
> >>John X wrote:
> >>
> >>>Hi, Mike,
> >>>
> >>>The current behavior of
other tools). If this sounds good, I will prepare a patch.
John
---
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the
ns.
>
> Yes, we need to go through the lib/ and src/plugin/*/lib directories and
> check the license of each jar file. Could someone volunteer to write up
> an inventory of these, with licenses? Thanks.
I will create a list for this.
John
-
. Interested?
John
On Tue, Dec 07, 2004 at 11:39:02AM -0500, Mike Richmond wrote:
> To Whom It May Concern:
>
> I am a Java developer looking to get involved with a project. I came across
> your site and noticed that there is a lot of attention paid to PDF parsing.
> I'm curious why
FROM:JOHN KOROMAH PHON:27-73-267-1376 EMAIL: [EMAIL PROTECTED] ATT: DIRECTOR/CEO, I got your contact through the South African Trade and Business Information Services by a very reliable friend of mine who introduced your capability, personality and business address to me. I am MR john Koromah
fied in
http://www.nutch.org/docs/en/policies.html
We are always in need of good content parsers, any type!
Thanks,
John
On Tue, Dec 07, 2004 at 06:00:52PM +0100, St?phane Lagraulet wrote:
__
http://www.neasys.com - A Good Place to Be
Come to vi
It's committed.
On Mon, Nov 15, 2004 at 11:53:34PM -0800, John X wrote:
> Hi, All,
>
> I have tried this plugin. It is quite useful. thanks mike.
> If no one objects, I will commit it with a few modifications
> late this week.
>
> John
>
> On Tue, Nov 09, 200
for now, I might have more ;-)
Thanks a lot.
John
On Sun, Nov 28, 2004 at 05:33:59PM -0800, Michael Cafarella wrote:
>
> Hi everyone,
>
> A few weeks ago I completed a research project that involved building
> a 50-100m page Nutch crawl. I've been working on Nutch a
On Thu, Nov 18, 2004 at 11:07:50AM -0800, John X wrote:
> On Thu, Nov 18, 2004 at 01:54:57PM +0100, Sven Wende wrote:
> > Hi,
> >
> > I used the "readdb" command to list some information about the pages in my
> > database.
> >
> > Th
ET 292278994
>
> Other pages have "normal" dates, like:
>
> Next fetch: Wed Dec 22 13:50:42 CET 2004
>
>
>
> I wonder about that strange year indicator. Is this a bug, or a "feature" ?
Which option did you use? A more detailed log/dump will be
. I think that is to much for a
> newbie, give people the chance to start get starting fast and small.
I agree. We'd better do it sooner than later.
John
---
This SF.Net email is sponsored by: InterSystems CACHE
FREE OODBMS DOWNLOAD -
Hi, All,
I have tried this plugin. It is quite useful. thanks mike.
If no one objects, I will commit it with a few modifications
late this week.
John
On Tue, Nov 09, 2004 at 08:41:05PM -0800, michael j pan wrote:
> hi all,
>
> i have developed a plugin for ontology-supported query r
ault mode of
> operation, because it saves a lot of disk IO, and anyway parsing in a
> separate stage is more bullet-proof.
>
My +1 vote.
John
> --
> Best regards,
> Andrzej Bialecki
---
This SF.Net email is sponsored by: InterSyst
On Sun, Nov 14, 2004 at 10:02:13PM +0100, Andrzej Bialecki wrote:
> John X wrote:
>
> >One thorny issue is: how to deal with various FetcherOutput states.
> >Before parsing was separated from fetching, failed parsing
> >was logged as NOT_FOUND. Now it will be marked as CA
as separated from fetching, failed parsing
was logged as NOT_FOUND. Now it will be marked as CANT_PARSE.
We may have to increase VERSION in FetcherOutput from 4 to 5,
so that "old" ./fetcher can be easily distignushed from new ./fetcher
and ./fetcher_output. I did not do that because not feel c
hat take less than 12 hours to fetch,
> using, e.g., the -numFetchers parameter when generating fetchlists. But
> this is substantially more complicated if you're currently using the
> crawl command.
There is one more way to debug: run fetcher with -noParsing option,
then run
On Fri, Nov 12, 2004 at 11:38:04AM +1030, Nick Lothian wrote:
> >
> I believe that the best way to encourage people to contribute is for the
> community to continue to develop innovative and useful software. That
> encourages people to continually update their versions, which acts as a
> significa
+1 for license change and top-level search.apache.org
John
On Thu, Nov 11, 2004 at 12:17:08PM +0100, Doug Cutting wrote:
>
> My belief is that we should disband the Nutch non-profit organization
> and assign the copyright for Nutch software to the Apache Foundation,
> switching Nut
Hi
I own webspider.com, would you be interested in using the domain to
implement nutch.
John
---
This SF.Net email is sponsored by:
Sybase ASE Linux Express Edition - download now for FREE
LinuxWorld Reader's Choice Award Winner for
ery useful.
>
> I attach the source code. The tool to dump segments data (currently in
> net.nutch.tools.DumpSegment) could be moved here, or left as it is -
> suggestions are welcome.
Please do so.
Thanks,
John
---
This SF.Net email
cook a query
plugin to search for it. I planned to do a query-more plugin (for content-type,
content-length as well as last-modified), but never got around to
do it and won't have time soon. It will be great if you
can contribute along the line.
John
On Sun, Oct 31, 2004 at 03:43:22PM -0500, Luke Baker wrote:
> On 10/31/2004 12:22 PM, John X wrote:
> [snip]
> >What are the numbers for kb/s and bytes/page?
> >I have a collection of mostly mswords, ppts and some pdfs, the numbers are
> >041001 194517 10 status: 0.1771225
On Sat, Oct 30, 2004 at 11:06:18AM -0400, Luke Baker wrote:
> Hey all,
>
> Does anyone else have the problem of the pdf parser taking up so many
> resources that it slows down the whole parsing process? I ran the fetch
> with the -noParsing option (thanks John!). I then ran th
riting the page would be trivial, but I'm not
> sure if you can be running a fetch or something while injecting a new
> url. (does that make sense?)
No. Please explain.
John
>
> also.. the clustering option is really nice, IMHO it should be on by
> default.
>
&g
1 - 100 of 253 matches
Mail list logo