Hi Folks,
I've recently encountered the following error using the crawl tool:
050426 214400 fetching
http://search.csmonitor.com/specials/neocon/index.html
050426 214401 fetching http://perspolis.usc.edu/Users/shahram/
050426 214401 fetching
http://www.cnn.com/rssclick/2005/TECH/science/
Very good,I will try to do it!
2005/4/27, Andy Liu <[EMAIL PROTECTED]>:
>
> You can cut and paste this code into any indexing plugin, or create a new
> one:
>
> // add links
> Outlink[] outlinks = parse.getData().getOutlinks();
> int end = Math.min(outlinks.length,
> UpdateDatabaseTool.MAX_OUTL
I'll work on this.
On 25/04/05, Doug Cutting <[EMAIL PROTECTED]> wrote:
> Someone could re-write this plugin to use the Swing RTF parser that
> comes with the JVM:
--
Cheers,
Hasan Diwan <[EMAIL PROTECTED]>
---
SF.Net email is sponsored by:
sorry, typo in last email: "all searches" = "allow searches"
On 4/26/05, Andy Liu <[EMAIL PROTECTED]> wrote:
> You can cut and paste this code into any indexing plugin, or create a new one:
>
>// add links
> Outlink[] outlinks = parse.getData().getOutlinks();
> int end = Math.min(outl
You can cut and paste this code into any indexing plugin, or create a new one:
// add links
Outlink[] outlinks = parse.getData().getOutlinks();
int end = Math.min(outlinks.length,
UpdateDatabaseTool.MAX_OUTLINKS_PER_PAGE);
for (int i = 0; i < end; i++) {
Outlink link = outli
Hi,
Anyone here expert on Nutch could help me to find a way to use the class
GetLinks???
It seems that my messages are being ignored...
Isn't this a great feature to search for "link:www.xxx.com"?
Or even to be able to show where a image come from in the case of searching
image files
Hope s
Jakob Heidebrecht wrote:
i get this error when i try to build nutch with ant.
What version of Java are you using? What version of Nutch are you
compiling? On what platform?
Doug
---
SF.Net email is sponsored by: Tell us your software developme
Hi,
i get this error when i try to build nutch with ant.
Does somebody know what it is?
Regards
Jakob
compile-core:
[javac] Compiling 2 source files to /data/nutch/trunk/build/classes
[javac] Found 1 semantic error compiling
"/data/nutch/trunk/src/java/org/apache/nutch/ipc/Client.java"
Hi! This is the ezmlm program. I'm managing the
nutch-dev@incubator.apache.org mailing list.
I'm working for my owner, who can be reached
at [EMAIL PROTECTED]
Messages to you from the nutch-dev mailing list seem to
have been bouncing. I've attached a copy of the first bounce
message I received.
[ http://issues.apache.org/jira/browse/NUTCH-51?page=comments#action_63670
]
Doug Cutting commented on NUTCH-51:
---
You need index-basic.
> Removing a plugin after fetch but before indexing causes errors
> --
Parser plugin for MS Excel files
Key: NUTCH-52
URL: http://issues.apache.org/jira/browse/NUTCH-52
Project: Nutch
Type: Improvement
Components: fetcher
Reporter: Rohit Kulkarni
Priority: Trivial
Attachments: parse-msexcel.
Hello,
Page object does not contain html page content. To access fetched page
content you have to iterate over segment data and extract it from there.
Please have a look at SegmentReader class - it gives you a simple API to
access all segment data.
Regards
Piotr
Hasan Diwan wrote:
On 23/04/05, r
Hi,
I'm trying to get some debug messages to be printed on the screen
while the crawl is being done in Nutch. I just can't get it done. Just
System.out.println() wouldn't work!
The funny part is that I'm unable to create files using simple syntax
like this inside the Fetcher.java class.
bos = new
Hello,
I am attaching a minor patch for datanode command line handling that
allows one to pass name of data directory as a command line parameter.
If not passed data directory configured in nutch config file is used.
It is very useful for running multiple instances of datanode on the same
host -
Hasan Diwan wrote:
On 22/04/05, Doug Cutting <[EMAIL PROTECTED]> wrote:
http://cvs.sourceforge.net/viewcvs.py/nutch/nutch/src/plugin/parse-rtf/lib/
Is there a licensing issue in importing this into subversion?
Yes. Look inside that jar. It is LGPL.
http://wiki.apache.org/jakarta/Using_LGPL'd_cod
Hi all.
Has anyone written a version of the FetchListTool that only adds a URL
to the fetch list if it complies with a particular Regex URL filter? If
so, would they be prepared to share? I need to do something like this,
but I dislike re-inventing wheels.
Essentially, I'm doing an intranet-ty
[ http://issues.apache.org/jira/browse/NUTCH-51?page=comments#action_63657
]
byron miller commented on NUTCH-51:
---
I have index more setup.
protocol-http|urlfilter-regex|parse-(text|html|pdf)|index-more|query-(basic|site|url)|clustering-carrot2|ontolo
[ http://issues.apache.org/jira/browse/NUTCH-53?page=all ]
Rohit Kulkarni updated NUTCH-53:
Attachment: parse-zip.zip
The plugin is tested with the latest nutch SVN and seems to work
fine.
Currently handles and calls parsers for the following types of f
18 matches
Mail list logo