[Nutch-general] Separating nutch and hadoop configurations.

2007-07-11 Thread Briggs
I am currently trying to figure out how to deploy Nutch and Hadoop separately. I want to configure Hadoop outside of Nutch and have Nutch use that service, rather than configuring hadoop within nutch. I would think all that Nutch should need to know is the urls to connect to Hadoop, but can't

Re: [Nutch-general] Separating nutch and hadoop configurations.

2007-07-11 Thread Briggs
at all. Though, I needed to replace hadoop-12.whatever.jar to the lastest within the nutch build. It seems to be working. yay. Thanks. On 7/11/07, Andrzej Bialecki [EMAIL PROTECTED] wrote: Briggs wrote: I am currently trying to figure out how to deploy Nutch and Hadoop separately. I

Re: [Nutch-general] NUTCH-479 Support for OR queries - what is this about

2007-07-09 Thread Briggs
Thanks for the answer. That was helpful. I was sooo wrong. On 7/7/07, Andrzej Bialecki [EMAIL PROTECTED] wrote: Briggs wrote: Please keep this thread going as I am also curious to know why this has been 'forked'. I am sure that most of this lies within the original OPIC filter but I

Re: [Nutch-general] NUTCH-479 Support for OR queries - what is this about

2007-07-07 Thread Briggs
Please keep this thread going as I am also curious to know why this has been 'forked'. I am sure that most of this lies within the original OPIC filter but I still can't understand why straight forward lucene queries have not been used within the application. On 7/6/07, Kai_testing Middleton

Re: [Nutch-general] Reload index

2007-06-20 Thread Briggs
. On 6/20/07, Naess, Ronny [EMAIL PROTECTED] wrote: Thanks, Briggs. I will try to create a new NutchBean to se if that solves reloading issue. By the way. Your former mail do not seem to have reached the mailinglist. I can't seem to find it anyway. -Ronny -Opprinnelig melding- Fra

Re: [Nutch-general] Reload index

2007-06-19 Thread Briggs
By the way, I was wrong about one thing, you can't override the 'get' method of nutch bean because it's static. Doh, that was a silly oversight. But again, if you are using nutch and you need to 'reload' the index, you need only to create a new NutchBean (that is if the NutchBean is what you are

Re: [Nutch-general] Reload index

2007-06-18 Thread Briggs
I would say that the best thing to do is to create a new nutch bean. I never cared much for the nutch bean containing logic to store itself in a servlet context. I do not believe that this is the place for such logic. It should be up to the user to place the nutch bean into the servlet context

Re: [Nutch-general] fetch failing while crawling

2007-06-15 Thread Briggs
Yeah, you still don't have the agent configured. All your values for the agent (the value/value needs a value) are blank. So, you need to at least confugure an agent name. On 6/15/07, karan thakral [EMAIL PROTECTED] wrote: i m using crawl on the cygwin while working on windows but the

Re: [Nutch-general] fetch failing while crawling

2007-06-15 Thread Briggs
Oh and as for the web interface, take a look at the wiki page: http://wiki.apache.org/nutch/NutchTutorial The bottom of the page has a section on searching. On 6/15/07, Briggs [EMAIL PROTECTED] wrote: Yeah, you still don't have the agent configured. All your values for the agent (the value

Re: [Nutch-general] Explanation of topN

2007-06-08 Thread Briggs
Well, the quick/simple exlanation is: If you have 5 urls with their associate nutch score: http://a.com/something1 = 5.0 http://b.com/something2 = 4.0 http://c.com/something3 = 3.0 http://d.com/something4 = 2.0 http://e.com/something5 = 1.0 Then you set nutch to crawl with topN = 3 then a,b,c

Re: [Nutch-general] indexing only special documents

2007-06-07 Thread Briggs
Ronny, your way is probably better. See, I was only dealing with the fetched properties. But, in your case, you don't fetch it, which gets rid of all that wasted bandwidth. For dealing with types that can be dealt with via the file extension, this would probably work better. On 6/7/07,

Re: [Nutch-general] Loading mechnism of plugin classes and singleton objects

2007-06-06 Thread Briggs
this unnecessary object creation. On 6/6/07, Briggs [EMAIL PROTECTED] wrote: FYI, I ran into the same problem. I wanted my filters to be instantiated only once, and they not only get instantiated repeatedly, but the classloading is flawed in that it keeps reloading the classes. So, if you ever dump

Re: [Nutch-general] Loading mechnism of plugin classes and singleton objects

2007-06-06 Thread Briggs
FYI, I ran into the same problem. I wanted my filters to be instantiated only once, and they not only get instantiated repeatedly, but the classloading is flawed in that it keeps reloading the classes. So, if you ever dump the stats from your app (use 'jmap -histo;) you can see all the classes

Re: [Nutch-general] urls/nutch in local is invalid

2007-06-06 Thread Briggs
is urls/nutch a file or directory? On 6/6/07, Martin Kammerlander [EMAIL PROTECTED] wrote: Hi I wanted to start a crawl like it is done in the nutch 0.8.x tutorial. Unfortunately I get the following error: [EMAIL PROTECTED] nutch-0.8.1]$ bin/nutch crawl urls/nutch -dir crawl.test -depth

Re: [Nutch-general] urls/nutch in local is invalid

2007-06-06 Thread Briggs
for nutch or 'only' mailing list? thx martin Zitat von Briggs [EMAIL PROTECTED]: is urls/nutch a file or directory? On 6/6/07, Martin Kammerlander [EMAIL PROTECTED] wrote: Hi I wanted to start a crawl like it is done in the nutch 0.8.x tutorial. Unfortunately I get the following error

Re: [Nutch-general] indexing only special documents

2007-06-06 Thread Briggs
You set that up in your nutch-site.xml file. Open the nutch-default.xml file (located in the NUTCH_INSTALL_DIR/conf. Look for this element: property nameplugin.includes/name

[Nutch-general] Content Type Not Resolved Correctly?

2007-06-01 Thread Briggs
So, I have been having huge problems with parsing. It seems that many urls are being ignored because the parser plugins throw and exception saying there is no parser found for, what is reportedly, and unresolved contentType. So, if you look at the exception:

Re: [Nutch-general] Content Type Not Resolved Correctly?

2007-06-01 Thread Briggs
(Unix) DAV/1.0.3 ApacheJServ/1.1.2 Cache-Control: no-store X-Highwire-SessionId: xlz2cgcww1.JS1 Set-Cookie: JServSessionIdroot=xlz2cgcww1.JS1; path=/ Transfer-Encoding: chunked Content-Type: text/html So, I'm lost. On 6/1/07, Doğacan Güney [EMAIL PROTECTED] wrote: Hi, On 6/1/07, Briggs [EMAIL

Re: [Nutch-general] Content Type Not Resolved Correctly?

2007-06-01 Thread Briggs
PROTECTED] wrote: Hi, On 6/1/07, Briggs [EMAIL PROTECTED] wrote: So, I have been having huge problems with parsing. It seems that many urls are being ignored because the parser plugins throw and exception saying there is no parser found for, what is reportedly, and unresolved contentType. So

Re: [Nutch-general] Content Type Not Resolved Correctly?

2007-06-01 Thread Briggs
-SessionId: nh2ukcdpv1.JS1 Set-Cookie: JServSessionIdroot=nh2ukcdpv1.JS1; path=/ Transfer-Encoding: chunked Content-Type: text/html So, that's it. any ideas? On 6/1/07, Briggs [EMAIL PROTECTED] wrote: So, here is one: http://hea.sagepub.com/cgi/alerts Segment Reader reports: Content

Re: [Nutch-general] Nutch on Windows. ssh: command not found

2007-05-30 Thread Briggs
so, when in cygwin, if you type 'ssh' (without the quotes, do you get the same error? If so, then you need to go back into the cygwin setup and install ssh. On 5/30/07, Ilya Vishnevsky [EMAIL PROTECTED] wrote: Hello. I try to run shell scripts starting Nutch. I use Windows XP, so I installed

[Nutch-general] Speed up indexing....

2007-05-30 Thread Briggs
Anyone have any good configuration ideas for indexing/merging with 0.9 using hadoop on a local fs? Our segment merging is taking an extremely long time compared with nutch 0.7. Currently, I am trying to merge 300 segments, which amounts to about 1gig of data. It has taken hours to merge, and

Re: [Nutch-general] Problem crawling in Nutch 0.9

2007-05-14 Thread Briggs
Just curious, did you happen to limit the number of urls using the topN switch? On 5/14/07, Annona Keene [EMAIL PROTECTED] wrote: I recently upgraded to 0.9, and I've started encountering a problem. I began with a single url and crawled with a depth of 10, assuming I would get every page on

Re: [Nutch-general] Nutch Indexer

2007-05-01 Thread Briggs
I would assume that it need these for handling the indexing of the link scores. Lucene puts no scoring weight on things such as urls, page rank and such. Since lucene only indexes documents, and calculates its keyword/query relevancy based only on term vectors (or whatever) nutch needs to add the

Re: [Nutch-general] Nutch Indexer

2007-05-01 Thread Briggs
Man, I should proofread this stuff before I send them. That is all I have to say. On 5/1/07, Briggs [EMAIL PROTECTED] wrote: I would assume that it need these for handling the indexing of the link scores. Lucene puts no scoring weight on things such as urls, page rank and such. Since lucene

[Nutch-general] Nutch and running crawls within a container.

2007-04-30 Thread Briggs
somewhere else in the code. Since Nutch handles so much of its threading, could this be causing the problem? I am not sure if I should x-post this to the dev group or not. Anyway, thanks. Briggs -- Conscious decisions by conscious minds are what make reality real

Re: [Nutch-general] Nutch and running crawls within a container.

2007-04-30 Thread Briggs
, but then another creeped up. On 4/30/07, Sami Siren [EMAIL PROTECTED] wrote: Briggs wrote: Version: Nutch 0.9 (but this applies to just about all versions) I'm really in a bind. Is anyone crawling from within a web application, or is everyone running Nutch using the shell scripts provided

Re: [Nutch-general] Nutch and running crawls within a container.

2007-04-30 Thread Briggs
I'll look around the code to make sure I am creating only one instance of Configuration in my classes, and will play around with the maxpermgen settings. Any other input from people that have attempted this sort of setup would be appreciated. On 4/30/07, Briggs [EMAIL PROTECTED] wrote: Well

Re: [Nutch-general] Removing pages from index immediately

2007-04-27 Thread Briggs
(or webdb I believe in your version) to include a flag of 'good/bad' or something. On 4/27/07, Briggs [EMAIL PROTECTED] wrote: Isn't this what you are looking for? org.apache.nutch.tools.PruneIndexTool. On 4/27/07, franklinb4u [EMAIL PROTECTED] wrote: hi Enis, This is franklin

Re: [Nutch-general] Removing pages from index immediately

2007-04-27 Thread Briggs
Well, it looks like the link I sent you goes to the 0.9 version of the nutch api. There is a link error on the nutch project site because the 0.7.2 doc link points to the 0.9 docs. On 4/27/07, Briggs [EMAIL PROTECTED] wrote: Here is the link to the docs: http://lucene.apache.org/nutch

Re: [Nutch-general] Using nutch just for the crawler/fetcher

2007-04-25 Thread Briggs
If you are just looking to have a seed list of domains, and would like to mirror their content for indexing, why not just use the unix tool 'wget'? It will mirror the site on your system and then you can just index that. On 4/25/07, John Kleven [EMAIL PROTECTED] wrote: Hello, I am hoping

Re: [Nutch-general] Index

2007-04-24 Thread Briggs
On the nutch wiki there is this tutorial: http://wiki.apache.org/nutch/NutchHadoopTutorial There is also (it is for version 0.8, but can still work with 0.9): http://lucene.apache.org/nutch/tutorial8.html On 4/24/07, ekoje ekoje [EMAIL PROTECTED] wrote: Hi Guys, I would like to create a

Re: [Nutch-general] Index

2007-04-24 Thread Briggs
Perhaps someone else can chime in on this. I am not sure of exactly what you are asking. The indexing is based on Lucene. So, if you need to understand how the indexing works you will need to look into the Lucene documentation. If you are only looking to add custom fields and such to the

Re: [Nutch-general] How to delete already stored indexed fields???

2007-04-20 Thread Briggs
If you look into the BasicIndexingFilter.java plugin source you will see that this is where those default fields get indexed. So, you can either create a new plugin that is configurable for the properties you want to index, or remove this plugin. Here is the snippet of code that is in the

Re: [Nutch-general] How to dump all the valid links which has been crawled?

2007-04-20 Thread Briggs
, Meryl Silverburgh [EMAIL PROTECTED] wrote: Can you please tell me what is the meaning of this command? what is the top 35 links? how nutch rank the top 35 links? bin/nutch readdb crawl/crawldb -topN 35 test On 4/19/07, Briggs [EMAIL PROTECTED] wrote: Those links are links that were discovered

Re: [Nutch-general] Classpath and plugins question

2007-04-19 Thread Briggs
Look into org.apache.nutch.plugin. The custom plugin classloader, and the resource loadeer reside in there. On 4/18/07, Antony Bowesman [EMAIL PROTECTED] wrote: I'm looking to use the Nutch parsing framework in a separate Lucene project. I'd like to be able to use the existing plugins

Re: [Nutch-general] Classpath and plugins question

2007-04-19 Thread Briggs
I'll add that the PluginRespository is the class that recurses through your plugins directory, and loads each plugin's descriptor file then loads all dependencies for each plugin within their own classloader. On 4/19/07, Briggs [EMAIL PROTECTED] wrote: Look into org.apache.nutch.plugin

[Nutch-general] Nutch and Crawl Frequency

2007-04-19 Thread Briggs
Nutch 0.9 Anyone know if it is possible to be more granular regarding crawl frequency? Meaning, that I would like some sites to be crawled more often then others. Like, a news site should be crawled every day, but your average business website should be crawled every 30 days. So, is it possible

Re: [Nutch-general] Nutch and Crawl Frequency

2007-04-19 Thread Briggs
Cool, cool. Thanks! On 4/19/07, Gal Nitzan [EMAIL PROTECTED] wrote: As it is right now... You answered the question yourself :-) ... Separate db's and the whole ceremony... -Original Message- From: Briggs [mailto:[EMAIL PROTECTED] Sent: Thursday, April 19, 2007 10:02 PM

Re: [Nutch-general] Forcing update of some URLs

2007-04-19 Thread Briggs
From what I have gathered is that you may want to keep multiple crawldbs for your crawls. So, you could have a crawldb for more frequent crawls and fire off nutch and read that db with the appropriate configs for that job. I was hoping for the same mechanism, but it looks like we need to write

Re: [Nutch-general] How to dump all the valid links which has been crawled?

2007-04-19 Thread Briggs
Those links are links that were discovered. It does not mean that they were fetched, they weren't. On 4/12/07, Meryl Silverburgh [EMAIL PROTECTED] wrote: I think I find out the answer to my previous question by doing this: bin/nutch readlinkdb crawl/linkdb/ -dump test But my next question

[Nutch-general] Source of Outlink and how to get Outlinks in 0.9

2007-04-18 Thread Briggs
(my application) up to date with the latest and greatest! Thanks for your time! And once I really get through this code I promise to start posting answers. Briggs. -- Conscious decisions by conscious minds are what make reality real

Re: [Nutch-general] Source of Outlink and how to get Outlinks in 0.9

2007-04-18 Thread Briggs
? I'll just have to figure out the format here so I can parse it. I'll probably write a wrapper that exports to xml or something to make transformation of this easier. Anyway, am I on the right track? Briggs. On 4/18/07, Briggs [EMAIL PROTECTED] wrote: Is it possible to determine from which

[Nutch-general] Logger duplicates entries by the thousands

2007-03-23 Thread Briggs
Currently using 0.7.2. We have a process that runs crawltool from within an application, perhaps hundreds of times during the course of the day. The problem I am seeing is that over time the log statements from my application (I am using commons logging and Log4j) are also being logged within

Re: [Nutch-general] Logger duplicates entries by the thousands

2007-03-23 Thread Briggs
where that is happening... It's either Nutch or ActiveMQ stuff. Anyway, Have fun and Cheers! On 3/23/07, Briggs [EMAIL PROTECTED] wrote: Currently using 0.7.2. We have a process that runs crawltool from within an application, perhaps hundreds of times during the course of the day. The problem I

[Nutch-general] List Domains and adding Boost Values for Custom Fields

2007-01-31 Thread Briggs
that if there are high boost values for those fields, they will be pushed to the top. Thanks again! Briggs -- Concious decisions by concious minds are what make reality real - Using Tomcat but need to do more? Need to support web

[Nutch-general] Plugin ClassLoader issues...

2007-01-31 Thread Briggs
So, I am having ClassLoader issues with plugins. It seems that the PluginRepository does some wierd class loading (PluginClassLoader) when it starts up. Does this mean that my plugin will not inherit the classpath of my web application that it is loaded within? A simple example is that my webapp

Re: [Nutch-general] Plugin ClassLoader issues...

2007-01-31 Thread Briggs
Well, I found this: http://wiki.apache.org/nutch/WhatsTheProblemWithPluginsAndClass-loading Arrrgh. Well, looks like I am going to use JMX to have my plugin talk to my application. That way I won't have have several copies of my business jars around. On 1/31/07, Briggs [EMAIL PROTECTED

[Nutch-general] Merging large sets of segments, help.

2007-01-24 Thread Briggs
that was crawled and I want to merge them all into several large segments. So, if anyone has any pointers I would appreciate it. Has anyone else attempted to keep segments at this granularity? This doesn't seem to work so well. briggs / Concious decisions by concious minds are what make

Re: [Nutch-general] Merging large sets of segments, help.

2007-01-24 Thread Briggs
and index those. Sorry for my ignorance, but not really sure how to scale nutch correctly. Do you know of a document, or some pointers as to how segment/index data should be stored? briggs / Concious decisions by concious minds are what make reality real

Re: [Nutch-general] Merging large sets of segments, help.

2007-01-24 Thread Briggs
Cool, thanks for your responses! Next time I should probably mention that we are using 0.7.2. Not quite sure if we can even think about moving to something 'more current' as I don't really know the reasons to. briggs / Most of this information is already available on the Nutch Wiki. All I