PruneRegexTool

2006-12-14 Thread Bryan Woliner
Nutchers, I know that I have seen many posts to this list regarding the usage of nutch's prune tool (org.apache.nutch.tools.PruneIn dexTool) and that many of those posts noted the difficulty of having to pass Lucene queries as parameters (for those of us who don't already have a firm understandin

Can PruneIndexTool still be used in Nutch 0.8.1?

2006-12-12 Thread Bryan Woliner
Hi, When using 0.7.x I often used the PruneIndexTool, but I noticed that calling bin/nutch Prune no longer works and Prune is not included in the 0.8commandline options section of the nutch wiki. Furthermore, when I call the command "locate PruneIndexTool", all the returned files start with: " nu

Does nutch 0.8.x have an command like bin/nutch fetchlist -dumpurls

2006-11-12 Thread Bryan Woliner
Hi, When I was using nutch 0.7, I found the bin/nutch fetchlist -dumpurls command to be very useful. However, I have not been able to find an equivalent command in nutch 0.8.x. Essentially all I want to do is dump all urls stored in a certain segment (or group of segments) into a text file. In

Two Errors in Nutch 0.8 Tutorial?

2006-07-25 Thread Bryan Woliner
I am certainly far from a nutch expert, but it appears to me that there are two errors in the current Nutch 0.8 tutorial. First off, here is the version of Nutch 0.8 that I am using, in case there has been changes made in newer version that invalidate my comments: -bash-2.05b$ svn info Path: . U

Dissecting the Nutch Search Page (Please Help!)

2006-07-23 Thread Bryan Woliner
I am trying to modify the standard nutch search page (for nutch 0.8-dev) and have several questions: 1. Do most people modify the search.html file directly, or is it better to modify the files that are used to automatically generate the search.htmlpage. If the latter is the case, are there any fi

Re: Problems switching over from nutch 0.7.1 to nutch 0.8 (dev) -- zero search results & problem with invertlinks

2006-06-20 Thread Bryan Woliner
search GUI.) I changed conf/log4j.properties like outlined below, to enable full debug logging. (Only changed lines are shown). #log4j.rootLogger=INFO,DRFA log4j.rootLogger=DEBUG, stdout #log4j.logger.org.apache.nutch=INFO #log4j.logger.org.apache.hadoop=WARN I hope this helps. -kuro > From: Bryan

Error when calling bin/nutch inject -- java.io.IOException: config()

2006-06-19 Thread Bryan Woliner
On June 13th, I downloaded the trunk version of nutch-0.8-dev and then built it using ant. I then created a valid urls file and put it in the urlsdir subdirectory of my nutch directory. I also made sure that my conf/regex-urlfilter.txt file was valid. At that point, I tried to do my first whole-

Problems switching over from nutch 0.7.1 to nutch 0.8 (dev) -- zero search results & problem with invertlinks

2006-06-15 Thread Bryan Woliner
Hi All, I have been using Nutch 0.7.1 for some time (although I am certainly not an expert) and am now in the process of switching over to Nutch 0.8. However, I have ran into a couple of problems along the way and am hoping that those of you who have been using nutch 0.8 for a while will take a q

What are valid names and location(s) for segments

2006-03-09 Thread Bryan Woliner
I am using nutch 0.7.1 and have a couple questions about valid segment names and locations: I can get nutch to work fine when I store my segments, with their original nutch assigned names in the folder: "/usr/local/nutch-0.7.1/live/segments/" and then start tomcat from the "/usr/local/nutch-0.7.1/

Re: Do not index seed page?

2006-01-24 Thread Bryan Woliner
I have a similar issue and have begun working on a tool that would prune an index using a file of regexes. When I get it working I will be happy to make it publicly available. -Bryan On 1/23/06, Stefan Groschupf <[EMAIL PROTECTED]> wrote: > > Blocking a page in a url filter will also not fetch a

Common Lucene Queries for PruneIndexTool -- GROUPS of files or folders

2006-01-16 Thread Bryan Woliner
OK, I have spent a fair amount of time trying to figure out how to create the correct Lucene queries to use with the PruneIndexTool. I have read the wiki page for bin/nutch Prune, looked at the Lucence Query Parser Syntax page and browsed past mailing list discussions on the subject. Accordingly,

Re: How can no URLs be fetched until the 11th round of fetching?

2006-01-15 Thread Bryan Woliner
olved!) > Sample3: Smth changed on a site, "redirect" added/changed, etc. > Sample4: web-master modified robots.txt > Sample5: big first HTML file, network errors during first 10 fetch attempts, > etc. > > It should be very uncommon behaviour, but it may happen... >

How can no URLs be fetched until the 11th round of fetching?

2006-01-15 Thread Bryan Woliner
I am using Nutch 0.7.1 (no mapreduce) and did a whole-web crawl with 14 rounds of fetching and an urls files with one URL in it. No urls were fetched during the first 10 rounds, but then in the 11th round one URL was fetched and increasing more URLs were fetched in rounds 12-14. I am basing the num

multiple nutch-site.xml files possible?

2006-01-10 Thread Bryan Woliner
Is it possible to have multiple nutch-site.xml files and somehow tell nutch which one to use at runtime? As a corollary, where in the nutch code is nutch directed to this file and the nutch-default.xml file? I know that the nutch environment variable, $NUTCH_CONF_DIR, can be changed to specifiy a

port :8080 no longer brings up Nutch search page!

2006-01-04 Thread Bryan Woliner
When I originally installed nutch and tomcat on my machine, I needed to change the ownership and permission of certain files in subdirectories of the "../jakarta-tomcat-4.1.31/" folder in order to be able to use tomcat and nutch together. I've had no problems with tomcat for some time, however I am

Re: port :8080 no longer brings up Nutch search page!

2006-01-04 Thread Bryan Woliner
Nevermind, I was able to fix it by renaming the tomcat/webapps/ROOT/ directory and then restarting tomcat, which recreated the root directory from the ROOT.war file. I must have messed up some of the permissions in the ROOT folder. On 1/4/06, Bryan Woliner <[EMAIL PROTECTED]> wrote: >

Best setup for multiple nutch users on one server

2005-12-24 Thread Bryan Woliner
For a while I have been the only one testing nutch on my server, but now I have a couple of colleagues that are going to start working on my nutch project. For those of you who also have several people working with nutch on the same server, I have a couple of questions: 1. Do you have one installa

Re: which files/directories are needed after a segment or index merge

2005-12-22 Thread Bryan Woliner
Thanks Stefan! I guess I should have look at the searcher.dir entry in nutch-site.xml to start with. For the record, I was able to search the index of the merged-segment successfully after I created a /nutch-0.7.1/Live/segements/ folder, put my segments in that directory and started tomcat from th

Re: which files/directories are needed after a segment or index merge

2005-12-21 Thread Bryan Woliner
Thanks for the info. I copied two merged segements to my live directory, then merged them together and indexed the twice-merged segement. However, when I started tomcat from the Live directory and opened mysite:8080, I got an org.apache.jasper.JasperException caused with a root cause of a java.lang

Re: which files/directories are needed after a segment or index merge

2005-12-21 Thread Bryan Woliner
om merging crawl db's. > Than you only need the merged segment, the linkdb and the index from > the merged segment. > The 10 segments used to build the merged segment can be removed. > > Hope this helps, you should only may change the scripts to have a 10 > round loop to c

which files/directories are needed after a segment or index merge

2005-12-21 Thread Bryan Woliner
I am using nutch 0.7.1 (non-mapred) and am a little confused about how to move the contents of several "test" crawls into a single "live" directory. Any suggestions are very much appreciated! I want to have a "Live" directory that contains all the indexes that are ready to be searched. The first

Re: Filesystem structure for the web front-end.

2005-12-18 Thread Bryan Woliner
I have a similar question, but I am using nutch 0.7.1 (non-mapred). Any suggestions are very much appreciated! I want to have a "Live" directory that contains all the indexes that are ready to be searched. The first index I want to add to the "Live" directory comes from a crawl with 10 rounds of

Is there is simple command to count number of docs in an index?

2005-12-12 Thread Bryan Woliner
Is there an easy way to count the number of documents/URLs in a particular index. Basically I am looking for the index equivalent of the bin/nutch segread -list command, which outputs the number of URLs in a segment. Thanks, Bryan

Re: Luke and Indexes

2005-12-08 Thread Bryan Woliner
t? Thanks again for the help, Bryan On 12/8/05, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: > Bryan Woliner wrote: > > >I have a couple very basic questions about Luke and indexes in > >general. Answers to any of these questions are much appreciated: > > > >1. I

Luke and Indexes

2005-12-07 Thread Bryan Woliner
I have a couple very basic questions about Luke and indexes in general. Answers to any of these questions are much appreciated: 1. In the Luke overview tab, what does "Index version" refer to? 2. Also in the overview tab, if "Has Deletions?" is equal to yes, where are the possible sources of dele

Number of URLs in segment fetchlist vs. Number of URLs in index

2005-12-05 Thread Bryan Woliner
How is the number of URLs in a a group of segment's fetchlists related to the number of urls in an index. Specifically, when I call the following command using the "segments2" directory, I find out that there are 166 entries in 15 segments: $ bin/nutch segread -list -dir segments However, when I

Re: RegexURLFilter / testing regex-urlfilter.txt

2005-11-30 Thread Bryan Woliner
Sorry if the answer to this question should be obvious, but where in the bin/nutch script do you need to add the following line to be able to test your regex-urlfilter.txt file from the command line? CLASSPATH=${CLASSPATH}:$NUTCH_HOME/plugins/urlfilter-regex/urlfilter-regex.jar On 11/29/05, Tho

Has anyone gotten the date query to function properly?

2005-11-21 Thread Bryan Woliner
If people have gotten the date query to work properly, it would be great to know the steps they used in get it working. I added the following property entry to my nutch-site.xml file and used the search phrase: "url:http date:19000101-20051231" (which returned zero results). plugin.includes

Re: Crawling a page for links, but not indexing it

2005-11-18 Thread Bryan Woliner
Dean, I think you could also use the PruneIndexTool to "prune" those parent pages from the index. If you search the mail list archive you can find some discussion of how the PruneIndexTool works. -Bryan On 11/18/05, Dean Elwood <[EMAIL PROTECTED]> wrote: > Hi Jerome, > > Thanks. So essentially I

Re: Crawling multiple sites

2005-11-16 Thread Bryan Woliner
To increase the depth of a whole web crawl you need to fetch additional rounds, then update the database with the newly fetch URLs (eventually you will also need to index these URLs along with the "homepage" URLs fetched in the first round). The following part of the tutorial details how the 2nd an

A couple of questions about the "date:" query

2005-11-13 Thread Bryan Woliner
OK, I believe that I correctly included the "more indexing" and "more query" plugins, which should allow searches using the "date:" query field. However, I am current unable to search by date ranges. I tried to use the search string that doug cutting suggested in an e-mail to the list on 9/12/2005.

Re: Using FetchListEntry -dumpurls

2005-11-13 Thread Bryan Woliner
ckage > structure and scripts where updated so you are probably using old script > with new release. > Regards > Piotr > > > Bryan Woliner wrote: > > Hi, > > > > I am trying to dump all of the URLS from a segment to a text file. I was > > able to do this s

Using FetchListEntry -dumpurls

2005-11-07 Thread Bryan Woliner
Hi, I am trying to dump all of the URLS from a segment to a text file. I was able to do this successfully under Nutch 0.6 but am not able to do so under 0.7.1 Please take a look a the line below and let me know if you can figure out why I'm getting an error. Perhaps it a due to change from versio

Re: Collections.

2005-10-25 Thread Bryan Woliner
The regular expressions that you use in your regex-urlfilter.txt file allow you to specify that Nutch should only crawl certain parts of a domain. For example, you could limit your search to URLs that start with news.domainor www.domain.com/news If you search the mail

Search returns zero results when it should, but blank page when 1 or more result should be returned

2005-10-25 Thread Bryan Woliner
I realized that the subject of my last posting (" org.apache.jasper.JasperException- Root Cause - java.lang.NullPointerException") was pretty cryptic and didn't really explain my primary question, which is: Has anyone had the problem where after a whole web search, you go to your Nutch sea

org.apache.jasper.JasperException----- Root Cause -----java.lang.NullPointerException

2005-10-24 Thread Bryan Woliner
Ok, So I was working with nutch a while back and then got sidetracked for about a month and am coming back to it now. I am using Nutch 0.6 and I have a bash script I wrote that calls the basic nutch commands neccessary for a basic "whole-web crawl." As far as I can remember, the script worked fine

Re: Where are indexes stored and where to store indexes

2005-08-24 Thread Bryan Woliner
ikely that this is the case). Thanks, Bryan On 8/24/05, Bryan Woliner <[EMAIL PROTECTED]> wrote: > I know that this is a really basic question, but once you index segment(s), > where is the index stored? > > On a related note, I read in numerous emails to the list that you

Where are indexes stored and where to store indexes

2005-08-24 Thread Bryan Woliner
I know that this is a really basic question, but once you index segment(s), where is the index stored? On a related note, I read in numerous emails to the list that you can search more than one index at the same time if they are in the same location when you start tomcat. Where is the correct l

Re: Adding small batches of fetched URLs to a larger aggregate segment/index

2005-08-23 Thread Bryan Woliner
, Bryan Woliner <[EMAIL PROTECTED]> wrote: > > Hi, > > I have a number of sites that I want to crawl, then merge their segments > and create a single index. One of the main reasons I want to do this is that > I want some of the sites in my index to be crawls on a daily basi

Adding small batches of fetched URLs to a larger aggregate segment/index

2005-08-23 Thread Bryan Woliner
Hi, I have a number of sites that I want to crawl, then merge their segments and create a single index. One of the main reasons I want to do this is that I want some of the sites in my index to be crawls on a daily basis, others on a weekly basis, etc. Each time I re-crawl a site, I want to add

Re: Constructing queries for pruning single URLs

2005-08-21 Thread Bryan Woliner
Thanks, I got it to work using that. -Bryan On 8/20/05, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: > > Bryan Woliner wrote: > > > I have created a queries.txt file with one line: > > > > "http://www.alternet.org/911oneyearlater/"; > > > &

Constructing queries for pruning single URLs

2005-08-20 Thread Bryan Woliner
Hi, I have a number of URLs that I want to include in my fetch but not in my final index. For example, a number of these pages have links to URLs that I want to include in my final index, but the original page does not itself have content that I want to include. It seems that the best way to do

Re: Fetching pages with query strings

2005-08-17 Thread Bryan Woliner
use a query string does not, > by > > itself, mean that the site isn't recursive. > > > > I'm not sure if Nutch itself has an internal limitation on the > > representation of URL's, if so, then that would be the only reason I can > > think of to exclude query st

Fetching pages with query strings

2005-08-16 Thread Bryan Woliner
By default the regex-urlfilter.txt file excludes URLs that contain query strings (i.e. include "?"). Could somebody explain the reason for excluding these sites. Is there something risky about including them in a crawl? Is there anyone who is no excluding these files, and if so, how has it worke

Re: using the FetchListEntry -dumplist command

2005-08-10 Thread Bryan Woliner
-dumpurls and not -dumplist. And > when FetchListEntry is run without any recognized command line options > it simply prints nothing. (I was doing it in nutch svn version not 0.6 > but I think it was not changed). > Regards, > Piotr > > > On 8/10/05, Bryan Woliner <[EMA

Re: using the FetchListEntry -dumplist command

2005-08-10 Thread Bryan Woliner
That's a good catch, but removing the space did not alter the output in my foo.txt file. When you enter this same command (with the same arguments), from the nutch directory, do you get a list of URLs in your output file? Thanks, Bryan On 8/10/05, Piotr Kosiorowski <[EMAIL PROTECTED]> wrote:

using the FetchListEntry -dumplist command

2005-08-09 Thread Bryan Woliner
I am trying to retrieve all of the URLs that were fetched and stored in the most recently created segment. It is my understanding that the following command would output a file with one URL per line to a text file named foo.txt. -bash-2.05b$ echo $s1 segments/20050809221613 -bash-2.05b$ bin/nu

Re: Nutch related tomcat error: HTTP Status 500 - No Context configured to process this request

2005-08-04 Thread Bryan Woliner
older in your tomcat/webapps folder. > > Stefan > > Am 04.08.2005 um 20:11 schrieb Bryan Woliner: > > > I have an urls file and regex-urlfilter.txt file that work fine on my > > personal machine that have running nutch. I am trying to setup a linux > > server running

Nutch related tomcat error: HTTP Status 500 - No Context configured to process this request

2005-08-04 Thread Bryan Woliner
I have an urls file and regex-urlfilter.txt file that work fine on my personal machine that have running nutch. I am trying to setup a linux server running redhat 9 to run nutch and am having the following problem. I do a simple "whole-web" crawl fine and everything seems to work. I start tomca

Two Questions: Refetching and searching the archive of this list

2005-08-03 Thread Bryan Woliner
Two questions: 1. Is there a way to search all archived messages from this mailing list? 2. Is there a way to configure the fetcher to refetch only those pages either: (i) didn't exist during the last fetch; or (ii) have been modified since the last fetch? I know people have asked questions sim

Re: How to view the URLs stored in a segment

2005-07-19 Thread Bryan Woliner
fechlist prints: > Usage: FetchListEntry (-local | -ndfs ) [ -recno N | > -dumpurls ] segmentDir > So to dump list of urls in fetchlist you have to use -dumpurls option. > Regards > Piotr > > Bryan Woliner wrote: > > Could someone give me a basic example of how to view the con

Nutch Search Page

2005-07-18 Thread Bryan Woliner
Where is the Nutch search webpage stored and/or generated from? What do you have to do to modify it?

How to view the URLs stored in a segment

2005-07-18 Thread Bryan Woliner
Could someone give me a basic example of how to view the contents the URLs stored in a segment. I searched some of the old e-mails from this list and it seems like FetchTool is the correct way to do this? My segments directory is called "segments" is assigned the value suggested in the Nutch

Nutch Search Page

2005-07-16 Thread Bryan Woliner
Where is the Nutch search page (i.e. the page that comes up when you go to http://localhost:8080/) stored and/or generated from? What do you have to do to modify it? Thanks, Bryan

Some basic questions about URL filters

2005-07-10 Thread Bryan Woliner
Hi, I am a Nutch newbie and have a handful of questions about the URL filtering process. I am using nutch 0.6. I am trying to do a whole web crawl. Eventually I want to injecting a couple hundred URLs from one urlfile and limit the search results to a list of URLs, which are similar but not ex

Basic Whole-Web Crawl Question: Problem running fetch for the first time

2005-07-06 Thread Bryan Woliner
Hi, I was able crawl/index/search a couple of sites using the "intranet crawl" instructions in the tutorial. I am now trying to go through the whole-web crawl instructions in the tutorial and only got through a few steps before I ran into an error the first time I called bin/nutch fetch. (Note

Will nutch work with my webhost?

2005-07-01 Thread Bryan Woliner
Hi all, I'm a Nutch newbie and was able to download and install nutch on my Dell Laptop running Windows XP, and although I have a lot to figure out still, I was able to do a couple successful crawl/index/searches using the intranet search and about to being my exploration of the whole-web searc