Re: Ignoring Robots.txt

2009-09-11 Thread John Mendenhall
ots.txt? We had a similar situation. We modified the parse-html plugin, with a configurable flag to adhere to robots.txt or not adhere to robots.txt. Works great. JohnM -- john mendenhall j...@surfutopia.net surf utopia internet services

Re: nutch fetch of redirects not ending up in index

2008-12-04 Thread John Mendenhall
contain the successfully fetched urls > and the redirected intermediate urls. At least that is what I think is > happening. > > The final number indexed should be the successfully fetched urls, which > would be db_fetched. > > Dennis Anything I can do to help debug this?

Re: nutch fetch of redirects not ending up in index

2008-12-04 Thread John Mendenhall
g it to 3 and > >your redirects should go down. > > > >Dennis > > > >John Mendenhall wrote: > >>>We are using nutch version nutch-2008-07-22_04-01-29. > >>>We have a crawldb with over 500k urls. > >>> > >>>The statu

Re: nutch fetch of redirects not ending up in index

2008-12-03 Thread John Mendenhall
t 8 times per day, with only small incremental progress each round. Should topN be higher? Or, do we need to rebuild the entire crawl database? Please let me know if there is any information I need to provide. Thanks in advance for any assistance provided. JohnM -- john mendenhall [EMAIL PROTECTED] surf utopia internet services

nutch fetch of redirects not ending up in index

2008-12-01 Thread John Mendenhall
errors, to ensure they are not something serious? Of course, this is not even close to the missing numbers we should be seeing. Thanks in advance for any assistance or pointers to other resources or ideas. JohnM -- john mendenhall [EMAIL PROTECTED] surf utopia internet services

Re: nutch parsetext missing for some urls

2008-10-23 Thread John Mendenhall
thout titles. We have worked through this issue and the titles now exist, along with the corresponding text. JohnM -- john mendenhall [EMAIL PROTECTED] surf utopia internet services

Re: nutch parsetext missing for some urls

2008-10-21 Thread John Mendenhall
> Can u post some of the urls for which parse text is missing. I am unable to post the actual urls. This is a private project for which exact urls cannot be shared. JohnM > On Tue, Oct 21, 2008 at 6:44 AM, John Mendenhall <[EMAIL PROTECTED]>wrote: > > > We are usin

nutch parsetext missing for some urls

2008-10-20 Thread John Mendenhall
guarantee all urls get a parsetext, and hopefully, a title? Thanks in advance for any assistance or pointers to other resources or ideas. JohnM -- john mendenhall [EMAIL PROTECTED] surf utopia internet services

Re: nutch mergedb filter does not appear to be filtering

2008-10-20 Thread John Mendenhall
directory. Then, start the hadoop processes. Once the filtering is done, we stop the hadoop processes. Then, we unset the NUTCH_CONF_DIR and HADOOP_CONF_DIR environment variables. Finally, we restart the hadoop processes. Everything works like a charm now. JohnM -- john mendenhall [EMAIL PROTECTED] surf utopia internet services

Re: nutch mergedb filter does not appear to be filtering

2008-10-14 Thread John Mendenhall
. Does anyone have any thoughts or ideas for what we can do to get this to work with the NUTCH_CONF_DIR? Thank you in advance for any pointers. JohnM -- john mendenhall [EMAIL PROTECTED] surf utopia internet services

nutch mergedb filter does not appear to be filtering

2008-10-13 Thread John Mendenhall
specific I should be looking at first. Thanks in advance for any guidance or ideas provided. Thanks! JohnM -- john mendenhall [EMAIL PROTECTED] surf utopia internet services

Re: Error: Failed to get the current user's information: Login failed: Cannot run program "whoami":

2008-04-29 Thread John Mendenhall
a box configured like Linux. Assumption were made on the default shell script. We have had nutch running on windows, linux, and solaris. To get it to run on any of these boxes, changes have been required to basic scripts to get them to run. JohnM -- john mendenhall [EMAIL PROTECTED] sur

Re: hadoop dfs -ls and nutch generate/fetch commands

2008-04-18 Thread John Mendenhall
er. Pipe it through sort, before you use tail. You can only delete old segments after the refetch time has surpassed for that segment, and all entries in that segment have been refetced. JohnM -- john mendenhall [EMAIL PROTECTED] surf utopia internet services

nutch data on *nix and windows

2008-04-16 Thread John Mendenhall
-- john mendenhall [EMAIL PROTECTED] surf utopia internet services

Re: Next Generation Nutch

2008-04-11 Thread John Mendenhall
n in the long run. Assuming this is the way Nutch moves forward, do we allow Nutch to stay as-is, with plugins and all, and create a new project? Or, do we not worry about abandoning the current setup and changing it en masse? JohnM -- john mendenhall [EMAIL PROTECTED] surf utopia internet services

nutch, hadoop, and windows

2008-04-11 Thread John Mendenhall
be seem to think that is the best way to go. Any thoughts? Thanks! JohnM -- john mendenhall [EMAIL PROTECTED] surf utopia internet services

Re: Cluster Summary

2008-03-20 Thread John Mendenhall
:50030/jobtracker.jsp, the cluster summary shows only one > > node. ? > > > > Any suggestions > > > > > > MapsReducesTasks/NodeNodes 0241 <http://ascot1:50030/machines.jsp> Did you see all nodes listed in the output of the start-all script? It should list

Re: nutch 0.9, tomcat 6.0.14, nutchbean okay, tomcat search error

2008-03-15 Thread John Mendenhall
x27;nutch'. Fixed that and it works like a charm. Thanks again! JohnM -- john mendenhall [EMAIL PROTECTED] surf utopia internet services

nutch 0.9, tomcat 6.0.14, nutchbean okay, tomcat search error

2008-03-15 Thread John Mendenhall
ng it anywhere. Is there a place where I can set the memory footprint for tomcat to use more memory? Or, is there another place I should be looking? Thanks in advance for any pointers or assistance. JohnM -- john mendenhall [EMAIL PROTECTED] surf utopia internet services

Re: Error when adding nutch-0.9 war file to tomcat

2008-03-06 Thread John Mendenhall
unpacked. Then, the nutch app is the default URL for your tomcat setup. I hope this helps. JohnM -- john mendenhall [EMAIL PROTECTED] surf utopia internet services

Re: nutch 0.9, multiple nodes, dedup error and Failed to transfer blk_-1407334809134504262

2008-03-05 Thread John Mendenhall
using nutch 0.9. > Thanks ! > > On Fri, Jan 11, 2008 at 12:57 AM, John Mendenhall <[EMAIL PROTECTED]> > wrote: > > > Hello, > > > > I am running nutch 0.9 currently. > > I am running on 4 nodes, one is the master, in > > addition to being a slave.

Re: Nutch 0.9 mysterious failure to crawl sites (stopping at depth=0)

2008-02-20 Thread John Mendenhall
and it fixed my > > problem: > > > > http://www.mail-archive.com/[EMAIL PROTECTED]/msg01991.html Look at NUTCH-503, not NUTCH-507. I have no experience with NUTCH-507. JohnM -- john mendenhall [EMAIL PROTECTED] surf utopia internet services

Re: Nutch 0.9 mysterious failure to crawl sites (stopping at depth=0)

2008-02-20 Thread John Mendenhall
ere is a problem with the Generator. There was a change committed after 0.9 was released. I implemented this change and it fixed my problem: http://www.mail-archive.com/[EMAIL PROTECTED]/msg01991.html JohnM -- john mendenhall [EMAIL PROTECTED] surf utopia internet services

Re: Nutch 0.9 mysterious failure to crawl sites (stopping at depth=0)

2008-02-20 Thread John Mendenhall
kay to me. I would start looking at the logs closely. I would try setting your log4j properties to INFO or DEBUG level for the generator step. The inject is obviously working since your stats shows the urls in the crawldb as unfetched. So, debug the generator. JohnM -- john mendenhall [E

Re: Nutch 0.9 mysterious failure to crawl sites (stopping at depth=0)

2008-02-20 Thread John Mendenhall
> Any help at all would be much appreciated. Submit your submitted command, plus a sample of the urls in the url file, plus your filter. We can start from there. JohnM -- john mendenhall [EMAIL PROTECTED] surf utopia internet services

nutch 0.9, mapred-default.xml, hadoop-site.xml file usage on slaves

2008-02-13 Thread John Mendenhall
s? Thanks in advance for any pointers or rules of thumb you can provide. JohnM -- john mendenhall [EMAIL PROTECTED] surf utopia internet services

nutch 0.9, task status, task logs

2008-02-13 Thread John Mendenhall
ask: task_0018_m_02_0 - Thanks in advance for any assistance you can provide. JohnM -- john mendenhall [EMAIL PROTECTED] surf utopia internet services

Re: nutch 0.9, mergesegs error

2008-02-13 Thread John Mendenhall
On Tue, 05 Feb 2008, John Mendenhall wrote: > - > Merging 14 segments to /var/nutch/crawl/mergesegs_dir/20080201220906 > SegmentMerger: adding /var/nutch/crawl/segments/20080128132506 > SegmentMerger: adding ... > SegmentMerger: using segment data from: content crawl_gener

Re: Deleteing an index document in nutch

2008-02-07 Thread John Mendenhall
/nutch merge -workingdir $NUTCHTMPDIR $NEWINDEXDIR $NEWINDEXESDIR The variable names should be self-explanatory. If not, just let me know. JohnM -- john mendenhall [EMAIL PROTECTED] surf utopia internet services

Re: nutch 0.9, mergesegs error

2008-02-07 Thread John Mendenhall
On Tue, 05 Feb 2008, John Mendenhall wrote: > I am running nutch 0.9. > I have run nutch mergesegs many times before. > The last couple times I have run, I get the following > errors: > > - > Merging 14 segments to /var/nutch/crawl/mergesegs_dir/20080201220906 > Segm

nutch 0.9, mergesegs error

2008-02-05 Thread John Mendenhall
Why is log4j not finding the log4j.properties file? The nutch script in nutch/bin already adds the conf dir to the class path. Thanks in advance for any assistance you can provide. JohnM -- john mendenhall [EMAIL PROTECTED] surf utopia internet services

Re: nutch 0.9, multiple nodes, not fetching topN links to fetch

2008-01-30 Thread John Mendenhall
some with 1.5gb ram, and others with 4gb ram. Sorry for all the questions. The fetch issue is my current wall I am trying to overcome. Should this be debugged in the fetch process or is it possible the generate process is only outputting 3%-4% of the topN value? Thanks in advance for any poi

Re: nutch 0.9, fetch2, fetcher.parse conf value not used

2008-01-30 Thread John Mendenhall
configuration value is being used. I recommend we modify Fetcher2.java to use this value instead of requiring it to be on the command line. JohnM -- john mendenhall [EMAIL PROTECTED] surf utopia internet services

Re: New Installation - Problems - Error 500

2008-01-29 Thread John Mendenhall
he jsp pages are in the jsp directory. Simple, huh? If you want to just modify what is already in the tomcat directory, they are located in the webapps/ROOT directory in various directories, assuming you renamed it to ROOT. I hope that helps. JohnM -- john mendenhall [EMAIL PROTECTED] surf utopia internet services

Re: Nutch and Hadoop

2008-01-28 Thread John Mendenhall
onfiguration files and what you are setting. JohnM -- john mendenhall [EMAIL PROTECTED] surf utopia internet services

nutch 0.9, fetch2, fetcher.parse conf value not used

2008-01-26 Thread John Mendenhall
problem? Thanks in advance for any assistance you can provide. JohnM -- john mendenhall [EMAIL PROTECTED] surf utopia internet services

Re: nutch 0.9, multiple nodes, not fetching topN links to fetch

2008-01-26 Thread John Mendenhall
3 pure slaves. What is the best procedure for turning off the 3 slaves? Should I go back to a "local" setup only, without the overhead of hadoop dfs? What is the best recommendation? Thanks! JohnM -- john mendenhall [EMAIL PROTECTED] surf utopia internet services

Re: nutch 0.9, multiple nodes, not fetching topN links to fetch

2008-01-25 Thread John Mendenhall
On Fri, 25 Jan 2008, Dennis Kubes wrote: > Yes you would need to run parsing after fetching and before updatedb. Thanks! JohnM > John Mendenhall wrote: > >On Fri, 25 Jan 2008, Dennis Kubes wrote: > > > >>>Is the recommendation to run fetcher in parsing mode? >

Re: nutch 0.9, multiple nodes, not fetching topN links to fetch

2008-01-25 Thread John Mendenhall
would complete the download and if the parsing failed you would > still have the page content and be able to try again without refetching. To clarify, run the parsing after the fetch process and before the updatedb process, correct? Thanks! JohnM -- john mendenhall [EMAIL PROTECTED] surf utopia internet services

Re: nutch 0.9, multiple nodes, not fetching topN links to fetch

2008-01-25 Thread John Mendenhall
m > the same host are assigned to the same map task. All hosts are the same. Everyone of them. If there is no way to split them up, this seems to imply the distributed nature of nutch is lost on attempting to build an index for a single large site. Please correct me if I am wrong with this presumption. Thanks! JohnM -- john mendenhall [EMAIL PROTECTED] surf utopia internet services

Re: nutch 0.9, multiple nodes, not fetching topN links to fetch

2008-01-24 Thread John Mendenhall
slots. What settings do I need to modify to get the generated topN (10) urls to be spread out amongst all map task slots? Thanks! JohnM -- john mendenhall [EMAIL PROTECTED] surf utopia internet services

Re: deprecated methods in org.apache.nutch.searcher.IndexSearcher

2008-01-24 Thread John Mendenhall
uals(LocalFileSystem.NAME)) { > ... > > because Hadoop reserves a specific URI of the local FS abstraction, no > matter what is its implementation. I found LocalFileSystem documentation at http://hadoop.apache.org/core/docs/r0.14.4/api/org/apache/hadoop/fs/LocalFileSyste

Re: deprecated methods in org.apache.nutch.searcher.IndexSearcher

2008-01-23 Thread John Mendenhall
On Wed, 23 Jan 2008, John Mendenhall wrote: > I am using nutch-0.9. > > In the searcher.IndexSearcher class, there is a getDirectory > method that uses the following two calls: > > - > if ("local".equals(this.fs.getName())) { > return FSDi

deprecated methods in org.apache.nutch.searcher.IndexSearcher

2008-01-23 Thread John Mendenhall
e just remove the boolean? Please let me know how we are planning on modifying this code to adhere to the APIs we are using. Thanks! JohnM -- john mendenhall [EMAIL PROTECTED] surf utopia internet services

Re: nutch 0.9, multiple nodes, not fetching topN links to fetch

2008-01-23 Thread John Mendenhall
? > Or, is this something else in the configuration? > > Is this error the cause of only doing 3% of the 100k > urls I requested to be done? > > Or, is it a problem with the other 96 map tasks not doing > anything? > > Thanks again for all of your help. > > JohnM Does anyone have any thoughts on how I can begin addressing the issues I am experiencing above? Thanks in advance for any pointers anyone can provide. JohnM -- john mendenhall [EMAIL PROTECTED] surf utopia internet services

Re: nutch 0.9, multiple nodes, not fetching topN links to fetch

2008-01-21 Thread John Mendenhall
ause of only doing 3% of the 100k urls I requested to be done? Or, is it a problem with the other 96 map tasks not doing anything? Thanks again for all of your help. JohnM -- john mendenhall [EMAIL PROTECTED] surf utopia internet services

Re: nutch 0.9, multiple nodes, not fetching topN links to fetch

2008-01-21 Thread John Mendenhall
It sends me the jobdetails.jsp page, which is what I reported on. It seems to me you are referring to another interface. Can you please let me know where I should be looking for the errors in the fetcher tasks themselves? Thanks! JohnM -- john mendenhall [EMAIL PROTECTED] surf utopia internet services

Re: nutch 0.9, multiple nodes, not fetching topN links to fetch

2008-01-19 Thread John Mendenhall
e to check on the bandwidth available for fetching. Variable mapred.map.tasks is set to 97. Variable mapred.reduce.tasks is set to 17. Variable fetcher.threads.fetch is set to 10. Thanks again for any pointers you can provide. JohnM > John Mendenhall wrote: > >Hello, > >

nutch 0.9, multiple nodes, not fetching topN links to fetch

2008-01-19 Thread John Mendenhall
higher than the default of 2? Is there something in the logs I should look for to determine the exact cause of this problem? Thank you in advance for any assistance that can be provided. If you need any additional information, please let me know and I'll send it. Thanks! JohnM -- john mende

nutch 0.9, multiple nodes, logging missing

2008-01-17 Thread John Mendenhall
ditional information, please let me know and I'll send them. Thanks! JohnM -- john mendenhall [EMAIL PROTECTED] surf utopia internet services

nutch 0.9, multiple nodes, dedup error

2008-01-10 Thread John Mendenhall
figuration.(Configuration.java:93) at org.apache.hadoop.fs.FsShell.main(FsShell.java:910) - If you need me to post log excerpts from the other slaves, please let me know and I'll put them up. Thanks! JohnM -- john mendenhall [EMAIL PROTECTED] surf utopia internet services

Re: How to get the crawl database free of links to recrawl only from seed URL?

2007-08-24 Thread John Mendenhall
d and no new URLs will be added. I hope that helps. JohnM -- john mendenhall [EMAIL PROTECTED] surf utopia internet services

Re: how to update CrawlDB instead of Recrawling???

2007-08-21 Thread John Mendenhall
t; > update our crawldb instead of re-crawling . > > > > So do u have any solution that how to update crawldb which already have > > been crawled and storing some useful information. > > > > It's nice if I find any solutions from u or any of ur colleagues. > > > > With Thanks & Regards, > > > > Ratnesh,V2Solutions India -- john mendenhall [EMAIL PROTECTED] surf utopia internet services

Re: nutch links repository

2007-08-20 Thread John Mendenhall
ation on to System.out -topN [] dump top urls sorted by score to [] skip records with scores below this value. This can significantly improve performance. Or, you can write your own class that outputs whatever you want from the database... Joh

Re: Error with Nutch 0.9

2007-07-31 Thread John Mendenhall
merge several segment indexes dedup remove duplicates from a set of segment indexes pluginload a plugin and run one of its classes main() serverrun a search server or CLASSNAME run the class named CLASSNAME Most commands print help

Re: Error with Nutch 0.9

2007-07-31 Thread John Mendenhall
ainThread.run(libgcj.so.8rh) > Caused by: java.lang.ClassNotFoundException: admin not found in > gnu.gjc.runtime ... > > I did this a couple of weeks ago. At that point I couldĀ“nt find any > documentation for Nutch 0.9, so I tried the > ./bin/nutch admin db -create > > is that the Problem? nutch 0.9 do

site-specific classes

2007-07-19 Thread John Mendenhall
know where I should ask, or where I can find the docs on this kinds of queries. Thanks! JohnM -- john mendenhall [EMAIL PROTECTED] surf utopia internet services