Re: [Nutch-general] Nutch and distributed searching (w/ apologies)

2007-08-02 Thread Dennis Kubes
the database files are replaced. And you would continually get the best urls in your index for the space you have. I imagine that this is very similar to how the google dance works. Dennis Kubes charlie w wrote: On 8/1/07, Dennis Kubes [EMAIL PROTECTED] wrote: I am currently writing a python script

Re: [Nutch-general] Nutch and distributed searching (w/ apologies)

2007-08-01 Thread Dennis Kubes
I am currently writing a python script to automate this whole process from inject to pushing out to search servers. It should be done in a day or two and I will post it on the wiki. Dennis Kubes charlie w wrote: Thanks very much for the extended reply; lots of food for thought. WRT

Re: [Nutch-general] Nutch and distributed searching (w/ apologies)

2007-07-31 Thread Dennis Kubes
index is never down. Hope this helps and let me know if you have any questions. Dennis Kubes - This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events

[Nutch-general] Really big indexing and timeouts?

2007-07-30 Thread Dennis Kubes
Is anybody doing really big indexing jobs on Nutch and Hadoop, say 50M or more and seeing indexer timeout jobs? Dennis - This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems?

Re: [Nutch-general] Adding Patches

2007-07-24 Thread Dennis Kubes
Also it is best to open new JIRA issues and attach the patch inside the JIRA if you are wanting the patch included in Nutch releases. Dennis Kubes Marcin Okraszewski wrote: If you think of contributing patch to Nutch to be included in sources some day, you should probably do it against head

Re: [Nutch-general] four nutch merge commands: mergedb, mergesegs, mergelinkdb, merge

2007-07-18 Thread Dennis Kubes
If I am reading the message right :) then yes that problem would have been fixed by now. I believe that problem was with an earlier version of Nutch (0.7). Dennis Kubes Kai_testing Middleton wrote: Am I correct that the 'new' mergedb and mergelinkdb commands together would fix this problem

Re: [Nutch-general] Indexing exits with Job Failed

2007-07-09 Thread Dennis Kubes
, say -Xmx512M (we have ours set for -Xmx1024M). Dennis Kubes Jason Ma wrote: I'm running Nutch on RedHat Linux with Java 1.6.0_01. I have successfully crawled and indexed smaller quantities of data in the past. However, after I tried to scale up the crawling, Nutch would give an exception

Re: [Nutch-general] No buffer space available (maximum connections reached?): connect

2007-06-29 Thread Dennis Kubes
Sounds to me like you have reached the maximum number of open connections or ran out of memory or swap space. What is the available space on the box, how much memory do you have and how much swap? Dennis Kubes Fritz Bein wrote: Hi, after about 500'000 fetches I receive the message

Re: [Nutch-general] Deploying Nutch on Tomcat

2007-06-27 Thread Dennis Kubes
version of tomcat in the 5x or 6x range. Dennis Kubes Jason Ma wrote: Hi, I'm new to Nutch and Tomcat, so there doubtlessly many stupid things that I've done. I'm running Nutch 0.8.1 and Tomcat 4.1.36, on RedHat Linux with Java 1.6.0_01. I have uncompressed the nutch-0.8.1.war file

Re: [Nutch-general] Distributed index

2007-06-22 Thread Dennis Kubes
index per search server I would love to hear about it. The former suggestions of space and architecture are what we have experienced. Dennis Kubes - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE

Re: [Nutch-general] Distributed index

2007-06-22 Thread Dennis Kubes
calculations may be somewhat low on the segment space. Dennis Kubes other question would be what part of those 4G is taken by index, i think it's the majority, but i might be very wrong... You said above that you don't want local storage. Search has to be on local file systems. While

Re: [Nutch-general] Distributed index

2007-06-21 Thread Dennis Kubes
hard drives (less if you can find them). Network is ideally Gigabit ethernet. Dennis Kubes Karol Rybak Programmer University of Internet Technology and Management - This SF.net email is sponsored by DB2 Express Download

Re: [Nutch-general] Distributed index

2007-06-21 Thread Dennis Kubes
Andrzej Bialecki wrote: Dennis Kubes wrote: 100 million pages = 50-100 servers and 20-40T of space distributed. Ideally the setup would be processing machines and search servers. You [..] That's a very nice description - thanks, Dennis. I think it would be useful to include

Re: [Nutch-general] Hadoop oddity

2007-06-07 Thread Dennis Kubes
I was asking if you can ping the master from the slaves. Can you hit the namenode from one or more of the remote datanodes? If so in the hadoop-site.xml files on the datanodes, if the namenode variable pointing to the fqdn of the namenode instead of local? Dennis Kubes Bolle, Jeffrey F

Re: [Nutch-general] stackoverflow error

2007-06-06 Thread Dennis Kubes
fixes the problem but it is not very robust and has no unit tests as of yet. I have run this successfully myself. I will provide a more robust patch when time allows but this should help you for now. Dennis Kubes djames wrote: Thanks a lot for your help I'll give you a feedback

Re: [Nutch-general] Hadoop oddity

2007-06-06 Thread Dennis Kubes
If the hosts file on the namenode is not setup correctly it could be listening only on localhost. Make sure your /etc/hosts file looks something like this: 127.0.0.1 localhost, localhost.localdomain x.x.x.x yourcomputer.domain.tld Dennis Kubes Bolle, Jeffrey F. wrote

Re: [Nutch-general] Parallelizing URLFiltering

2007-05-31 Thread Dennis Kubes
of the servers are open to the public. search domain.com nameserver 127.0.0.1 nameserver 208.67.222.222 nameserver 208.67.220.220 nameserver 4.2.2.1 nameserver 4.2.2.2 nameserver 4.2.2.3 nameserver 4.2.2.4 nameserver 4.2.2.5 Dennis Kubes Enzo Michelangeli wrote: - Original Message - From

Re: [Nutch-general] Parallelizing URLFiltering

2007-05-31 Thread Dennis Kubes
, not tens of thousand). We are also using BIND and our current index is 52,519,267 pages so you should be fine with this. I think djbdns is just easier to use. Are you using any big DNS caches as backups? Dennis Kubes I've had positive experience with djbdns / tinydns package, with some

Re: [Nutch-general] parser not found for contentType=application/pdf

2007-05-17 Thread Dennis Kubes
In the nutch-default.xml file you have the configuration option plugin.includes. Copy that property to the nutch-site.xml file and change the parse-(text|html|js) to look like this parse-(text|html|js|pdf) This will enable the pdf parser plugin. Dennis Kubes Sævaldur Arnar Gunnarsson wrote

Re: [Nutch-general] Generic Question about initial seed

2007-05-16 Thread Dennis Kubes
. Your second crawl will be around 54 million pages. And a depth of 3 will give you over 300 million pages. These are the numbers that we are currently seeing. Dennis Kubes bbrown wrote: This is kind of a generic question. Are there any stats on how many pages will get crawled based on some

Re: [Nutch-general] Nutch Crawling error

2007-05-14 Thread Dennis Kubes
It should look like this but change out domain for your domain. Try this and let me know if it works. 127.0.0.1 dhcppc0.domain.com dhcppc0 localhost.localdomain localhost Dennis Kubes Reza Harditya wrote: Hi Dennis, Yes dhcppc0 is the machine that Nutch is on. And yes

Re: [Nutch-general] Nutch Crawling error

2007-05-13 Thread Dennis Kubes
For some reason the nutch process can't resolve the hosts. This could be due to incorrect setup of dns on the machine or a firewall or proxy in place. See if you can ping one of the urls (hosts) that you are trying to fetch. Dennis Kubes Reza Harditya wrote: Hi, I'm a new nutch user

Re: [Nutch-general] Nutch Crawling error

2007-05-13 Thread Dennis Kubes
PROTECTED] wrote: I have checked and confirmed that the hosts I'm trying to fetch are actually accessible (ping requests and loading the site itself). However, I still get the same error. Any other alternatives? On 5/14/07, Dennis Kubes [EMAIL PROTECTED] wrote: For some reason the nutch

Re: [Nutch-general] nutch and hadoop: can't launch properly the name node

2007-05-02 Thread Dennis Kubes
The problem may be that the machine is listening on only the local interface. If you do a ping myhostname from the local box you should receive the real IP and not the loopback address. Let me know if this was the problem or if you need more help. Dennis Kubes cybercouf wrote: I'm trying to setup

Re: [Nutch-general] nutch and hadoop: can't launch properly the name node

2007-05-02 Thread Dennis Kubes
What errors are you seeing in your hadoop-namenode and datanode logs? Dennis Kubes cybercouf wrote: Yes it is. Here more details: $ cat /etc/hosts 127.0.0.1 localhost 84.x.x.xmyhostname.mydomain.com myhostname # ping myhostname PING myhostname.mydomain.com (84.x.x.x) 56(84

Re: [Nutch-general] nutch and hadoop: can't launch properly the name node

2007-05-02 Thread Dennis Kubes
Is your hadoop jar in the lib directory named hadoop-0.4.0-patched.jar! with the exclamation point? If it is, that may be causing the error. Also let me know if you can ping the namenode from any of the data nodes. Dennis Kubes cybercouf wrote: I tried both with localhost

Re: [Nutch-general] Hardware Crashes and Garbage Collection on Nutch/Hadoop

2007-04-21 Thread Dennis Kubes
Andrzej Bialecki wrote: Dennis Kubes wrote: So we moved 50 machines to a data center for a beta cluster of a new search engine based on Nutch and Hadoop. We fired all of the machines up and started fetching and almost immediately started experiencing JVM crashes and checksum/IO errors

[Nutch-general] Hardware Crashes and Garbage Collection on Nutch/Hadoop

2007-04-20 Thread Dennis Kubes
, but in part I hope some of this information helps someone else to avoid having to spend a week tracking down hardware and weird JVM problems. Dennis Kubes - This SF.net email is sponsored by DB2 Express Download DB2 Express C

Re: [Nutch-general] Long URL's in results

2007-04-14 Thread Dennis Kubes
We use a substring the JSP pages to chop off after 150 characters. Then it shows something like this with the ellipse. http://www.somelongurl.com/?w=with;a;big;long;query;string... Dennis Kubes rubdabadub wrote: Hi: You have two option 1. Don't crawl/index URL's having more then X char

Re: [Nutch-general] Help please trying to crawl local file system

2007-04-05 Thread Dennis Kubes
Did you set the agent name in the nutch configuration. I think even when crawling only the local file system the agent name still needs to be set. If not set I believe nothing is fetched and errors are thrown but you would only see this if your logging was setup for it. Dennis Kubes jim

[Nutch-general] Wikia Search Engine? Anyone working on it?

2007-03-24 Thread Dennis Kubes
haven't really seen anybody that has been active on the lists say they are going to be involved in the project though? What is everyone's interest level on this? Dennis Kubes - Take Surveys. Earn Cash. Influence the Future

Re: [Nutch-general] Crawl not crawling entire page

2007-03-22 Thread Dennis Kubes
Nutch by default will only parse the first 65536 bytes of an http request. You can change this to your desired limit by changing the http.content.limit configuration variable. Another question is whether some of the links are duplicates? Dennis Kubes Mike Howarth wrote: Thanks

Re: [Nutch-general] Nutch conf reading

2007-03-15 Thread Dennis Kubes
If within nutch: Configuration conf = NutchConfiguration.create(); Object obj = conf.get(my.variable.name)... or another get method Dennis Kubes djames wrote: Thanks for your help but where i call this methode, she could'nt be resolved. Is there an import i must do

Re: [Nutch-general] Nutch conf reading

2007-03-14 Thread Dennis Kubes
in the hadoop source code. Dennis Kubes djames wrote: Hello, I need to add a parameter in the conf file of nutch. What is the method to read the xml file in nutch? Thanks - Take Surveys. Earn Cash. Influence the Future

Re: [Nutch-general] Any hints for debuging errors like java.io.exception: read 95 bytes, should read 159 ?

2007-03-14 Thread Dennis Kubes
. The errors is basically stating that you wrote something out but haven't read it back in. Dennis Kubes qi wu wrote: Hi, I am trying to modify the Fetcher code in Nutch.81 , but always get the exceptions below in the hadoop.log. java.lang.RuntimeException: java.io.IOException: Version

Re: [Nutch-general] How to avoid outlinks on jpg/css/... ?

2007-03-09 Thread Dennis Kubes
|suffix)... Then you will need the prefix-urlfilter.txt and suffix-urlfilter.txt files in the conf directory. Below is a configuration that only crawls http pages with specific suffixes. On the suffix we start by allowing everything and then specifically deny certain file types. Dennis Kubes

Re: [Nutch-general] Java Programmatic Access to Invoking Search

2007-03-09 Thread Dennis Kubes
will need some experience with various query types. How do we specify the directory where our crawl results are located to the query engine? This is specified by the searcher.dir configuration variable. Dennis Kubes Is the API for Lucene the one I should use to retreve results? How

Re: [Nutch-general] Behavior of nutch-site.xml vs. hadoop-site.xml

2007-03-02 Thread Dennis Kubes
properties are overridden not the entire file. Practically you should define properties having to do with Hadoop (i.e. the DFS, Mapreduce, etc) in the hadoop-site.xml and properties having to do with Nutch (i.e. fetcher, url-normalizers, etc) in the nutch-site.xml. Dennis Kubes Ricardo J. Méndez wrote

Re: [Nutch-general] Behavior of nutch-site.xml vs. hadoop-site.xml

2007-03-02 Thread Dennis Kubes
.. How is that happen .. cos I am trying my best to read the code but I can't go beyond parse.. I started at crawl :-) After looking through it I don't want to hi jack the thread i just thought you answered the question so clearly.. Regards On 3/2/07, Dennis Kubes [EMAIL PROTECTED] wrote

Re: [Nutch-general] Behavior of nutch-site.xml vs. hadoop-site.xml

2007-03-02 Thread Dennis Kubes
of eclipse you can move around the item on the classpath and put your favored conf directory first. Dennis Kubes Ricardo J. Méndez http://ricardo.strangevistas.net/ - Take Surveys. Earn Cash. Influence the Future

Re: [Nutch-general] Customizing crawling

2007-02-22 Thread Dennis Kubes
into the CrawlDb. Or if you are writing your own parse plugin, simply don't add the link to the Outlinks. Dennis Kubes Thanks in advance, Ricardo J. Méndez http://ricardo.strangevistas.net/ - Take Surveys. Earn Cash

Re: [Nutch-general] re-fetch

2007-02-22 Thread Dennis Kubes
inject, generate, fetch process...don't use the same path) Then you can merge those results using mergedb for the CrawlDb and mergesegs for the Segments. You should have to do a full recrawl unless you don't know what pages were changed. Dennis Kubes Thanks Peter

Re: [Nutch-general] Incremental crawl using Nutch

2007-02-22 Thread Dennis Kubes
You can use the python automation script found at: http://wiki.apache.org/nutch/Automating_Fetches_with_Python I almost have a new version ready. Will post it in the next couple of days to the wiki. Dennis Kubes sandeep pujar wrote: Greetings, Are there ways we can initiate incremental

Re: [Nutch-general] focused crawls -- where to add parse filter

2007-02-18 Thread Dennis Kubes
x, y, and z then I wouldn't do it through HtmlParseFilter I would probably go with the lucene after index approach. Dennis Kubes -Brian - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's

Re: [Nutch-general] Want to study Nutch, do I need to read the source code one word by one?

2007-02-17 Thread Dennis Kubes
through the tutorials you will have an understanding of how the system runs. Then read the Becoming_A_Nutch_Developer document on the wiki and follow the steps. This will get you started, when you have questions or errors post messages to the user list to get help. Dennis Kubes boycanfly wrote

Re: [Nutch-general] focused crawls -- where to add parse filter

2007-02-17 Thread Dennis Kubes
. This may be an area that we need to add an extension point to if one doesn't already exist. I am sure there are many more people out there that would like to selectively store content based on the content. Dennis Kubes Brian Whitman wrote: In doing whole-internet focused crawls we'd like

Re: [Nutch-general] Web Proxy Authentication

2007-02-15 Thread Dennis Kubes
Fetcher is using the correct proxy but the DNS isn't getting out. Take a look at this, it might help. http://www.rgagnon.com/javadetails/java-0085.html Dennis Kubes Damian Florczyk wrote: ekoje ekoje napisał(a): Hello, I tried to modify Nutch in order to pass through a web proxy as advice

Re: [Nutch-general] Writing plugin example

2007-02-12 Thread Dennis Kubes
Someone overwrote the login page to the wiki. I restored it and you should now be able to login regularly. Dennis Kubes rubdabadub wrote: On 2/12/07, Ricardo J. Méndez [EMAIL PROTECTED] wrote: Hi, I was checking out the plug in writing example on the Wiki at http://wiki.apache.org/nutch

[Nutch-general] ClassNotFoundException on Hadoop Trunk

2007-01-31 Thread Dennis Kubes
Is anybody else getting ClassNotFoundExceptions when running the injector on the newest trunk of Hadoop? Dennis - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the

Re: [Nutch-general] Vertical Search Means

2007-01-30 Thread Dennis Kubes
It means searching a specific domain such as automotive, health, etc. How to do it is another story, short answer you could either index only specific sites that you know are in the domain or you could create ways to determine automatically if a page is in a domain. Dennis Kubes Reddeppa

Re: [Nutch-general] New to Nutch, a few questions

2007-01-30 Thread Dennis Kubes
believe this is default) and then you can add a required field to the query in the search.jsp for the language like this: query.addRequiredTerm(en, lang);// substitute language for en Many thanks, Nes Dennis Kubes

Re: [Nutch-general] Fetcher threads automation

2007-01-29 Thread Dennis Kubes
for the logging.conf file. Is that file in the same directory as the JobStream.py script? In the top of the logging file there is a section called formatters like this: [formatters] keys=simple Dennis Kubes Justin Hartman wrote: Hi Dennis This is a great contribution and I personally thank you

Re: [Nutch-general] Fetcher threads automation

2007-01-29 Thread Dennis Kubes
file at all - should i? Regards Justin On 1/29/07, Dennis Kubes [EMAIL PROTECTED] wrote: Justin, Thanks for the update. I will update the script and the wiki to be able to run this from a clean, no previous fetches run. Currently it did assume that there were at least some previous

Re: [Nutch-general] Lease expired exception

2007-01-28 Thread Dennis Kubes
There was some work done on this problem in hadoop a while back so my guess is you are probably using a version of Nutch 0.8? Take a look at HADOOP-563 in the Jira Denns Kubes djames wrote: Hello, During the parse of a fetch of 600 000 pages in a cluster of 5 box,The job failed with this

Re: [Nutch-general] Fetcher threads automation

2007-01-28 Thread Dennis Kubes
of job streams in python but that is not complete yet. Andrzej, do you think this is something we should post to the wiki? Dennis Kubes Justin Hartman wrote: Hi all Just have a couple more questions which remain unclear to me at this stage. 1. I'm fetching urls on a P4 2.8ghz machine

Re: [Nutch-general] Lease expired exception

2007-01-28 Thread Dennis Kubes
+Calls%22 That being said it is important to have the time synchronized between the machines and there are other errors (mostly stalls) that will occur if they are not synchronized. Dennis Kubes djames wrote: Thanks a lot for your response, I'm using nutch 0.8.1. I will rebuid hadoop

Re: [Nutch-general] Fetcher threads automation

2007-01-28 Thread Dennis Kubes
It is up on the wiki at the following location. http://wiki.apache.org/nutch/Automating_Fetches_with_Python It has also been added to the front page. Dennis Kubes Andrzej Bialecki wrote: Dennis Kubes wrote: We have a python script with logging which fully automates the fetching

Re: [Nutch-general] Nutch Crawler (.81) picking up strange links

2007-01-12 Thread Dennis Kubes
limit file types with prefix, suffix, or regex filters. Let me know if you need to know more about how to do that. Dennis Kubes Steve Kallestad wrote: I've implemented nutch as a site search to try it out. When I crawl my own site with nutch, I end up with a strange set of links

Re: [Nutch-general] Filtering URLs in CrawlDB

2007-01-09 Thread Dennis Kubes
My stupid mistake. I am using an older version, customized .8 branch which didn't have normalization. I added normalization to it but in the process wasn't updating the key with the normalized url for mergesegs filtering. Dennis Andrzej Bialecki wrote: Dennis Kubes wrote: If I wrote a new

Re: [Nutch-general] Issues Starting Hadoop Process in Nutch0.9l.1

2007-01-07 Thread Dennis Kubes
and 0.8.2 same problemsand also i tried with 0.9.2 version i can't succeed ..then i feel there is something to do with configurations? Dennis Kubes wrote: Can you ping the master computer (name node) from the slave (data node) computers. Also is your namenode configuration

Re: [Nutch-general] re-parse hang?

2007-01-04 Thread Dennis Kubes
What nutch version are you using and what is your setup. An 80K reparse should only take a few minutes at most. Dennis Brian Whitman wrote: On yesterdays nutch-nightly, from Dennis Kubes suggestions on how to normalize URLs, I removed the parsed folders via rm -rf crawl_parse parse_data

Re: [Nutch-general] Issues Starting Hadoop Process in Nutch0.9l.1

2007-01-04 Thread Dennis Kubes
I would take a look at the processes on the namenode server and see if the namenode has started up. It doesn't look like it did. If this is a new install, did you format the namenode? Dennis srinath wrote: Hi, While starting hadoop process we are getting the following error in logs

Re: [Nutch-general] NutchBean searching options

2007-01-03 Thread Dennis Kubes
NutchBean creates a query through the [Query query = Query.parse(args[0], conf);] call in its main method. The actual query object is created behind the scenes by the whole nutch analysis mechanism. This does alot of work that is helpful in creating general queries but it is not the only

Re: [Nutch-general] Duplicate URLs with slightly different URIs.. how to normalize?

2007-01-03 Thread Dennis Kubes
-parse, re-index and then dedup. Another option is a url filter that simply removes urls with the #a as they are internal links. Again you would need to re-parse, etc. Let me know if you need more information on how to do this. Dennis Kubes Brian Whitman wrote: I'm using Solr to search

Re: [Nutch-general] NUTCH 0.8.1: Difficulties with Analyzers

2007-01-01 Thread Dennis Kubes
I have not used the french analyzer...but did you use the french analyzer for both indexing and searching? Dennis [EMAIL PROTECTED] wrote: I am having a hardtime implementing the French Analyzer... Any help with be immensely appreciated. Here are the details, first I tried with the

Re: [Nutch-general] how to crawl Specified type files?

2006-12-31 Thread Dennis Kubes
You can use prefix and suffix filters by making sure the plugin.includes variable in the nutch-*.xml file has the urlfilters configured with the urlfilter variable like so: urlfilter-(prefix|suffix)... Then you will need the prefix-urlfilter.txt and suffix-urlfilter.txt files in the conf

Re: [Nutch-general] Need help with deleteduplicates

2006-12-29 Thread Dennis Kubes
(that.hash)) { // order first by hash return this.hash.compareTo(that.hash); ... So, is that where I would place my similary score and return that value there? Dennis Kubes wrote: If I am understanding what you are asking, in the getRecordReader method of the InputFormat innner

Re: [Nutch-general] query to hit all

2006-11-08 Thread Dennis Kubes
Segment is indexed as a field so you could write a query filter the includes the segment name. You could also use an IndexReader and loop through document by document from 0 to maxDoc() -1 checking for the segment field. The second option is much more resource intensive though. Dennis

Re: [Nutch-general] generate db segments topN with TYPE

2006-10-23 Thread Dennis Kubes
You could use suffix filters to filter out any document that isn't a PDF. Dennis Marco Vanossi wrote: Hi, Do you think there is an easy way to do make nutch generate a list of only certain documents type to fetch? For example: If one would like to crawl only PDF docs (after some pages

Re: [Nutch-general] fetch fails at reduce stage because can not sense heartbeat for 600 seconds

2006-10-18 Thread Dennis Kubes
I agree with Andrzej that a thread dump would be best. Also what version of nutch are you using? Dennis Andrzej Bialecki wrote: Mike Smith wrote: Hi Dennis, But it doesn't make sense since the reducers' keys are URLs and the heartbeat cannot be sent when the reduce task is called. Since

Re: [Nutch-general] Extending BasicQueryFilter for a new plugiin?

2006-10-17 Thread Dennis Kubes
I don't know exactly what you are wanting to do below. Adding a term through a query filter would be something like this: import org.apache.nutch.searcher.FieldQueryFilter; import org.apache.hadoop.conf.Configuration; public class NewQueryFilter extends FieldQueryFilter { public

Re: [Nutch-general] java 1.5 or 1.4

2006-10-17 Thread Dennis Kubes
A guess would be that somewhere in your classpath you have the wrong version of xalan. Dennis NG-Marketing, M.Schneider wrote: Hello list, when I use Java 1.4 everything works well, but if I switch to 1.5 i have the following error:

Re: [Nutch-general] fetch fails at reduce stage because can not sense heartbeat for 600 seconds

2006-10-17 Thread Dennis Kubes
I have seen this happen before if the box is loaded down with too many tasks and the IO is maxed. I have also seen this happen when the regex filters spin out. We changed our systems to use only prefix and suffix url filters and that cleared up those types of problems for us. Dennis Mike

Re: [Nutch-general] HELP: Why crawled files so small? nutch version 0.8.1

2006-10-11 Thread Dennis Kubes
Did you set the user agent name in the nutch-site.xml file? Dennis kevin wrote: Why crawl file so small? Total size: 12.4 KB I used this command: ./nutch crawl urls -dir crawled -depth 20 However,the website I crawled is not so small. Regards!

Re: [Nutch-general] no results in nutch 0.8.1

2006-09-28 Thread Dennis Kubes
the *. Also, in the log file, I can not find any error regarding this - Original Message - From: Dennis Kubes [EMAIL PROTECTED] To: nutch-user@lucene.apache.org Sent: Wednesday, September 27, 2006 7:59 PM Subject: Re: no results in nutch 0.8.1 Did you setup the user agent name

Re: [Nutch-general] no results in nutch 0.8.1

2006-09-27 Thread Dennis Kubes
Did you setup the user agent name in the nutch-site.xml file or nutch-default.xml file? Dennis carmmello wrote: I have followed the steps in the 0.8.1 tutorial and, also, I have been using Nutch for some time now, without seeing the kind of problem I am encountering now. After I have

Re: [Nutch-general] how to turn on fetcher log?

2006-09-21 Thread Dennis Kubes
It depends on settings in the conf/log4j.properites file for the level of logging. The log files are in the HADOOP_LOG_DIR directory which can be set in the hadoop-env.sh file in the conf directory. Usually the file is called hadoop-phoenix-tasktracker... Dennis Mike Smith wrote: Hi, I

Re: [Nutch-general] Fetcher aborts with hung threads

2006-09-18 Thread Dennis Kubes
Do a search on this mailing list for fetcher slowness and you will find a thread detailing this subject. Basically it is due to long crawl delays. Patches have been submitted on that thread. Dennis Bruno Thiel wrote: Hi all, I have got a problem with the fetcher (nutch-0.8). The Fetcher

Re: [Nutch-general] How to build nutch with ant?

2006-09-17 Thread Dennis Kubes
run ant package. the full distribution is under build/nutch-x,x folder. heack wrote: I run ant in nutch base dir, and It compile successfully. But it does not generate nutch-0.8.jar or nutch-0.8.war, only a nutch-0.8.job file(and other plunge class) in build folder. What options should I use

Re: [Nutch-general] (Problem)Why after I ran ant in nutch base dir, NO nutch-0.8.jar found in build dir?

2006-09-17 Thread Dennis Kubes
because the default target is job which creates the job file, run package to create all. heack wrote: Only a nutch-0.8.job file there. And also question what the next step should I do after I modified source code like NutchAnalysis.jj and use ant to build it? The search.jsp seems not use

Re: [Nutch-general] Nutch Cannot Find Indexed Pages?

2006-09-14 Thread Dennis Kubes
Does it not have anything in the database or are there entries in the index but nothing is being returned by the search? Dennis victor_emailbox wrote: Can anyone help? Thanks. victor_emailbox wrote: Hi, I followed all the steps in the 0.8 tutorial except that I have only 2 urls in the

Re: [Nutch-general] ClassNotFoundException while using segread

2006-09-12 Thread Dennis Kubes
Isn't this the same problem that was happening before with the SegmentMerger I think where the nutch-x.x.jar needed to be added to the classpath on all of the task trackers. We added the following code to our hadoop script just below the other for loops and redeployed script and restarted all

[Nutch-general] crawl_generate

2006-09-11 Thread Dennis Kubes
Besides the initial fetch is the crawl_generate folder in a segment used anywhere else? Would it be safe to delete or not have the crawl_generate folder while searching? Dennis - Using Tomcat but need to do more? Need to

Re: [Nutch-general] java.lang.OutOfMemoryError: Java heap space

2006-09-10 Thread Dennis Kubes
I don't know if it is the same in 7.2 but in .8 there is a hadoop-env.sh file where you can uncomment the JAVA_OPTS variable and give the heap more memory. Either way the JVM must be started with more memory, something like this vm option -Xmx1024M for a 1Gig heap. Dennis Bogdan Kecman

Re: [Nutch-general] Customize the crawl process

2006-09-08 Thread Dennis Kubes
You would need to modify Fetcher line 433 to use a a text output format like this: job.setOutputFormat(TextOutputFormat.class); and you would need to modify Fetcher line 307 only collect the information you are looking for, maybe something link this: Outlink[] links =

Re: [Nutch-general] # of tasks executed in parallel

2006-09-08 Thread Dennis Kubes
How many urls are you fetching and does each machine have the same settings as below? Remember that number of fetchers is number of fetcher threads per task per machine. So you would be running 2 tasks per machine * 12 threads * 3 machines = 75 fetchers. Dennis Vishal Shah wrote: Hi,

Re: [Nutch-general] Reduce Error during fetch

2006-09-08 Thread Dennis Kubes
You may be running into problems with regex stalls on filtering. Try removing the regex filter from the nutch-site.xml plugin.includes property. I was having similar problems before switching to just use prefix and suffix filters as below. I attached my prefix and suffix url filter files

Re: [Nutch-general] two nutch indexes on same webserver

2006-09-08 Thread Dennis Kubes
Assuming you have two separate war files deployed, it should be as easy as setting the searcher.dir property in the nutch-site.xml file in the different web-inf directories to the separate index locations. If you want to go the distributed searching route there is a in depth explanation on

Re: [Nutch-general] how to combine two run's result for search

2006-09-05 Thread Dennis Kubes
Are those like the shuttle boards? Smaller 1/4 size boxes? Dennis Zaheed Haque wrote: Renaud: Yes or No!. I have done some testing as Dennis Kubes suggested and got similler results like his test. In short having 4 nutch search servers in one box but in 4 different disks with in my case

Re: [Nutch-general] how to combine two run's result for search

2006-09-04 Thread Dennis Kubes
You can keep the indexes separate and use the distributed search server, one per index or you can use the mergedb and mergesegs commands to merge the two runs into a single crawldb and a single segments then re-run the invertlinks and index to create a single index file which can then be

Re: [Nutch-general] how to set NUTCH_JAVA_HOME

2006-08-29 Thread Dennis Kubes
in windows under control panel - system - advanced - environment variables - system variables. Dennis Kubes Philip Brown wrote: nutnoob wrote: how to set NUTCH_JAVA_HOME ??? I have java install in machine but don't know how to set it for nutch. please help me . see link: Setting

Re: [Nutch-general] How long to get 100 million page

2006-08-24 Thread Dennis Kubes
You will also need more than 1 terabyte to get to 100 million pages. A good rule of thumb is 2 gigs * replication factor for every 1 million pages. Dennis Dan Morrill wrote: Hi, I found that with a 3 meg DSL line I was averaging 8 pages per second with a similar set up, to reach 100

Re: [Nutch-general] problem in crawling......

2006-08-22 Thread Dennis Kubes
Unfortunately you have to start over. We started breaking our crawls into 100K to 500K runs because of this. Dennis Abdelhakim Diab wrote: Hi all: What can I do if I were crawling a big list of sites and suddenly the crawler stopped for any problem ? must I return the whole process or I

Re: [Nutch-general] what Linux distribution goes best with Nutch?

2006-08-17 Thread Dennis Kubes
I installed nutch, tomcat, and java fresh. All of my FC5 installs use only the minimal amount of packages, I think just editors, admin tools and base. I don't put x servers on them. We also use network boots and kickstart load to get a consistent install across machines. We install java,

Re: [Nutch-general] On fetcher slowness

2006-08-13 Thread Dennis Kubes
. But the problem that I run into are the fetcher threads hangs, and for crawl delay/robots.txt file (Please see Dennis Kubes posting on this). Yes, these are definitely problems. Stefan has been working on a queue-based fetcher that uses NIO. Seems very promising, but not yet ready for prime time. -- Ken

Re: [Nutch-general] crawl w/o store

2006-08-13 Thread Dennis Kubes
You can add the property to the nutch-site.xml file to take precedence over default in nutch-default.xml file. The value is as below. This is for Nutch 0.8 I am not sure if this is the same for 0.72 property namefetcher.store.content/name valuefalse/value descriptionIf true, fetcher

Re: [Nutch-general] On fetcher slowness

2006-08-13 Thread Dennis Kubes
maximize my full bandwidth. But the problem that I run into are the fetcher threads hangs, and for crawl delay/robots.txt file (Please see Dennis Kubes posting on this). Yes, these are definitely problems. Stefan has been working on a queue-based fetcher that uses NIO. Seems very promising

Re: [Nutch-general] crawl-urlfilter subpages of domains

2006-08-12 Thread Dennis Kubes
You can use a suffix filter if there are no query strings. Dennis Jens Martin Schubert wrote: Hello, is it possible to crawl e.g. http://www.domain.com, but to skip crawling all urls matching to (http://www.domain.com/subpage/) I tried to achieve this with

Re: [Nutch-general] number of mapper

2006-08-10 Thread Dennis Kubes
There is also a mapred.tasktracker.tasks.maximum variable which may be causing the task number to be different. Dennis Murat Ali Bayir wrote: Hi everbody, Although I change the number of mappers in hadoop-site.xml and use job.setNumMapTasks method the system gives another number as a

Re: [Nutch-general] problems with start-all command

2006-08-10 Thread Dennis Kubes
The name node is running. Run the bin/stop-all.sh script first and then do a ps -ef | grep NameNode to see if the process is still running. If it is, it may need to be killed by hand kill -9 processid. The second problem is the setup of ssh keys as described in previous email. Also I would

  1   2   >