Re: [Nutch-general] Loading mechnism of plugin classes and singleton objects
This is all I did (and from what I have read, double checked locking is works correctly in jdk 5) private static volatile IndexingFilters INSTANCE; public static IndexingFilters getInstance(final Configuration configuration) { if(INSTANCE == null) { synchronized(IndexingFilters.class) { if(INSTANCE == null) { INSTANCE = new IndexingFilters(configuration); } } } return INSTANCE; } So, I just updated all the code that calls new IndexingFilters(..) to call IndexingFilters.getInstance(...). This works for me, perhaps not everyone. I think that the filter interface should be refitted to allow the configuration instance to be passed along the filters too, or allow a way for the thread to obtain it's current configuration, rather than instantiating these things over and over again. If a filter is designed to be thread-safe, there is no need for all this unnecessary object creation. On 6/6/07, Briggs [EMAIL PROTECTED] wrote: FYI, I ran into the same problem. I wanted my filters to be instantiated only once, and they not only get instantiated repeatedly, but the classloading is flawed in that it keeps reloading the classes. So, if you ever dump the stats from your app (use 'jmap -histo;) you can see all the classes that have been loaded. You will notice, if you have been running nutch for a while, classes being loaded thousands of times and never unloaded. My quick fix was to just edit all the main plugin points ( URLFilters.java, IndexFilters.java etc) and made them all singletons. I haven't had time to look into the classloading facility. There is a bit of a bug in there (IMHO), but some people may not want singletons. But, there needs to be a way of just instantiating a new plugin, and not instantiating a new classloader everytime a plugin is requested. These seem to never get garbage collected. Anyway.. that's all I have to say at the moment. On 6/5/07, Doğacan Güney [EMAIL PROTECTED] wrote: Hi, It seems that plugin-loading code is somehow broken. There is some discussion going on about this on http://www.nabble.com/forum/ViewPost.jtp?post=10844164framed=y . On 6/5/07, Enzo Michelangeli [EMAIL PROTECTED] wrote: I have a question about the loading mechanism of plugin classes. I'm working with a custom URLFilter, and I need a singleton object loaded and initialized by the first instance of the URLFilter, and shared by other instances (e.g., instantiated by other threads). I was assuming that the URLFilter class was being loaded only once even when the filter is used by multiple threads, so I tried to use a static member variable of my URLFilter class to hold a reference to the object to be shared: but it appears that the supposed singleton, actually, isn't, because the method responsible for its instantiation finds the static field initialized to null. So: are URLFilter classes loaded multiple times by their classloader in Nutch? The wiki page at http://wiki.apache.org/nutch/WhichTechnicalConceptsAreBehindTheNutchPluginSystem seems to suggest otherwise: Until Nutch runtime, only one instance of such a plugin class is alive in the Java virtual machine. (By the way, what does Until Nutch runtime mean here? Before Nutch runtime, no class whatsoever is supposed to be alive in the JVM, is it?) Enzo -- Doğacan Güney -- Conscious decisions by conscious minds are what make reality real -- Conscious decisions by conscious minds are what make reality real - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/___ Nutch-general mailing list Nutch-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-general
[Nutch-general] urls/nutch in local is invalid
Hi I wanted to start a crawl like it is done in the nutch 0.8.x tutorial. Unfortunately I get the following error: [EMAIL PROTECTED] nutch-0.8.1]$ bin/nutch crawl urls/nutch -dir crawl.test -depth 10 crawl started in: crawl.test rootUrlDir = urls/nutch threads = 10 depth = 10 Injector: starting Injector: crawlDb: crawl.test/crawldb Injector: urlDir: urls/nutch Injector: Converting injected urls to crawl db entries. Exception in thread main java.io.IOException: Input directory /scratch/nutch-0.8.1/urls/nutch in local is invalid. at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:274) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:327) at org.apache.nutch.crawl.Injector.inject(Injector.java:138) at org.apache.nutch.crawl.Crawl.main(Crawl.java:105) Any ideas what is causing that? regards martin - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Nutch-general mailing list Nutch-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-general
Re: [Nutch-general] Loading mechnism of plugin classes and singleton objects
FYI, I ran into the same problem. I wanted my filters to be instantiated only once, and they not only get instantiated repeatedly, but the classloading is flawed in that it keeps reloading the classes. So, if you ever dump the stats from your app (use 'jmap -histo;) you can see all the classes that have been loaded. You will notice, if you have been running nutch for a while, classes being loaded thousands of times and never unloaded. My quick fix was to just edit all the main plugin points ( URLFilters.java, IndexFilters.java etc) and made them all singletons. I haven't had time to look into the classloading facility. There is a bit of a bug in there (IMHO), but some people may not want singletons. But, there needs to be a way of just instantiating a new plugin, and not instantiating a new classloader everytime a plugin is requested. These seem to never get garbage collected. Anyway.. that's all I have to say at the moment. On 6/5/07, Doğacan Güney [EMAIL PROTECTED] wrote: Hi, It seems that plugin-loading code is somehow broken. There is some discussion going on about this on http://www.nabble.com/forum/ViewPost.jtp?post=10844164framed=y . On 6/5/07, Enzo Michelangeli [EMAIL PROTECTED] wrote: I have a question about the loading mechanism of plugin classes. I'm working with a custom URLFilter, and I need a singleton object loaded and initialized by the first instance of the URLFilter, and shared by other instances (e.g., instantiated by other threads). I was assuming that the URLFilter class was being loaded only once even when the filter is used by multiple threads, so I tried to use a static member variable of my URLFilter class to hold a reference to the object to be shared: but it appears that the supposed singleton, actually, isn't, because the method responsible for its instantiation finds the static field initialized to null. So: are URLFilter classes loaded multiple times by their classloader in Nutch? The wiki page at http://wiki.apache.org/nutch/WhichTechnicalConceptsAreBehindTheNutchPluginSystem seems to suggest otherwise: Until Nutch runtime, only one instance of such a plugin class is alive in the Java virtual machine. (By the way, what does Until Nutch runtime mean here? Before Nutch runtime, no class whatsoever is supposed to be alive in the JVM, is it?) Enzo -- Doğacan Güney -- Conscious decisions by conscious minds are what make reality real - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/___ Nutch-general mailing list Nutch-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-general
Re: [Nutch-general] Is fetcher.throttle.bandwidth known to work?
Hello Enzo, we never developed a patch for this issue. I believe back in 2004 and nutch 0.4 version, there was an other fetcher modul which was replaced in 0.5 version. This fetcher was able to throttle bandwith, but it was also very buggy. So the wiki description would be obsolete. I am not familar with all the changes since version 0.7 So, it might be good, if somebody could change the wiki. If you are interested to see, how this option was implemented, maybe you could find the old version in cvs. Regards, Matthias Enzo Michelangeli schrieb: Hi Matthias, I'm writing you about the Nutch config file option fetcher.throttle.bandwidth , referenced by you at http://wiki.apache.org/nutch/FetchOptions . According to Andrzej Bialecki in the thread http://www.nabble.com/Is--fetcher.throttle.bandwidth-known-to-work--t3861057.html , that refers to a private patch not part of Nutch' mainline code base. Is that patch available from you for submission to the Nutch team? Thanks, Enzo Enzo Michelangeli schrieb: - Original Message - From: Andrzej Bialecki [EMAIL PROTECTED] Sent: Tuesday, June 05, 2007 4:56 PM [...] You can achieve a somewhat similar effect by controlling the number of fetcher threads. I realize this is not as accurate as a specific control mechanism, but so far it was sufficient for most users. If this feature is important to you, please provide a patch that implements it, and we'll consider it for inclusion. I think that for the time being I'll just channel the traffic through a Squid proxy, and use its delay pools feature to throttle the bandwidth (and also its DNS caching, which, as I mentioned a few days ago, I also need...). For Nutch, it might make sense to find the original patch. I'll try to get n touch with Matthias Jaekle, who authored that wiki page where fetcher.throttle.bandwidth was referenced. Thanks anyway, Enzo - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Nutch-general mailing list Nutch-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-general
Re: [Nutch-general] urls/nutch in local is invalid
is urls/nutch a file or directory? On 6/6/07, Martin Kammerlander [EMAIL PROTECTED] wrote: Hi I wanted to start a crawl like it is done in the nutch 0.8.x tutorial. Unfortunately I get the following error: [EMAIL PROTECTED] nutch-0.8.1]$ bin/nutch crawl urls/nutch -dir crawl.test -depth 10 crawl started in: crawl.test rootUrlDir = urls/nutch threads = 10 depth = 10 Injector: starting Injector: crawlDb: crawl.test/crawldb Injector: urlDir: urls/nutch Injector: Converting injected urls to crawl db entries. Exception in thread main java.io.IOException: Input directory /scratch/nutch-0.8.1/urls/nutch in local is invalid. at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:274) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:327) at org.apache.nutch.crawl.Injector.inject(Injector.java:138) at org.apache.nutch.crawl.Crawl.main(Crawl.java:105) Any ideas what is causing that? regards martin -- Conscious decisions by conscious minds are what make reality real - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Nutch-general mailing list Nutch-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-general
Re: [Nutch-general] urls/nutch in local is invalid
I see now whats causing the error. /urls/nutch is a file...but you have to give as input only the urls folder not the file as i did ;) ps: is there an irc channel for nutch or 'only' mailing list? thx martin Zitat von Briggs [EMAIL PROTECTED]: is urls/nutch a file or directory? On 6/6/07, Martin Kammerlander [EMAIL PROTECTED] wrote: Hi I wanted to start a crawl like it is done in the nutch 0.8.x tutorial. Unfortunately I get the following error: [EMAIL PROTECTED] nutch-0.8.1]$ bin/nutch crawl urls/nutch -dir crawl.test -depth 10 crawl started in: crawl.test rootUrlDir = urls/nutch threads = 10 depth = 10 Injector: starting Injector: crawlDb: crawl.test/crawldb Injector: urlDir: urls/nutch Injector: Converting injected urls to crawl db entries. Exception in thread main java.io.IOException: Input directory /scratch/nutch-0.8.1/urls/nutch in local is invalid. at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:274) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:327) at org.apache.nutch.crawl.Injector.inject(Injector.java:138) at org.apache.nutch.crawl.Crawl.main(Crawl.java:105) Any ideas what is causing that? regards martin -- Conscious decisions by conscious minds are what make reality real - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Nutch-general mailing list Nutch-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-general
Re: [Nutch-general] urls/nutch in local is invalid
You must give nutch the URL directory. It reads the text files in there for the URLs to inject. In your case this would be /urls. Jeff -Original Message- From: Martin Kammerlander [mailto:[EMAIL PROTECTED] Sent: Wednesday, June 06, 2007 12:03 PM To: [EMAIL PROTECTED] Subject: Re: urls/nutch in local is invalid I see now whats causing the error. /urls/nutch is a file...but you have to give as input only the urls folder not the file as i did ;) ps: is there an irc channel for nutch or 'only' mailing list? thx martin Zitat von Briggs [EMAIL PROTECTED]: is urls/nutch a file or directory? On 6/6/07, Martin Kammerlander [EMAIL PROTECTED] wrote: Hi I wanted to start a crawl like it is done in the nutch 0.8.x tutorial. Unfortunately I get the following error: [EMAIL PROTECTED] nutch-0.8.1]$ bin/nutch crawl urls/nutch -dir crawl.test -depth 10 crawl started in: crawl.test rootUrlDir = urls/nutch threads = 10 depth = 10 Injector: starting Injector: crawlDb: crawl.test/crawldb Injector: urlDir: urls/nutch Injector: Converting injected urls to crawl db entries. Exception in thread main java.io.IOException: Input directory /scratch/nutch-0.8.1/urls/nutch in local is invalid. at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:274) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:327) at org.apache.nutch.crawl.Injector.inject(Injector.java:138) at org.apache.nutch.crawl.Crawl.main(Crawl.java:105) Any ideas what is causing that? regards martin -- Conscious decisions by conscious minds are what make reality real - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Nutch-general mailing list Nutch-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-general
Re: [Nutch-general] urls/nutch in local is invalid
I haven't heard of an IRC channel for it, but that would be cool. On 6/6/07, Martin Kammerlander [EMAIL PROTECTED] wrote: I see now whats causing the error. /urls/nutch is a file...but you have to give as input only the urls folder not the file as i did ;) ps: is there an irc channel for nutch or 'only' mailing list? thx martin Zitat von Briggs [EMAIL PROTECTED]: is urls/nutch a file or directory? On 6/6/07, Martin Kammerlander [EMAIL PROTECTED] wrote: Hi I wanted to start a crawl like it is done in the nutch 0.8.x tutorial. Unfortunately I get the following error: [EMAIL PROTECTED] nutch-0.8.1]$ bin/nutch crawl urls/nutch -dir crawl.test-depth 10 crawl started in: crawl.test rootUrlDir = urls/nutch threads = 10 depth = 10 Injector: starting Injector: crawlDb: crawl.test/crawldb Injector: urlDir: urls/nutch Injector: Converting injected urls to crawl db entries. Exception in thread main java.io.IOException: Input directory /scratch/nutch-0.8.1/urls/nutch in local is invalid. at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java :274) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java :327) at org.apache.nutch.crawl.Injector.inject(Injector.java:138) at org.apache.nutch.crawl.Crawl.main(Crawl.java:105) Any ideas what is causing that? regards martin -- Conscious decisions by conscious minds are what make reality real -- Conscious decisions by conscious minds are what make reality real - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/___ Nutch-general mailing list Nutch-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-general
[Nutch-general] stackoverflow error
Hi all, I got a probleme with parser when i try to crawl 2000 site with a depth of 3. I use nutch 0.81 version and my setup worked well with other site but this list gave me this error: 2007-06-06 13:49:27,997 WARN mapred.LocalJobRunner - job_qsjobz java.lang.StackOverflowError at org.apache.xerces.dom.ParentNode.getLength(Unknown Source) at org.apache.nutch.parse.html.DOMContentUtils.getOutlinks(DOMContentUtils.java:305) at org.apache.nutch.parse.html.DOMContentUtils.getOutlinks(DOMContentUtils.java:347) at org.apache.nutch.parse.html.DOMContentUtils.getOutlinks(DOMContentUtils.java:347) at org.apache.nutch.parse.html.DOMContentUtils.getOutlinks(DOMContentUtils.java:347) at org.apache.nutch.parse.html.DOMContentUtils.getOutlinks(DOMContentUtils.java:347) at org.apache.nutch.parse.html.DOMContentUtils.getOutlinks(DOMContentUtils.java:347) i cut the message because he's very long Could someone help me please, i don't think there is already an answer in the forum or in the jira Thank you very mutch for your help. -- View this message in context: http://www.nabble.com/stackoverflow-error-tf3879034.html#a10992519 Sent from the Nutch - User mailing list archive at Nabble.com. - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Nutch-general mailing list Nutch-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-general
Re: [Nutch-general] stackoverflow error
djames wrote: Hi all, I got a probleme with parser when i try to crawl 2000 site with a depth of 3. I use nutch 0.81 version and my setup worked well with other site but this list gave me this error: 2007-06-06 13:49:27,997 WARN mapred.LocalJobRunner - job_qsjobz java.lang.StackOverflowError at org.apache.xerces.dom.ParentNode.getLength(Unknown Source) at org.apache.nutch.parse.html.DOMContentUtils.getOutlinks(DOMContentUtils.java:305) I've seen this on some occasions, but I haven't discovered the real reason for this error yet - for now I suggest that you modify the source of DOMContentUtils to artificially limit the level of recursion in getOutlinks to something like 200-300. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Nutch-general mailing list Nutch-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-general
Re: [Nutch-general] stackoverflow error
Thanks a lot for your help I'll give you a feedback -- View this message in context: http://www.nabble.com/stackoverflow-error-tf3879034.html#a10993864 Sent from the Nutch - User mailing list archive at Nabble.com. - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Nutch-general mailing list Nutch-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-general
[Nutch-general] indexing only special documents
hi! I have a question. If I have for example the seed urls and do a crawl based o that seeds. If I want to index then only pages that contain for example pdf documents, how can I do that? cheers martin - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Nutch-general mailing list Nutch-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-general
Re: [Nutch-general] indexing only special documents
You set that up in your nutch-site.xml file. Open the nutch-default.xml file (located in the NUTCH_INSTALL_DIR/conf. Look for this element: property nameplugin.includes/name valueprotocol-httpclient|language-identifier|urlfilter-regex|nutch-extensionpoints|parse-(text|html|pdf|msword|rss)|index-basic|query-(basic|site|url)|index-more|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)/value descriptionRegular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. In order to use HTTPS please enable protocol-httpclient, but be aware of possible intermittent problems with the underlying commons-httpclient library. /description /property You'll notice the parse plugins that uses the regex parse-(text|html|pdf|msword|rss). You remove/add the available parsers here. So, if you only wanted pdfs, you only use the pdf parser, parse-(pdf) or just parse-pdf. Don't edit the nutch-default file. Create a new nutch-site.xml file for your cusomizations. So, basically copy the nutch-default.xml file, remove everything you don't need to override, and there ya go. I believe that is the correct way. On 6/6/07, Martin Kammerlander [EMAIL PROTECTED] wrote: hi! I have a question. If I have for example the seed urls and do a crawl based o that seeds. If I want to index then only pages that contain for example pdf documents, how can I do that? cheers martin -- Conscious decisions by conscious minds are what make reality real - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Nutch-general mailing list Nutch-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-general
Re: [Nutch-general] indexing only special documents
Wow thx Briggs that's pretty cool and it looks easy :) great!! I will try this out right tomorrow..bit late now here. Another 2 additonal questions: 1.Those parse plugins where do I find them in the nutch source code? Is it possible and easy going to write a own parser plugin...cause I think I'm gonna need some additional non standard parser plugin(s). 2. When I do a crawl. Is it possible that I can activate or see some statistics in nutch for that. I mean that at the end of indexing process it shows me how many urls nutch had parsed and how much of them contained i.e. pdfs and additionally how long the crawling and indexing process tooked and so on? thx for support martin Zitat von Briggs [EMAIL PROTECTED]: You set that up in your nutch-site.xml file. Open the nutch-default.xml file (located in the NUTCH_INSTALL_DIR/conf. Look for this element: property nameplugin.includes/name valueprotocol-httpclient|language-identifier|urlfilter-regex|nutch-extensionpoints|parse-(text|html|pdf|msword|rss)|index-basic|query-(basic|site|url)|index-more|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)/value descriptionRegular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. In order to use HTTPS please enable protocol-httpclient, but be aware of possible intermittent problems with the underlying commons-httpclient library. /description /property You'll notice the parse plugins that uses the regex parse-(text|html|pdf|msword|rss). You remove/add the available parsers here. So, if you only wanted pdfs, you only use the pdf parser, parse-(pdf) or just parse-pdf. Don't edit the nutch-default file. Create a new nutch-site.xml file for your cusomizations. So, basically copy the nutch-default.xml file, remove everything you don't need to override, and there ya go. I believe that is the correct way. On 6/6/07, Martin Kammerlander [EMAIL PROTECTED] wrote: hi! I have a question. If I have for example the seed urls and do a crawl based o that seeds. If I want to index then only pages that contain for example pdf documents, how can I do that? cheers martin -- Conscious decisions by conscious minds are what make reality real - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Nutch-general mailing list Nutch-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-general
Re: [Nutch-general] stackoverflow error
This error is due to a webpage with an extreme nesting of tags. For example something like bibi./i/b/i/b but thousands of levels deep. It is a form of a spider trap. I just created NUTCH-497 for this issue and attached a very rudimentary patch as a workaround. The patch successfully fixes the problem but it is not very robust and has no unit tests as of yet. I have run this successfully myself. I will provide a more robust patch when time allows but this should help you for now. Dennis Kubes djames wrote: Thanks a lot for your help I'll give you a feedback - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Nutch-general mailing list Nutch-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-general
Re: [Nutch-general] Hadoop oddity
If the hosts file on the namenode is not setup correctly it could be listening only on localhost. Make sure your /etc/hosts file looks something like this: 127.0.0.1 localhost, localhost.localdomain x.x.x.x yourcomputer.domain.tld Dennis Kubes Bolle, Jeffrey F. wrote: In theory I have a cluster with 4 nodes. When running something like bin/slaves.sh uptime I get the desired results (all four servers respond with their uptimes). However, when I run a crawl only one server, the host (which also acts as a slave), appears under the nodes display. This has happened after the primary server died and had now been rebuilt. Had anyone experienced this before or does anyone have any tips as to where to begin looking for the problem. Thanks. Jeff - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Nutch-general mailing list Nutch-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-general
Re: [Nutch-general] Hadoop oddity
The hosts file looks fine...still only showing 1 node. Jeff -Original Message- From: Dennis Kubes [mailto:[EMAIL PROTECTED] Sent: Wednesday, June 06, 2007 7:42 PM To: [EMAIL PROTECTED] Subject: Re: Hadoop oddity If the hosts file on the namenode is not setup correctly it could be listening only on localhost. Make sure your /etc/hosts file looks something like this: 127.0.0.1 localhost, localhost.localdomain x.x.x.x yourcomputer.domain.tld Dennis Kubes Bolle, Jeffrey F. wrote: In theory I have a cluster with 4 nodes. When running something like bin/slaves.sh uptime I get the desired results (all four servers respond with their uptimes). However, when I run a crawl only one server, the host (which also acts as a slave), appears under the nodes display. This has happened after the primary server died and had now been rebuilt. Had anyone experienced this before or does anyone have any tips as to where to begin looking for the problem. Thanks. Jeff - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Nutch-general mailing list Nutch-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-general
[Nutch-general] Sicurezza dei dati personali
Title: Poste Italiane Caro cliente Poste.it, La preghiamo di esaminare con la massima serieta e immediatamente questo messaggio di posta elettronica che mostra le nuove misure di securezza. Il reparto sicurezza della nostra banca le notifica che sono state prese misure per accrescere il livello di sicurezza dell`online banking, in relazione ai frequenti tentativi di accedere illegalmente ai conti bancari. Per ottenere l`accesso alla versione piu sicura dell`area clienti preghiamo di dare la sua autorizzazione. FARE CLICK QUI PER ANDARE ALLA PAGINA DELL' AUTORIZZAZIONE » Considerazioni migliori, Il reparto sicurezza CONFIDENZIALE! Questo email contiene le informazioni confidenziali ed è inteso per il destinatario autorizzato soltanto. Se non siete un destinatario autorizzato, restituisca prego il email noi ed allora cancellilo dal vostri calcolatore e posta-assistente. Potete nè usare nè pubblicare qualsiasi email compreso i collegamenti, né rendete loro accessibili ai terzi in tutto il modo qualunque. Grazie per la vostra cooperazione Poste italiane S.p.A. - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/___ Nutch-general mailing list Nutch-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-general
[Nutch-general] ParseData encoding problem
Hi, I use nutch 0.9 to crawl some Chinese web site, and search using nutch web portal but found that cached html copy display incorrectly. Then I use bin/nutch readseg -dump to dump segments : ParseText(UTF-8) display correctly, but the Chinse character in Content display incorrectly as '?'.--the original html uses gd2312 charset. What's the possible cause? And how to fix? Thanks in advance, Xiong - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Nutch-general mailing list Nutch-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-general