OOM error during parsing with nekohtml
Hi All, We are getting an OOM Exception during the processing of http://www.fotofinity.com/cgi-bin/homepages.cgi . We have also applied Nutch-497 patch to our source code. But actually the error is coming during the parse method. Does anybody has any idea regarding this. Here is the complete stacktrace : java.lang.OutOfMemoryError: Java heap space at java.lang.String.toUpperCase(String.java:2637) at java.lang.String.toUpperCase(String.java:2660) at org.cyberneko.html.filters.NamespaceBinder.bindNamespaces(NamespaceBinder.java:443) at org.cyberneko.html.filters.NamespaceBinder.startElement(NamespaceBinder.java:252) at org.cyberneko.html.HTMLTagBalancer.callStartElement(HTMLTagBalancer.java:1009) at org.cyberneko.html.HTMLTagBalancer.startElement(HTMLTagBalancer.java:639) at org.cyberneko.html.HTMLTagBalancer.startElement(HTMLTagBalancer.java:646) at org.cyberneko.html.HTMLScanner$ContentScanner.scanStartElement(HTMLScanner.java:2343) at org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:1820) at org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:789) at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:478) at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:431) at org.cyberneko.html.parsers.DOMFragmentParser.parse(DOMFragmentParser.java:164) at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:265) at org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:229) at org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:168) at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:84) at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:75) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1445) Regards, Shailendra
RE: OOM error during parsing with nekohtml
I successfully run the whole-web crawl with the my new ubuntu OS, and I am ready to fix the bug. I need someone to guide me to get the most updated source code and the bug assignment. Thank you in advance!! Adam Shuy, President ePacific Web Design Hosting Professional Web/Software developer TEL: 408-272-6946 www.epacificweb.com -Original Message- From: Shailendra Mudgal [mailto:[EMAIL PROTECTED] Sent: Monday, July 16, 2007 3:05 AM To: [EMAIL PROTECTED]; nutch-dev@lucene.apache.org Subject: OOM error during parsing with nekohtml Hi All, We are getting an OOM Exception during the processing of http://www.fotofinity.com/cgi-bin/homepages.cgi . We have also applied Nutch-497 patch to our source code. But actually the error is coming during the parse method. Does anybody has any idea regarding this. Here is the complete stacktrace : java.lang.OutOfMemoryError: Java heap space at java.lang.String.toUpperCase(String.java:2637) at java.lang.String.toUpperCase(String.java:2660) at org.cyberneko.html.filters.NamespaceBinder.bindNamespaces(NamespaceBinder.ja va:443) at org.cyberneko.html.filters.NamespaceBinder.startElement(NamespaceBinder.java :252) at org.cyberneko.html.HTMLTagBalancer.callStartElement(HTMLTagBalancer.java:100 9) at org.cyberneko.html.HTMLTagBalancer.startElement(HTMLTagBalancer.java:639) at org.cyberneko.html.HTMLTagBalancer.startElement(HTMLTagBalancer.java:646) at org.cyberneko.html.HTMLScanner$ContentScanner.scanStartElement(HTMLScanner.j ava:2343) at org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:1820) at org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:789) at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:478) at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:431) at org.cyberneko.html.parsers.DOMFragmentParser.parse(DOMFragmentParser.java:16 4) at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:265) at org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:229) at org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:168) at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:84) at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:75) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1445) Regards, Shailendra
[jira] Created: (NUTCH-515) Next fetch time is set incorrectly
Next fetch time is set incorrectly -- Key: NUTCH-515 URL: https://issues.apache.org/jira/browse/NUTCH-515 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Reporter: Doğacan Güney Priority: Blocker Fix For: 1.0.0 After NUTCH-61 , db.default.fetch.interval option is deprecated and superceded by db.fetch.interval.default. However, various parts in nutch still use the old option. Since old option is in days (with default being 30) and new option in seconds (default is ~25), when nutch fetches a url, its next fetch time is set as ***30 SECONDS*** later. This means that nutch keeps refetching same urls over and over and over and over. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-439) Top Level Domains Indexing / Scoring
[ https://issues.apache.org/jira/browse/NUTCH-439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12512930 ] Doğacan Güney commented on NUTCH-439: - A big +1 from me. Though, it may be useful to break this patch into multiple pieces (fixes to opic and build system as a seperate patch, core changes as a seperate patch and plugin as a seperate patch). IMHO, most usages of URL.getHost should be replaced with this patch's getDomainName. For example, host field in index gets a big boost currently. But it is easy to spam hosts. Just buy a host 'example.com' then set up your own dns and add 'foo.example.com', 'bar.example.com', 'baz.example.com'. I have actually seen a lot of spam sites that do this. Doing this in linkdb reduces anchor spam (where 'foo.example.com' gives a link to 'bar.example.com' and nutch considers this an external link and stores this anchor). Another example is generator. Instead of partitioning on host or ip, we can partition urls based on their domains. This doesn't have the overhead of resolving ips (and ip-resolving also has problems. Urls under the same domain [sometimes even the same url] may be served from different ips [think load balancers and stuff]) and will be much more polite and resistant to honey pots. Top Level Domains Indexing / Scoring Key: NUTCH-439 URL: https://issues.apache.org/jira/browse/NUTCH-439 Project: Nutch Issue Type: New Feature Components: indexer Affects Versions: 0.9.0 Reporter: Enis Soztutar Attachments: tld_plugin_v1.0.patch, tld_plugin_v1.1.patch, tld_plugin_v2.0.patch, tld_plugin_v2.1.patch Top Level Domains (tlds) are the last part(s) of the host name in a DNS system. TLDs are managed by the Internet Assigned Numbers Authority. IANA divides tlds into three. infrastructure, generic(such as com, edu) and country code tlds(such as en, de , tr, ). Indexing the top level domain and optionally boosting is needed for improving the search results and enhancing locality. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: OOM error during parsing with nekohtml
You could try looking at these two discussions: http://www.mail-archive.com/nutch-dev@lucene.apache.org/msg06571.html http://www.mail-archive.com/nutch-dev@lucene.apache.org/msg06571.html --Kai - Original Message From: Tsengtan A Shuy [EMAIL PROTECTED] To: nutch-dev@lucene.apache.org; [EMAIL PROTECTED] Sent: Monday, July 16, 2007 3:45:59 AM Subject: RE: OOM error during parsing with nekohtml I successfully run the whole-web crawl with the my new ubuntu OS, and I am ready to fix the bug. I need someone to guide me to get the most updated source code and the bug assignment. Thank you in advance!! Adam Shuy, President ePacific Web Design Hosting Professional Web/Software developer TEL: 408-272-6946 www.epacificweb.com -Original Message- From: Shailendra Mudgal [mailto:[EMAIL PROTECTED] Sent: Monday, July 16, 2007 3:05 AM To: [EMAIL PROTECTED]; nutch-dev@lucene.apache.org Subject: OOM error during parsing with nekohtml Hi All, We are getting an OOM Exception during the processing of http://www.fotofinity.com/cgi-bin/homepages.cgi . We have also applied Nutch-497 patch to our source code. But actually the error is coming during the parse method. Does anybody has any idea regarding this. Here is the complete stacktrace : java.lang.OutOfMemoryError: Java heap space at java.lang.String.toUpperCase(String.java:2637) at java.lang.String.toUpperCase(String.java:2660) at org.cyberneko.html.filters.NamespaceBinder.bindNamespaces(NamespaceBinder.ja va:443) at org.cyberneko.html.filters.NamespaceBinder.startElement(NamespaceBinder.java :252) at org.cyberneko.html.HTMLTagBalancer.callStartElement(HTMLTagBalancer.java:100 9) at org.cyberneko.html.HTMLTagBalancer.startElement(HTMLTagBalancer.java:639) at org.cyberneko.html.HTMLTagBalancer.startElement(HTMLTagBalancer.java:646) at org.cyberneko.html.HTMLScanner$ContentScanner.scanStartElement(HTMLScanner.j ava:2343) at org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:1820) at org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:789) at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:478) at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:431) at org.cyberneko.html.parsers.DOMFragmentParser.parse(DOMFragmentParser.java:16 4) at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:265) at org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:229) at org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:168) at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:84) at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:75) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1445) Regards, Shailendra Boardwalk for $500? In 2007? Ha! Play Monopoly Here and Now (it's updated for today's economy) at Yahoo! Games. http://get.games.yahoo.com/proddesc?gamekey=monopolyherenow
[jira] Commented: (NUTCH-515) Next fetch time is set incorrectly
[ https://issues.apache.org/jira/browse/NUTCH-515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12513019 ] Andrzej Bialecki commented on NUTCH-515: - +1 - sorry for the mess up ... Next fetch time is set incorrectly -- Key: NUTCH-515 URL: https://issues.apache.org/jira/browse/NUTCH-515 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Reporter: Doğacan Güney Priority: Blocker Fix For: 1.0.0 Attachments: NUTCH-515.patch After NUTCH-61 , db.default.fetch.interval option is deprecated and superceded by db.fetch.interval.default. However, various parts in nutch still use the old option. Since old option is in days (with default being 30) and new option in seconds (default is ~25), when nutch fetches a url, its next fetch time is set as ***30 SECONDS*** later. This means that nutch keeps refetching same urls over and over and over and over. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-515) Next fetch time is set incorrectly
[ https://issues.apache.org/jira/browse/NUTCH-515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12513040 ] Doğacan Güney commented on NUTCH-515: - With more than a hundred config options, and with the way we use hadoop's configuration system (not that there is anything wrong with it but we have to specify a default value for most cases and we generally specify what is in nutch-default.xml as the default value) there are bound to be mistakes somewhere no matter how careful one is. I think this is my third wrong configuration option fix and I wonder how many I am missing. Perhaps, we can add a ConfParams class that stores parameter names. I mean, if you need say, db.outlinks.max.per.page option, you get its key as ConfParams.DB_OUTLINKS_MAX_PER_PAGE (So conf.getInt(ConfParams.DB_OUTLINKS_MAX_PER_PAGE, 100). Or we can add a hierarchy to it: ConfParams.IndexParams.MAX_TOKENS. A class with tens of static final strings in it is not the most elegant thing, but IMHO, it is better than what we are currently doing. Next fetch time is set incorrectly -- Key: NUTCH-515 URL: https://issues.apache.org/jira/browse/NUTCH-515 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Reporter: Doğacan Güney Priority: Blocker Fix For: 1.0.0 Attachments: NUTCH-515.patch After NUTCH-61 , db.default.fetch.interval option is deprecated and superceded by db.fetch.interval.default. However, various parts in nutch still use the old option. Since old option is in days (with default being 30) and new option in seconds (default is ~25), when nutch fetches a url, its next fetch time is set as ***30 SECONDS*** later. This means that nutch keeps refetching same urls over and over and over and over. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-506) Nutch should delegate compression to Hadoop
[ https://issues.apache.org/jira/browse/NUTCH-506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12513044 ] Doğacan Güney commented on NUTCH-506: - If there are no objections, I am going to commit this one. Just to get more comments, here is a break-down of what this patch does: * Remove all compression code from nutch. This means no more writeCompressedString-s or writeCompressedStringArrays, also no more CompressedWritable-s. All changes are done in a backward compatible manner. Also after this change, Content's version is -1 and new changes should *decrease* that number. See NUTCH-392 for more details. * Respect io.seqfile.compression.type setting for all structures except ParseText. ParseText is always compressed as RECORD. Also for some reason crawl_generate is not compressed. Why are we doing this? Because hadoop can efficiently (both in space and in time) compress these structures for us. I have done some tests with different compression settings in NUTCH-392 and BLOCK compression really does a difference. I think for a large enough crawl, overall space savings will be around %20 - %40. Note that this is basically for free (there may even be a small performance gain) if you are using hadoop's native libraries. Nutch should delegate compression to Hadoop --- Key: NUTCH-506 URL: https://issues.apache.org/jira/browse/NUTCH-506 Project: Nutch Issue Type: Improvement Reporter: Doğacan Güney Fix For: 1.0.0 Attachments: compress.patch, NUTCH-506.patch Some data structures within nutch (such as Content, ParseText) handle their own compression. We should delegate all compressions to Hadoop. Also, nutch should respect io.seqfile.compression.type setting. Currently even if io.seqfile.compression.type is BLOCK or RECORD, nutch overrides it for some structures and sets it to NONE (However, IMO, ParseText should always be compressed as RECORD because of performance reasons). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: OOM error during parsing with nekohtml
Hi all, Thanks for your suggestions. I am running parse on a single url ( http://www.fotofinity.com/cgi-bin/homepages.cgi). For other urls, parse works perfectly. we are getting this error because of the html of the page. The page contains many anchor tags which are not closed properly. Hence neko html parser throws this exception. The page can be parsed successfully using tagsoup. We think this as a bug in neko html parser. Regards, Shailendra On 7/16/07, Tsengtan A Shuy [EMAIL PROTECTED] wrote: Thank you for the info. The OOM exception in your previous email indicates that your system is running out of heap memory. You either have instantiated too many objects, or there are memory leaks in the source codes. Hope this will help you! Cheer!! Adam Shuy, President ePacific Web Design Hosting Professional Web/Software developer TEL: 408-272-6946 www.epacificweb.com -Original Message- From: Kai_testing Middleton [mailto:[EMAIL PROTECTED] Sent: Monday, July 16, 2007 8:43 AM To: nutch-dev@lucene.apache.org Subject: Re: OOM error during parsing with nekohtml You could try looking at these two discussions: http://www.mail-archive.com/nutch-dev@lucene.apache.org/msg06571.html http://www.mail-archive.com/nutch-dev@lucene.apache.org/msg06571.html --Kai - Original Message From: Tsengtan A Shuy [EMAIL PROTECTED] To: nutch-dev@lucene.apache.org; [EMAIL PROTECTED] Sent: Monday, July 16, 2007 3:45:59 AM Subject: RE: OOM error during parsing with nekohtml I successfully run the whole-web crawl with the my new ubuntu OS, and I am ready to fix the bug. I need someone to guide me to get the most updated source code and the bug assignment. Thank you in advance!! Adam Shuy, President ePacific Web Design Hosting Professional Web/Software developer TEL: 408-272-6946 www.epacificweb.com -Original Message- From: Shailendra Mudgal [mailto:[EMAIL PROTECTED] Sent: Monday, July 16, 2007 3:05 AM To: [EMAIL PROTECTED]; nutch-dev@lucene.apache.org Subject: OOM error during parsing with nekohtml Hi All, We are getting an OOM Exception during the processing of http://www.fotofinity.com/cgi-bin/homepages.cgi . We have also applied Nutch-497 patch to our source code. But actually the error is coming during the parse method. Does anybody has any idea regarding this. Here is the complete stacktrace : java.lang.OutOfMemoryError: Java heap space at java.lang.String.toUpperCase(String.java:2637) at java.lang.String.toUpperCase(String.java:2660) at org.cyberneko.html.filters.NamespaceBinder.bindNamespaces( NamespaceBinder.ja va:443) at org.cyberneko.html.filters.NamespaceBinder.startElement( NamespaceBinder.java :252) at org.cyberneko.html.HTMLTagBalancer.callStartElement(HTMLTagBalancer.java :100 9) at org.cyberneko.html.HTMLTagBalancer.startElement(HTMLTagBalancer.java:639) at org.cyberneko.html.HTMLTagBalancer.startElement(HTMLTagBalancer.java:646) at org.cyberneko.html.HTMLScanner$ContentScanner.scanStartElement( HTMLScanner.j ava:2343) at org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:1820) at org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:789) at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:478) at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:431) at org.cyberneko.html.parsers.DOMFragmentParser.parse(DOMFragmentParser.java :16 4) at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:265) at org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:229) at org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:168) at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:84) at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:75) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1445) Regards, Shailendra Boardwalk for $500? In 2007? Ha! Play Monopoly Here and Now (it's updated for today's economy) at Yahoo! Games. http://get.games.yahoo.com/proddesc?gamekey=monopolyherenow