OOM error during parsing with nekohtml

2007-07-16 Thread Shailendra Mudgal

Hi All,

We are getting an OOM Exception during the processing of
http://www.fotofinity.com/cgi-bin/homepages.cgi . We have also applied
Nutch-497 patch to our source code. But actually the error is coming during
the parse method.
Does anybody has any idea regarding this.  Here is the complete stacktrace :

java.lang.OutOfMemoryError: Java heap space
at java.lang.String.toUpperCase(String.java:2637)
at java.lang.String.toUpperCase(String.java:2660)
at 
org.cyberneko.html.filters.NamespaceBinder.bindNamespaces(NamespaceBinder.java:443)
at 
org.cyberneko.html.filters.NamespaceBinder.startElement(NamespaceBinder.java:252)
at 
org.cyberneko.html.HTMLTagBalancer.callStartElement(HTMLTagBalancer.java:1009)
at 
org.cyberneko.html.HTMLTagBalancer.startElement(HTMLTagBalancer.java:639)
at 
org.cyberneko.html.HTMLTagBalancer.startElement(HTMLTagBalancer.java:646)
at 
org.cyberneko.html.HTMLScanner$ContentScanner.scanStartElement(HTMLScanner.java:2343)
at 
org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:1820)
at org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:789)
at 
org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:478)
at 
org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:431)
at 
org.cyberneko.html.parsers.DOMFragmentParser.parse(DOMFragmentParser.java:164)
at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:265)
at org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:229)
at org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:168)
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:84)
at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:75)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)
at 
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1445)


Regards,
Shailendra


RE: OOM error during parsing with nekohtml

2007-07-16 Thread Tsengtan A Shuy
I successfully run the whole-web crawl with the my new ubuntu OS, and I am
ready to fix the bug.  I need someone to guide me to get the most updated
source code and the bug assignment.

Thank you in advance!! 

Adam Shuy, President
ePacific Web Design  Hosting
Professional Web/Software developer
TEL: 408-272-6946
www.epacificweb.com
-Original Message-
From: Shailendra Mudgal [mailto:[EMAIL PROTECTED] 
Sent: Monday, July 16, 2007 3:05 AM
To: [EMAIL PROTECTED]; nutch-dev@lucene.apache.org
Subject: OOM error during parsing with nekohtml

Hi All,

We are getting an OOM Exception during the processing of
http://www.fotofinity.com/cgi-bin/homepages.cgi . We have also applied
Nutch-497 patch to our source code. But actually the error is coming during
the parse method.
Does anybody has any idea regarding this.  Here is the complete stacktrace :

java.lang.OutOfMemoryError: Java heap space
at java.lang.String.toUpperCase(String.java:2637)
at java.lang.String.toUpperCase(String.java:2660)
at
org.cyberneko.html.filters.NamespaceBinder.bindNamespaces(NamespaceBinder.ja
va:443)
at
org.cyberneko.html.filters.NamespaceBinder.startElement(NamespaceBinder.java
:252)
at
org.cyberneko.html.HTMLTagBalancer.callStartElement(HTMLTagBalancer.java:100
9)
at
org.cyberneko.html.HTMLTagBalancer.startElement(HTMLTagBalancer.java:639)
at
org.cyberneko.html.HTMLTagBalancer.startElement(HTMLTagBalancer.java:646)
at
org.cyberneko.html.HTMLScanner$ContentScanner.scanStartElement(HTMLScanner.j
ava:2343)
at
org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:1820)
at org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:789)
at
org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:478)
at
org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:431)
at
org.cyberneko.html.parsers.DOMFragmentParser.parse(DOMFragmentParser.java:16
4)
at
org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:265)
at org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:229)
at
org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:168)
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:84)
at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:75)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)
at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1445)


Regards,
Shailendra



[jira] Created: (NUTCH-515) Next fetch time is set incorrectly

2007-07-16 Thread JIRA
Next fetch time is set incorrectly
--

 Key: NUTCH-515
 URL: https://issues.apache.org/jira/browse/NUTCH-515
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
Reporter: Doğacan Güney
Priority: Blocker
 Fix For: 1.0.0


After NUTCH-61 , db.default.fetch.interval option is deprecated and superceded 
by db.fetch.interval.default. However, various parts in nutch still use the old 
option. Since old option is in days (with default being 30) and new option in 
seconds (default is ~25), when nutch fetches a url, its next fetch time is 
set as ***30 SECONDS*** later. This means that nutch keeps refetching same urls 
over and over and over and over.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-439) Top Level Domains Indexing / Scoring

2007-07-16 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12512930
 ] 

Doğacan Güney commented on NUTCH-439:
-

A big +1 from me. Though, it may be useful to break this patch into multiple 
pieces (fixes to opic and build system as a seperate patch, core changes as a 
seperate patch and plugin as a seperate patch).

IMHO, most usages of URL.getHost should be replaced with this patch's 
getDomainName. For example, host field in index gets a big boost currently. 
But it is easy to spam hosts. Just buy a host 'example.com' then set up your 
own dns and add 'foo.example.com', 'bar.example.com', 'baz.example.com'. I have 
actually seen a lot of spam sites that do this. Doing this in linkdb reduces 
anchor spam (where 'foo.example.com' gives a link to 'bar.example.com' and 
nutch considers this an external link and stores this anchor).

Another example is generator. Instead of partitioning on host or ip, we can 
partition urls based on their domains. This doesn't have the overhead of 
resolving ips (and ip-resolving also has problems. Urls under the same domain 
[sometimes even the same url] may be served from different ips [think load 
balancers and stuff]) and will be much more polite and resistant to honey pots.

 Top Level Domains Indexing / Scoring
 

 Key: NUTCH-439
 URL: https://issues.apache.org/jira/browse/NUTCH-439
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Affects Versions: 0.9.0
Reporter: Enis Soztutar
 Attachments: tld_plugin_v1.0.patch, tld_plugin_v1.1.patch, 
 tld_plugin_v2.0.patch, tld_plugin_v2.1.patch


 Top Level Domains (tlds) are the last part(s) of the host name in a DNS 
 system. TLDs are managed by the Internet Assigned Numbers Authority. IANA 
 divides tlds into three. infrastructure, generic(such as com, edu) and 
 country code tlds(such as en, de , tr, ). Indexing the top level domain 
 and optionally boosting is needed for improving the search results and 
 enhancing locality. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: OOM error during parsing with nekohtml

2007-07-16 Thread Kai_testing Middleton
You could try looking at these two discussions:
http://www.mail-archive.com/nutch-dev@lucene.apache.org/msg06571.html
http://www.mail-archive.com/nutch-dev@lucene.apache.org/msg06571.html

--Kai

- Original Message 
From: Tsengtan A Shuy [EMAIL PROTECTED]
To: nutch-dev@lucene.apache.org; [EMAIL PROTECTED]
Sent: Monday, July 16, 2007 3:45:59 AM
Subject: RE: OOM error during parsing with nekohtml

I successfully run the whole-web crawl with the my new ubuntu OS, and I am
ready to fix the bug.  I need someone to guide me to get the most updated
source code and the bug assignment.

Thank you in advance!! 

Adam Shuy, President
ePacific Web Design  Hosting
Professional Web/Software developer
TEL: 408-272-6946
www.epacificweb.com
-Original Message-
From: Shailendra Mudgal [mailto:[EMAIL PROTECTED] 
Sent: Monday, July 16, 2007 3:05 AM
To: [EMAIL PROTECTED]; nutch-dev@lucene.apache.org
Subject: OOM error during parsing with nekohtml

Hi All,

We are getting an OOM Exception during the processing of
http://www.fotofinity.com/cgi-bin/homepages.cgi . We have also applied
Nutch-497 patch to our source code. But actually the error is coming during
the parse method.
Does anybody has any idea regarding this.  Here is the complete stacktrace :

java.lang.OutOfMemoryError: Java heap space
at java.lang.String.toUpperCase(String.java:2637)
at java.lang.String.toUpperCase(String.java:2660)
at
org.cyberneko.html.filters.NamespaceBinder.bindNamespaces(NamespaceBinder.ja
va:443)
at
org.cyberneko.html.filters.NamespaceBinder.startElement(NamespaceBinder.java
:252)
at
org.cyberneko.html.HTMLTagBalancer.callStartElement(HTMLTagBalancer.java:100
9)
at
org.cyberneko.html.HTMLTagBalancer.startElement(HTMLTagBalancer.java:639)
at
org.cyberneko.html.HTMLTagBalancer.startElement(HTMLTagBalancer.java:646)
at
org.cyberneko.html.HTMLScanner$ContentScanner.scanStartElement(HTMLScanner.j
ava:2343)
at
org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:1820)
at org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:789)
at
org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:478)
at
org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:431)
at
org.cyberneko.html.parsers.DOMFragmentParser.parse(DOMFragmentParser.java:16
4)
at
org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:265)
at org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:229)
at
org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:168)
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:84)
at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:75)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)
at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1445)


Regards,
Shailendra








   

Boardwalk for $500? In 2007? Ha! Play Monopoly Here and Now (it's updated for 
today's economy) at Yahoo! Games.
http://get.games.yahoo.com/proddesc?gamekey=monopolyherenow  

[jira] Commented: (NUTCH-515) Next fetch time is set incorrectly

2007-07-16 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12513019
 ] 

Andrzej Bialecki  commented on NUTCH-515:
-

+1 - sorry for the mess up ...

 Next fetch time is set incorrectly
 --

 Key: NUTCH-515
 URL: https://issues.apache.org/jira/browse/NUTCH-515
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
Reporter: Doğacan Güney
Priority: Blocker
 Fix For: 1.0.0

 Attachments: NUTCH-515.patch


 After NUTCH-61 , db.default.fetch.interval option is deprecated and 
 superceded by db.fetch.interval.default. However, various parts in nutch 
 still use the old option. Since old option is in days (with default being 30) 
 and new option in seconds (default is ~25), when nutch fetches a url, its 
 next fetch time is set as ***30 SECONDS*** later. This means that nutch keeps 
 refetching same urls over and over and over and over.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-515) Next fetch time is set incorrectly

2007-07-16 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12513040
 ] 

Doğacan Güney commented on NUTCH-515:
-

With more than a hundred config options, and with the way we use hadoop's 
configuration system (not that there is anything wrong with it but we have to 
specify a default value for most cases and we generally specify what is in 
nutch-default.xml as the default value) there are bound to be mistakes 
somewhere no matter how careful one is. I think this is my third wrong 
configuration option fix and I wonder how many I am missing.

Perhaps, we can add a ConfParams class that stores parameter names. I mean, if 
you need say, db.outlinks.max.per.page option, you get its key as 
ConfParams.DB_OUTLINKS_MAX_PER_PAGE (So 
conf.getInt(ConfParams.DB_OUTLINKS_MAX_PER_PAGE, 100). Or we can add a 
hierarchy to it: ConfParams.IndexParams.MAX_TOKENS. A class with tens of static 
final strings in it is not the most elegant thing, but IMHO, it is better than 
what we are currently doing.



 Next fetch time is set incorrectly
 --

 Key: NUTCH-515
 URL: https://issues.apache.org/jira/browse/NUTCH-515
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
Reporter: Doğacan Güney
Priority: Blocker
 Fix For: 1.0.0

 Attachments: NUTCH-515.patch


 After NUTCH-61 , db.default.fetch.interval option is deprecated and 
 superceded by db.fetch.interval.default. However, various parts in nutch 
 still use the old option. Since old option is in days (with default being 30) 
 and new option in seconds (default is ~25), when nutch fetches a url, its 
 next fetch time is set as ***30 SECONDS*** later. This means that nutch keeps 
 refetching same urls over and over and over and over.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-506) Nutch should delegate compression to Hadoop

2007-07-16 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12513044
 ] 

Doğacan Güney commented on NUTCH-506:
-

If there are no objections, I am going to commit this one.

Just to get more comments, here is a break-down of what this patch does:

* Remove all compression code from nutch. This means no more 
writeCompressedString-s or writeCompressedStringArrays, also no more 
CompressedWritable-s. All changes are done in a backward compatible manner. 
Also after this change, Content's version is -1 and new changes should 
*decrease* that number. See NUTCH-392 for more details.

* Respect io.seqfile.compression.type setting for all structures except 
ParseText. ParseText is always compressed as RECORD. Also for some reason 
crawl_generate is not compressed.

Why are we doing this? Because hadoop can efficiently (both in space and in 
time) compress these structures for us. I have done some tests with different 
compression settings in NUTCH-392 and BLOCK compression really does a 
difference. I think for a large enough crawl, overall space savings will be 
around %20 - %40. Note that this is basically for free (there may even be a 
small performance gain) if you are using hadoop's native libraries.

 Nutch should delegate compression to Hadoop
 ---

 Key: NUTCH-506
 URL: https://issues.apache.org/jira/browse/NUTCH-506
 Project: Nutch
  Issue Type: Improvement
Reporter: Doğacan Güney
 Fix For: 1.0.0

 Attachments: compress.patch, NUTCH-506.patch


 Some data structures within nutch (such as Content, ParseText) handle their 
 own compression. We should delegate all compressions to Hadoop. 
 Also, nutch should respect io.seqfile.compression.type setting. Currently 
 even if io.seqfile.compression.type is BLOCK or RECORD, nutch overrides it 
 for some structures and sets it to NONE (However, IMO, ParseText should 
 always be compressed as RECORD because of performance reasons).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: OOM error during parsing with nekohtml

2007-07-16 Thread Shailendra Mudgal

Hi all,

Thanks for your suggestions.

I am running parse on a single url (
http://www.fotofinity.com/cgi-bin/homepages.cgi). For other urls, parse
works perfectly. we are getting this error because of the html of the page.
The page contains many anchor tags which are not closed properly. Hence neko
html parser throws this exception. The page can be parsed successfully using
tagsoup. We think this as a bug in neko html parser.


Regards,
Shailendra







On 7/16/07, Tsengtan A Shuy [EMAIL PROTECTED] wrote:


Thank you for the info.
The OOM exception in your previous email indicates that your system is
running out of heap memory.  You either have instantiated too many
objects,
or there are memory leaks in the source codes.

Hope this will help you!
Cheer!!

Adam Shuy, President
ePacific Web Design  Hosting
Professional Web/Software developer
TEL: 408-272-6946
www.epacificweb.com

-Original Message-
From: Kai_testing Middleton [mailto:[EMAIL PROTECTED]
Sent: Monday, July 16, 2007 8:43 AM
To: nutch-dev@lucene.apache.org
Subject: Re: OOM error during parsing with nekohtml

You could try looking at these two discussions:
http://www.mail-archive.com/nutch-dev@lucene.apache.org/msg06571.html
http://www.mail-archive.com/nutch-dev@lucene.apache.org/msg06571.html

--Kai

- Original Message 
From: Tsengtan A Shuy [EMAIL PROTECTED]
To: nutch-dev@lucene.apache.org; [EMAIL PROTECTED]
Sent: Monday, July 16, 2007 3:45:59 AM
Subject: RE: OOM error during parsing with nekohtml

I successfully run the whole-web crawl with the my new ubuntu OS, and I am
ready to fix the bug.  I need someone to guide me to get the most updated
source code and the bug assignment.

Thank you in advance!!

Adam Shuy, President
ePacific Web Design  Hosting
Professional Web/Software developer
TEL: 408-272-6946
www.epacificweb.com
-Original Message-
From: Shailendra Mudgal [mailto:[EMAIL PROTECTED]
Sent: Monday, July 16, 2007 3:05 AM
To: [EMAIL PROTECTED]; nutch-dev@lucene.apache.org
Subject: OOM error during parsing with nekohtml

Hi All,

We are getting an OOM Exception during the processing of
http://www.fotofinity.com/cgi-bin/homepages.cgi . We have also applied
Nutch-497 patch to our source code. But actually the error is coming
during
the parse method.
Does anybody has any idea regarding this.  Here is the complete stacktrace
:

java.lang.OutOfMemoryError: Java heap space
at java.lang.String.toUpperCase(String.java:2637)
at java.lang.String.toUpperCase(String.java:2660)
at
org.cyberneko.html.filters.NamespaceBinder.bindNamespaces(
NamespaceBinder.ja
va:443)
at
org.cyberneko.html.filters.NamespaceBinder.startElement(
NamespaceBinder.java
:252)
at
org.cyberneko.html.HTMLTagBalancer.callStartElement(HTMLTagBalancer.java
:100
9)
at
org.cyberneko.html.HTMLTagBalancer.startElement(HTMLTagBalancer.java:639)
at
org.cyberneko.html.HTMLTagBalancer.startElement(HTMLTagBalancer.java:646)
at
org.cyberneko.html.HTMLScanner$ContentScanner.scanStartElement(
HTMLScanner.j
ava:2343)
at
org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:1820)
at org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:789)
at
org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:478)
at
org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:431)
at
org.cyberneko.html.parsers.DOMFragmentParser.parse(DOMFragmentParser.java
:16
4)
at
org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:265)
at org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:229)
at
org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:168)
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:84)
at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:75)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)
at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1445)


Regards,
Shailendra












Boardwalk for $500? In 2007? Ha! Play Monopoly Here and Now (it's updated
for today's economy) at Yahoo! Games.
http://get.games.yahoo.com/proddesc?gamekey=monopolyherenow