date:20070716

OOM error during parsing with nekohtml

2007-07-16 Thread Shailendra Mudgal


Hi All,

We are getting an OOM Exception during the processing of
http://www.fotofinity.com/cgi-bin/homepages.cgi . We have also applied
Nutch-497 patch to our source code. But actually the error is coming during
the parse method.
Does anybody has any idea regarding this.  Here is the complete stacktrace :

java.lang.OutOfMemoryError: Java heap space
at java.lang.String.toUpperCase(String.java:2637)
at java.lang.String.toUpperCase(String.java:2660)
at 
org.cyberneko.html.filters.NamespaceBinder.bindNamespaces(NamespaceBinder.java:443)
at 
org.cyberneko.html.filters.NamespaceBinder.startElement(NamespaceBinder.java:252)
at 
org.cyberneko.html.HTMLTagBalancer.callStartElement(HTMLTagBalancer.java:1009)
at 
org.cyberneko.html.HTMLTagBalancer.startElement(HTMLTagBalancer.java:639)
at 
org.cyberneko.html.HTMLTagBalancer.startElement(HTMLTagBalancer.java:646)
at 
org.cyberneko.html.HTMLScanner$ContentScanner.scanStartElement(HTMLScanner.java:2343)
at 
org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:1820)
at org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:789)
at 
org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:478)
at 
org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:431)
at 
org.cyberneko.html.parsers.DOMFragmentParser.parse(DOMFragmentParser.java:164)
at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:265)
at org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:229)
at org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:168)
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:84)
at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:75)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)
at 
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1445)


Regards,
Shailendra

RE: OOM error during parsing with nekohtml

2007-07-16 Thread Tsengtan A Shuy

I successfully run the whole-web crawl with the my new ubuntu OS, and I am
ready to fix the bug.  I need someone to guide me to get the most updated
source code and the bug assignment.

Thank you in advance!! 

Adam Shuy, President
ePacific Web Design  Hosting
Professional Web/Software developer
TEL: 408-272-6946
www.epacificweb.com
-Original Message-
From: Shailendra Mudgal [mailto:[EMAIL PROTECTED] 
Sent: Monday, July 16, 2007 3:05 AM
To: [EMAIL PROTECTED]; nutch-dev@lucene.apache.org
Subject: OOM error during parsing with nekohtml

Hi All,

We are getting an OOM Exception during the processing of
http://www.fotofinity.com/cgi-bin/homepages.cgi . We have also applied
Nutch-497 patch to our source code. But actually the error is coming during
the parse method.
Does anybody has any idea regarding this.  Here is the complete stacktrace :

java.lang.OutOfMemoryError: Java heap space
at java.lang.String.toUpperCase(String.java:2637)
at java.lang.String.toUpperCase(String.java:2660)
at
org.cyberneko.html.filters.NamespaceBinder.bindNamespaces(NamespaceBinder.ja
va:443)
at
org.cyberneko.html.filters.NamespaceBinder.startElement(NamespaceBinder.java
:252)
at
org.cyberneko.html.HTMLTagBalancer.callStartElement(HTMLTagBalancer.java:100
9)
at
org.cyberneko.html.HTMLTagBalancer.startElement(HTMLTagBalancer.java:639)
at
org.cyberneko.html.HTMLTagBalancer.startElement(HTMLTagBalancer.java:646)
at
org.cyberneko.html.HTMLScanner$ContentScanner.scanStartElement(HTMLScanner.j
ava:2343)
at
org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:1820)
at org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:789)
at
org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:478)
at
org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:431)
at
org.cyberneko.html.parsers.DOMFragmentParser.parse(DOMFragmentParser.java:16
4)
at
org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:265)
at org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:229)
at
org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:168)
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:84)
at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:75)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)
at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1445)


Regards,
Shailendra

[jira] Created: (NUTCH-515) Next fetch time is set incorrectly

2007-07-16 Thread JIRA

Next fetch time is set incorrectly
--

 Key: NUTCH-515
 URL: https://issues.apache.org/jira/browse/NUTCH-515
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
Reporter: Doğacan Güney
Priority: Blocker
 Fix For: 1.0.0


After NUTCH-61 , db.default.fetch.interval option is deprecated and superceded 
by db.fetch.interval.default. However, various parts in nutch still use the old 
option. Since old option is in days (with default being 30) and new option in 
seconds (default is ~25), when nutch fetches a url, its next fetch time is 
set as ***30 SECONDS*** later. This means that nutch keeps refetching same urls 
over and over and over and over.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-439) Top Level Domains Indexing / Scoring

2007-07-16 Thread JIRA

[
https://issues.apache.org/jira/browse/NUTCH-439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12512930
]

Doğacan Güney commented on NUTCH-439:
-

A big +1 from me. Though, it may be useful to break this patch into multiple
pieces (fixes to opic and build system as a seperate patch, core changes as a
seperate patch and plugin as a seperate patch).

IMHO, most usages of URL.getHost should be replaced with this patch's
getDomainName. For example, host field in index gets a big boost currently.
But it is easy to spam hosts. Just buy a host 'example.com' then set up your
own dns and add 'foo.example.com', 'bar.example.com', 'baz.example.com'. I have
actually seen a lot of spam sites that do this. Doing this in linkdb reduces
anchor spam (where 'foo.example.com' gives a link to 'bar.example.com' and
nutch considers this an external link and stores this anchor).

Another example is generator. Instead of partitioning on host or ip, we can
partition urls based on their domains. This doesn't have the overhead of
resolving ips (and ip-resolving also has problems. Urls under the same domain
[sometimes even the same url] may be served from different ips [think load
balancers and stuff]) and will be much more polite and resistant to honey pots.

Top Level Domains Indexing / Scoring

Key: NUTCH-439
URL: https://issues.apache.org/jira/browse/NUTCH-439
Project: Nutch
Issue Type: New Feature
Components: indexer
Affects Versions: 0.9.0
Reporter: Enis Soztutar
Attachments: tld_plugin_v1.0.patch, tld_plugin_v1.1.patch,
tld_plugin_v2.0.patch, tld_plugin_v2.1.patch

Top Level Domains (tlds) are the last part(s) of the host name in a DNS
system. TLDs are managed by the Internet Assigned Numbers Authority. IANA
divides tlds into three. infrastructure, generic(such as com, edu) and
country code tlds(such as en, de , tr, ). Indexing the top level domain
and optionally boosting is needed for improving the search results and
enhancing locality.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: OOM error during parsing with nekohtml

2007-07-16 Thread Kai_testing Middleton

You could try looking at these two discussions:
http://www.mail-archive.com/nutch-dev@lucene.apache.org/msg06571.html
http://www.mail-archive.com/nutch-dev@lucene.apache.org/msg06571.html

--Kai

- Original Message 
From: Tsengtan A Shuy [EMAIL PROTECTED]
To: nutch-dev@lucene.apache.org; [EMAIL PROTECTED]
Sent: Monday, July 16, 2007 3:45:59 AM
Subject: RE: OOM error during parsing with nekohtml

I successfully run the whole-web crawl with the my new ubuntu OS, and I am
ready to fix the bug.  I need someone to guide me to get the most updated
source code and the bug assignment.

Thank you in advance!! 

Adam Shuy, President
ePacific Web Design  Hosting
Professional Web/Software developer
TEL: 408-272-6946
www.epacificweb.com
-Original Message-
From: Shailendra Mudgal [mailto:[EMAIL PROTECTED] 
Sent: Monday, July 16, 2007 3:05 AM
To: [EMAIL PROTECTED]; nutch-dev@lucene.apache.org
Subject: OOM error during parsing with nekohtml

Hi All,

We are getting an OOM Exception during the processing of
http://www.fotofinity.com/cgi-bin/homepages.cgi . We have also applied
Nutch-497 patch to our source code. But actually the error is coming during
the parse method.
Does anybody has any idea regarding this.  Here is the complete stacktrace :

java.lang.OutOfMemoryError: Java heap space
at java.lang.String.toUpperCase(String.java:2637)
at java.lang.String.toUpperCase(String.java:2660)
at
org.cyberneko.html.filters.NamespaceBinder.bindNamespaces(NamespaceBinder.ja
va:443)
at
org.cyberneko.html.filters.NamespaceBinder.startElement(NamespaceBinder.java
:252)
at
org.cyberneko.html.HTMLTagBalancer.callStartElement(HTMLTagBalancer.java:100
9)
at
org.cyberneko.html.HTMLTagBalancer.startElement(HTMLTagBalancer.java:639)
at
org.cyberneko.html.HTMLTagBalancer.startElement(HTMLTagBalancer.java:646)
at
org.cyberneko.html.HTMLScanner$ContentScanner.scanStartElement(HTMLScanner.j
ava:2343)
at
org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:1820)
at org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:789)
at
org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:478)
at
org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:431)
at
org.cyberneko.html.parsers.DOMFragmentParser.parse(DOMFragmentParser.java:16
4)
at
org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:265)
at org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:229)
at
org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:168)
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:84)
at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:75)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)
at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1445)


Regards,
Shailendra








   

Boardwalk for $500? In 2007? Ha! Play Monopoly Here and Now (it's updated for 
today's economy) at Yahoo! Games.
http://get.games.yahoo.com/proddesc?gamekey=monopolyherenow

[jira] Commented: (NUTCH-515) Next fetch time is set incorrectly

2007-07-16 Thread Andrzej Bialecki (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12513019
 ] 

Andrzej Bialecki  commented on NUTCH-515:
-

+1 - sorry for the mess up ...

 Next fetch time is set incorrectly
 --

 Key: NUTCH-515
 URL: https://issues.apache.org/jira/browse/NUTCH-515
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
Reporter: Doğacan Güney
Priority: Blocker
 Fix For: 1.0.0

 Attachments: NUTCH-515.patch


 After NUTCH-61 , db.default.fetch.interval option is deprecated and 
 superceded by db.fetch.interval.default. However, various parts in nutch 
 still use the old option. Since old option is in days (with default being 30) 
 and new option in seconds (default is ~25), when nutch fetches a url, its 
 next fetch time is set as ***30 SECONDS*** later. This means that nutch keeps 
 refetching same urls over and over and over and over.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-515) Next fetch time is set incorrectly

2007-07-16 Thread JIRA


[ 
https://issues.apache.org/jira/browse/NUTCH-515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12513040
 ] 

Doğacan Güney commented on NUTCH-515:
-

With more than a hundred config options, and with the way we use hadoop's 
configuration system (not that there is anything wrong with it but we have to 
specify a default value for most cases and we generally specify what is in 
nutch-default.xml as the default value) there are bound to be mistakes 
somewhere no matter how careful one is. I think this is my third wrong 
configuration option fix and I wonder how many I am missing.

Perhaps, we can add a ConfParams class that stores parameter names. I mean, if 
you need say, db.outlinks.max.per.page option, you get its key as 
ConfParams.DB_OUTLINKS_MAX_PER_PAGE (So 
conf.getInt(ConfParams.DB_OUTLINKS_MAX_PER_PAGE, 100). Or we can add a 
hierarchy to it: ConfParams.IndexParams.MAX_TOKENS. A class with tens of static 
final strings in it is not the most elegant thing, but IMHO, it is better than 
what we are currently doing.



 Next fetch time is set incorrectly
 --

 Key: NUTCH-515
 URL: https://issues.apache.org/jira/browse/NUTCH-515
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
Reporter: Doğacan Güney
Priority: Blocker
 Fix For: 1.0.0

 Attachments: NUTCH-515.patch


 After NUTCH-61 , db.default.fetch.interval option is deprecated and 
 superceded by db.fetch.interval.default. However, various parts in nutch 
 still use the old option. Since old option is in days (with default being 30) 
 and new option in seconds (default is ~25), when nutch fetches a url, its 
 next fetch time is set as ***30 SECONDS*** later. This means that nutch keeps 
 refetching same urls over and over and over and over.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-506) Nutch should delegate compression to Hadoop

2007-07-16 Thread JIRA

[
https://issues.apache.org/jira/browse/NUTCH-506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12513044
]

Doğacan Güney commented on NUTCH-506:
-

If there are no objections, I am going to commit this one.

Just to get more comments, here is a break-down of what this patch does:

* Remove all compression code from nutch. This means no more
writeCompressedString-s or writeCompressedStringArrays, also no more
CompressedWritable-s. All changes are done in a backward compatible manner.
Also after this change, Content's version is -1 and new changes should
*decrease* that number. See NUTCH-392 for more details.

* Respect io.seqfile.compression.type setting for all structures except
ParseText. ParseText is always compressed as RECORD. Also for some reason
crawl_generate is not compressed.

Why are we doing this? Because hadoop can efficiently (both in space and in
time) compress these structures for us. I have done some tests with different
compression settings in NUTCH-392 and BLOCK compression really does a
difference. I think for a large enough crawl, overall space savings will be
around %20 - %40. Note that this is basically for free (there may even be a
small performance gain) if you are using hadoop's native libraries.

Nutch should delegate compression to Hadoop
---

Key: NUTCH-506
URL: https://issues.apache.org/jira/browse/NUTCH-506
Project: Nutch
Issue Type: Improvement
Reporter: Doğacan Güney
Fix For: 1.0.0

Attachments: compress.patch, NUTCH-506.patch

Some data structures within nutch (such as Content, ParseText) handle their
own compression. We should delegate all compressions to Hadoop.
Also, nutch should respect io.seqfile.compression.type setting. Currently
even if io.seqfile.compression.type is BLOCK or RECORD, nutch overrides it
for some structures and sets it to NONE (However, IMO, ParseText should
always be compressed as RECORD because of performance reasons).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: OOM error during parsing with nekohtml

2007-07-16 Thread Shailendra Mudgal


Hi all,

Thanks for your suggestions.

I am running parse on a single url (
http://www.fotofinity.com/cgi-bin/homepages.cgi). For other urls, parse
works perfectly. we are getting this error because of the html of the page.
The page contains many anchor tags which are not closed properly. Hence neko
html parser throws this exception. The page can be parsed successfully using
tagsoup. We think this as a bug in neko html parser.


Regards,
Shailendra







On 7/16/07, Tsengtan A Shuy [EMAIL PROTECTED] wrote:


Thank you for the info.
The OOM exception in your previous email indicates that your system is
running out of heap memory.  You either have instantiated too many
objects,
or there are memory leaks in the source codes.

Hope this will help you!
Cheer!!

Adam Shuy, President
ePacific Web Design  Hosting
Professional Web/Software developer
TEL: 408-272-6946
www.epacificweb.com

-Original Message-
From: Kai_testing Middleton [mailto:[EMAIL PROTECTED]
Sent: Monday, July 16, 2007 8:43 AM
To: nutch-dev@lucene.apache.org
Subject: Re: OOM error during parsing with nekohtml

You could try looking at these two discussions:
http://www.mail-archive.com/nutch-dev@lucene.apache.org/msg06571.html
http://www.mail-archive.com/nutch-dev@lucene.apache.org/msg06571.html

--Kai

- Original Message 
From: Tsengtan A Shuy [EMAIL PROTECTED]
To: nutch-dev@lucene.apache.org; [EMAIL PROTECTED]
Sent: Monday, July 16, 2007 3:45:59 AM
Subject: RE: OOM error during parsing with nekohtml

I successfully run the whole-web crawl with the my new ubuntu OS, and I am
ready to fix the bug.  I need someone to guide me to get the most updated
source code and the bug assignment.

Thank you in advance!!

Adam Shuy, President
ePacific Web Design  Hosting
Professional Web/Software developer
TEL: 408-272-6946
www.epacificweb.com
-Original Message-
From: Shailendra Mudgal [mailto:[EMAIL PROTECTED]
Sent: Monday, July 16, 2007 3:05 AM
To: [EMAIL PROTECTED]; nutch-dev@lucene.apache.org
Subject: OOM error during parsing with nekohtml

Hi All,

We are getting an OOM Exception during the processing of
http://www.fotofinity.com/cgi-bin/homepages.cgi . We have also applied
Nutch-497 patch to our source code. But actually the error is coming
during
the parse method.
Does anybody has any idea regarding this.  Here is the complete stacktrace
:

java.lang.OutOfMemoryError: Java heap space
at java.lang.String.toUpperCase(String.java:2637)
at java.lang.String.toUpperCase(String.java:2660)
at
org.cyberneko.html.filters.NamespaceBinder.bindNamespaces(
NamespaceBinder.ja
va:443)
at
org.cyberneko.html.filters.NamespaceBinder.startElement(
NamespaceBinder.java
:252)
at
org.cyberneko.html.HTMLTagBalancer.callStartElement(HTMLTagBalancer.java
:100
9)
at
org.cyberneko.html.HTMLTagBalancer.startElement(HTMLTagBalancer.java:639)
at
org.cyberneko.html.HTMLTagBalancer.startElement(HTMLTagBalancer.java:646)
at
org.cyberneko.html.HTMLScanner$ContentScanner.scanStartElement(
HTMLScanner.j
ava:2343)
at
org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:1820)
at org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:789)
at
org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:478)
at
org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:431)
at
org.cyberneko.html.parsers.DOMFragmentParser.parse(DOMFragmentParser.java
:16
4)
at
org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:265)
at org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:229)
at
org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:168)
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:84)
at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:75)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)
at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1445)


Regards,
Shailendra












Boardwalk for $500? In 2007? Ha! Play Monopoly Here and Now (it's updated
for today's economy) at Yahoo! Games.
http://get.games.yahoo.com/proddesc?gamekey=monopolyherenow

OOM error during parsing with nekohtml

RE: OOM error during parsing with nekohtml

[jira] Created: (NUTCH-515) Next fetch time is set incorrectly

[jira] Commented: (NUTCH-439) Top Level Domains Indexing / Scoring

Re: OOM error during parsing with nekohtml

[jira] Commented: (NUTCH-515) Next fetch time is set incorrectly

[jira] Commented: (NUTCH-515) Next fetch time is set incorrectly

[jira] Commented: (NUTCH-506) Nutch should delegate compression to Hadoop

Re: OOM error during parsing with nekohtml

9 matches

Site Navigation

Mail list logo

Footer information