Re: bug with generate performance

2007-09-07 Thread Andrzej Bialecki

Doğacan Güney wrote:


Others have also reported a problem with generate performance. It
seems we have a problem here but I can not reproduce this behaviour so
I am not sure what causes it. Can you open a JIRA issue and enter your
comments there? Also, how you are running generate will be very
helpful (what is generate.max.per.host? what is -topN argument, etc.)


Also the value of generate.max.per.host.by.ip - this could be a 
DNS-related issue.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Meta Tags and Indexing

2007-09-07 Thread Jeff Maki
Hello everyone,

I'm working on a project that is essentially a searchable database of
citations (academic ones). Nutch, naturally, was the searching tool I
decided to use because of it's full-featuredness. And cost.

Anyway, we had the requirement to be able to sort results by year (for
instance) and restrict results based on type (journal article, book,
etc.). I couldn't find a way to label the results (to use a Google
term), so I ended up writing a plugin to do so.

Based heavily on the example plugin for nutch, this plugin adds data
found in HTML page meta fields to the index, and allows one to
optionally use them in querying (or sorting). Not as full-featured as
google's labeling solution, perhaps, but it's a start.

Just thought I'd post it in case it saves anybody else some time...

Link: http://upclose.lrdc.pitt.edu/people/maki_assets/metatag.tar.gz

Thanks!

-Jeff


Re: Limiting outlink tags.

2007-09-07 Thread Doğacan Güney
Hi Marcin,

On 9/7/07, Marcin Okraszewski [EMAIL PROTECTED] wrote:
 Hi,
 I have noticed that Nutch considers img/@src as an outlink. I suppose in many 
 cases people do not want to threat image as an outlink. At least I don't 
 want. The same case is with script/@src. But, it seems there is no way to 
 limit outlink tags. The DOMContentUtils.getOutlinks() takes links from all 
 a,area,form,frame,iframe,script,link,img. Only form element can be turned 
 off by parser.html.form.use_action parameter.

 I would suggest to introduce a new configuration parameter which could be 
 used to turn on or off certain elements. It could be simply done by single 
 parameter, which would contain coma separated list of tags to be turned off.

 What is your opinion? If you think it is a valid issue I can make a patch for 
 this.

There is already NUTCH-488 open for this (with a patch). Feel free to
add comments/patches/etc. there. Btw, I agree that using a CSV is
better than using a new configuration parameter for every tag.


 Regards,
 Marcin




-- 
Doğacan Güney


Re: bug with generate performance

2007-09-07 Thread Doğacan Güney
Hi,

On 8/31/07, misc [EMAIL PROTECTED] wrote:

 Hello-

 I am almost certain I have found a nasty bug with nutch genereate.

 Problem: Nutch generate can take many hours, even a day to complete (on a 
 crawldb that has less than 2 million urls).

 I added debug code to Generator-Selector.map to see when map is called 
 and returns, and observed interesting behavior, described here:

 1. Most of the time, when generate is run urls are processed in chunky 
 batches, usually about 40 at a time, followed by a 1 second delay.  I timed 
 the delay, and it really is a 1 second delay (ie- 30 batches was 30 seconds.) 
  When this happens it takes hours to complete.

 2. Sometimes (randomly as far as I can tell) when I run nutch, the urls 
 are processed without delays.  It is an all or nothing event, either I run 
 and all urls process quickly without delay (in minutes), or more likely I get 
 the chunky processing with many 1 second delays and the program takes hours 
 to end.  The one exception is

 3. When the processing runs quickly I've seen the main thread end (I have 
 some profiling going, so I know when a thread ends), and then more likely 
 than not a second thread begins where the first starts, chunky like usual.  
 Although I sometimes can get fast processing in one thread, it is almost 
 impossible for me te get it in all threads and therefore general processing 
 is very slow (hours).

 4. I tried to put in more debug code to find the line where the delays 
 occured, but the last line printed to the log at a delay seemed random, 
 leading me to believe that the log is not being flushed uniformly.

 5. The profiler I used seemed to imply that about 100% of the time was 
 spent in javallang.Thread.sleep.  I am not completely familiar with the 
 profiler I used so I am not completely sure I inturpreted this correctly.

 I will keep debugging here, but perhaps someone here has some insight 
 into what might be happening?

Others have also reported a problem with generate performance. It
seems we have a problem here but I can not reproduce this behaviour so
I am not sure what causes it. Can you open a JIRA issue and enter your
comments there? Also, how you are running generate will be very
helpful (what is generate.max.per.host? what is -topN argument, etc.)


 thanks
 -J


-- 
Doğacan Güney


[jira] Created: (NUTCH-550) Parse fails if db.max.outlinks.per.page is -1

2007-09-07 Thread JIRA
Parse fails if db.max.outlinks.per.page is -1
-

 Key: NUTCH-550
 URL: https://issues.apache.org/jira/browse/NUTCH-550
 Project: Nutch
  Issue Type: Bug
Reporter: Doğacan Güney
Assignee: Doğacan Güney
 Fix For: 1.0.0
 Attachments: NUTCH-550.patch

See here 
http://www.nabble.com/nutch-nightly%3A-IllegalArgumentException%3A-Illegal-Capacity%3A--1-tf4245360.html#a12511661

One of (my|the) earlier commits broke ParseOutputFormat such that if 
db.max.outlinks.per.page is -1, ParseOutputFormat tries to create an ArrayList 
of size -1. The solution is to simply make maxOutlinks variable 
Integer.MAX_VALUE if db.max.outlinks.per.page is -1 .

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Labeling URLs a-la Google

2007-09-07 Thread ogjunk-nutch
Hi Jeff,

Nice.  Could you submit this to JIRA as a patch?

Otis
 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

- Original Message 
From: Jeff Maki [EMAIL PROTECTED]
To: nutch-dev@lucene.apache.org
Sent: Thursday, September 6, 2007 4:04:18 PM
Subject: Labeling URLs a-la Google

Hello everybody,

I'm working on a project that is essentially a searchable database for
academic citations at the University of Pittsburgh. One of our
searching requirements was to be able to break the search results into
sections--in order to do this, I implemented something similar to
Google's labels.

It's based heavily on the example plugin, and maybe not so pretty
code-wise, but it's a start.

Downloadable here:
http://upclose.lrdc.pitt.edu/people/maki_assets/nutch-regex-label.tar.gz

You configure it by adding something like the below to your nutch-site.xml file:

property
  nameextension.regexlabeler.labels/name
  value
http://dev3\.informalscience\.org/research.*\.php.* = firsttag
secondtag thirdtag,
http://dev3\.informalscience\.org/project.*\.php.* = project,
http://www.?\.informalscience\.org.* = oldsite,
http://dev3\.informalscience\.org.* = devsite
  /value
/property

Notes:
* Format of each line is regular expression=labels, space delimited
* URLS must be unique.
* Multiple tags for the same pattern are delimited by a space.

Hope this saves somebody some time,

-Jeff

(BTW, Nutch as worked very well for us--excellent project!)





[jira] Created: (NUTCH-551) performance for generate is often really bad

2007-09-07 Thread Jim (JIRA)
performance for generate is often really bad


 Key: NUTCH-551
 URL: https://issues.apache.org/jira/browse/NUTCH-551
 Project: Nutch
  Issue Type: Bug
  Components: generator
Affects Versions: 0.9.0
 Environment: Ubuntu, Core duo 2.4GhZ, 1 gig ram, 750GB hard drive.
 The ethernet connection has a dedicated 1gb connection to the web, so 
certainly that isn't a problem.
I have tested on nutch 0.9 and the newest daily build from 2007-08-28.

Reporter: Jim



Generate often takes many hours to finish (6+), where I would expect it 
to be done in minutes.

This behavior has been observed for topN of small (~100) and large 
(~100) values.  Other configuration values are

generate.max.per.host: -1

generate.max.per.host.by.ip: false

I added debug code to Generator-Selector.map to see when map is called 
and returns, and observed interesting behavior, described here:

1. Most of the time, when generate is run urls are processed in chunky 
batches, usually about 40 at a time, followed by a 1 second delay.  I timed the 
delay, and it really is a 1 second delay (ie- 30 batches was 30 seconds.)  When 
this happens it takes hours to complete.

2. Sometimes (randomly as far as I can tell) when I run nutch, the urls 
are processed without delays.  It is an all or nothing event, either I run and 
all urls process quickly without delay (in minutes), or more likely I get the 
chunky processing with many 1 second delays and the program takes hours to end. 
 The one exception is

3. When the processing runs quickly I've seen the main thread end (I 
have some profiling going, so I know when a thread ends), and then more likely 
than not a second thread begins where the first starts, chunky like usual.  
Although I sometimes can get fast processing in one thread, it is almost 
impossible for me te get it in all threads and therefore general processing is 
very slow (hours).

4. I tried to put in more debug code to find the line where the delays 
occured, but the last line printed to the log at a delay seemed random, leading 
me to believe that the log is not being flushed uniformly.  The timestamps in 
the log always indicate that the delay is wither right before or after the 
first log item in the map function.

5. The profiler I used seemed to imply that about 100% of the time was 
spent in javallang.Thread.sleep.  I am not completely familiar with the 
profiler I used so I am not completely sure I inturpreted this correctly.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: bug with generate performance

2007-09-07 Thread misc


Hello-

   I've made a bug, and included the extra required information 
(generate.max.per.host = -1, error seen with small topN around 100 and large 
topN around 100).


   I've since tried to run with a debugger, but the slowness went away 
(ugh).  I also know that dns lookups are not the problem as I ran with 
wireshark running and there were no dns lookups.


   thanks
   -Jim




Others have also reported a problem with generate performance. It
seems we have a problem here but I can not reproduce this behaviour so
I am not sure what causes it. Can you open a JIRA issue and enter your
comments there? Also, how you are running generate will be very
helpful (what is generate.max.per.host? what is -topN argument, etc.)





[jira] Commented: (NUTCH-551) performance for generate is often really bad

2007-09-07 Thread Jim (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12525882
 ] 

Jim commented on NUTCH-551:
---


It is really maddening, but I can not reproduce the bug with the jdb debugger 
attached.  Whenever I run with jdb generate just finishes immediately (in 
minutes).  On the bright side I am now certain that there *is* a bug, because I 
can see from start to finish how long generate should take (minutes as opposed 
to hours).

Also, I've been watching the map/reduce log progress, and also the output log 
at the same time and have verified that the chunkyness has something to do with 
the slowdown.  The logs show a progression of the map in a steady fashion until 
the logs start pausing every second for a second.  Then the percentage only 
goes very slowly.



 performance for generate is often really bad
 

 Key: NUTCH-551
 URL: https://issues.apache.org/jira/browse/NUTCH-551
 Project: Nutch
  Issue Type: Bug
  Components: generator
Affects Versions: 0.9.0
 Environment: Ubuntu, Core duo 2.4GhZ, 1 gig ram, 750GB hard drive.
  The ethernet connection has a dedicated 1gb connection to the web, so 
 certainly that isn't a problem.
 I have tested on nutch 0.9 and the newest daily build from 2007-08-28.
Reporter: Jim

 Generate often takes many hours to finish (6+), where I would expect 
 it to be done in minutes.
 This behavior has been observed for topN of small (~100) and large 
 (~100) values.  Other configuration values are
 generate.max.per.host: -1
 generate.max.per.host.by.ip: false
 I added debug code to Generator-Selector.map to see when map is 
 called and returns, and observed interesting behavior, described here:
 1. Most of the time, when generate is run urls are processed in 
 chunky batches, usually about 40 at a time, followed by a 1 second delay.  I 
 timed the delay, and it really is a 1 second delay (ie- 30 batches was 30 
 seconds.)  When this happens it takes hours to complete.
 2. Sometimes (randomly as far as I can tell) when I run nutch, the 
 urls are processed without delays.  It is an all or nothing event, either I 
 run and all urls process quickly without delay (in minutes), or more likely I 
 get the chunky processing with many 1 second delays and the program takes 
 hours to end.  The one exception is
 3. When the processing runs quickly I've seen the main thread end (I 
 have some profiling going, so I know when a thread ends), and then more 
 likely than not a second thread begins where the first starts, chunky like 
 usual.  Although I sometimes can get fast processing in one thread, it is 
 almost impossible for me te get it in all threads and therefore general 
 processing is very slow (hours).
 4. I tried to put in more debug code to find the line where the 
 delays occured, but the last line printed to the log at a delay seemed 
 random, leading me to believe that the log is not being flushed uniformly.  
 The timestamps in the log always indicate that the delay is wither right 
 before or after the first log item in the map function.
 5. The profiler I used seemed to imply that about 100% of the time 
 was spent in javallang.Thread.sleep.  I am not completely familiar with the 
 profiler I used so I am not completely sure I inturpreted this correctly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Pl...Give me example

2007-09-07 Thread m.harig

hi all... am new to new to nutch.. i need sample java program for nutch
crawl and search
cud any1 guide me reg dis. or where can i get code. i got nutch-0.9
war and i deployed it...but am not able to understand. any1 help me
pl...
-- 
View this message in context: 
http://www.nabble.com/Pl...Give-me-example-tf4404885.html#a12566694
Sent from the Nutch - Dev mailing list archive at Nabble.com.