[jira] Created: (NUTCH-305) Update crawl and url filter lists to exclude jpeg|JPEG|bmp|BMP

2006-06-09 Thread chris finne (JIRA)
Update crawl and url filter lists to exclude jpeg|JPEG|bmp|BMP
--

 Key: NUTCH-305
 URL: http://issues.apache.org/jira/browse/NUTCH-305
 Project: Nutch
Type: Bug

Versions: 0.8-dev
Reporter: chris finne




-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



Re: anchor text modifications

2006-06-09 Thread Andrzej Bialecki

Brian Higgins wrote:

Hi,
i'm pretty new to Nutch and i'm trying to modify the code so it stores 
the

words before and after a hyperlink as well as the anchor text.
i've ben looking through the nutch code for a couple of days and i'm 
still a

little unclear as to the layout...
Nutch parses incoming webpages in HTMLParser.java right? i can't seem to
find the code in here for url processing though - where exactly does it
parse the anchor text and write it to the database?


It collects outlinks in DOMContentUtils.getOutlinks. You will need to 
get the preceding sibling nodes, or a parent node, to collect more of 
the surrounding text.


--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




[jira] Updated: (NUTCH-305) Update crawl and url filter lists to exclude jpeg|JPEG|bmp|BMP

2006-06-09 Thread Stefan Neufeind (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-305?page=all ]

Stefan Neufeind updated NUTCH-305:
--

Attachment: suffix-urlfilter.txt

Find attached an suffix-urlfilter.txt that might be interesting to some people. 
More contributions welcome at any time. Maybe we should ship such a list and 
use the suffix-filter instead of regex to filter by document-extension?

 Update crawl and url filter lists to exclude jpeg|JPEG|bmp|BMP
 --

  Key: NUTCH-305
  URL: http://issues.apache.org/jira/browse/NUTCH-305
  Project: Nutch
 Type: Bug

 Versions: 0.8-dev
 Reporter: chris finne
  Attachments: suffix-urlfilter.txt



-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



Adding new urls in WebDB

2006-06-09 Thread Lourival Júnior

Hi all!

I have some problems with update my WebDB. I've a page, test.htm, that has 4
links to 4 pdf's documents. I execute the crawler then when I do this
command:

bin/nutch readdb Mydir/db -stats

I get this output:

Number of pages: 5
Number of links: 4

That's ok. The problem is when I add more 4 links to the test.htm. I want a
script that re crawl or update my WebDB without I have to delete Mydir
folder. I hope I am being clearly.
I found some shell scripts to do this, however it's don't do what I want.
Always I get the same number of pages and links.

Anyone can help me?

--
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: [EMAIL PROTECTED]


Re: Adding new urls in WebDB

2006-06-09 Thread Stefan Neufeind
Lourival Júnior wrote:
 Hi all!
 
 I have some problems with update my WebDB. I've a page, test.htm, that
 has 4
 links to 4 pdf's documents. I execute the crawler then when I do this
 command:
 
 bin/nutch readdb Mydir/db -stats
 
 I get this output:
 
 Number of pages: 5
 Number of links: 4
 
 That's ok. The problem is when I add more 4 links to the test.htm. I want a
 script that re crawl or update my WebDB without I have to delete Mydir
 folder. I hope I am being clearly.
 I found some shell scripts to do this, however it's don't do what I want.
 Always I get the same number of pages and links.
 
 Anyone can help me?

Hi,

please re-read from the mailinglist-archives as of ... hmm ... yesterday
I think. You'll have to do a small modification to be able to re-inject
your URL to start re-crawling it on the next run. Otherwise a page will
only be re-crawled after a configurable amount of days, which is the
same value also used for the PDFs.


Regards,
 Stefan


Re: Adding new urls in WebDB

2006-06-09 Thread Lourival Júnior

Hi Stefan,

Sorry I don't found the mail that you related :(.

Look at this shell script (I'm using the Cygwin in Windows 2000):

#!/bin/bash

# Set JAVA_HOME to reflect your systems java configuration
export JAVA_HOME=/cygdrive/c/Arquivos\ de\ programas/Java/jre1.5.0

# Start index updation
bin/nutch generate crawl-LEGISLA/db crawl-LEGISLA/segments -topN 1000
s=`ls -d crawl-LEGISLA/segments/2* | tail -1`
echo Segment is $s
bin/nutch fetch $s
bin/nutch updatedb crawl-LEGISLA /db $s
bin/nutch analyze crawl-LEGISLA /db 5
bin/nutch index $s
bin/nutch dedup crawl-LEGISLA /segments crawl-LEGISLA/tmpfile

# Merge segments to prevent too many open files exception in Lucene
bin/nutch mergesegs -dir crawl-LEGISLA/segments -i -ds
s=`ls -d crawl-LEGISLA/segments/2* | tail -1`
echo Merged Segment is $s

rm -rf crawl-LEGISLA/index

I found it in the wiki page of the nutch project. It has some errors in
execution time. I don't know if is it correct... Do you have other example
of how to do this job?

On 6/9/06, Stefan Neufeind [EMAIL PROTECTED] wrote:


Lourival Júnior wrote:
 Hi all!

 I have some problems with update my WebDB. I've a page, test.htm, that
 has 4
 links to 4 pdf's documents. I execute the crawler then when I do this
 command:

 bin/nutch readdb Mydir/db -stats

 I get this output:

 Number of pages: 5
 Number of links: 4

 That's ok. The problem is when I add more 4 links to the test.htm. I
want a
 script that re crawl or update my WebDB without I have to delete Mydir
 folder. I hope I am being clearly.
 I found some shell scripts to do this, however it's don't do what I
want.
 Always I get the same number of pages and links.

 Anyone can help me?

Hi,

please re-read from the mailinglist-archives as of ... hmm ... yesterday
I think. You'll have to do a small modification to be able to re-inject
your URL to start re-crawling it on the next run. Otherwise a page will
only be re-crawled after a configurable amount of days, which is the
same value also used for the PDFs.


Regards,
Stefan





--
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: [EMAIL PROTECTED]


Re: [jira] Commented: (NUTCH-258) Once Nutch logs a SEVERE log item, Nutch fails forevermore

2006-06-09 Thread Scott Ganyo

Thanks, Chris!  (And thank you, Andrzej for interpreting my rantings!)

That plan sounds fantastic and I would be happy to help out.

Scott

On Jun 5, 2006, at 1:01 PM, Chris Mattmann wrote:


Hi Andrzej,



The main problem, as Scott observed, is that the static flag  
affects all

instances of the task executing inside the same JVM. If there are
several Fetcher tasks (or any other tasks that check for SEVERE  
flag!),

belonging to different jobs, all of them will quit. This is certainly
not the intended behavior.



Got it.



In fact, I believe that this would make a fantastic anti- 
pattern.  If this
kind of behavior is *really* wanted (and I argue that it should  
not be

below),
it should be done through an explicit mechanism, not as a side- 
effect.







I have a proposal for a simple solution: set a flag in the current
Configuration instance, and check for this flag. The Configuration
instance provides a task-specific context persisting throughout the
lifetime of a task - but limited only to that task. Voila - problem
solved. We get rid of the dubious use of LogFormatter (I hope  
Chris that

even you would agree that this pattern is slightly .. unusual ;) )


What, unusual? Huh? :-)


and
we gain flexible mechanism limited in scope to the current task,  
which

ensures isolation from other tasks in the same JVM. How about that?


+1

I like your proposed solution. I haven't used multiple fetchers really
inside the same process too, much however, I do have an application  
that
calls fetches in more of a sequential way in the same JVM. So, I  
guess I
just never ran across the behavior. The thing I like about the  
proposed
solution is its separation and isolation of a task context, which I  
think
that Nutch (now relying on Hadoop as the underlying architectural  
computing

platform) needed to address.

So, to summarize, the proposed resolution is:

* add flag field in Configuration instance to signify whether or not a
SEVERE error has been logged within a task's context

* check this field within the fetcher to determine whether or not  
to stop
the fetcher, just for that fetching task identified by its  
Configuration

(and no others)

Is this representative of what you're proposing Andrzej? If so, I'd  
like to
take the lead on contributing a small patch that handles this, and  
then it
would be great if people like Scott could test this out in their  
existing

environments where this error was manifesting itself.

Thanks!

Cheers,
  Chris

(BTW: would you like me to re-open the JIRA issue, or do you want  
to do it?)


__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not  
reflect

those of either NASA, JPL, or the California Institute of Technology.






Re: Adding new urls in WebDB

2006-06-09 Thread Stefan Neufeind
This one here:

http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg04829.html


Regards,
 Stefan

Lourival Júnior wrote:
 Hi Stefan,
 
 Sorry I don't found the mail that you related :(.
 
 Look at this shell script (I'm using the Cygwin in Windows 2000):
 
 #!/bin/bash
 
 # Set JAVA_HOME to reflect your systems java configuration
 export JAVA_HOME=/cygdrive/c/Arquivos\ de\ programas/Java/jre1.5.0
 
 # Start index updation
 bin/nutch generate crawl-LEGISLA/db crawl-LEGISLA/segments -topN 1000
 s=`ls -d crawl-LEGISLA/segments/2* | tail -1`
 echo Segment is $s
 bin/nutch fetch $s
 bin/nutch updatedb crawl-LEGISLA /db $s
 bin/nutch analyze crawl-LEGISLA /db 5
 bin/nutch index $s
 bin/nutch dedup crawl-LEGISLA /segments crawl-LEGISLA/tmpfile
 
 # Merge segments to prevent too many open files exception in Lucene
 bin/nutch mergesegs -dir crawl-LEGISLA/segments -i -ds
 s=`ls -d crawl-LEGISLA/segments/2* | tail -1`
 echo Merged Segment is $s
 
 rm -rf crawl-LEGISLA/index
 
 I found it in the wiki page of the nutch project. It has some errors in
 execution time. I don't know if is it correct... Do you have other example
 of how to do this job?
 
 On 6/9/06, Stefan Neufeind [EMAIL PROTECTED] wrote:

 Lourival Júnior wrote:
  Hi all!
 
  I have some problems with update my WebDB. I've a page, test.htm, that
  has 4
  links to 4 pdf's documents. I execute the crawler then when I do this
  command:
 
  bin/nutch readdb Mydir/db -stats
 
  I get this output:
 
  Number of pages: 5
  Number of links: 4
 
  That's ok. The problem is when I add more 4 links to the test.htm. I
 want a
  script that re crawl or update my WebDB without I have to delete Mydir
  folder. I hope I am being clearly.
  I found some shell scripts to do this, however it's don't do what I
 want.
  Always I get the same number of pages and links.
 
  Anyone can help me?

 Hi,

 please re-read from the mailinglist-archives as of ... hmm ... yesterday
 I think. You'll have to do a small modification to be able to re-inject
 your URL to start re-crawling it on the next run. Otherwise a page will
 only be re-crawled after a configurable amount of days, which is the
 same value also used for the PDFs.


 Regards,
 Stefan


Re: Adding new urls in WebDB

2006-06-09 Thread Lourival Júnior

Thanks a lot!

On 6/9/06, Stefan Neufeind [EMAIL PROTECTED] wrote:


This one here:

http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg04829.html


Regards,
Stefan

Lourival Júnior wrote:
 Hi Stefan,

 Sorry I don't found the mail that you related :(.

 Look at this shell script (I'm using the Cygwin in Windows 2000):

 #!/bin/bash

 # Set JAVA_HOME to reflect your systems java configuration
 export JAVA_HOME=/cygdrive/c/Arquivos\ de\ programas/Java/jre1.5.0

 # Start index updation
 bin/nutch generate crawl-LEGISLA/db crawl-LEGISLA/segments -topN 1000
 s=`ls -d crawl-LEGISLA/segments/2* | tail -1`
 echo Segment is $s
 bin/nutch fetch $s
 bin/nutch updatedb crawl-LEGISLA /db $s
 bin/nutch analyze crawl-LEGISLA /db 5
 bin/nutch index $s
 bin/nutch dedup crawl-LEGISLA /segments crawl-LEGISLA/tmpfile

 # Merge segments to prevent too many open files exception in Lucene
 bin/nutch mergesegs -dir crawl-LEGISLA/segments -i -ds
 s=`ls -d crawl-LEGISLA/segments/2* | tail -1`
 echo Merged Segment is $s

 rm -rf crawl-LEGISLA/index

 I found it in the wiki page of the nutch project. It has some errors in
 execution time. I don't know if is it correct... Do you have other
example
 of how to do this job?

 On 6/9/06, Stefan Neufeind [EMAIL PROTECTED] wrote:

 Lourival Júnior wrote:
  Hi all!
 
  I have some problems with update my WebDB. I've a page, test.htm,
that
  has 4
  links to 4 pdf's documents. I execute the crawler then when I do this
  command:
 
  bin/nutch readdb Mydir/db -stats
 
  I get this output:
 
  Number of pages: 5
  Number of links: 4
 
  That's ok. The problem is when I add more 4 links to the test.htm. I
 want a
  script that re crawl or update my WebDB without I have to delete
Mydir
  folder. I hope I am being clearly.
  I found some shell scripts to do this, however it's don't do what I
 want.
  Always I get the same number of pages and links.
 
  Anyone can help me?

 Hi,

 please re-read from the mailinglist-archives as of ... hmm ...
yesterday
 I think. You'll have to do a small modification to be able to re-inject
 your URL to start re-crawling it on the next run. Otherwise a page will
 only be re-crawled after a configurable amount of days, which is the
 same value also used for the PDFs.


 Regards,
 Stefan





--
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: [EMAIL PROTECTED]


Nutch logging questions

2006-06-09 Thread Jérôme Charron

Hi,

I'm currently working on NUTCH-303 so that nutch uses commons logging facade
API and log4j
as the default implementation. All the code is actually switched to and uses
Commons Logging API, and I have
replaced some System.out and printStackTrace to make use of Commons Logging.

To finalize this patch, my problem is on the configuration:

1. Does the back-end and front-end should have the same logging
configuration?
2. What kind of configuration do you think is the best one by default?
For now, I have used the same log4 properties than hadoop (see
http://svn.apache.org/viewvc/lucene/hadoop/trunk/conf/log4j.properties?view=markuppathrev=411254
) for the back-end, and
I was thinking to use the stdout for front-end.
What do you think about this?
3. When using the default DRFA appender (Daily Rolling File Appender) in
nutch, should I log in the the hadoop log file or in a nutch file?

Thanks for your feed-back.

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: Nutch logging questions

2006-06-09 Thread Doug Cutting

Jérôme Charron wrote:

For now, I have used the same log4 properties than hadoop (see
http://svn.apache.org/viewvc/lucene/hadoop/trunk/conf/log4j.properties?view=markuppathrev=411254 


) for the back-end, and
I was thinking to use the stdout for front-end.
What do you think about this?


We should use console rather than stdout, so that it can be 
distinguished from application output.


http://issues.apache.org/jira/browse/HADOOP-292

Doug


[jira] Updated: (NUTCH-258) Once Nutch logs a SEVERE log item, Nutch fails forevermore

2006-06-09 Thread Chris A. Mattmann (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-258?page=all ]

Chris A. Mattmann updated NUTCH-258:


Attachment: NUTCH-258.Mattmann.060906.patch.txt

Hi Folks,

  Attached is a patch that implements the suggested two fixes to this issue. I 
had to go through the Nutch code and look for LOG.severe calls, and then add an 
additional:

conf.set(NutchConfiguration.LOG_SEVERE_FIELD, NutchConfiguration.LOG_SEVERE);

at the bottom of it. I had to go through several places in the code too where 
SEVERE errors were being logged and make sure that those pieces of code had 
access to the Configuration object. I ran unit-level tests and compilation, but 
no system level tests. Could Scott or someone else who was experiencing this 
problem test out this patch and then let me know if this fixes the issue?

Thanks!

Cheers,
  Chris




 Once Nutch logs a SEVERE log item, Nutch fails forevermore
 --

  Key: NUTCH-258
  URL: http://issues.apache.org/jira/browse/NUTCH-258
  Project: Nutch
 Type: Bug

   Components: fetcher
 Versions: 0.8-dev
  Environment: All
 Reporter: Scott Ganyo
 Assignee: Chris A. Mattmann
 Priority: Critical
  Attachments: NUTCH-258.Mattmann.060906.patch.txt, dumbfix.patch

 Once a SEVERE log item is written, Nutch shuts down any fetching forevermore. 
  This is from the run() method in Fetcher.java:
 public void run() {
   synchronized (Fetcher.this) {activeThreads++;} // count threads
   
   try {
 UTF8 key = new UTF8();
 CrawlDatum datum = new CrawlDatum();
 
 while (true) {
   if (LogFormatter.hasLoggedSevere()) // something bad happened
 break;// exit
   
 Notice the last 2 lines.  This will prevent Nutch from ever Fetching again 
 once this is hit as LogFormatter is storing this data as a static.
 (Also note that LogFormatter.hasLoggedSevere() is also checked in 
 org.apache.nutch.net.URLFilterChecker and will disable this class as well.)
 This must be fixed or Nutch cannot be run as any kind of long-running 
 service.  Furthermore, I believe it is a poor decision to rely on a logging 
 event to determine the state of the application - this could have any number 
 of side-effects that would be extremely difficult to track down.  (As it has 
 already for me.)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Created: (NUTCH-306) DistributedSearch.Client liveAddresses concurrency problem

2006-06-09 Thread Grant Glouser (JIRA)
DistributedSearch.Client liveAddresses concurrency problem
--

 Key: NUTCH-306
 URL: http://issues.apache.org/jira/browse/NUTCH-306
 Project: Nutch
Type: Bug

  Components: searcher  
Versions: 0.7, 0.8-dev
Reporter: Grant Glouser
Priority: Critical


Under heavy load, hits returned by DistributedSearch.Client can become out of 
sync with the Client's live server list.

DistributedSearch.Client maintains an array of live search servers 
(liveAddresses).  This array is updated at intervals by a watchdog thread.  
When the Client returns hits from a search, it tracks which hits came from 
which server by saving an index into the liveAddresses array (as Hit.indexNo).

The problem occurs when the search servers cannot service some remote procedure 
calls before the client times out (due to heavy load, for example).  If the 
Client returns some Hits from a search, and then the array of liveAddresses 
changes while the Hits are still being used, the indexNos for those Hits can 
become invalid, referring to different servers than the Hit originated from (or 
no server at all!).

Symptoms of this problem include:

- ArrayIndexOutOfBoundsException (when the array of liveAddresses shrinks, a 
Hit from the last server in liveAddresses in the previous update cycle now has 
an indexNo past the end of the array)

- IOException: read past EOF (suppose a hit comes back from server A with a doc 
number of 1000.  Then the watchdog thread updates liveAddresses and now the Hit 
looks like it came from server B, but server B only has 900 documents.  Trying 
to get details for the hit will read past EOF in server B's index.)

- Of course, you could also get a silent failure in which you find a hit on 
server A, but the details/summary are fetched from server B.  To the user, it 
would simply look like an incorrect or nonsense hit.

We have solved this locally by removing the liveAddresses array.  Instead, the 
watchdog thread updates an array of booleans (same size as the array of 
defaultAddresses) that indicate whether that address responded to the latest 
call from the watchdog thread.  Hit.indexNo is then always an index into the 
complete array of defaultAddresses, so it is stable and always valid.  Callers 
of getDetails()/getSummary()/etc. must still be aware that these methods may 
return null when the corresponding server is unable to respond.


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-306) DistributedSearch.Client liveAddresses concurrency problem

2006-06-09 Thread Grant Glouser (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-306?page=all ]

Grant Glouser updated NUTCH-306:


Attachment: DistributedSearch.java-patch

 DistributedSearch.Client liveAddresses concurrency problem
 --

  Key: NUTCH-306
  URL: http://issues.apache.org/jira/browse/NUTCH-306
  Project: Nutch
 Type: Bug

   Components: searcher
 Versions: 0.7, 0.8-dev
 Reporter: Grant Glouser
 Priority: Critical
  Attachments: DistributedSearch.java-patch

 Under heavy load, hits returned by DistributedSearch.Client can become out of 
 sync with the Client's live server list.
 DistributedSearch.Client maintains an array of live search servers 
 (liveAddresses).  This array is updated at intervals by a watchdog thread.  
 When the Client returns hits from a search, it tracks which hits came from 
 which server by saving an index into the liveAddresses array (as Hit.indexNo).
 The problem occurs when the search servers cannot service some remote 
 procedure calls before the client times out (due to heavy load, for example). 
  If the Client returns some Hits from a search, and then the array of 
 liveAddresses changes while the Hits are still being used, the indexNos for 
 those Hits can become invalid, referring to different servers than the Hit 
 originated from (or no server at all!).
 Symptoms of this problem include:
 - ArrayIndexOutOfBoundsException (when the array of liveAddresses shrinks, a 
 Hit from the last server in liveAddresses in the previous update cycle now 
 has an indexNo past the end of the array)
 - IOException: read past EOF (suppose a hit comes back from server A with a 
 doc number of 1000.  Then the watchdog thread updates liveAddresses and now 
 the Hit looks like it came from server B, but server B only has 900 
 documents.  Trying to get details for the hit will read past EOF in server 
 B's index.)
 - Of course, you could also get a silent failure in which you find a hit on 
 server A, but the details/summary are fetched from server B.  To the user, it 
 would simply look like an incorrect or nonsense hit.
 We have solved this locally by removing the liveAddresses array.  Instead, 
 the watchdog thread updates an array of booleans (same size as the array of 
 defaultAddresses) that indicate whether that address responded to the latest 
 call from the watchdog thread.  Hit.indexNo is then always an index into the 
 complete array of defaultAddresses, so it is stable and always valid.  
 Callers of getDetails()/getSummary()/etc. must still be aware that these 
 methods may return null when the corresponding server is unable to respond.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



0.8 release

2006-06-09 Thread Sami Siren
How would folks feel about releasing 0.8 now, there has been quite a lot 
of improvements/new features
since 0.7 series and I strongly feel that we should push the first 0.8 
series release (alfa/beta)
out the door now. It would IMO lower the barrier to first timers try the 
0.8 series and that would

give us more feedback about the overall quality.

If there is a consensus about this I can volunteer to be the RM.

--
Sami Siren