Re: Problem with latest SVN during reduce phase

2006-01-12 Thread Lukas Vlcek
Hi,
I am facing this error as well. Now I located one particular document
which is causing it (it is msword document which can't be properly
parsed by parser). I have sent it to Andrzej in separed email. Let's
see if that helps...
Lukas

On 1/11/06, Dominik Friedrich <[EMAIL PROTECTED]> wrote:
> I got this exception a lot, too. I haven't tested the patch by Andrzej
> yet but instead I just put the doc.add() lines in the indexer reduce
> function in a try-catch block . This way the indexing finishes even with
> a null value and i can see which documents haven't been indexed in the
> log file.
>
> Wouldn't it be a good idea to catch every exceptions that only affect
> one document in loops like this? At least I don't like it if an indexing
> process dies after a few hours because one document triggers such an
> exception.
>
> best regards,
> Dominik
>
> Byron Miller wrote:
> > 60111 103432 reduce > reduce
> > 060111 103432 Optimizing index.
> > 060111 103433 closing > reduce
> > 060111 103434 closing > reduce
> > 060111 103435 closing > reduce
> > java.lang.NullPointerException: value cannot be null
> > at
> > org.apache.lucene.document.Field.(Field.java:469)
> > at
> > org.apache.lucene.document.Field.(Field.java:412)
> > at
> > org.apache.lucene.document.Field.UnIndexed(Field.java:195)
> > at
> > org.apache.nutch.indexer.Indexer.reduce(Indexer.java:198)
> > at
> > org.apache.nutch.mapred.ReduceTask.run(ReduceTask.java:260)
> > at
> > org.apache.nutch.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:90)
> > Exception in thread "main" java.io.IOException: Job
> > failed!
> > at
> > org.apache.nutch.mapred.JobClient.runJob(JobClient.java:308)
> > at
> > org.apache.nutch.indexer.Indexer.index(Indexer.java:259)
> > at
> > org.apache.nutch.crawl.Crawl.main(Crawl.java:121)
> > [EMAIL PROTECTED]:/data/nutch/trunk$
> >
> >
> > Pulled todays build and got above error. No problems
> > running out of disk space or anything like that. This
> > is a single instance, local file systems.
> >
> > Anyway to recover the crawl/finish the reduce job from
> > where it failed?
> >
> >
> >
>
>
>


Re: Problem with latest SVN during reduce phase

2006-01-12 Thread Pashabhai
Hi ,

   The very similar exception occurs while indexing a
page which do not have body content (and title
sometimes). 

051223 194717 Optimizing index. 
java.lang.NullPointerException 
at 
org.apache.nutch.indexer.basic.BasicIndexingFilter.filter(BasicIndexingFilter.java:75)

at 
org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:63)

at 
org.apache.nutch.crawl.Indexer.reduce(Indexer.java:217)

at 
org.apache.nutch.mapred.ReduceTask.run(ReduceTask.java:260)

at 


 Looking into the source code of BasicIndexingFilter.
it is trying to 
doc.add(Field.UnStored("content", parse.getText()));
 
I guess adding check for null on parse object
if(parse!=null)   should solve the problem.

Can confirm when tested locally.

Thanks
P




--- Lukas Vlcek <[EMAIL PROTECTED]> wrote:

> Hi,
> I am facing this error as well. Now I located one
> particular document
> which is causing it (it is msword document which
> can't be properly
> parsed by parser). I have sent it to Andrzej in
> separed email. Let's
> see if that helps...
> Lukas
> 
> On 1/11/06, Dominik Friedrich
> <[EMAIL PROTECTED]> wrote:
> > I got this exception a lot, too. I haven't tested
> the patch by Andrzej
> > yet but instead I just put the doc.add() lines in
> the indexer reduce
> > function in a try-catch block . This way the
> indexing finishes even with
> > a null value and i can see which documents haven't
> been indexed in the
> > log file.
> >
> > Wouldn't it be a good idea to catch every
> exceptions that only affect
> > one document in loops like this? At least I don't
> like it if an indexing
> > process dies after a few hours because one
> document triggers such an
> > exception.
> >
> > best regards,
> > Dominik
> >
> > Byron Miller wrote:
> > > 60111 103432 reduce > reduce
> > > 060111 103432 Optimizing index.
> > > 060111 103433 closing > reduce
> > > 060111 103434 closing > reduce
> > > 060111 103435 closing > reduce
> > > java.lang.NullPointerException: value cannot be
> null
> > > at
> > >
>
org.apache.lucene.document.Field.(Field.java:469)
> > > at
> > >
>
org.apache.lucene.document.Field.(Field.java:412)
> > > at
> > >
>
org.apache.lucene.document.Field.UnIndexed(Field.java:195)
> > > at
> > >
>
org.apache.nutch.indexer.Indexer.reduce(Indexer.java:198)
> > > at
> > >
>
org.apache.nutch.mapred.ReduceTask.run(ReduceTask.java:260)
> > > at
> > >
>
org.apache.nutch.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:90)
> > > Exception in thread "main" java.io.IOException:
> Job
> > > failed!
> > > at
> > >
>
org.apache.nutch.mapred.JobClient.runJob(JobClient.java:308)
> > > at
> > >
>
org.apache.nutch.indexer.Indexer.index(Indexer.java:259)
> > > at
> > >
> org.apache.nutch.crawl.Crawl.main(Crawl.java:121)
> > > [EMAIL PROTECTED]:/data/nutch/trunk$
> > >
> > >
> > > Pulled todays build and got above error. No
> problems
> > > running out of disk space or anything like that.
> This
> > > is a single instance, local file systems.
> > >
> > > Anyway to recover the crawl/finish the reduce
> job from
> > > where it failed?
> > >
> > >
> > >
> >
> >
> >
> 


__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 


Re: Problem with latest SVN during reduce phase

2006-01-12 Thread Lukas Vlcek
Hi,
I think this issue can be more complex. If I remember my test
correctly then parse object was not null. Also parse.getText() was not
null (it just contained empty String).
If document is not parsed correctly then "empty" parse is returned
instead: parseStatus.getEmptyParse(); which should be OK, but I didn't
have a chance to check if this can cause any troubles during index
index optimization.
Lukas

On 1/12/06, Pashabhai <[EMAIL PROTECTED]> wrote:
> Hi ,
>
>The very similar exception occurs while indexing a
> page which do not have body content (and title
> sometimes).
>
> 051223 194717 Optimizing index.
> java.lang.NullPointerException
> at
> org.apache.nutch.indexer.basic.BasicIndexingFilter.filter(BasicIndexingFilter.java:75)
>
> at
> org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:63)
>
> at
> org.apache.nutch.crawl.Indexer.reduce(Indexer.java:217)
>
> at
> org.apache.nutch.mapred.ReduceTask.run(ReduceTask.java:260)
>
> at
>
>
>  Looking into the source code of BasicIndexingFilter.
> it is trying to
> doc.add(Field.UnStored("content", parse.getText()));
>
> I guess adding check for null on parse object
> if(parse!=null)   should solve the problem.
>
> Can confirm when tested locally.
>
> Thanks
> P
>
>
>
>
> --- Lukas Vlcek <[EMAIL PROTECTED]> wrote:
>
> > Hi,
> > I am facing this error as well. Now I located one
> > particular document
> > which is causing it (it is msword document which
> > can't be properly
> > parsed by parser). I have sent it to Andrzej in
> > separed email. Let's
> > see if that helps...
> > Lukas
> >
> > On 1/11/06, Dominik Friedrich
> > <[EMAIL PROTECTED]> wrote:
> > > I got this exception a lot, too. I haven't tested
> > the patch by Andrzej
> > > yet but instead I just put the doc.add() lines in
> > the indexer reduce
> > > function in a try-catch block . This way the
> > indexing finishes even with
> > > a null value and i can see which documents haven't
> > been indexed in the
> > > log file.
> > >
> > > Wouldn't it be a good idea to catch every
> > exceptions that only affect
> > > one document in loops like this? At least I don't
> > like it if an indexing
> > > process dies after a few hours because one
> > document triggers such an
> > > exception.
> > >
> > > best regards,
> > > Dominik
> > >
> > > Byron Miller wrote:
> > > > 60111 103432 reduce > reduce
> > > > 060111 103432 Optimizing index.
> > > > 060111 103433 closing > reduce
> > > > 060111 103434 closing > reduce
> > > > 060111 103435 closing > reduce
> > > > java.lang.NullPointerException: value cannot be
> > null
> > > > at
> > > >
> >
> org.apache.lucene.document.Field.(Field.java:469)
> > > > at
> > > >
> >
> org.apache.lucene.document.Field.(Field.java:412)
> > > > at
> > > >
> >
> org.apache.lucene.document.Field.UnIndexed(Field.java:195)
> > > > at
> > > >
> >
> org.apache.nutch.indexer.Indexer.reduce(Indexer.java:198)
> > > > at
> > > >
> >
> org.apache.nutch.mapred.ReduceTask.run(ReduceTask.java:260)
> > > > at
> > > >
> >
> org.apache.nutch.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:90)
> > > > Exception in thread "main" java.io.IOException:
> > Job
> > > > failed!
> > > > at
> > > >
> >
> org.apache.nutch.mapred.JobClient.runJob(JobClient.java:308)
> > > > at
> > > >
> >
> org.apache.nutch.indexer.Indexer.index(Indexer.java:259)
> > > > at
> > > >
> > org.apache.nutch.crawl.Crawl.main(Crawl.java:121)
> > > > [EMAIL PROTECTED]:/data/nutch/trunk$
> > > >
> > > >
> > > > Pulled todays build and got above error. No
> > problems
> > > > running out of disk space or anything like that.
> > This
> > > > is a single instance, local file systems.
> > > >
> > > > Anyway to recover the crawl/finish the reduce
> > job from
> > > > where it failed?
> > > >
> > > >
> > > >
> > >
> > >
> > >
> >
>
>
> __
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
>


Speed up searching

2006-01-12 Thread YourSoft

Dear Developers,

I think this great improvement is missing from latest Nutch/Lucene 
nightly build:

http://issues.apache.org/jira/browse/LUCENE-443

Best Regards,
   Ferenc


NutchQuery adding non required Terms

2006-01-12 Thread Stefan Groschupf

Hi,
I would love to build a nutch Query object via API and not using the  
Queryparser.
In my case I need the complete set of boolean operators in the query,  
so required (AND) and non required (OR)  terms and prohibited (NOT).
I notice that in general this would be possible to add a clause in  
the Query object, since the BasicQuery filter just copies the  
parameter isRequired and isProhibited.
However the Clauses arraylist is private and there is not method in  
the nutch query object that allows to add custom terms or clauses  
with isRequired and isProhibited.


Did I miss something in general to be able to support non required  
terms in nutch?
Would people agree to add a little method that allows to adding terms  
with these parameters?



Thanks for any comments.
Stefan 


Re: NutchQuery adding non required Terms

2006-01-12 Thread Doug Cutting

Stefan Groschupf wrote:
Did I miss something in general to be able to support non required  
terms in nutch?


I left OR and nesting out of the API to simplify what query filters have 
to process.  Nutch's query features are approximately what Google 
supported for its first three years.  (Google did not add OR until 2000, 
I think.)


If we permit optional clauses then we need to make sure that each query 
filter can handle them correctly.


For example, the query "+A +B" is translated by query-basic into 
something like:


+(title:a OR content:a OR anchors:a OR url:a OR host:a)
+(title:b OR content:b OR anchors:b OR url:b OR host:b)
title:"a b"~999
content:"a b"~999
anchors:"a b"~999
url:"a b"~999
host:"a b"~999

The query "+A B" (where B is optional) should remove the plus in the 
second line above.  So it should not be too hard to change query-basic 
to be able to handle optional terms in the default field.  Perhaps 
that's the only query filter that would need to be updated.  And it 
looks like LuceneQueryOptimizer already checks that filterized clauses 
are required.


It would be good to have some unit tests for query filtering.

Doug


quit the maillist

2006-01-12 Thread Su Yan
Hi,

May I quit the nutch-dev mailing list? Thank you!

Sue


MapReduce and segment merging

2006-01-12 Thread Mike Alulin
Is it possible to merge segments in the map reduce version of Nutch?


-
Yahoo! Photos – Showcase holiday pictures in hardcover
 Photo Books. You design it and we’ll bind it!

Re: MapReduce and segment merging

2006-01-12 Thread Andrzej Bialecki

Mike Alulin wrote:

Is it possible to merge segments in the map reduce version of Nutch?
  


Not yet.

--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Re: MapReduce and segment merging

2006-01-12 Thread Mike Alulin
Then how people uses the new version if they need let's say daily crawls of the 
new/updated pages? I crawl updated pages every 24 hours and if I do not merge 
the segments, soon I will have hundreds of them. What is the best solution in 
this case? 
   
  Full recrawl is not a good option as i have millions of documents and I DO 
know which of them were updated without requesting them.
  

Andrzej Bialecki <[EMAIL PROTECTED]> wrote:  Mike Alulin wrote:
> Is it possible to merge segments in the map reduce version of Nutch?
> 

Not yet.

-- 
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com






-
Yahoo! Photos
 Got holiday prints? See all the ways to get quality prints in your hands ASAP.

Where is org.apache.nutch.protocol.http.api.HttpBase?

2006-01-12 Thread Jack Tang
Hi Guys

I update the source code from svn head version now. However I cannot
find org.apache.nutch.protocol.http.api.HttpBase class. Did you miss
it?

Thanks
/Jack

--
Keep Discovering ... ...
http://www.jroller.com/page/jmars


Re: MapReduce and segment merging

2006-01-12 Thread Andrzej Bialecki

Mike Alulin wrote:
Then how people uses the new version if they need let's say daily crawls of the new/updated pages? I crawl updated pages every 24 hours and if I do not merge the segments, soon I will have hundreds of them. What is the best solution in this case? 
   
  Full recrawl is not a good option as i have millions of documents and I DO know which of them were updated without requesting them.
  


This is a development version, nobody said it's feature complete. 
Patience, my friend... or spend some effort to improve it. ;-)


--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




[jira] Created: (NUTCH-172) Segment merger

2006-01-12 Thread Mike Alulin (JIRA)
Segment merger
--

 Key: NUTCH-172
 URL: http://issues.apache.org/jira/browse/NUTCH-172
 Project: Nutch
Type: New Feature
Versions: 0.8-dev
 Environment: Any
Reporter: Mike Alulin


The map reduce version missing segment merging that can be very important when 
one wants to have frequent crawls of updated pages only.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



RE: MapReduce and segment merging

2006-01-12 Thread Goldschmidt, Dave
Could you also just copy segments out of NDFS to local -- perform merges
in local -- then copy segments back into NDFS?

DaveG


-Original Message-
From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] 
Sent: Thursday, January 12, 2006 2:14 PM
To: nutch-dev@lucene.apache.org
Subject: Re: MapReduce and segment merging

Mike Alulin wrote:
> Then how people uses the new version if they need let's say daily
crawls of the new/updated pages? I crawl updated pages every 24 hours
and if I do not merge the segments, soon I will have hundreds of them.
What is the best solution in this case? 
>
>   Full recrawl is not a good option as i have millions of documents
and I DO know which of them were updated without requesting them.
>   

This is a development version, nobody said it's feature complete. 
Patience, my friend... or spend some effort to improve it. ;-)

-- 
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Re: Where is org.apache.nutch.protocol.http.api.HttpBase?

2006-01-12 Thread Stefan Groschupf

I guess it is in:
src/plugin/lib-http/

Am 12.01.2006 um 18:06 schrieb Jack Tang:


Hi Guys

I update the source code from svn head version now. However I cannot
find org.apache.nutch.protocol.http.api.HttpBase class. Did you miss
it?

Thanks
/Jack

--
Keep Discovering ... ...
http://www.jroller.com/page/jmars



---
company:http://www.media-style.com
forum:http://www.text-mining.org
blog:http://www.find23.net




RE: MapReduce and segment merging

2006-01-12 Thread Byron Miller
I was thinking that Nutch needs some sort of workflow
manager. This way you could build jobs off specific
workflows and hopefully recover jobs based upon the
portion of the workflow they are stuck. (or restart a
job if failed/processing time > x hours and other such
workflow processes rules)

Something like that could also send notifications of
jobs done, trigger other events and create a
management interface to what your cluster is up to or
apply configuration types to be defigned based upon
batch job/workflow process "in process".  For example
if i'm building a blog index i may want more smaller
segments based upon daily fetches while for other jobs
i may want less larger segments. 

Does something like that make much sense for where
mapred branch is going?

is workflow the right term for such beast?

-byron



--- "Goldschmidt, Dave" <[EMAIL PROTECTED]>
wrote:

> Could you also just copy segments out of NDFS to
> local -- perform merges
> in local -- then copy segments back into NDFS?
> 
> DaveG
> 
> 
> -Original Message-
> From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] 
> Sent: Thursday, January 12, 2006 2:14 PM
> To: nutch-dev@lucene.apache.org
> Subject: Re: MapReduce and segment merging
> 
> Mike Alulin wrote:
> > Then how people uses the new version if they need
> let's say daily
> crawls of the new/updated pages? I crawl updated
> pages every 24 hours
> and if I do not merge the segments, soon I will have
> hundreds of them.
> What is the best solution in this case? 
> >
> >   Full recrawl is not a good option as i have
> millions of documents
> and I DO know which of them were updated without
> requesting them.
> >   
> 
> This is a development version, nobody said it's
> feature complete. 
> Patience, my friend... or spend some effort to
> improve it. ;-)
> 
> -- 
> Best regards,
> Andrzej Bialecki <><
>  ___. ___ ___ ___ _ _  
> __
> [__ || __|__/|__||\/|  Information Retrieval,
> Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System
> Integration
> http://www.sigram.com  Contact: info at sigram dot
> com
> 
> 
> 



Re: NutchQuery adding non required Terms

2006-01-12 Thread Stefan Groschupf

Thanks for the hint.
I would love to add non required terms and nesting to the Query  
object API, I will provide also some unit tests, but since I'm not a  
javacc geek it will only extend the java api not the query parser.

Would such a extension  be welcome?

Stefan

Am 12.01.2006 um 18:29 schrieb Doug Cutting:


Stefan Groschupf wrote:
Did I miss something in general to be able to support non  
required  terms in nutch?


I left OR and nesting out of the API to simplify what query filters  
have to process.  Nutch's query features are approximately what  
Google supported for its first three years.  (Google did not add OR  
until 2000, I think.)


If we permit optional clauses then we need to make sure that each  
query filter can handle them correctly.


For example, the query "+A +B" is translated by query-basic into  
something like:


+(title:a OR content:a OR anchors:a OR url:a OR host:a)
+(title:b OR content:b OR anchors:b OR url:b OR host:b)
title:"a b"~999
content:"a b"~999
anchors:"a b"~999
url:"a b"~999
host:"a b"~999

The query "+A B" (where B is optional) should remove the plus in  
the second line above.  So it should not be too hard to change  
query-basic to be able to handle optional terms in the default  
field.  Perhaps that's the only query filter that would need to be  
updated.  And it looks like LuceneQueryOptimizer already checks  
that filterized clauses are required.


It would be good to have some unit tests for query filtering.

Doug



---
company:http://www.media-style.com
forum:http://www.text-mining.org
blog:http://www.find23.net




Re: NutchQuery adding non required Terms

2006-01-12 Thread Doug Cutting

Stefan Groschupf wrote:
I would love to add non required terms and nesting to the Query  object 
API, I will provide also some unit tests, but since I'm not a  javacc 
geek it will only extend the java api not the query parser.

Would such a extension  be welcome?


I think we should start with just adding non-required terms, and leave 
nesting as a subsequent step.


I also agree that we can leave this out of the query parser as a start.

Doug


[jira] Updated: (NUTCH-87) Efficient site-specific crawling for a large number of sites

2006-01-12 Thread Matt Kangas (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-87?page=all ]

Matt Kangas updated NUTCH-87:
-

Attachment: build.xml.patch
urlfilter-whitelist.tar.gz

THIS REPLACES THE PREVIOUS TARBALL
SEE THE INCLUDED README.txt FOR USAGE GUIDELINES

Place both of these files into ~nutch/src/plugin, then:
- untar the tarball
- apply the patch to ~nutch/src/plugin/build.xml to permit urifilter-whitelist 
to be built

Next, cd ~nutch and build ("ant").

A JUnit test is included. It will be run automatically by "ant test-plugins".

Then follow the instructions in ~nutch/src/plugin/urlfilter-whitelist/README.txt

> Efficient site-specific crawling for a large number of sites
> 
>
>  Key: NUTCH-87
>  URL: http://issues.apache.org/jira/browse/NUTCH-87
>  Project: Nutch
> Type: New Feature
>   Components: fetcher
>  Environment: cross-platform
> Reporter: AJ Chen
>  Attachments: JIRA-87-whitelistfilter.tar.gz, build.xml.patch, 
> urlfilter-whitelist.tar.gz
>
> There is a gap between whole-web crawling and single (or handful) site 
> crawling. Many applications actually fall in this gap, which usually require 
> to crawl a large number of selected sites, say 10 domains. Current 
> CrawlTool is designed for a handful of sites. So, this request calls for a 
> new feature or improvement on CrawTool so that "nutch crawl" command can 
> efficiently deal with large number of sites. One requirement is to add or 
> change smallest amount of code so that this feature can be implemented sooner 
> rather than later. 
> There is a discussion about adding a URLFilter to implement this requested 
> feature, see the following thread - 
> http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00726.html
> The idea is to use a hashtable in URLFilter for looking up regex for any 
> given domain. Hashtable will be much faster than list implementation 
> currently used in RegexURLFilter.  Fortunately, Matt Kangas has implemented 
> such idea before for his own application and is willing to make it available 
> for adaptation to Nutch. I'll be happy to help him in this regard.  
> But, before we do it, we would like to hear more discussions or comments 
> about this approach or other approaches. Particularly, let us know what 
> potential downside will be for hashtable lookup in a new URLFilter plugin.
> AJ Chen

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-87) Efficient site-specific crawling for a large number of sites

2006-01-12 Thread Matt Kangas (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-87?page=comments#action_12362584 ] 

Matt Kangas commented on NUTCH-87:
--

JIRA-87-whitelistfilter.tar.gz is OBSOLETE. Use the newer tarball + patch file 
instead.

> Efficient site-specific crawling for a large number of sites
> 
>
>  Key: NUTCH-87
>  URL: http://issues.apache.org/jira/browse/NUTCH-87
>  Project: Nutch
> Type: New Feature
>   Components: fetcher
>  Environment: cross-platform
> Reporter: AJ Chen
>  Attachments: JIRA-87-whitelistfilter.tar.gz, build.xml.patch, 
> urlfilter-whitelist.tar.gz
>
> There is a gap between whole-web crawling and single (or handful) site 
> crawling. Many applications actually fall in this gap, which usually require 
> to crawl a large number of selected sites, say 10 domains. Current 
> CrawlTool is designed for a handful of sites. So, this request calls for a 
> new feature or improvement on CrawTool so that "nutch crawl" command can 
> efficiently deal with large number of sites. One requirement is to add or 
> change smallest amount of code so that this feature can be implemented sooner 
> rather than later. 
> There is a discussion about adding a URLFilter to implement this requested 
> feature, see the following thread - 
> http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00726.html
> The idea is to use a hashtable in URLFilter for looking up regex for any 
> given domain. Hashtable will be much faster than list implementation 
> currently used in RegexURLFilter.  Fortunately, Matt Kangas has implemented 
> such idea before for his own application and is willing to make it available 
> for adaptation to Nutch. I'll be happy to help him in this regard.  
> But, before we do it, we would like to hear more discussions or comments 
> about this approach or other approaches. Particularly, let us know what 
> potential downside will be for hashtable lookup in a new URLFilter plugin.
> AJ Chen

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



Nutch/Lucene Document Model

2006-01-12 Thread Chih How Bong
Hi all,
  I just get my hand dirty in nutch recently, especially in extending its
functionalities.
  I learned that nutch/lucene have their document retrieval model implmented
in TF vector-based approach. I wonder if there exist of other document model
like fuzzy set or probabilistic model implemented in nutch/lucene.
  The objective of proposing and having a number of document models
implemented is to enable us further improve the document ranking in nutch.
Please understand that I not questioning the current nutch document ranking
efficiency. I just like to see more options in nutch especially how document
being modelled and how well they are.
  It is worth while to move in this direction?  Please comment.

Bong Chih How


Re: Problem with latest SVN during reduce phase

2006-01-12 Thread Pashabhai
Hi ,

   You are right, Parse object is not null even though
page has no content and title.

   Could it be FetcherOutput Object ???

 
P   

--- Lukas Vlcek <[EMAIL PROTECTED]> wrote:

> Hi,
> I think this issue can be more complex. If I
> remember my test
> correctly then parse object was not null. Also
> parse.getText() was not
> null (it just contained empty String).
> If document is not parsed correctly then "empty"
> parse is returned
> instead: parseStatus.getEmptyParse(); which should
> be OK, but I didn't
> have a chance to check if this can cause any
> troubles during index
> index optimization.
> Lukas
> 
> On 1/12/06, Pashabhai <[EMAIL PROTECTED]>
> wrote:
> > Hi ,
> >
> >The very similar exception occurs while
> indexing a
> > page which do not have body content (and title
> > sometimes).
> >
> > 051223 194717 Optimizing index.
> > java.lang.NullPointerException
> > at
> >
>
org.apache.nutch.indexer.basic.BasicIndexingFilter.filter(BasicIndexingFilter.java:75)
> >
> > at
> >
>
org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:63)
> >
> > at
> >
>
org.apache.nutch.crawl.Indexer.reduce(Indexer.java:217)
> >
> > at
> >
>
org.apache.nutch.mapred.ReduceTask.run(ReduceTask.java:260)
> >
> > at
> >
> >
> >  Looking into the source code of
> BasicIndexingFilter.
> > it is trying to
> > doc.add(Field.UnStored("content",
> parse.getText()));
> >
> > I guess adding check for null on parse object
> > if(parse!=null)   should solve the problem.
> >
> > Can confirm when tested locally.
> >
> > Thanks
> > P
> >
> >
> >
> >
> > --- Lukas Vlcek <[EMAIL PROTECTED]> wrote:
> >
> > > Hi,
> > > I am facing this error as well. Now I located
> one
> > > particular document
> > > which is causing it (it is msword document which
> > > can't be properly
> > > parsed by parser). I have sent it to Andrzej in
> > > separed email. Let's
> > > see if that helps...
> > > Lukas
> > >
> > > On 1/11/06, Dominik Friedrich
> > > <[EMAIL PROTECTED]> wrote:
> > > > I got this exception a lot, too. I haven't
> tested
> > > the patch by Andrzej
> > > > yet but instead I just put the doc.add() lines
> in
> > > the indexer reduce
> > > > function in a try-catch block . This way the
> > > indexing finishes even with
> > > > a null value and i can see which documents
> haven't
> > > been indexed in the
> > > > log file.
> > > >
> > > > Wouldn't it be a good idea to catch every
> > > exceptions that only affect
> > > > one document in loops like this? At least I
> don't
> > > like it if an indexing
> > > > process dies after a few hours because one
> > > document triggers such an
> > > > exception.
> > > >
> > > > best regards,
> > > > Dominik
> > > >
> > > > Byron Miller wrote:
> > > > > 60111 103432 reduce > reduce
> > > > > 060111 103432 Optimizing index.
> > > > > 060111 103433 closing > reduce
> > > > > 060111 103434 closing > reduce
> > > > > 060111 103435 closing > reduce
> > > > > java.lang.NullPointerException: value cannot
> be
> > > null
> > > > > at
> > > > >
> > >
> >
>
org.apache.lucene.document.Field.(Field.java:469)
> > > > > at
> > > > >
> > >
> >
>
org.apache.lucene.document.Field.(Field.java:412)
> > > > > at
> > > > >
> > >
> >
>
org.apache.lucene.document.Field.UnIndexed(Field.java:195)
> > > > > at
> > > > >
> > >
> >
>
org.apache.nutch.indexer.Indexer.reduce(Indexer.java:198)
> > > > > at
> > > > >
> > >
> >
>
org.apache.nutch.mapred.ReduceTask.run(ReduceTask.java:260)
> > > > > at
> > > > >
> > >
> >
>
org.apache.nutch.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:90)
> > > > > Exception in thread "main"
> java.io.IOException:
> > > Job
> > > > > failed!
> > > > > at
> > > > >
> > >
> >
>
org.apache.nutch.mapred.JobClient.runJob(JobClient.java:308)
> > > > > at
> > > > >
> > >
> >
>
org.apache.nutch.indexer.Indexer.index(Indexer.java:259)
> > > > > at
> > > > >
> > >
> org.apache.nutch.crawl.Crawl.main(Crawl.java:121)
> > > > > [EMAIL PROTECTED]:/data/nutch/trunk$
> > > > >
> > > > >
> > > > > Pulled todays build and got above error. No
> > > problems
> > > > > running out of disk space or anything like
> that.
> > > This
> > > > > is a single instance, local file systems.
> > > > >
> > > > > Anyway to recover the crawl/finish the
> reduce
> > > job from
> > > > > where it failed?
> > > > >
> > > > >
> > > > >
> > > >
> > > >
> > > >
> > >
> >
> >
> > __
> > Do You Yahoo!?
> > Tired of spam?  Yahoo! Mail has the best spam
> protection around
> > http://mail.yahoo.com
> >
> 


__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 


java.io.EOFException ... at org.apache.nutch.ndfs.DataNode$DataXceiver.run...

2006-01-12 Thread Rafi Iz

Hi,
I am running mapreduce with 3 machines:
one name node and two datanodes.
I am using the latest revision of nutch 0.8, revision number 368582, and 
java version jdk1.5.0_06


I tried a very simple thing on all the three machines:
move file from local to ndfs :
bin/nutch ndfs -put tmp /user/rafi/tmp10

and I got the same comment from all the machines:
060113 011422 Recovered from failed datanode connection

looking at the log files of the data nodes I got the next message:
060112 212301 39 DataXCeiver
java.io.EOFException
   at java.io.DataInputStream.readFully(DataInputStream.java:178)
   at java.io.DataInputStream.readLong(DataInputStream.java:380)
   at org.apache.nutch.ndfs.DataNode$DataXceiver.run(DataNode.java:432)
   at java.lang.Thread.run(Thread.java:595)


Am I missing somting?

Thanks,
Rafi

_
Express yourself instantly with MSN Messenger! Download today - it's FREE! 
http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/