junit test failed

2005-08-28 Thread AJ Chen
I'm a new comer, trying to test Nutch for vertical search. I downloaded 
the code and compiled it in cygwin. But, the unit test failed with the 
following message:


test-core:
  [delete] Deleting directory nutch\trunk\build\test\data
   [mkdir] Created dir: nutch\trunk\build\test\data

BUILD FAILED
nutch\trunk\build.xml:173: Could not create task or type of type: junit.

Did I miss anything for junit? Appreciate your help.


AJ Chen




Re: junit test failed

2005-08-28 Thread AJ Chen

Michael,
See http://wiki.apache.org/nutch/HowToContribute for unit test. Junit is 
a tool Nutch uses to do some unit test. Nutch package includes quite a 
few test classes. It's a good idea to run the test as a way to check if 
there is any unexpected consequences that may be introduced by new codes.


Apparently, the command "ant test" does not work. Anybody has an idea 
how to make the unit test work?


AJ

Michael Ji wrote:


What is junit test standing for? A particular patch?

Sorry, if my question is silly.

Michael Ji,

--- AJ Chen <[EMAIL PROTECTED]> wrote:

 


I'm a new comer, trying to test Nutch for vertical
search. I downloaded 
the code and compiled it in cygwin. But, the unit
test failed with the 
following message:


test-core:
  [delete] Deleting directory
nutch\trunk\build\test\data
   [mkdir] Created dir: nutch\trunk\build\test\data

BUILD FAILED
nutch\trunk\build.xml:173: Could not create task or
type of type: junit.

Did I miss anything for junit? Appreciate your help.


AJ Chen



   




__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

 



--
AJ (Anjun) Chen, Ph.D.
Canova Bioconsulting 
Marketing * BD * Software Development

748 Matadero Ave., Palo Alto, CA 94306, USA
Cell 650-283-4091, [EMAIL PROTECTED]
---


Re: junit test failed

2005-08-28 Thread AJ Chen
I'm using ant1.6.5, which has junit.jar in the ANT_HOME/lib.  I replace 
the ANT_HOME/lib/ant-junit.jar with the latest junit-3.8.1.jar that 
comes with Nutch.  But, it did not work.  I haven't seen anything like 
"the optional tasks jar file" described in the instruction, though. Any 
idea?


Can someone also verify Nutch's build.xml defines the right classpath 
element for junit task? 


AJ



Fuad Efendi wrote:


http://ant.apache.org/manual/index.html

JUnit
Description
This task runs tests from the JUnit testing framework. The latest
version of the framework can be found at http://www.junit.org. This task
has been tested with JUnit 3.0 up to JUnit 3.8.1; it won't work with
versions prior to JUnit 3.0.

Note: This task depends on external libraries not included in the Ant
distribution. See Library Dependencies for more information. 


Note: You must have junit.jar and the class files for the  task
in the same classpath. You can do one of: 

Put both junit.jar and the optional tasks jar file in ANT_HOME/lib. 
Do not put either in ANT_HOME/lib, and instead include their locations
in your CLASSPATH environment variable. 
Do neither of the above, and instead, specify their locations using a
 element in the build file. See the FAQ for details. 




-Original Message-
From: Fuad Efendi [mailto:[EMAIL PROTECTED] 
Sent: Sunday, August 28, 2005 9:45 PM

To: nutch-dev@lucene.apache.org; 'nutch-dev'
Subject: RE: junit test failed


Check version of ANT! 



Line 173:

nutch\trunk\build.xml:173: Could not create task or type of type: junit.


Probably, your current ANT configuration/version does not have such a
task defined, 

Regards,
Fuad Efendi


-Original Message-
From: AJ Chen [mailto:[EMAIL PROTECTED] 
Sent: Sunday, August 28, 2005 9:01 PM

To: nutch-dev
Subject: junit test failed


I'm a new comer, trying to test Nutch for vertical search. I downloaded 
the code and compiled it in cygwin. But, the unit test failed with the 
following message:


test-core:
  [delete] Deleting directory nutch\trunk\build\test\data
   [mkdir] Created dir: nutch\trunk\build\test\data

BUILD FAILED
nutch\trunk\build.xml:173: Could not create task or type of type: junit.

Did I miss anything for junit? Appreciate your help.


AJ Chen







 



--
AJ (Anjun) Chen, Ph.D.
Canova Bioconsulting 
Marketing * BD * Software Development

748 Matadero Ave., Palo Alto, CA 94306, USA
Cell 650-283-4091, [EMAIL PROTECTED]
---


Re: junit test failed

2005-08-28 Thread AJ Chen

Fuad, thanks. problem solved.

Fuad Efendi wrote:


I just reproduced it on my Windows XP, I had same problem with Ant 1.6.3

It's not version problem (as mentioned by Erik Hatcher)

I simly copied junit-3.8.1.jar file into apache-ant-1.6.3\lib
Problem disappeared

You should restore this file:
ANT_HOME/lib/ant-junit.jar

And, copy junit-3.8.1.jar file into apache-ant-1.6.3\lib



-Original Message-
From: AJ Chen [mailto:[EMAIL PROTECTED] 
Sent: Monday, August 29, 2005 12:00 AM

To: nutch-dev@lucene.apache.org
Subject: Re: junit test failed


I'm using ant1.6.5, which has junit.jar in the ANT_HOME/lib.  I replace 
the ANT_HOME/lib/ant-junit.jar with the latest junit-3.8.1.jar that 
comes with Nutch.  But, it did not work.  I haven't seen anything like 
"the optional tasks jar file" described in the instruction, though. Any 
idea?


Can someone also verify Nutch's build.xml defines the right classpath 
element for junit task? 


AJ



Fuad Efendi wrote:

 


http://ant.apache.org/manual/index.html

JUnit
Description
This task runs tests from the JUnit testing framework. The latest 
version of the framework can be found at http://www.junit.org. This 
task has been tested with JUnit 3.0 up to JUnit 3.8.1; it won't work 
with versions prior to JUnit 3.0.


Note: This task depends on external libraries not included in the Ant 
distribution. See Library Dependencies for more information.


Note: You must have junit.jar and the class files for the  task 
in the same classpath. You can do one of:


Put both junit.jar and the optional tasks jar file in ANT_HOME/lib.
Do not put either in ANT_HOME/lib, and instead include their locations
in your CLASSPATH environment variable. 
Do neither of the above, and instead, specify their locations using a
 element in the build file. See the FAQ for details. 




-Original Message-
From: Fuad Efendi [mailto:[EMAIL PROTECTED]
Sent: Sunday, August 28, 2005 9:45 PM
To: nutch-dev@lucene.apache.org; 'nutch-dev'
Subject: RE: junit test failed


Check version of ANT!


Line 173:

nutch\trunk\build.xml:173: Could not create task or type of type: 
junit.



Probably, your current ANT configuration/version does not have such a 
task defined, 


Regards,
Fuad Efendi


-Original Message-
From: AJ Chen [mailto:[EMAIL PROTECTED]
Sent: Sunday, August 28, 2005 9:01 PM
To: nutch-dev
Subject: junit test failed


I'm a new comer, trying to test Nutch for vertical search. I downloaded
the code and compiled it in cygwin. But, the unit test failed with the 
following message:


test-core:
 [delete] Deleting directory nutch\trunk\build\test\data
  [mkdir] Created dir: nutch\trunk\build\test\data

BUILD FAILED
nutch\trunk\build.xml:173: Could not create task or type of type: 
junit.


Did I miss anything for junit? Appreciate your help.


AJ Chen









   



 



--
AJ (Anjun) Chen, Ph.D.
Canova Bioconsulting 
Marketing * BD * Software Development

748 Matadero Ave., Palo Alto, CA 94306, USA
Cell 650-283-4091, [EMAIL PROTECTED]
---


manage crawling cycles and progress

2005-09-01 Thread AJ Chen
Seeded with a list of urls, nutch whole-web crawler is going to take 
unknown number of cycles of generate/fetch/updatedb in order to drive to 
some level of completeness, both for internal links and outlinks. It's 
crucial to monitor the progress. I'll appreciate some suggestions or 
best  practices on the following questions:


1. After each cycle, how to list what were fetched successfully, and 
separately, what urls were failed? 
2. Are there tools to create  progress reports?
3. Will these failed urls be included in the next fetchlist generated by 
"nutch generate"? If not, how to control when these failed urls get 
fetched again?
4. What's a good way to measure the completeness of crawling a list of 
sites, say 1000 seed urls? For internal links of a site, how to 
determine all internal links were fetched or at least tried?  Same 
question for outlinks.
5. I see a great need for automation of this process. Is there a tool or 
plan in Nutch for such automation? Any body has developed an automated 
process that can be shared?


Thanks,
AJ



Re: Automating workflow using ndfs

2005-09-02 Thread AJ Chen
I'm also thinking about implementing an automated workflow of 
fetchlist->crawl->updateDb->index. Although my project may not require 
NDSF because it only concerns about deep crawling of 100,000 sites, an 
appropriate workflow is still needed to automatically take care of 
failed urls, newly-added urls, daily update, etc.  Appreciate it if 
somebody can share experience on design of the workflow.


The nutch intranet crawler (or site-specific crawler, which I prefer to 
call) is an automated process, but it's designed to conveniently deal 
with just a handful of sites.  With a larger number of selected sites, I 
expect a modified version is needed.  One modification I can think of is 
to create a lookup table in the urlfilter object for domains to be 
crawled and their corresponding regular expressions.  The goal is to 
avoid entering 100,000 regex in the craw-urlfilter.xml and checking ALL 
these regex for each URL. Any comment?


thanks,
-AJ


Jay Lorenzo wrote:

Thanks, that's good information - it sounds like I need to take a closer 
look at index deployment to see what the best solution is for automating 
index management.


The initial email was more about understanding what the envisioned workflow 
would for automating the creation of those indexes in a NDFS system, meaning 
what 
choices are available for automating the workflow of 
fetchlist->crawl->updateDb->index
part of the equation when you have a node hosting a webdb, and a number of 
nodes 
crawling and indexing. 

If I use a message based system, I assume I would create new fetchlists at a 
given 
locations of the NDFS, and message the fetchers where to find the 
fetchlists. Once crawled, 
I need to then update the webdb with the links discovered during the crawl.


Maybe this is too complex of a solution, but my sense is that map-reduce 
systems still need a way 
to manage the workflow/control that needs to occur if you want to create 
pipelines that 
generate indexes.


Thanks,

Jay Lorenzo

On 8/31/05, Doug Cutting <[EMAIL PROTECTED]> wrote:
 


I assume that in most NDFS-based configurations the production search
system will not run out of NDFS. Rather, indexes will be created
offline for a deployment (i.e., merging things to create an index 
peractually

search node), then copied out of NDFS to the local filesystem on a
production search node and placed in production. This can be done
incrementally, where new indexes are deployed without re-deploying old
indexes. In this scenario, new indexes are rotated in replacing old
indexes, and the .del file for every index is updated, to reflect
deduping. There is no code yet which implements this.

Is this what you were asking?

Doug


Jay Lorenzo wrote:
   


I'm pretty new to nutch, but in reading through the mail lists and other
papers, I don't think I've really seen any discussion on using ndfs with
respect to automating end to end workflow for data that is going to be
searched (fetch->index->merge->search).

The few crawler designs I'm familiar with typically have spiders
(fetchers) and
indexers on the same box. Once pages are crawled and indexed the indexes
are pipelined to merge/query boxes to complete the workflow.

When I look at the nutch design and ndfs, I'm assuming the design intent
for 'pure ndfs' workflow is for the webdb to generate segments on a ndfs
partition, and once the updating of the webdb is completed, the segments
are processed 'on-disk' by the subsequent
fetcher/index/merge/query mechanisms. Is this a correct assumption?

Automating this kind of continuous workflow usually is dependent on the
implementation of some kind of control mechanism to assure that the
correct sequence of operations is performed.

Are there any recommendations on the best way to automate this
workflow when using ndfs? I've prototyped a continuous workflow system
using a traditional pipeline model with per stage work queues, and I see
how that could be applied to a clustered filesystem like ndfs, but I'm
curious to hear what the design intent or best practice is envisioned
for automating ndfs based implementations.


Thanks,

Jay

 



 



--
AJ (Anjun) Chen, Ph.D.
Canova Bioconsulting 
Marketing * BD * Software Development

748 Matadero Ave., Palo Alto, CA 94306, USA
Cell 650-283-4091, [EMAIL PROTECTED]
---


Re: Automating workflow using ndfs

2005-09-02 Thread AJ Chen
From reading http://wiki.apache.org/nutch/DissectingTheNutchCrawler, it 
seems that a new urlfilter is a good place to extend the inclusion regex 
capability.  The new urlfilter will be defined by urlfilter.class 
property, which gets loaded by the URLFilterFactory.
Regex is necessary because you want to include urls matching certain 
patterns.


Can anybody who implemented URLFilter plugin before share some thoughts 
about this approach? I expect the new filter must have all capabilities 
that the current RegexURLFilter.java has so that it won't require change 
in any other classes. The difference is that the new filter uses a hash 
table for efficiently looking up regex for included domains (a large 
number!).


BTW, I can't find urlfilter.class property in any of the configuration 
files in Nutch-0.7. Does 0.7 version still support urlfilter extension? 
Any difference relative to what's described in the doc 
DissectingTheNutchCrawler cited above?


Thanks,
AJ

Earl Cahill wrote:

The goal is to 
avoid entering 100,000 regex in the
craw-urlfilter.xml and checking ALL 
these regex for each URL. Any comment?
   



Sure seems like just some hash look up table could
handle it.  I am having a hard time seeing when you
really need a regex and a fixed list wouldn't do. 
Especially if you have forward and maybe a backwards

lookup as well in a multi-level hash, to perhaps
include/exclude at a certain sudomain level, like

include: com->site->good (for good.site.com stuff)
exclude: com->site->bad (for bad.site.com)

and kind of walk backwards, kind of like dns.  Then
you could just do a few hash lookups instead of
100,000 regexes.

I realize I am talking about host and not page level
filtering, but if you want to include everything from
your 100,000 sites, I think such a strategy could
work.

Hope this makes sense.  Maybe I could write some code
to and see if it works in practice.  If nothing else,
maybe the hash stuff could just be another filter
option in conf/crawl-urlfilter.txt.

Earl

__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

 



--
AJ (Anjun) Chen, Ph.D.
Canova Bioconsulting 
Marketing * BD * Software Development

748 Matadero Ave., Palo Alto, CA 94306, USA
Cell 650-283-4091, [EMAIL PROTECTED]
---


Re: Automating workflow using ndfs

2005-09-02 Thread AJ Chen

Matt,
This is great! It would be very useful to Nutch developers if your code 
can be shared.  I'm sure quite a few applications will benefit from it 
because it fills a gap between whole-web crawling and single site (or a 
handful of sites) crawling.  I'll be interested in adapting your plugin 
to Nutch convention.

Thanks,
-AJ

Matt Kangas wrote:


AJ and Earl,

I've implemented URLFilters before. In fact, I have a  
WhitelistURLFilter that implements just what you describe: a  
hashtable of regex-lists. We implemented it specifically because we  
want to be able to crawl a large number of known-good paths through  
sites, including paths through CGIs. The hash is a Nutch ArrayFile,  
which provides low runtime overhead. We've tested it on 200+ sites  
thus far, and haven't seen any indication that it will have problems  
scaling further.


The filter and its supporting WhitelistWriter currently rely on a few  
custom classes, but it should be straightforward to adapt to Nutch  
naming conventions, etc. If you're interested in doing this work, I  
can see if it's ok to publish our code.


BTW, we're currently alpha-testing the site that uses this plugin,  
and preparing for a public beta. I'll be sure to post here when we're  
finally open for business. :)


--Matt


On Sep 2, 2005, at 11:43 AM, AJ Chen wrote:

From reading http://wiki.apache.org/nutch/ DissectingTheNutchCrawler, 
it seems that a new urlfilter is a good  place to extend the 
inclusion regex capability.  The new urlfilter  will be defined by 
urlfilter.class property, which gets loaded by  the URLFilterFactory.
Regex is necessary because you want to include urls matching  certain 
patterns.


Can anybody who implemented URLFilter plugin before share some  
thoughts about this approach? I expect the new filter must have all  
capabilities that the current RegexURLFilter.java has so that it  
won't require change in any other classes. The difference is that  
the new filter uses a hash table for efficiently looking up regex  
for included domains (a large number!).


BTW, I can't find urlfilter.class property in any of the  
configuration files in Nutch-0.7. Does 0.7 version still support  
urlfilter extension? Any difference relative to what's described in  
the doc DissectingTheNutchCrawler cited above?


Thanks,
AJ

Earl Cahill wrote:



The goal is to avoid entering 100,000 regex in the
craw-urlfilter.xml and checking ALL these regex for each URL. Any  
comment?





Sure seems like just some hash look up table could
handle it.  I am having a hard time seeing when you
really need a regex and a fixed list wouldn't do. Especially if  you 
have forward and maybe a backwards

lookup as well in a multi-level hash, to perhaps
include/exclude at a certain sudomain level, like

include: com->site->good (for good.site.com stuff)
exclude: com->site->bad (for bad.site.com)

and kind of walk backwards, kind of like dns.  Then
you could just do a few hash lookups instead of
100,000 regexes.

I realize I am talking about host and not page level
filtering, but if you want to include everything from
your 100,000 sites, I think such a strategy could
work.

Hope this makes sense.  Maybe I could write some code
to and see if it works in practice.  If nothing else,
maybe the hash stuff could just be another filter
option in conf/crawl-urlfilter.txt.

Earl




--
AJ (Anjun) Chen, Ph.D.
Canova Bioconsulting Marketing * BD * Software Development
748 Matadero Ave., Palo Alto, CA 94306, USA
Cell 650-283-4091, [EMAIL PROTECTED]
---



--
Matt Kangas / [EMAIL PROTECTED]





--
AJ (Anjun) Chen, Ph.D.
Canova Bioconsulting 
Marketing * BD * Software Development

748 Matadero Ave., Palo Alto, CA 94306, USA
Cell 650-283-4091, [EMAIL PROTECTED]
---


Re: "db.max.outlinks.per.page" is misunderstood?

2005-09-07 Thread AJ Chen
My understanding is that only up to the maximum number of outlinks are 
processed for a page when updating the web db. I assume the same page 
won't get fetched and processed again in the next fetch/update cycles, 
then you won't get those outlinks exceeding the maximum number no matter 
how many cycles you are running.


To make sure all of the outlinks are processed for a page, the 
db.max.outlinks.per.page must be set to a number that is larger than the 
number of outlinks on the page. If these is true, then the max number 
has to be determined in real time since the number of outlinks varies 
from page to page. 


Is my understanding correct?

AJ


Jack Tang wrote:


Hi All

Here is the "db.max.outlinks.per.page" property and its description in
nutch-default.xml

  db.max.outlinks.per.page
  100
  The maximum number of outlinks that we'll process for a 
page.
  
  

I don't think the description is right.
Say, my crawler feeds are:
http://www.a.com/index.php (90 outlinks)
http://www.b.com/index.jsp  (80 outlinks)
http://www.c.com/index.html (50 outlinks)

and the number of crawler thread is 30. Do you think the reminder URLs
( (80 -10) outlinks + 50  outlinks) will be fetched?

I think the description should be "The maximum number of outlinks in
one fecthing phase."


Regards
/Jack
 



--
AJ (Anjun) Chen, Ph.D.
Canova Bioconsulting 
Marketing * BD * Software Development

748 Matadero Ave., Palo Alto, CA 94306, USA
Cell 650-283-4091, [EMAIL PROTECTED]
---


Re: "db.max.outlinks.per.page" is misunderstood?

2005-09-07 Thread AJ Chen

Jack,
Set the max to 100, but run 10 cycles (i.e., depth=10) with the 
CrawlTool. You may see all the outlinks are collected toward the end.  3 
cycles is usually not enough.

-AJ

Jack Tang wrote:


Yes, Stefan.
But it missed some URLs, and I set the value to 3000, then everything is OK

/Jack

On 9/8/05, Stefan Groschupf <[EMAIL PROTECTED]> wrote:
 


Jack,
That is max outlinks per html page.
All your example pages have less than 100 outlinks, right?!
Stefan

Am 07.09.2005 um 18:43 schrieb Jack Tang:

   


Hi All

Here is the "db.max.outlinks.per.page" property and its description in
nutch-default.xml
   
 db.max.outlinks.per.page
 100
 The maximum number of outlinks that we'll
process for a page.
 
  

I don't think the description is right.
Say, my crawler feeds are:
http://www.a.com/index.php (90 outlinks)
http://www.b.com/index.jsp  (80 outlinks)
http://www.c.com/index.html (50 outlinks)

and the number of crawler thread is 30. Do you think the reminder URLs
( (80 -10) outlinks + 50  outlinks) will be fetched?

I think the description should be "The maximum number of outlinks in
one fecthing phase."


Regards
/Jack
--
Keep Discovering ... ...
http://www.jroller.com/page/jmars


 


---
company:http://www.media-style.com
forum:http://www.text-mining.org
blog:http://www.find23.net




   




 



--
AJ (Anjun) Chen, Ph.D.
Canova Bioconsulting 
Marketing * BD * Software Development

748 Matadero Ave., Palo Alto, CA 94306, USA
Cell 650-283-4091, [EMAIL PROTECTED]
---


Re: fetch performance

2005-09-09 Thread AJ Chen
Hi Andrzej,
Thanks for the suggestion. I'm using pdf plugin that
comes with nutch from vsn.  Where to get the PDFBox
unreleased version 0.7.2 that works for you? 
-AJ



On 9/9/05, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
> 
> AJ wrote:
> > I tried to run 10 cycles of fetch/updatabs. In the 3rd cycle, the fetch
> > list had 8810 urls. Fetch ran pretty fast on my laptop before 4000
> > pages were fetched. After 4000 pages, it suddenly switched to very slow
> > speed, about 30 mins for just 100 pages. My laptop also started to run
> > at 100% CPU all the time. Is there a threshold for fetch list size,
> > above which fetch performance will be degraded? Or it was because my
> > laptop? I know "-topN" option can control the fetch size. But, topN=4000
> > seems too small because it will end up thousands of segments. Is there
> > a good rule of thumb for topN setting ?
> >
> > A related question is how big a segment should be in order to keep the
> > number of segments small without hitting fetch performance too much. For
> > example, to crawl 1 million pages in one run (has many fetch cycles),
> > what will be a good limit for each fetch list?
> 
> There are no artificial limits like that - I'm routinely fetching
> segments of 1 mln pages. Most likely what happened to you is that:
> 
> * you are using Nutch version with PDFBox 0.7.1 or below
> 
> * you fetched a rare kind of PDF, which puts PDFBox in a tight loop
> 
> * the thread that got stuck is consuming 99% of your CPU. :-)
> 
> Solution: upgrade PDFBox to the yet unreleased 0.7.2 .
> 
> 
> --
> Best regards,
> Andrzej Bialecki <><
> ___. ___ ___ ___ _ _ __
> [__ || __|__/|__||\/| Information Retrieval, Semantic Web
> ___|||__|| \| || | Embedded Unix, System Integration
> http://www.sigram.com Contact: info at sigram dot com
> 
>


Re: fetch performance

2005-09-10 Thread AJ Chen

Andrzej, Thanks.
A related question: Some of the sites I crawl use https: or redirect to 
https:.  Nutch default setting does not recognize https: as valid url. 
Is there a way to crawl url starting with "https:"?


-AJ


Andrzej Bialecki wrote:


AJ Chen wrote:


Hi Andrzej,
Thanks for the suggestion. I'm using pdf plugin that
comes with nutch from vsn.  Where to get the PDFBox
unreleased version 0.7.2 that works for you? 



http://www.pdfbox.com/dist

If you are not too familiar with the classpath setting in plugin.xml 
then it's better to just replace the old JAR with the new one, but 
keeping the same name as the old JAR.






Re: fetch performance

2005-09-10 Thread AJ Chen
Nutch 0.7 default plugin-includes property does not include 
protocol-httpclient. After it's added, crawling does recognize https 
urls.  Thanks.  However, there are still two kinds of error related to 
https.


(1) NoRouteToHostException.  It occurs very often, for example,

050910 150336 fetching https://www.picoscript.com/products.aspx
050910 150336 fetch of https://www.picoscript.com/products.aspx failed 
with: java.lang.Exception: java.net.NoRouteToHost

Exception: No route to host: connect

(2) does not recognize https url redirected from http url. It occurs 
very often. for example,


050910 150341 fetch of 
http://www.cellsciences.com/content/c2-contact.asp failed with: 
java.lang.Exception: org.apache.n
utch.protocol.http.HttpException: Not an HTTP 
url:https://www.cellsciences.com/content/c2-contact.asp


Any idea what happens?

-AJ

Andrzej Bialecki wrote:


AJ Chen wrote:


Andrzej, Thanks.
A related question: Some of the sites I crawl use https: or redirect 
to https:.  Nutch default setting does not recognize https: as valid 
url. Is there a way to crawl url starting with "https:"?



Which version of Nutch? 0.7 recognizes and supports https urls, 
through the protocol-httpclient plugin.






how to deal with large/slow sites

2005-09-11 Thread AJ Chen
In vertical crawling, there are always some large sites that have tens 
of thousands of pages. Fetching a page from these large sites very often 
returns "retry later" because http.max.delays is exceeded.  Setting 
appropriate values for http.max.delays and fetcher.server.delay can 
minimize this kind of url dropping. However, with my application , I 
still see 20-50% urls got dropped from a few large sites even with 
pretty long delay setting, http.max.delays=20, fetcher.server.delay=5.0, 
effectively 100 sec per host.


Two questions:
(1) Is there a better approach to deep-crawl large sites?  Should we 
treat large sites differently from smaller sites?  I notice Doug and 
Andrzej had discussed potential solutions to this problem.  But, anybody 
has a good short-term solution?


(2) Will the dropped urls be picked up again in subsequent cycles of 
fetchlist/segment/fetch/updatedb?  If this is true, running more cycles 
should eventually fetch the dropped urls.  Does 
db.default.fetch.interval (default is 30 days) influence when the 
dropped urls will be fetched again?


Appreciate your advice.
AJ



how to reuse webDB with new urls

2005-09-13 Thread AJ Chen
Once I create a webDB, can I inject new root urls to the same webDB 
repeatly? After each injection, run as many cycles of 
generate/fetch/updatedb to fetch all web pages from the new sites. I think 
this will allow me to gradually build a comprehensive vertical site. Any 
comment or suggestion?
-AJ


Re: how to reuse webDB with new urls

2005-09-14 Thread AJ Chen
Before re-injecting a new set of urls to webdb, I'll wait until all 
fetch operations (generate + fetch + updatedb) are done.  I'm not sure 
it's necessary or not, but it's cleaner.


One more question: Should I run UpdateSegmentsFromDb to update the 
segments before any new injection?  Does segment updating affect url 
injection and fetch list generation? 


-AJ

Jay Lorenzo wrote:

What about the issue of maintaining some semblance of ACIDity? Don't you 
have to make sure that the generation of fetchlists and the updates are run 
synchronously, ie one update or generate at a time?


On 9/13/05, Michael Ji <[EMAIL PROTECTED]> wrote:
 


I think this scenario will work.

Just a bit worry about the filter performance if the
domain site number is in scale of thundreds of
thousands.

Michael Ji

--- AJ Chen <[EMAIL PROTECTED]> wrote:

   


Once I create a webDB, can I inject new root urls to
the same webDB
repeatly? After each injection, run as many cycles
of
generate/fetch/updatedb to fetch all web pages from
the new sites. I think
this will allow me to gradually build a
comprehensive vertical site. Any
comment or suggestion?
-AJ

 




__
Yahoo! Mail - PC Magazine Editors' Choice 2005
http://mail.yahoo.com

   



 



--
AJ (Anjun) Chen, Ph.D.
Canova Bioconsulting 
Marketing * BD * Software Development

748 Matadero Ave., Palo Alto, CA 94306, USA
Cell 650-283-4091, [EMAIL PROTECTED]
---


saving log file

2005-09-20 Thread AJ Chen
Following the tutorial, I redirect the log messages to a log file. But, 
when crawling 1 million pages, this log file can become hugh and writing 
log messages to a huge file can slow down the fetching process.  Is 
there a better way to manage the log?  maybe saving it to a series of 
smaller files? appreciate your suggestions.

-AJ



Re: saving log file

2005-09-21 Thread AJ Chen

Jerome, thanks a lot. This is helpful.
-AJ

Jérôme Charron wrote:


Following the tutorial, I redirect the log messages to a log file. But,
when crawling 1 million pages, this log file can become hugh and writing
log messages to a huge file can slow down the fetching process. Is
there a better way to manage the log? maybe saving it to a series of
smaller files? appreciate your suggestions.
   



Change your JDK logging behaviors in $JAVA_HOME/jre/lib/logging.properties 
and add a FileHandler with limit greater than 0.
Thus, the logs will be rotated in many log files (with a maximum size = 
limit).
See also 
http://java.sun.com/j2se/1.4.2/docs/api/java/util/logging/package-summary.html

http://java.sun.com/j2se/1.4.2/docs/api/java/util/logging/FileHandler.html

Regards

Jérôme
 



what contibute to fetch slowing down

2005-09-28 Thread AJ Chen
I started the crawler with about 2000 sites.  The fetcher could achieve 
7 pages/sec initially, but the performance gradually dropped to about 2 
pages/sec, sometimes even 0.5 pages/sec.  The fetch list had 300k pages 
and I used 500 threads. What are the main causes of this slowing down? 
Below are sample status:


050927 005952 status: segment 20050927005922, 100 pages, 3 errors, 
1784615 bytes, 14611 ms

050927 005952 status: 6.8441586 pages/s, 954.2334 kb/s, 17846.15 bytes/page
050927 010005 status: segment 20050927005922, 200 pages, 9 errors, 
3656863 bytes, 28170 ms
050927 010005 status: 7.0997515 pages/s, 1014.1726 kb/s, 18284.314 
bytes/page


after sometime ...
050927 171818 status: segment 20050927070752, 101400 pages, 7201 errors, 
2593026554 bytes, 36216316 ms

050927 171818 status: 2.799843 pages/s, 559.3617 kb/s, 25572.254 bytes/page
050927 171832 status: segment 20050927070752, 101500 pages, 7204 errors, 
2595591632 bytes, 36230516 ms

050927 171832 status: 2.8015058 pages/s, 559.6956 kb/s, 25572.332 bytes/page

Thanks,
AJ



Re: what contibute to fetch slowing down

2005-10-02 Thread AJ Chen
Update on fetch performance of my current run: download speed has been
stable at 3.8 pages/sec, about 640kbps. This is probably limited by my
bandwidth - regular DSL service, promising up to 1.5 mbps inbound but
realistically only 640 kbps.

More than 1 million pages were fetched, but it took several days at current
speed - just too slow. I'm planning to get more bandwidth. Could someone
share their experience on what stable rate (pages/sec) can be achieved using
3 mbps or 10 mbps inbound connection?

Thanks,
AJ


On 9/28/05, AJ Chen <[EMAIL PROTECTED]> wrote:
>
> I started the crawler with about 2000 sites. The fetcher could achieve
> 7 pages/sec initially, but the performance gradually dropped to about 2
> pages/sec, sometimes even 0.5 pages/sec. The fetch list had 300k pages
> and I used 500 threads. What are the main causes of this slowing down?
> Below are sample status:
>
> 050927 005952 status: segment 20050927005922, 100 pages, 3 errors,
> 1784615 bytes, 14611 ms
> 050927 005952 status: 6.8441586 pages/s, 954.2334 kb/s, 17846.15bytes/page
> 050927 010005 status: segment 20050927005922, 200 pages, 9 errors,
> 3656863 bytes, 28170 ms
> 050927 010005 status: 7.0997515 pages/s, 1014.1726 kb/s, 18284.314
> bytes/page
>
> after sometime ...
> 050927 171818 status: segment 20050927070752, 101400 pages, 7201 errors,
> 2593026554 bytes, 36216316 ms
> 050927 171818 status: 2.799843 pages/s, 559.3617 kb/s, 25572.254bytes/page
> 050927 171832 status: segment 20050927070752, 101500 pages, 7204 errors,
> 2595591632 bytes, 36230516 ms
> 050927 171832 status: 2.8015058 pages/s, 559.6956 kb/s, 25572.332bytes/page
>
> Thanks,
> AJ
>
>


fetch speed issue

2005-10-10 Thread AJ Chen
Another observation: when the same size fetch list and same number of
threads were used, the fetcher started at different speed in different runs,
ranging from 200kb/s to 1200kb/s. I'm using DSL at home, so this variation
in downlaod speed could be due to the variation in DSL connection. If using
stable connections like T1 or fiber, I expect the fetcher should start at
the same spped. Could someone using T1 line or fiber connection verify that
the fetcher starts always at similar speed? Given large enough number of
threads, did your fetcher always reliably achieve the maximum speed, i.e.
using the full bandwidth of the connection?

Thanks,
AJ


Re: [jira] Updated: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation

2005-10-11 Thread AJ Chen
Fuad,
Several days for 120,000 pages? That's very slow. Could you show some status
lines in the log file? (grep "status:") What's the bandwidth you have?

-AJ

On 10/11/05, Fuad Efendi (JIRA) <[EMAIL PROTECTED]> wrote:
>
> [ http://issues.apache.org/jira/browse/NUTCH-109?page=all ]
>
> Fuad Efendi updated NUTCH-109:
> --
>
> Summary: Nutch - Fetcher - Performance Test - new
> Protocol-HTTPClient-Innovation (was: Nutch - Fetcher - HTTP - Performance
> Testing & Tuning)
>
> I performed performance tests, using default Apache HTTPD Web-Server
> installation, with crawled 120,000 pages (I used Teleport Ultra to crawl
> HTML pages from www.apache.org , I spent probably
> 10 hours)
>
> Everything run in a separate LAN, Windows XP (Client with Nutch 0.7.1),
> and Suse Linux 9.3 (Server with Apache)
>
> I measured crawl for 21,000 pages (Depth=6, Threads=20) (it seems to take
> few days to crawl all 120,000 pages):
>
> Protocol-HTTPClient-Innovation:
> 1,321,470 milliseconds
>
> Protocol-HTTP:
> 26,946,076 milliseconds
>
> Protocol-HttpClient:
> 27,062,854 milliseconds
>
>
> P.S.
> Please note, Protocol-HTTPClient-Innovation plugin is only basic version,
> v.0.1.0,
> HttpFactory is growing and contains cache (3 TCP connections per Host)
> http://www.innovation.ch/java/HTTPClient/ is very old but _production_
> level... style of a source code may seem too old... you may need to change
> "enum" to "enumeration" in downloaded source files in order to compile it
> :)))
>
> Very popular load-generating tool is based on HTTPClient (Innovation):
> http://grinder.sourceforge.net/
> http://www.innovation.ch/java/HTTPClient/
>
>
> > Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation
> > ---
> >
> > Key: NUTCH-109
> > URL: http://issues.apache.org/jira/browse/NUTCH-109
> > Project: Nutch
> > Type: Improvement
> > Components: fetcher
> > Versions: 0.7, 0.8-dev, 0.6, 0.7.1
> > Environment: Nutch: Windows XP, J2SE 1.4.2_09
> > Web Server: Suse Linux, Apache HTTPD, apache2-worker, v. 2.0.53
> > Reporter: Fuad Efendi
> > Attachments: protocol-httpclient-innovation-0.1.0.zip
> >
> > 1. TCP connection costs a lot, not only for Nutch and end-point web
> servers, but also for intermediary network equipment
> > 2. Web Server creates Client thread and hopes that Nutch really uses
> HTTP/1.1, or at least Nutch sends "Connection: close" before closing in JVM
> "Socket.close()" ...
> > I need to perform very objective tests, probably 2-3 days; new plugin
> crawled/parsed 23,000 pages for 1,321 seconds; it seems that existing
> http-plugin needs few days...
> > I am using separate network segment with Windows XP (Nutch), and Suse
> Linux (Apache HTTPD + 120,000 pages)
> > Please find attached new plugin based on
> http://www.innovation.ch/java/HTTPClient/
> > Please note:
> > Class HttpFactory contains cache of HTTPConnection objects; each object
> run each thread; each object is absolutely thread-safe, so we can send
> multiple GET requests using single instance:
> > private static int CLIENTS_PER_HOST = NutchConf.get().getInt("
> http.clients.per.host", 3);
> > I'll add more comments after finishing tests...
>
> --
> This message is automatically generated by JIRA.
> -
> If you think it was sent incorrectly contact one of the administrators:
> http://issues.apache.org/jira/secure/Administrators.jspa
> -
> For more information on JIRA, see:
> http://www.atlassian.com/software/jira
>
>


how to make fetcher to use the full bandwidth

2005-10-13 Thread AJ Chen
I try to fetch as fast as it can by using more threads on a large fetch
list. But, the fetcher starts download at speed much lower than the full
bandwidth allows. And the start download speed varies a lot from run to run,
200kb/s to 1200kb/s on my DSL line. This variation also happens on T1 line
that I just tested.
Could someone share experience on how to make fetcher use the full
bandwidth? We know the speed drops gradually during a long fetch run. But,
can the fetch achieve the highest speed allowed by the bandwidth when fetch
starts?

AJ


Re: how to make fetcher to use the full bandwidth

2005-10-13 Thread AJ Chen
Thanks, Rod. Were you always able to fill the pipe under the same
conditions? I'm puzzling by the difference in fetch speed even when the same
number of threads and root urls are used.

I don't have local DNS server yet. To avoid overwhelming ISP's DNS server, I
use only 10 threads for the first run of fetch and so the fetch speed is
expected not great in this run. But, in the second fetch run, I use 500
threads and it can fill the pipe sometimes, but most of time uses 1/5 of the
pipe. The number of hosts, >1500, may be small. How many hosts are usually
used in your crawl?

AJ


On 10/13/05, Rod Taylor <[EMAIL PROTECTED]> wrote:
>
> On Thu, 2005-10-13 at 13:35 -0700, AJ Chen wrote:
> > I try to fetch as fast as it can by using more threads on a large fetch
> > list. But, the fetcher starts download at speed much lower than the full
> > bandwidth allows. And the start download speed varies a lot from run to
> run,
> > 200kb/s to 1200kb/s on my DSL line. This variation also happens on T1
> line
> > that I just tested.
> > Could someone share experience on how to make fetcher use the full
> > bandwidth? We know the speed drops gradually during a long fetch run.
> But,
> > can the fetch achieve the highest speed allowed by the bandwidth when
> fetch
> > starts?
>
> I found that for high bandwidth (50Mbits and above) DNS seems to be a
> limiting factor.
>
> 4000 threads with a local caching DNS server seems to be enough to fill
> the pipe though
>
> --
> Rod Taylor <[EMAIL PROTECTED]>
>
>


merge indices from multiple webdb

2005-10-25 Thread AJ Chen
Has anyone merged indices from two separate webdb? I have two separate webdb
and need to find a good way to combine them for unified search.
AJ


Re: merge indices from multiple webdb

2005-10-25 Thread AJ Chen
Thanks so much, Graham. This should do it.
A related question: After the merge, is it possible to build the new webdb
as well? The link data for the merged db can be different from the two
original db. In order to have accurate page ranking, the link data should be
updated.

AJ

On 10/25/05, Graham Stead <[EMAIL PROTECTED]> wrote:
>
> I am by no means a Nutch expert yet, but this is how I merged two
> separate segments so I could search through them:
>
> Step 1:
> $ bin/nutch mergesegs -local -o testmerge -i
> ../crawls/foo/segments/20051018224434/
> ../crawls/bar/segments/20051018225505/
> < bunch of stuff happens >
>
> This creates a segment 20051023112848 in the testmerge folder. The
> segment contains a combined index as well as copies of all information
> from the two input segments.
>
> Step 2:
> This wasn't quite enough to search with, however. I copied the index
> folder and organized the directories into the same structure as used
> during a crawl, then was able to run the Tomcat searcher on the new
> segment.
>
> After copying/moving/reorganizing I have:
>
> $ ls -l testmerge/
> total 0
> drwxrwxrwx+ 2 Oct 23 11:42 index
> drwxrwxrwx+ 3 Oct 23 11:42 segments
>
> $ ls -l testmerge/segments/
> total 0
> drwxrwxrwx+ 7 Oct 23 11:28 20051023112848
>
>
> Step 3:
> Then place this in Tomcat's nutch-site.xml file:
>
> 
> 
> searcher.dir
> C:\path_to_testmerge\testmerge
> 
> 
>
> Run Tomcat and search away.
>
> Hope this helps,
> -Graham
>
> > -Original Message-
> > From: AJ Chen [mailto:[EMAIL PROTECTED]
> > Sent: Tuesday, October 25, 2005 4:03 PM
> > To: nutch-dev@lucene.apache.org
> > Subject: merge indices from multiple webdb
> >
> > Has anyone merged indices from two separate webdb? I have two
> > separate webdb and need to find a good way to combine them
> > for unified search.
> > AJ
> >
>


Re: merge indices from multiple webdb

2005-10-25 Thread AJ Chen
How do you buid a new webdb from the merged segment/index? Could you provide
detailed steps for the process you described? Thanks.

AJ

On 10/25/05, Andrey Ilinykh <[EMAIL PROTECTED]> wrote:
>
> If you merge two segments page ranks are off. You have to build new webdb,
> calculate page rank and then build one more segment again.
>
> Thank you,
> Andrey
>
> -Original Message-
> From: AJ Chen [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, October 25, 2005 2:02 PM
> To: nutch-dev@lucene.apache.org
> Subject: Re: merge indices from multiple webdb
>
>
> Thanks so much, Graham. This should do it.
> A related question: After the merge, is it possible to build the new webdb
> as well? The link data for the merged db can be different from the two
> original db. In order to have accurate page ranking, the link data should
> be
> updated.
>
> AJ
>
> On 10/25/05, Graham Stead <[EMAIL PROTECTED]> wrote:
> >
> > I am by no means a Nutch expert yet, but this is how I merged two
> > separate segments so I could search through them:
> >
> > Step 1:
> > $ bin/nutch mergesegs -local -o testmerge -i
> > ../crawls/foo/segments/20051018224434/
> > ../crawls/bar/segments/20051018225505/
> > < bunch of stuff happens >
> >
> > This creates a segment 20051023112848 in the testmerge folder. The
> > segment contains a combined index as well as copies of all information
> > from the two input segments.
> >
> > Step 2:
> > This wasn't quite enough to search with, however. I copied the index
> > folder and organized the directories into the same structure as used
> > during a crawl, then was able to run the Tomcat searcher on the new
> > segment.
> >
> > After copying/moving/reorganizing I have:
> >
> > $ ls -l testmerge/
> > total 0
> > drwxrwxrwx+ 2 Oct 23 11:42 index
> > drwxrwxrwx+ 3 Oct 23 11:42 segments
> >
> > $ ls -l testmerge/segments/
> > total 0
> > drwxrwxrwx+ 7 Oct 23 11:28 20051023112848
> >
> >
> > Step 3:
> > Then place this in Tomcat's nutch-site.xml file:
> >
> > 
> > 
> > searcher.dir
> > C:\path_to_testmerge\testmerge
> > 
> > 
> >
> > Run Tomcat and search away.
> >
> > Hope this helps,
> > -Graham
> >
> > > -Original Message-
> > > From: AJ Chen [mailto:[EMAIL PROTECTED]
> > > Sent: Tuesday, October 25, 2005 4:03 PM
> > > To: nutch-dev@lucene.apache.org
> > > Subject: merge indices from multiple webdb
> > >
> > > Has anyone merged indices from two separate webdb? I have two
> > > separate webdb and need to find a good way to combine them
> > > for unified search.
> > > AJ
> > >
> >
>


debug JSP with eclipse

2005-10-29 Thread AJ Chen
I'm using elicpse for nutch java code and trying to set up eclipse for
debugging JSP pages. I have got WST plugin installed, created a new dynamic
web project called nutch071web, and imported all the webcontent and jars.
But, it failed to run index.jsp page, see error message below. Is anyone
using eclipse to debug nutch jsp pages? Will appreciate some pointers from
you.

Oct 29, 2005 11:06:24 PM org.apache.catalina.core.StandardContextresourcesStart
SEVERE: Error starting static Resources
java.lang.IllegalArgumentException: Document base
C:\nutch\nutch071web\.deployables\nutch071web does not exist or is not a
readable directory
at org.apache.naming.resources.FileDirContext.setDocBase(FileDirContext.java
:140)

AJ


java open source software for Tagging ?

2005-11-07 Thread AJ Chen
Although tagging is not directly related to nutch, I think combining nutch
search and the ability to tag search result pages will be quite powerful.
Anyone has implemented tagging on nutch search site? Is there a java open
source package for tagging function?
AJ


severe error in fetch

2005-12-25 Thread AJ Chen
I have seen repeatedly the following severe errors during fetching 
400,000 pages with 200 threads.  What may cause "Host connection pool 
not found"? This type of error must be avoided, otherwise the fetcher 
will stop prematurely. 
 
051224 075950 SEVERE Host connection pool not found, 
hostConfig=HostConfiguration[host=https://www.kodak.com]

java.lang.RuntimeException: SEVERE error logged.  Exiting fetcher.

Thanks,
AJ



Re: severe error in fetch

2005-12-25 Thread AJ Chen
Stefan,
Here is the trace in my log.  My SSFetcher (for site-specific fetch) is the
same as nutch Fetcher except that the URLFilters it uses has additional
filter based on domain names. Line 363 is
throw new RuntimeException("SEVERE error logged.  Exiting
fetcher.");


051224 075950 SEVERE Host connection pool not found,
hostConfig=HostConfiguration[host=https://www.kodak.com]
java.lang.RuntimeException: SEVERE error logged.  Exiting fetcher.
at vscope.crawl.SSFetcher.run(SSFetcher.java:363)
at vscope.crawl.SSFetcher.main(SSFetcher.java:510)
at vscope.crawl.SSCrawler.main(SSCrawler.java:251)

Thanks,
AJ

On 12/25/05, Stefan Groschupf <[EMAIL PROTECTED]> wrote:
>
> Hi,
> Can you provide a detailed stacktrace from the log file?
>
> Stefan
>
> Am 25.12.2005 um 23:38 schrieb AJ Chen:
>
> > I have seen repeatedly the following severe errors during fetching
> > 400,000 pages with 200 threads.  What may cause "Host connection
> > pool not found"? This type of error must be avoided, otherwise the
> > fetcher will stop prematurely.  051224 075950 SEVERE Host
> > connection pool not found, hostConfig=HostConfiguration
> > [host=https://www.kodak.com]
> > java.lang.RuntimeException: SEVERE error logged.  Exiting fetcher.
> >
> > Thanks,
> > AJ
> >
> >
>
> ---
> company:http://www.media-style.com
> forum:http://www.text-mining.org
> blog:http://www.find23.net
>
>
>
>


Re: severe error in fetch

2005-12-30 Thread AJ Chen
This problem is recurring. It happens when fetching
https://www.kodak.com:0/something.  I guess the port number 0 is the cause
of the problem because there is no problem fetching
https://www.kodak.com/anything.  see log entries:

051230 105257 fetching
https://www.kodak.com:0/eknec/PageQuerier.jhtml?pq-path=2/782/2608/2610/4074/7058&pq-locale=en_US&_loopback=1
051230 105305 SEVERE Host connection pool not found,
hostConfig=HostConfiguration[host=https://www.kodak.com]
java.lang.RuntimeException: SEVERE error logged.  Exiting fetcher.

Is it right that some specific port numbers can cause connection pool
problem in httpclient? If yes, I can filter out url containing these trouble
ports before httpclient is fixed.

Thanks,
AJ

On 12/26/05, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
>
> AJ Chen wrote:
>
> >Stefan,
> >Here is the trace in my log.  My SSFetcher (for site-specific fetch) is
> the
> >same as nutch Fetcher except that the URLFilters it uses has additional
> >filter based on domain names. Line 363 is
> >throw new RuntimeException("SEVERE error logged.  Exiting
> >fetcher.");
> >
> >
> >051224 075950 SEVERE Host connection pool not found,
> >hostConfig=HostConfiguration[host=https://www.kodak.com]
> >
> >
>
> This error comes from the httpclient library (you won't get a better
> stacktrace, you need to redefine the java.util.logging properties to get
> more info). I'm in the process of upgrading to the latest release, but
> it's trivial, you can try it yourself. Hopefully this should solve the
> issue.
>
> --
> Best regards,
> Andrzej Bialecki <><
> ___. ___ ___ ___ _ _   __
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
>


how to add additional factor at search time to ranking score

2005-12-31 Thread AJ Chen
My vertical search application will use additional factor for page 
ranking, which is given to each page at search time. I'm trying to 
figure out a good way to integrate this additional dynamic factor into 
the nutch score.  I'll appreciate any suggestions or pointers.


It would be great if I can add some new functions to the nutch code to 
accomplish this. But, if it requires to customize lucene code, that's 
fine. I have tried to use the most recent release (1.4.3) of lucene 
source code, but it did not work.  Is the lucene jar files included in 
the nutch release (0.7.1) very different from lucene 1.4.3?  If yes, is 
it possible to get the source code for lucene used in nutch?


Thanks,
AJ



Re: problems http-client

2006-01-06 Thread AJ Chen
I have started to see this problem recently. topN=20 per crawl, but
fetched pages = 15 - 17, while error pages = 2000 - 5000.  >25000
pages are missing.  this is reproducible with nutch0.7.1, both protocol-http
and protocol-httpclient are included.

I also see lots of "Response content length is not known" in the log.  but,
can't find where it comes from.  Which class logs this message?

AJ

On 12/19/05, Stefan Groschupf <[EMAIL PROTECTED]> wrote:
>
> Hi there,
>
> is there someone out there that can confirm a problem we discovered?
>
> We was wondering why not all pages of a  generated segments was
> fetched. The most strange thing was that the  sum of errors and
> sucesspages was never the same as we defined in topN when generating
> a sgemtent .
> First we discovered a problem with the segment size, but I can not
> reproduce the problem anymore with the latest trunk code. :-/
> Very strange since I don't think something changed something but I
> was able to reproduce that the size of the segment is around than 50%
> of the defined size (topN) on 2 different map reduce installations.
>
> Anyway today we note that when fetching with http-client the sum of
> errors and fetched pages is  much less than the size defined when
> generating the segment.
> Changing to protocol-http solves the problem.
> Has anyone also note this behavior?
>
> Thanks for comments.
> Stefan
>
>
>
>
>
>


does nutch follow HEAD element?

2006-06-16 Thread AJ Chen

I'm about to use nutch to crawl semantic data. Links to semantic data files
(RDF, OWL, etc.) can be placed in two places: (1) HEAD ; (2) BODY .  Does nutch crawler follows the HEAD ?

I'm also creating a semantic data publishing tool, I would appreciate any
suggestion regarding the best way to make RDF files visible to nutch
crawler.

There was a brief discussion last year on the topic of crawling semantic
web.  I believe this is a growing area. I would like to make nutch a
component of the new semantic data publishing and crawling system that I'm
working on.  It would be great if any nutch expert can share some pointers
as to how nutch can optimally support such system or how such system should
be designed to optimally take advantage of nutch.

Best,
AJ


Re: does nutch follow HEAD element?

2006-06-16 Thread AJ Chen

Andrzej, thanks so much. It's great that nutch follows HEAD  since
it's the preferred place for autodiscovery of rdf/owl data. The type
property inside  tag can be set to "application/owl+xml" and
"application/rdf+xml"
so that nutch crawler knows the linked resource has rdf/owl content.

A related question: If I want nutch to fetch only rdf/owl files, is it
possible to generate the fetch list with urls that have type of
"application/owl+xml" or "application/rdf+xml"? Using file extension does
not always work because the resource url may not have extention like ".rdf".
If Nutch keeps the application type for each  item it finds, then the
application type can be used later when selecting urls for fetch list.

I plan to use nutch to crawl specifically for rdf/owl files and then parse
them into Lucene document for storing in a lucene index. This lucene index
of semantic data will be searched from the same nutch search interface.

Thanks,
AJ

On 6/16/06, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:


AJ Chen wrote:
> I'm about to use nutch to crawl semantic data. Links to semantic data
> files
> (RDF, OWL, etc.) can be placed in two places: (1) HEAD ; (2)
> BODY  href...>.  Does nutch crawler follows the HEAD ?

Yes. Please see parse-html//DOMContentUtils.java for details.

>
> I'm also creating a semantic data publishing tool, I would appreciate
any
> suggestion regarding the best way to make RDF files visible to nutch
> crawler.

Well, Nutch is certainly not a competitor to an RDF triple-store ;) It
may be used to collect RDF files, and then the map-reduce jobs can be
used to massively process these files to annotate large numbers of
target resources (e.g. add metadata to pages in the crawldb). You could
also load them to a triple store and use that to annotate resources in
Nutch, to provide a better searching experience (e.g. searching by
concept, by semantic relationships, finding similar concepts in other
ontologies, etc).

In the end, the model that Nutch supports the best is the Lucene model,
which is an unordered bag of documents with multiple fields
(properties). If you can translate your required model into this, then
you're all set. Nutch/Hadoop provides also a scalable processing
framework, which is quite useful for enhancing the existing data with
data from external sources (e.g. databases, triplestore, ontologies,
semantic nets and such).

In some cases, when this external infrastructure is efficient enough,
it's possible to combine it on-the-fly (I have successfully used this
approach with WordNet, Wikipedia and DMOZ), in other cases you will need
to do some batch pre-processing to make this external metadata available
as a part of Nutch documents ... again, the framework of map/reduce and
DFS is very useful for that (and I have used this approach too, even
with the same data as above).

--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com





Crawl error

2006-07-09 Thread AJ Chen

I checked out the 0.8 code from trunk and tried to set it up in eclipse.
When trying to run Crawl from Eclipse using args "urls -dir crawl -depth 3
-topN 50", I got the following error, which started from LogFactory.getLog(
Crawl.class). Any idea what file was not found?  There is a url file under
directory urls. Thanks,

log4j:ERROR setFile(null,true) call failed.
java.io.FileNotFoundException: \ (The system cannot find the path specified)
   at java.io.FileOutputStream.openAppend(Native Method)
   at java.io.FileOutputStream.(FileOutputStream.java:177)
   at java.io.FileOutputStream.(FileOutputStream.java:102)
   at org.apache.log4j.FileAppender.setFile(FileAppender.java:289)
   at org.apache.log4j.FileAppender.activateOptions(FileAppender.java:163)
   at org.apache.log4j.DailyRollingFileAppender.activateOptions(
DailyRollingFileAppender.java:215)
   at org.apache.log4j.config.PropertySetter.activate(PropertySetter.java
:256)
   at org.apache.log4j.config.PropertySetter.setProperties(
PropertySetter.java:132)
   at org.apache.log4j.config.PropertySetter.setProperties(
PropertySetter.java:96)
   at org.apache.log4j.PropertyConfigurator.parseAppender(
PropertyConfigurator.java:654)
   at org.apache.log4j.PropertyConfigurator.parseCategory(
PropertyConfigurator.java:612)
   at org.apache.log4j.PropertyConfigurator.configureRootCategory(
PropertyConfigurator.java:509)
   at org.apache.log4j.PropertyConfigurator.doConfigure(
PropertyConfigurator.java:415)
   at org.apache.log4j.PropertyConfigurator.doConfigure(
PropertyConfigurator.java:441)
   at org.apache.log4j.helpers.OptionConverter.selectAndConfigure(
OptionConverter.java:468)
   at org.apache.log4j.LogManager.(LogManager.java:122)
   at org.apache.log4j.Logger.getLogger(Logger.java:104)
   at org.apache.commons.logging.impl.Log4JLogger.getLogger(
Log4JLogger.java:229)
   at org.apache.commons.logging.impl.Log4JLogger.(Log4JLogger.java
:65)
   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
   at sun.reflect.NativeConstructorAccessorImpl.newInstance(
NativeConstructorAccessorImpl.java:39)
   at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(
DelegatingConstructorAccessorImpl.java:27)
   at java.lang.reflect.Constructor.newInstance(Constructor.java:494)
   at org.apache.commons.logging.impl.LogFactoryImpl.newInstance(
LogFactoryImpl.java:529)
   at org.apache.commons.logging.impl.LogFactoryImpl.getInstance(
LogFactoryImpl.java:235)
   at org.apache.commons.logging.impl.LogFactoryImpl.getInstance(
LogFactoryImpl.java:209)
   at org.apache.commons.logging.LogFactory.getLog(LogFactory.java:351)
   at org.apache.nutch.crawl.Crawl.(Crawl.java:38)
log4j:ERROR Either File or DatePattern options are not set for appender
[DRFA].

-AJ


Re: [Nutch-dev] Crawl error

2006-07-10 Thread AJ Chen

My classpath has "conf" folder. NUTCH_JAVA_HOME is set. In fact, nutch
0.71is working well from my eclipse. I suspect the error comes from
changes in
verions 0.8. The problem is the log message does not say what file is not
found. So, it's hard to debug.  Any idea?
Thanks,
AJ

On 7/9/06, Stefan Groschupf <[EMAIL PROTECTED]> wrote:


Try to put the "conf" folder to your classpath in eclipse and set the
environemnt variables that are setted in  bin/nutch.

Btw, please do not crosspost.
Thanks.
Stefan

Am 09.07.2006 um 21:47 schrieb AJ Chen:

> I checked out the 0.8 code from trunk and tried to set it up in
> eclipse.
> When trying to run Crawl from Eclipse using args "urls -dir crawl -
> depth 3
> -topN 50", I got the following error, which started from
> LogFactory.getLog(
> Crawl.class). Any idea what file was not found?  There is a url
> file under
> directory urls. Thanks,
>
> log4j:ERROR setFile(null,true) call failed.
> java.io.FileNotFoundException: \ (The system cannot find the path
> specified)
>at java.io.FileOutputStream.openAppend(Native Method)
>at java.io.FileOutputStream.(FileOutputStream.java:177)
>at java.io.FileOutputStream.(FileOutputStream.java:102)
>at org.apache.log4j.FileAppender.setFile(FileAppender.java:289)
>at org.apache.log4j.FileAppender.activateOptions
> (FileAppender.java:163)
>at org.apache.log4j.DailyRollingFileAppender.activateOptions(
> DailyRollingFileAppender.java:215)
>at org.apache.log4j.config.PropertySetter.activate
> (PropertySetter.java
> :256)
>at org.apache.log4j.config.PropertySetter.setProperties(
> PropertySetter.java:132)
>at org.apache.log4j.config.PropertySetter.setProperties(
> PropertySetter.java:96)
>at org.apache.log4j.PropertyConfigurator.parseAppender(
> PropertyConfigurator.java:654)
>at org.apache.log4j.PropertyConfigurator.parseCategory(
> PropertyConfigurator.java:612)
>at org.apache.log4j.PropertyConfigurator.configureRootCategory(
> PropertyConfigurator.java:509)
>at org.apache.log4j.PropertyConfigurator.doConfigure(
> PropertyConfigurator.java:415)
>at org.apache.log4j.PropertyConfigurator.doConfigure(
> PropertyConfigurator.java:441)
>at org.apache.log4j.helpers.OptionConverter.selectAndConfigure(
> OptionConverter.java:468)
>at org.apache.log4j.LogManager.(LogManager.java:122)
>at org.apache.log4j.Logger.getLogger(Logger.java:104)
>at org.apache.commons.logging.impl.Log4JLogger.getLogger(
> Log4JLogger.java:229)
>at org.apache.commons.logging.impl.Log4JLogger.
> (Log4JLogger.java
> :65)
>at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
>at sun.reflect.NativeConstructorAccessorImpl.newInstance(
> NativeConstructorAccessorImpl.java:39)
>at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(
> DelegatingConstructorAccessorImpl.java:27)
>at java.lang.reflect.Constructor.newInstance(Constructor.java:494)
>at org.apache.commons.logging.impl.LogFactoryImpl.newInstance(
> LogFactoryImpl.java:529)
>at org.apache.commons.logging.impl.LogFactoryImpl.getInstance(
> LogFactoryImpl.java:235)
>at org.apache.commons.logging.impl.LogFactoryImpl.getInstance(
> LogFactoryImpl.java:209)
>at org.apache.commons.logging.LogFactory.getLog(LogFactory.java:
> 351)
>at org.apache.nutch.crawl.Crawl.(Crawl.java:38)
> log4j:ERROR Either File or DatePattern options are not set for
> appender
> [DRFA].
>
> -AJ
>
> --
> ---
> Using Tomcat but need to do more? Need to support web services,
> security?
> Get stuff done quickly with pre-integrated technology to make your
> job easier
> Download IBM WebSphere Application Server v.1.0.1 based on Apache
> Geronimo
> http://sel.as-us.falkag.net/sel?
> cmd=lnk&kid=120709&bid=263057&dat=121642
> ___
> Nutch-developers mailing list
> Nutch-developers@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nutch-developers




fetcher status missing in log file

2006-08-30 Thread AJ Chen

I'm using nutch-0.9-dev from svn.  hadoop.log has records from fetching
except the status line. is there a setting required to print the fetch
status line?  the status is set in Fetcher.java: report.setStatus(string),
but where does the report object print the status?
thanks,
--
AJ Chen
http://web2express.org


log error in deploying nutch-0.9-dev.jar

2006-09-07 Thread AJ Chen

I'm customizing 0.9-dev code for my vertical search engine.  After rebuild
the nutch-0.9-dev.jar and put it into ROOT\WEB-INF\lib, there is an error
when starting Tomcat:
log4j:ERROR setFile(null,true) call failed.
java.io.FileNotFoundException: \ (The system cannot find the path specified)
   at java.io.FileOutputStream.openAppend(Native Method)
   at java.io.FileOutputStream.(Unknown Source)
   at java.io.FileOutputStream.(Unknown Source)
   at org.apache.log4j.FileAppender.setFile(FileAppender.java:289)
   at org.apache.log4j.FileAppender.activateOptions(FileAppender.java
:163)

log4j:ERROR Either File or DatePattern options are not set for appender
[DRFA].

It complains about log file.  When crawling, nutch script sets the
directory/file for hadoop log.  But, when doing search, this is not set.
Does nutch write log to haddop.log during web search?  How to make it not to
look for log file?

Thanks,
AJ
--
http://web2express.org


Re: log error in deploying nutch-0.9-dev.jar

2006-09-07 Thread AJ Chen

This is solved. I accidentally put log4j.prperties into
ROOT\WEB-INF\classes.
-aj

On 9/7/06, AJ Chen <[EMAIL PROTECTED]> wrote:


I'm customizing 0.9-dev code for my vertical search engine.  After rebuild
the nutch-0.9-dev.jar and put it into ROOT\WEB-INF\lib, there is an error
when starting Tomcat:
log4j:ERROR setFile(null,true) call failed.
java.io.FileNotFoundException: \ (The system cannot find the path
specified)
at java.io.FileOutputStream.openAppend(Native Method)
at java.io.FileOutputStream.(Unknown Source)
at java.io.FileOutputStream.(Unknown Source)
at org.apache.log4j.FileAppender.setFile(FileAppender.java:289)
at org.apache.log4j.FileAppender.activateOptions(FileAppender.java
:163)

log4j:ERROR Either File or DatePattern options are not set for appender
[DRFA].

It complains about log file.  When crawling, nutch script sets the
directory/file for hadoop log.  But, when doing search, this is not set.
Does nutch write log to haddop.log during web search?  How to make it not
to look for log file?

Thanks,
AJ
--
http://web2express.org





--
AJ Chen, PhD
http://web2express.org


outlink extractor finds lots of junk

2006-10-23 Thread AJ Chen

During fetching, OutlinkExtractor.getOutlinks() finds lots of junk, such as
the following:
rdf:about=
xmlns:pdf=
http://ns.adobe.com/pdf/1.3/
pdf:Producer
pdf:Producer
rdf:Description
rdf:Description
rdf:about=
xmlns:xap=
http://ns.adobe.com/xap/1.0/
xap:CreatorTool
xap:CreatorTool
xap:ModifyDate
T14:43:23-07:00

This is because the defined URL_PATTERN matches things that are not web
links. Is there a fix for it?  Is there a way to set protocols (e.g. http,
https) for the desired outlinks? This way, only links containing the
specified protocols will be considered as "outlink".  I'm using 0.9-devcode.

Thanks,
--
AJ Chen, PhD
http://web2express.org


how to minimize reduce operations when using single machine

2006-10-27 Thread AJ Chen

I use 0.9-dev code and local file system to crawl on a single machine.
After fetching pages, nutch spends huge amount of time doing "reduce > sort"
and reduce "reduce > reduce". This is not necessary since it uses only the
local file system.  I'm not familiar with map-reduce code, but guess it may
be possible to control the number of map and reduce operations.  Is it
possible to configure nutch to break fetch job to only few sub-operations so
that there will be only 1 or few map and reduce opresation?  What setting or
code can be changed to minimize the time spent on map-reduce operations when
crawling with a single machine?

Thanks,
AJ


need help to speed up map-reduce

2006-11-06 Thread AJ Chen

Sorry for repeating this question. But, I have to find a solution, otherwise
the crawling is too slow to be practical.  I'm using nutch 0.9-dev on one
linux server to crawl millions of pages.  The fetching itself is reasonable,
but the map-reduce operations is killing the performance. For example,
fetching takes 10 hours and map-reduce also takes 10 hours, which makes the
overall performance very slow. Can anyone share experience on how to speed
up map-reduce for single server crawling?  Single server uses local file
system. It should spend very little time in doing map and reduce, isn't it
right?

Thanks,
--
AJ Chen, PhD
http://web2express.org


Re: [jira] Resolved: (NUTCH-395) Increase fetching speed

2006-11-13 Thread AJ Chen

Sami,
Thanks for resolving this serious issue.  I just updated my code from trunk
and plan to test fetch speed. But ,there is a runtime error related to
switching from UTF8 to Text. Since the error is from hadoop, how do I fix
it?

java.lang.ClassCastException: org.apache.hadoop.io.UTF8
   at org.apache.nutch.crawl.Generato r$Selector.map(Generator.java:108)
   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:213)
   at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java
:105)

Thanks,
AJ


On 11/13/06, Sami Siren (JIRA) <[EMAIL PROTECTED]> wrote:


 [ http://issues.apache.org/jira/browse/NUTCH-395?page=all ]

Sami Siren resolved NUTCH-395.
--

Fix Version/s: 0.9.0
   Resolution: Fixed

applied to trunk with some additional whitespace changes.

> Increase fetching speed
> ---
>
> Key: NUTCH-395
> URL: http://issues.apache.org/jira/browse/NUTCH-395
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 0.8.1, 0.9.0
>Reporter: Sami Siren
> Assigned To: Sami Siren
> Fix For: 0.9.0
>
> Attachments: nutch-0.8-performance.txt,
NUTCH-395-trunk-metadata-only-2.patch, NUTCH-395-trunk-metadata-only.patch
>
>
> There have been some discussion on nutch mailing lists about fetcher
being slow, this patch tried to address that. the patch is just a quich hack
and needs some cleaning up, it also currently applies to 0.8 branch and
not trunk and it has also not been tested in large. What it changes?
> Metadata - the original metadata uses spellchecking, new version does
not (a decorator is provided that can do it and it should perhaps be used
where http headers are handled but in most of the cases the functionality is
not required)
> Reading/writing various data structures - patch tries to do io more
efficiently see the patch for details.
> Initial benchmark:
> A small benchmark was done to measure the performance of changes with a
script that basically does the following:
> -inject a list of urls into a fresh crawldb
> -create fetchlist (10k urls pointing to local filesystem)
> -fetch
> -updatedb
> original code from 0.8-branch:
> real10m51.907s
> user10m9.914s
> sys 0m21.285s
> after applying the patch
> real4m15.313s
> user3m42.598s
> sys 0m18.485s

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira






--
AJ Chen, PhD
http://web2express.org


Re: [jira] Commented: (NUTCH-395) Increase fetching speed

2006-11-22 Thread AJ Chen

I checked out the code from trunk after Sami committed the change. I started
out a new crawl db and run several cycles of crawl sequentially on one linux
server. See below for the real numbers from my test.  The performance is
still poor because the crawler still spend too much time in reduce and
update operations.

#crawl cycle: topN=20
2006-11-17 17:25:27,367 INFO  crawl.Generator - Generator: segment:
crawl/segments/20061117172527
2006-11-17 17:47:45,837 INFO  fetcher.Fetcher - Fetcher: segment:
crawl/segments/20061117172527
# 8 hours fetching ~20 pages
2006-11-18 03:13:31,992 INFO  mapred.LocalJobRunner - 183644 pages, 5506
errors, 5.4 pages/s, 1043 kb/s,
# 4 hours doing "reduce"
2006-11-18 07:30:38,085 INFO  crawl.CrawlDb - CrawlDb update: starting
# 4 hours update db
2006-11-18 11:17:54,000 INFO  crawl.CrawlDb - CrawlDb update: done

#crawl sycle: topN=500,000 pages
2006-11-18 13:22:51,530 INFO  crawl.Generator - Generator: segment:
crawl/segments/20061118132251
2006-11-18 14:50:07,006 INFO  fetcher.Fetcher - Fetcher: segment:
crawl/segments/20061118132251
# fetching for 16 hours
2006-11-19 06:53:34,923 INFO  mapred.LocalJobRunner - 394343 pages, 19050
errors, 6.8 pages/s, 1439 kb/s,
# reduce for 11 hours
2006-11-19 17:49:15,778 INFO  crawl.CrawlDb - CrawlDb update: segment:
crawl/segments/20061118132251
# update db for 10 hours
2006-11-20 03:55:22,882 INFO  crawl.CrawlDb - CrawlDb update: done

#crawl cycle: topN=600,000 pages
2006-11-20 08:14:51,463 INFO  crawl.Generator - Generator: segment:
crawl/segments/20061120081451
2006-11-20 11:31:22,384 INFO  fetcher.Fetcher - Fetcher: segment:
crawl/segments/20061120081451
#fetching for 18 hours
2006-11-21 06:00:08,504 INFO  mapred.LocalJobRunner - 410078 pages, 26316
errors, 6.2 pages/s, 1257 kb/s,
#reduce for 11 hours
2006-11-21 17:26:38,213 INFO  crawl.CrawlDb - CrawlDb update: starting
#update for 13 hours
2006-11-22 06:25:48,592 INFO  crawl.CrawlDb - CrawlDb update: done


-AJ


On 11/13/06, Andrzej Bialecki (JIRA) <[EMAIL PROTECTED]> wrote:


[
http://issues.apache.org/jira/browse/NUTCH-395?page=comments#action_12449292]

Andrzej Bialecki  commented on NUTCH-395:
-

+1 - this patch looks good to me - if you could just fix the whitespace
issues prior to committing, so that it conforms to the coding style ...

> Increase fetching speed
> ---
>
> Key: NUTCH-395
> URL: http://issues.apache.org/jira/browse/NUTCH-395
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 0.9.0, 0.8.1
>Reporter: Sami Siren
> Assigned To: Sami Siren
> Attachments: nutch-0.8-performance.txt,
NUTCH-395-trunk-metadata-only-2.patch, NUTCH-395-trunk-metadata-only.patch
>
>
> There have been some discussion on nutch mailing lists about fetcher
being slow, this patch tried to address that. the patch is just a quich hack
and needs some cleaning up, it also currently applies to 0.8 branch and
not trunk and it has also not been tested in large. What it changes?
> Metadata - the original metadata uses spellchecking, new version does
not (a decorator is provided that can do it and it should perhaps be used
where http headers are handled but in most of the cases the functionality is
not required)
> Reading/writing various data structures - patch tries to do io more
efficiently see the patch for details.
> Initial benchmark:
> A small benchmark was done to measure the performance of changes with a
script that basically does the following:
> -inject a list of urls into a fresh crawldb
> -create fetchlist (10k urls pointing to local filesystem)
> -fetch
> -updatedb
> original code from 0.8-branch:
> real10m51.907s
> user10m9.914s
> sys 0m21.285s
> after applying the patch
> real4m15.313s
> user3m42.598s
> sys 0m18.485s

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira






--
AJ Chen, PhD
Palo Alto, CA
http://web2express.org


Re: [jira] Commented: (NUTCH-395) Increase fetching speed

2006-11-22 Thread AJ Chen

Linux box, opteron 2Ghz, 2GB RAM, DSL download bandwidth up to 5mbps.

This is a new crawldb, crawling on 4000 selected sites, total ~1 million
pages fetched after last run.

use default regex-urlfilter.txt except for :
-(?i)\.(ai|asf|au|avi|bz2|bin|bmp|c|class|css|dmg|doc|dot|dvi|eps|exe|gif|gz|h|hqx|ico|iso|jar|java|jnlp|jpeg|jpg|lha|md5|mov|
mp3|mp4|mpg|msi|ogg|png|pps|ppt|ps|psd|ram|ris|rm|rpm|rss|rtf|sit|swf|tar|tbz|tbz2|tgz|tif|wav|wmf|wmv|xls|z|zip)\)?$
[EMAIL PROTECTED]

additional filter to limit urls to the selected domains  (hashtable
implementation)

plugins:
protocol-http|urlfilter-regex|parse-(text|html)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic

use default org.apache.nutch.net.URLNormalizer

thanks for helping,
AJ


parse only html and text

On 11/22/06, Sami Siren <[EMAIL PROTECTED]> wrote:


What kind of hardware are you running on? Your pages per sec ratio seems
very low to me.

How big was your crawldb when you started and how big was it at end?

What kind of filters and normalizers are you using?

--
  Sami Siren

AJ Chen wrote:
> I checked out the code from trunk after Sami committed the change. I
> started
> out a new crawl db and run several cycles of crawl sequentially on one
> linux
> server. See below for the real numbers from my test.  The performance is
> still poor because the crawler still spend too much time in reduce and
> update operations.
>
> #crawl cycle: topN=20
> 2006-11-17 17:25:27,367 INFO  crawl.Generator - Generator: segment:
> crawl/segments/20061117172527
> 2006-11-17 17:47:45,837 INFO  fetcher.Fetcher - Fetcher: segment:
> crawl/segments/20061117172527
> # 8 hours fetching ~20 pages
> 2006-11-18 03:13:31,992 INFO  mapred.LocalJobRunner - 183644 pages, 5506
> errors, 5.4 pages/s, 1043 kb/s,
> # 4 hours doing "reduce"
> 2006-11-18 07:30:38,085 INFO  crawl.CrawlDb - CrawlDb update: starting
> # 4 hours update db
> 2006-11-18 11:17:54,000 INFO  crawl.CrawlDb - CrawlDb update: done
>
> #crawl sycle: topN=500,000 pages
> 2006-11-18 13:22:51,530 INFO  crawl.Generator - Generator: segment:
> crawl/segments/20061118132251
> 2006-11-18 14:50:07,006 INFO  fetcher.Fetcher - Fetcher: segment:
> crawl/segments/20061118132251
> # fetching for 16 hours
> 2006-11-19 06:53:34,923 INFO  mapred.LocalJobRunner - 394343 pages,
19050
> errors, 6.8 pages/s, 1439 kb/s,
> # reduce for 11 hours
> 2006-11-19 17:49:15,778 INFO  crawl.CrawlDb - CrawlDb update: segment:
> crawl/segments/20061118132251
> # update db for 10 hours
> 2006-11-20 03:55:22,882 INFO  crawl.CrawlDb - CrawlDb update: done
>
> #crawl cycle: topN=600,000 pages
> 2006-11-20 08:14:51,463 INFO  crawl.Generator - Generator: segment:
> crawl/segments/20061120081451
> 2006-11-20 11:31:22,384 INFO  fetcher.Fetcher - Fetcher: segment:
> crawl/segments/20061120081451
> #fetching for 18 hours
> 2006-11-21 06:00:08,504 INFO  mapred.LocalJobRunner - 410078 pages,
26316
> errors, 6.2 pages/s, 1257 kb/s,
> #reduce for 11 hours
> 2006-11-21 17:26:38,213 INFO  crawl.CrawlDb - CrawlDb update: starting
> #update for 13 hours
> 2006-11-22 06:25:48,592 INFO  crawl.CrawlDb - CrawlDb update: done
>
>
> -AJ
>
>
> On 11/13/06, Andrzej Bialecki (JIRA) <[EMAIL PROTECTED]> wrote:
>>
>> [
>>
http://issues.apache.org/jira/browse/NUTCH-395?page=comments#action_12449292
]
>>
>>
>> Andrzej Bialecki  commented on NUTCH-395:
>> -
>>
>> +1 - this patch looks good to me - if you could just fix the whitespace
>> issues prior to committing, so that it conforms to the coding style ...
>>
>> > Increase fetching speed
>> > ---
>> >
>> > Key: NUTCH-395
>> > URL: http://issues.apache.org/jira/browse/NUTCH-395
>> > Project: Nutch
>> >  Issue Type: Improvement
>> >  Components: fetcher
>> >Affects Versions: 0.9.0, 0.8.1
>> >Reporter: Sami Siren
>> > Assigned To: Sami Siren
>> > Attachments: nutch-0.8-performance.txt,
>> NUTCH-395-trunk-metadata-only-2.patch,
>> NUTCH-395-trunk-metadata-only.patch
>> >
>> >
>> > There have been some discussion on nutch mailing lists about fetcher
>> being slow, this patch tried to address that. the patch is just a
>> quich hack
>> and needs some cleaning up, it also currently applies to 0.8 branch and
>> not trunk and it has also not been tested in large. What it changes?
>> > Metadata - the original metadata uses spellchecking, new version does
>> not (a decorator is provided that can do it and it should perhaps be
used

Re: Reviving Nutch 0.7

2007-01-22 Thread AJ Chen

On 1/22/07, Doug Cutting <[EMAIL PROTECTED]> wrote:



Finally, web crawling, indexing and searching are data-intensive.
Before long, users will want to index tens or hundreds of millions of
pages.  Distributed operation is soon required at this scale, and
batch-mode is an order-of-magnitude faster.  So be careful before you
threw those features out: you might want them back soon.

Doug


As a developer building application on top of Nutch, my experience is that

I can't go back to version 0.7x because the features in version 0.8/0.9 are
so much needed even for non-distributed crawling/indexing. For example, I
can run crawling/indexing on a linux server and a windows laptop separately,
and merge newly crawled databases into the main crawldb. I remember
v0.7can't merge separate crawldb without lots of customization.

It may takes some time to switch from 0.7x to v0.8/0.9 especially if you
have lots of customization code. But, once you get over this one hurdle, you
will enjoy the new and better features in 0.8/0.9 version.  Also, this may
be the time to re-think the design of your application. For my own project,
I always try to separate my code from nutch core code as much as possible so
that I can easily upgrade the application to keep up with new nutch release.
Keeping away from the newest nutch version is somewhat backward to me.

AJ
--
AJ Chen, PhD
Palo Alto, CA
http://web2express.org


[jira] Created: (NUTCH-87) Efficient site-specific crawling for a large number of sites

2005-09-02 Thread AJ Chen (JIRA)
Efficient site-specific crawling for a large number of sites


 Key: NUTCH-87
 URL: http://issues.apache.org/jira/browse/NUTCH-87
 Project: Nutch
Type: New Feature
  Components: fetcher  
 Environment: cross-platform
 Reporter: AJ Chen


There is a gap between whole-web crawling and single (or handful) site 
crawling. Many applications actually fall in this gap, which usually require to 
crawl a large number of selected sites, say 10 domains. Current CrawlTool 
is designed for a handful of sites. So, this request calls for a new feature or 
improvement on CrawTool so that "nutch crawl" command can efficiently deal with 
large number of sites. One requirement is to add or change smallest amount of 
code so that this feature can be implemented sooner rather than later. 

There is a discussion about adding a URLFilter to implement this requested 
feature, see the following thread - 
http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00726.html
The idea is to use a hashtable in URLFilter for looking up regex for any given 
domain. Hashtable will be much faster than list implementation currently used 
in RegexURLFilter.  Fortunately, Matt Kangas has implemented such idea before 
for his own application and is willing to make it available for adaptation to 
Nutch. I'll be happy to help him in this regard.  

But, before we do it, we would like to hear more discussions or comments about 
this approach or other approaches. Particularly, let us know what potential 
downside will be for hashtable lookup in a new URLFilter plugin.

AJ Chen



-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Created: (NUTCH-398) map-reduce very slow when crawling on single server

2006-11-07 Thread AJ Chen (JIRA)
map-reduce very slow when crawling on single server
---

 Key: NUTCH-398
 URL: http://issues.apache.org/jira/browse/NUTCH-398
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 0.8.1
 Environment: linux and windows
Reporter: AJ Chen


This seems a bug and so I create a ticket here. I'm using nutch 0.9-dev to 
crawl web on one linux server. With default hadoop
configuration (local file system, no distributed crawling), the Generator and 
Fetcher spend unproportional amount of time on map-reduce opearations. For 
example:
2006-11-01 20:32:44,074 INFO  crawl.Generator - Generator: segment:
crawl/segments/20061101203244
... (doing map and reduce for 2 hours )
2006-11-01 22:28:11,102 INFO  fetcher.Fetcher - Fetcher: segment:
crawl/segments/20061101203244
... (fetching 12 hours )
2006-11-02 11:15:10,590 INFO  mapred.LocalJobRunner - 175383 pages, 16583
errors, 3.8 pages/s, 687 kb/s,
2006-11-02 11:17:24,039 INFO  mapred.LocalJobRunner - reduce > sort
... (but doing reduce>sort and reduce>duce for 8 hours )
2006-11-02 19:13:38,882 INFO  crawl.CrawlDb - CrawlDb update: segment:
crawl/segments/20061101203244

Since it's crawling on a single machine, such slow map-reduce opearation is not 
expected.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira