refetching interval

2006-04-21 Thread Michael Ji
Hi,

I am using nutch 07 and found the following code in
FetchListTool.java

private static final long FETCH_GENERATION_DELAY_MS =
7 * 24 * 60 * 60 * 1000;

that means next refetching time is always 7 days
later, no matter what fetch interval setting in
nutch-site.xml,

I feel puzzled. Could any one give me a hint? 

thanks,

Michael,

__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



compile search.jsp

2006-03-04 Thread Michael Ji

Hi,
 
I made change in search.jsp under /nutch/src/web/jsp
and hope the change could reflect to the skin of
nutch search page.
 
I tried to run ant war and replace ROOT.war in
tomcat/webapp
 
also I tried to shutdown and restart tomcat;
 
But seems the nutch search page keeps the same, also
the bean.LOG.info keeps the same as before even I am
writing new information.
 
I wonder if any compiling steps I missed.
 
thanks your help,
 
Michael,
 


__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam
protection around 
http://mail.yahoo.com 

__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 


entrance point of Nutch search page

2006-03-03 Thread Michael Ji
hi,

Which JSP file is the entrance for Nutch search page.

I saw nutch using

search(Query query, int numHits, String dedupField,
String sortField, boolean reverse) 

to get the search result.

But not sure which JSP triggers this function.

Is it in tomcat container?

thanks,

Michael,

__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 


Re: Halloween Joke at Google

2005-11-02 Thread Michael Ji
hi Byron:

Did you run LinkAnalysisTool to update score in the
fetched segment? I guess that is the most accurate
PageRank score, otherwise, in IndexSegment.java Nutch
do score calculation based on the number of anchor
links for source page.

Michael Ji,

--- Byron Miller [EMAIL PROTECTED] wrote:

 We run with
 
 fetchlist.score.by.link.count=true and
 indexer.boost.by.link.count=true
 
 We haven't run a stand alone analyze, so it's how
 the
 database is updated when we run updatedb. (per the
 recommendations a few months back when it was found
 to
 be pretty darn close results!)
 
 Even though my scale is still much smaller than
 Googles, it is amazing how closely the results can
 match!
 
 Makes you wonder just how much of the net is
 usefull
 ;)
 
 -byron
 
 
 
 --- Andrzej Bialecki [EMAIL PROTECTED] wrote:
 
  Byron Miller wrote:
  
  Actually, to add fuel to the fire, using nutch
 out
  of
  the box, searching for miserable failure yields
 the
  same thing.
  
 

http://www.mozdex.com/search.jsp?query=miserablefailure
  

  
  
  I'm curious... could you check if the anchors come
  from the same site, 
  or from different sites? Do you run with 
  fetchlist.score.by.link.count=true and
  indexer.boost.by.link.count=true?
  
  Anyway, that's how the PageRank is _supposed_ to
  work - it should give a 
  higher score to sites that are highly linked, and
  also it should 
  strongly consider the anchor text as an indication
  of the page's true 
  subject ... ;-)
  
  -- 
  Best regards,
  Andrzej Bialecki 
   ___. ___ ___ ___ _ _  
  __
  [__ || __|__/|__||\/|  Information Retrieval,
  Semantic Web
  ___|||__||  \|  ||  |  Embedded Unix, System
  Integration
  http://www.sigram.com  Contact: info at sigram dot
  com
  
  
  
 
 




__ 
Start your day with Yahoo! - Make it your home page! 
http://www.yahoo.com/r/hs


searching return 0 hit

2005-10-18 Thread Michael Ji
Somehow, I found my search engine didn't show the
result, even I can see the index from LukeAll. ( It
works fine before )

I replace ROOT.WAR file in tomcat by nutch's and
launch tomcat in nutch's segment directory ( parallel
to index subdir )

Should I reinstall Tomcat? Or will that be nutch's
indexing issue? My system is running in Linux. 

thanks,

Michael Ji,
-

051019 215411 11 query: com
051019 215411 11 searching for 20 raw hits
051019 215411 11 total hits: 0
051019 215449 12 query request from 65.34.213.205
051019 215449 12 query: net
051019 215449 12 searching for 20 raw hits




__ 
Yahoo! Music Unlimited 
Access over 1 million songs. Try it free.
http://music.yahoo.com/unlimited/


Re: searching return 0 hit

2005-10-18 Thread Michael Ji
hi Gal:

I Do stop tomcat before I do indexing. And restart
tomcat after.

I am not sure what the first server you mentioned
here. The Linux box in my case?

thanks,

Michael Ji,

--- Gal Nitzan [EMAIL PROTECTED] wrote:

 Hi Michael,
 
 At least on my side every time I run index, I must
 stop server and than 
 tomcat and than re start first server than tomcat.
 
 I have asked about this twice in this list but
 nobody answered.
 
 I'm not sure it is the same issue, but try it.
 
 Regards,
 
 Gal.
 
 
 Michael Ji wrote:
  Somehow, I found my search engine didn't show the
  result, even I can see the index from LukeAll. (
 It
  works fine before )
 
  I replace ROOT.WAR file in tomcat by nutch's and
  launch tomcat in nutch's segment directory (
 parallel
  to index subdir )
 
  Should I reinstall Tomcat? Or will that be nutch's
  indexing issue? My system is running in Linux. 
 
  thanks,
 
  Michael Ji,
  -
 
  051019 215411 11 query: com
  051019 215411 11 searching for 20 raw hits
  051019 215411 11 total hits: 0
  051019 215449 12 query request from 65.34.213.205
  051019 215449 12 query: net
  051019 215449 12 searching for 20 raw hits
 
 
 
  
  __ 
  Yahoo! Music Unlimited 
  Access over 1 million songs. Try it free.
  http://music.yahoo.com/unlimited/
 
  .
 

 
 
 




__ 
Start your day with Yahoo! - Make it your home page! 
http://www.yahoo.com/r/hs


next score usage

2005-10-14 Thread Michael Ji
hi,

I saw several discussions about Distributed Link
Analysis Tool before. And I still have question about
the usage of the field next score in Page data
structure.

Seems Distributed Link Analysis Tool will update this
field by OutlinkWithTarget ( as I understand, that
means the link has target page ).

But I didn't see how next score field being used in
search result. Coz in IndexSegment.java, only the
field score in Page is used to generate boost value
of a document for indexing segment and will affect the
search rank for a particular document in Lucene
indexing structure.

thanks,

Michael Ji,



__ 
Start your day with Yahoo! - Make it your home page! 
http://www.yahoo.com/r/hs

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Document Duplication for Multiple Segment Merge

2005-10-14 Thread Michael Ji
hi,

When Nutch's IndexMerger.java is called, the indexes
from multiple segment directories are merged to one
target directory.

I wonder how lucene deals with the case when identical
documents existing in two segments. Is the older
document ( lower time stamp ) deleted? 

thanks,

Michael Ji,



__ 
Yahoo! Music Unlimited 
Access over 1 million songs. Try it free.
http://music.yahoo.com/unlimited/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Document Duplication for Multiple Segment Merge

2005-10-14 Thread Michael Ji
hi Yonik:

Does that mean when two documents has same MD5 content
in two different segments, IndexMerger.java  will keep
both of them?

When I look at the code of IndexSegment.java, it
handle MD5 dedupling by keeping the one with higher
document ID.

So, when refetching happens, the old segment should be
discarded totally. And, a strategy must be made in
such a way that each segment should relate to a
fetchlist with same interval time. Is it the way Nutch
handling refetching case?


Michael Ji,

--- Yonik Seeley [EMAIL PROTECTED] wrote:

 There is no concept in Lucene of document identity
 linked to any fields of a
 document.
 You need to handle removal of duplicates yourself.
 
 -Yonik
 Now hiring -- http://tinyurl.com/7m67g
 
 
 On 10/14/05, Michael Ji [EMAIL PROTECTED] wrote:
 
  hi,
 
  When Nutch's IndexMerger.java is called, the
 indexes
  from multiple segment directories are merged to
 one
  target directory.
 
  I wonder how lucene deals with the case when
 identical
  documents existing in two segments. Is the older
  document ( lower time stamp ) deleted?
 
  thanks,
 
  Michael Ji,
 
 
 
  __
  Yahoo! Music Unlimited
  Access over 1 million songs. Try it free.
  http://music.yahoo.com/unlimited/
 
 

-
  To unsubscribe, e-mail:
 [EMAIL PROTECTED]
  For additional commands, e-mail:
 [EMAIL PROTECTED]
 
 
 




__ 
Yahoo! Music Unlimited 
Access over 1 million songs. Try it free.
http://music.yahoo.com/unlimited/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Document Duplication for Multiple Segment Merge

2005-10-14 Thread Michael Ji
Sorry, I guess I point out a wrong java class name.

I want to be confirmed that if SegmentMerger.java in
Lucene do dedup or not. I tracing down couple of java
class from SegmentMerger.java, such as,
SegmentReader.java, IndexWriter.java,
SegmentReader.java. I didn't see a dedup mechanism
yet.

thanks,

Micheal Ji,

--- Yonik Seeley [EMAIL PROTECTED] wrote:

 Sorry, I've only briefly looked at Nutch, so you
 should ask on that mailing
 list.
 Lucene doesn't do deduping.
 
 
 -Yonik
 Now hiring -- http://tinyurl.com/7m67g
 
 On 10/14/05, Michael Ji [EMAIL PROTECTED] wrote:
 
  hi Yonik:
 
  Does that mean when two documents has same MD5
 content
  in two different segments, IndexMerger.java will
 keep
  both of them?
 
  When I look at the code of IndexSegment.java, it
  handle MD5 dedupling by keeping the one with
 higher
  document ID.
 
 





__ 
Yahoo! Mail - PC Magazine Editors' Choice 2005 
http://mail.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: crawl db stats

2005-10-14 Thread Michael Ji
using DBAdminTool to dump the webdb and you can get
whole list of Pages in text format,

Michael Ji,

--- Stefan Groschupf [EMAIL PROTECTED] wrote:

 Hi,
 is there any chance to read the statistics of the
 nutch 0.8 crawl db  
 or a trick to get an idea of how many pages are
 already crawled?
 Thanks for the hints.
 Stefan
 
 




__ 
Start your day with Yahoo! - Make it your home page! 
http://www.yahoo.com/r/hs


Re: crawl db stats

2005-10-14 Thread Michael Ji
or, you can use segread in bin/nutch to dump a new
fetch segment to see what page it fetched,

Michael Ji,

--- Stefan Groschupf [EMAIL PROTECTED] wrote:

 Which class do you mean?
 There is the old webdbadmin tool, but I guess this
 will not work for  
 the new crawl db.
 The bin/nutch admin command isn't supported until
 more.
 Thanks
 Stefan
 
 
 Am 15.10.2005 um 00:21 schrieb Michael Ji:
 
  using DBAdminTool to dump the webdb and you can
 get
  whole list of Pages in text format,
 
  Michael Ji,
 
  --- Stefan Groschupf [EMAIL PROTECTED] wrote:
 
 
  Hi,
  is there any chance to read the statistics of the
  nutch 0.8 crawl db
  or a trick to get an idea of how many pages are
  already crawled?
  Thanks for the hints.
  Stefan
 
 
 
 
 
 
 
  __
  Start your day with Yahoo! - Make it your home
 page!
  http://www.yahoo.com/r/hs
 
 
 
 




__ 
Yahoo! Music Unlimited 
Access over 1 million songs. Try it free.
http://music.yahoo.com/unlimited/


Re: How can I unsubscribe from the mailing list?

2005-10-02 Thread Michael Ji
http://lucene.apache.org/nutch/mailing_lists.html

--- [EMAIL PROTECTED] wrote:

 Does any body know how I can unsubscribe from this
 mailing list?
  Thanks,
 Nima
 




__ 
Yahoo! Mail - PC Magazine Editors' Choice 2005 
http://mail.yahoo.com


RE: what contibute to fetch slowing down

2005-10-02 Thread Michael Ji
Kelvin's OC implementation is queuing fetching request
according to the host and using http 1.1 protocol. It
is a nutch patch currently.

Michael Ji,

--- Fuad Efendi [EMAIL PROTECTED] wrote:

 Some suggestion to improve performance:
 
 
 1. Decrease randomization of FetchList.
  
 Here is comment from FetchListTool:
/**
  * The TableSet class will allocate a given
 FetchListEntry
  * into one of several ArrayFiles.  It chooses
 which
  * ArrayFile based on a hash of the URL's domain
 name.
  *
  * It uses a hash of the domain name so that
 pages are
  * allocated to a random ArrayFile, but
 same-host pages
  * go to the same file (for efficiency purposes
 during
  * fetch).
  *
  * Further, within a given file, the
 FetchListEntry items
  * appear in random order.  This is so that we
 don't
  * hammer the same site over and over again
 during fetch.
  *
  * Each table should receive a roughly
  * even number of entries, but all URLs for a 
 specific 
  * domain name will be found in a single table. 
 If
  * the dataset is weirdly skewed toward large
 domains,
  * there may be an uneven distribution.
  */
 
 Same same-host pages go to the same file - they
 should go in a
 sequence, without mixing/randomizing with other
 host-pages...
 
 We are fetching single URL, then we forget about
 existense of this
 TCP/IP connection, we even forget that Web Server
 created Client Process
 to handle our HTTP requests, it is called Keep
 Alive. Creation of TCP
 connection, and additionally creation of a such
 Client Process on a Web
 Server costs a lot of CPU on both sides, Nutch 
 WebServer.
 
 I suggest to use single Keep-Alive thread to fetch
 single Host, without
 randomization.
 
 
 2. Use/Investigate more staff from Socket API such
 as
 public void setSoTimeout(int timeout)
 public void setReuseAddress(true)
 
 I found this in J2SE API for
 setReuseAddress(default: false):
 =
 When a TCP connection is closed the connection may
 remain in a timeout
 state for a period of time after the connection is
 closed (typically
 known as the TIME_WAIT state or 2MSL wait state).
 For applications using
 a well known socket address or port it may not be
 possible to bind a
 socket to the required SocketAddress if there is a
 connection in the
 timeout state involving the socket address or port. 
 =
 
 It probably means that we are reaching huge amount
 (65000!) of waiting
 TCP ports after Socket.close(); and Fetcher Theads
 are blocking by OS
 waiting when OS release some of these ports... Am I
 right?
 
 
 P.S.
 Anyway, using Keep-Alive option is very important
 not only for us but
 also for Production Web Sites.
 
 Thanks,
 Fuad
 
 
 
 
 
 -Original Message-
 From: Fuad Efendi [mailto:[EMAIL PROTECTED] 
 Sent: Friday, September 30, 2005 10:58 PM
 To: nutch-dev@lucene.apache.org; [EMAIL PROTECTED]
 Subject: RE: what contibute to fetch slowing down
 
 
 Dear Nutchers,
 
 
 I noticed same problem twise, with PentiumMobile2Mhz
  WindowsXP  2Gb,
 and with 2xOpteron252 x SuseLinux x 4Gb
 
 I have only one explanation which should be probably
 mirrored at JIRA:
 
 
 
 Network.
 
 
 
 1.
 I never had such a problem with The Grinder,
 http://grinder.sourceforge.net, which is based on
 alternate HTTPClient
 http://www.innovation.ch/java/HTTPClient/index.html.
 Apache SF should
 really review their HttpClient RC3(!!!) accordingly,
 HTTPClient
 (upper--HTTP-case)is not alpha, it is production
 version... I used
 Grinder a lot, it allows to execute 32 processes
 with 64 threads each on
 2048Mb RAM...
 
 
 2.
 I found at SUN API this: 
 java.net.Socket
 public void setReuseAddress(boolean on) - please
 check API!!!
 
 
 3. 
 I saw in your PROTOCOL-HTTP this code:
 ... HTTP/1.0 ...
 Why? Why version 1.0??? It should understand
 server's reply such as
 Connection: close Connection: keep-alive etc.
 (pls ignore typo).
 
 
 4.
 By the way, how many files UNIX needs in order to
 maintain 65536 network
 sockets?
 
 
 Respectfully,
 Fuad
 
 P.S.
 Sorry guys, I don't have anough time to
 participate... Could you please
 test this suspicious behaviour, and very strange
 opinion? Should I
 create a new bug report at JIRA? 
 
 SUN's Socket, Apache's HttpClient, UNIX's
 networking...
 
 
 
 
 -Original Message-
 From: Daniele Menozzi [mailto:[EMAIL PROTECTED] 
 Sent: Wednesday, September 28, 2005 4:42 PM
 To: nutch-dev@lucene.apache.org
 Subject: Re: what contibute to fetch slowing down
 
 
 On  10:27:55 28/Sep , AJ Chen wrote:
  I started the crawler with about 2000 sites.  The
 fetcher could
  achieve
  7 pages/sec initially, but the performance
 gradually dropped to about
 2 
  pages/sec, sometimes even 0.5 pages/sec.  The
 fetch list had 300k
 pages 
  and I used 500 threads. What are the main causes
 of this slowing down?
 
 
 I have the same problem; I've tried with different
 number of fetchers
 (10,20,50,100

possibility of adding customerized data field in nutch Page Class

2005-09-17 Thread Michael Ji
hi there,

I am trying to add a new data field to Page class, a
simple String.

So, I follow the URL field in Page Class as template.
But when I do WebDBInject, it gives me following error
messages. Seems the readFields() is not reading in the
right position.

I wonder if it is feasible to make a change in Page
Class, as I understand nutch webdb has advanced
structure and operations. From OO view, all the Page
fields should be accessed by Page Class Interface, but
I just met something weird.

thanks,

Michael Ji,

- 


Exception in thread main java.io.EOFException
at
java.io.DataInputStream.readUnsignedShort(DataInputStream.java:310)
at org.apache.nutch.io.UTF8.readFields(UTF8.java:101)
at org.apache.nutch.db.Page.readFields(Page.java:146)
at
org.apache.nutch.io.SequenceFile$Reader.next(SequenceFile.java:278)
at
org.apache.nutch.io.MapFile$Reader.next(MapFile.java:349)
at
org.apache.nutch.db.WebDBWriter$PagesByURLProcessor.mergeEdits(WebDBWriter.java:618)
at
org.apache.nutch.db.WebDBWriter$CloseProcessor.closeDown(WebDBWriter.java:557)
at
org.apache.nutch.db.WebDBWriter.close(WebDBWriter.java:1544)
at
org.apache.nutch.db.WebDBInjector.close(WebDBInjector.java:336)
at
org.apache.nutch.db.WebDBInjector.main(WebDBInjector.java:581)




__ 
Yahoo! Mail - PC Magazine Editors' Choice 2005 
http://mail.yahoo.com


Re: Index Infos

2005-09-17 Thread Michael Ji
there is a book called Lucene in Action published,

michael ji

--- Daniele Menozzi [EMAIL PROTECTED] wrote:

 Hi all, can you please point me to a document in
 which is described the
 Indexing step? I havn't found anything inside the
 wiki, and I do not
 understand very well what happens...
 
 Thank you again!!!
   Menoz
 
 -- 
 Free Software Enthusiast
Debian Powered Linux User #332564 
http://menoz.homelinux.org
 




__ 
Yahoo! Mail - PC Magazine Editors' Choice 2005 
http://mail.yahoo.com


Re: Problems on Crawling

2005-09-16 Thread Michael Ji
at look at this good nutch doc

http://wiki.apache.org/nutch/DissectingTheNutchCrawler

Michael Ji

--- Daniele Menozzi [EMAIL PROTECTED] wrote:

 Hi all, I have questions regarding
 org.apache.nutch.tools.CrawlTool: I do
 not have really understood what is the ralationship
 between
 depth,segments,fetching..
 Take for example the tutorial, I understand theese 2
 steps:
 
   bin/nutch admin db -create
   bin/nutch inject db -dmozfile content.rdf.u8
 -subset 3000
 
 but, when I do this:
   
   bin/nutch generate db segments
 
 what happens? I think that a dir called 'segments'
 id created, and inside
 of it I can find the links I have previously
 injected.Ok.Next steps:
   
   bin/nutch fetch $s1 
   bin/nutch updatedb db $s1 
 
 Ok, no problems here. 
 But now I cannot understood what happens with this
 command:
 
   bin/nutch generate db segments
 
 it is the same command of above, but now I've not
 injected anything in the
 DB, it only contais the pages I've previously
 fetched.
 So, does it mean that when I generate a segment, it
 will automagically be
 filled with links found in fetched pages? And where
 theese links are saved?
 And who saves theese link?
 
 Thank you so much, this work is really interesting!
   Menoz
 
 -- 
 Free Software Enthusiast
Debian Powered Linux User #332564 
http://menoz.homelinux.org
 


__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 


Re: Nutch 6.1 running issu

2005-09-11 Thread Michael Ji
hi Andrzej:

Thanks for your correction. The patch is compiled
successfully and running well in Nutch 07.

Just a curious question:

As stated in nutch 61:
...if content is unmodified it doesn't have to be
fetched and processed...

And I did test for refetching a page without content
modification and Nutch 6.1 DID parsing this page to
content/, parse_data/, and parse_text/

I took look at code: 

In Fetcher.java, 

ProtocolOutput output =
protocol.getProtocolOutput(fle);
ProtocolStatus pstat = output.getStatus();
:
switch ( pstat ) {
:
:
case ProtocolStatus.NOTMODIFIED:
 handleFetch(fle, output); 
break;
:
:
}


Should we just do nothing in case of NOTMODIFIED,
which is the flag set when content.MD5 = page.MD5 in
protocol.http.java?

The handleFetch() actually parsing and output data
structure to segments/.

Thanks,

Michael Ji,





--- Andrzej Bialecki [EMAIL PROTECTED] wrote:

 Michael Ji wrote:
  
  FetchListEntry value = new FetchListEntry();
  Page page = (Page)value.getPage().clone();
  
  
  Seems value is an empty FetchListEntry instance.
 Will
  that cause clone getPage failure coz it is NULL?
 
 Please try to replace this logic with the following:
 
  FetchListEntry value = new
 FetchListEntry();
  while (topN  0  reader.next(key,
 value)) {
Page page = value.getPage();
if (page != null) {
  Page p = new Page();
  p.set(page);
  page = p;
}
  if (forceRefetch) {
Page p = value.getPage();
// reset fetchTime and MD5,
 so that the content will
// always be new and unique.
p.setNextFetchTime(0L);
   
 p.setMD5(MD5Hash.digest(p.getURL().toString()));
  }
  tables.append(value);
  topN--;
 
 
 This patchset still needs a lot of thought and work.
 Even the part that 
 avoids re-fetching unmodified content needs
 additional thinking - it's 
 easy to end up in a state, where Nutch cannot be
 forced to re-fetch the 
 page because every time you try it remains
 unmodified - but you need 
 refetching the actual data because e.g. you lost
 that segment data...
 
 -- 
 Best regards,
 Andrzej Bialecki 
   ___. ___ ___ ___ _ _  
 __
 [__ || __|__/|__||\/|  Information Retrieval,
 Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System
 Integration
 http://www.sigram.com  Contact: info at sigram dot
 com
 
 


__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 


Nutch-87 Setup

2005-09-10 Thread Michael Ji
hi Matt:

You nutch-87 has a good idea and I believe it provides
a solution for good size of controled domain, say
hundreds of thousands sites.

I am currently trying to implement it to Nutch 07.

Got several questions want to be clearified:

1)
Should I create two plug-in classes in nutch? 

etc
one for WhitelistURLFilter 
one for WhitelistWriter

2)
I found Whitelist.java refer to 
import epile.util.LogLevel;

And
WhitelistURLFilter.java refer to
import epile.crawl.util.StringURL;
import epile.util.LogLevel;

Are these new package existing in Nutch lib? If not,
should we import a new epile*.jar?

3)
If we want to use Nutch-87, should we change the code
in Nutch core code. 

I plan to replace all the places where
RegexURLFilter appearing by WhitelistURLFilter.

Is it a right approach?

thanks,

Michael Ji,





__
Click here to donate to the Hurricane Katrina relief effort.
http://store.yahoo.com/redcross-donate3/


nutch excerpt

2005-09-05 Thread Michael Ji
hi,

There is a class Summarizer to accept a text string
and generate summary for the hits page.

My question is:

1)
Where the raw text string from? Open db file in
segment/content/?

2)
I guess some code must call Summarizer.main to trigger
it, but I didn't find which code does that.

thanks,

Michael Ji

__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 


link analysis in OC

2005-09-05 Thread Michael Ji
hi Kelvin:

Did OC compute page score same as Nutch crawling?

I found Nutch/index compute document boost value based
on the score/anchor data in segment/fetchlist data
structure.

I guess OC won't generate this boost score by itself
or use its' own data structure. So if we want to have
this score saved in lucene index, we need to use
nutch/generate.. to get the fetchlist and generate
webdb.

That means OC will live with Nutch's webdb and other
data structures.

Is my though right?

thanks,

Michael Ji

__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 


Re: bot-traps and refetching

2005-08-30 Thread Michael Ji
hi Kelvin:

I believe my previous email about further concerning
of controlled crawling confused you a bit via my
unmatured thought. But I believe that controlled
crawling is very important for an efficient vertical
crawling application generally.

After reviewed our previous discussion, I think the
solution for bot-traps and refetching in OC might be
able to be combined as one.

1) Refetching will look at the FetcherOutput of last
run, and queue the URLs according to their domain name
(for http 1.1 protocol) as your FetcherThread does.

2) We might just count the number of URLs within the
same domain (in fly of queue?). If that number is over
centain threshold, we might think stop adding new URLs
for that domain---it is equvalent in sense of
controlled crawling, but in a way of width.

Will it work as proposed?

thanks, 

Michael Ji,


--- Kelvin Tan [EMAIL PROTECTED] wrote:

 Michael,
 
 On Sun, 28 Aug 2005 07:31:06 -0700 (PDT), Michael Ji
 wrote:
  Hi Kelvin:
 
  2) refetching
 
  If OC's fetchlist is online (memory residence),
 the next time
  refetch we have to restart from seeds.txt once
 again. Is it right?
 
 
 Maybe with the current implementation. But if you
 Implement a CrawlSeedSource that reads in the
 FetcherOutput directory in the Nutch segment, then
 you can seed a crawl using what's already been
 fetched.
 
 





Start your day with Yahoo! - make it your home page 
http://www.yahoo.com/r/hs 
 


Re: Fetcher for constrained crawls

2005-08-28 Thread Michael Ji
Hi Jeremy:

1)
I guess the solution/patch provided by Kelvin tries to
enhance site fetching performance in several ways.

One of these is using HTTP 1.1 features . His
crawler is a site-depth---a sequence of URLs with the
same host. See his concept at
http://www.supermind.org/index.php?cat=17


2)
I think your approach is based on existing Nutch
scenario with minimum data structure modification in
webDB.

I am running test for Kelvin's patch now. I wonder if
it is possible that you could provide more detail
about your patch so that I can run test as well.

thanks,

Michael Ji

--- Jeremy Calvert [EMAIL PROTECTED] wrote:

 Like Kelvin, I too have been trying to get limited
 crawl capabilities 
 out of nutch.
 
 I've come up with a simplistic approach.  I'm afraid
 I haven't had time 
 to try out Kelvin's approach .
 
 I extend page to store a depth and a radius byte. 
 Loosely speaking, 
 depth is the distance you can hop within a given
 site (based on 
 domainID), and radius is the distnce you can hop
 once you've left the site.
 
 You set these when you inject seed URLs.
 
 When you create new pages from outgoing links, you
 call 

linkedPage.propagateDepthAndRadius(pageWithOutgoingLink)
 where:
 /**
  * @param incoming The pointing page.
  */
 public void propagateDepthAndRadius(Page
 incoming){
 boolean sameSite = false;
 try{ sameSite = this.computeDomainID() == 
 incoming.computeDomainID();}
 catch( MalformedURLException e ) {}//oh
 well, I guess they're 
 different domains.
 if(sameSite  incoming.depth  0){
 this.depth = (byte) (incoming.depth -
 1);  // same site, 
 decrement depth, maintain radius
 this.radius = incoming.radius;
 }else{
 this.depth = 0; //
 different sites or 
 out of depth, decrement radius
 this.radius = (byte) (incoming.radius -
 1);
 }
 }
 
 If the page already exists when you go to add it to
 the DB (with 
 instruction ADD_PAGE_IFN_PRESENT), you take the max
 of existing depth 
 and radius with the newly assigned depth and radius.
 
 The overall code modifications are about 30
 lines...small additions to 
 WebDBWriter and Page.
 
  From there, it's fun and handy to have depth and
 radius at your 
 disposal when creating the fetchlist.  I've written
 a new FetchListTool 
 to make use of them to keep things that are at the
 end of constraints 
 out and prioritize pages to fetch.  I also perturb
 the priorities 
 slightly by 0.001% so that, if I do have enough
 domains to prevent my 
 fetches from piling up on a single host, I generally
 do.
 
 Impacts:
 WebDBWriter (12 lines)
 Page (~20 lines)
 Requires new or modified FetchList tool.
 
 It's a simple and elegant solution for constrained
 crawl, but it does 
 touch the WebDB.  I'm interested to hear people's
 thoughts, and would be 
 more than happy to contribute a patch.
 
 J
 


__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 


bot-traps and refetching

2005-08-28 Thread Michael Ji
Hi Kelvin:

1) bot-traps problem for OC

If we have a crawling depth for each starting host, it
seems that the crawling will be finalized in the end (
we can decrement depth value in each time the outlink
falls in same host domain). 

Let me know if my thought is wrong.

2) refetching

If OC's fetchlist is online (memory residence), the
next time refetch we have to restart from seeds.txt
once again. Is it right?

3) page content checking

In OC API, I found an API WebDBContentSeenFilter, who
uses Nutch webdb data structure to see if the fetched
page content has been seen before. That means, we have
to use Nutch to create a webdb (maybe nutch/updatedb)
in order to support this function. Is it right?

thanks,

Michael,






Start your day with Yahoo! - make it your home page 
http://www.yahoo.com/r/hs 
 


crawling ability of NUTCH-84

2005-08-28 Thread Michael Ji
hi Kelvin:

Just a curious question.

As I know, the goal of nutch global crawling ability
will reach 10 billions page based on implementation of
map reduced.

OC, seeming to fall in the middle, is for control
industry domain crawling. How many sites is its'
goal?dealing with couple of thousand sites?

I believe the importance for industry domain crawling
is in-time updating. So identifying content of fetched
page and saving post-parsing time is critical.

thanks,

Michael Ji,




Start your day with Yahoo! - make it your home page 
http://www.yahoo.com/r/hs 
 


Re: junit test failed

2005-08-28 Thread Michael Ji
What is junit test standing for? A particular patch?

Sorry, if my question is silly.

Michael Ji,

--- AJ Chen [EMAIL PROTECTED] wrote:

 I'm a new comer, trying to test Nutch for vertical
 search. I downloaded 
 the code and compiled it in cygwin. But, the unit
 test failed with the 
 following message:
 
 test-core:
[delete] Deleting directory
 nutch\trunk\build\test\data
 [mkdir] Created dir: nutch\trunk\build\test\data
 
 BUILD FAILED
 nutch\trunk\build.xml:173: Could not create task or
 type of type: junit.
 
 Did I miss anything for junit? Appreciate your help.
 
 
 AJ Chen
 
 
 


__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 


Re: newbie questions

2005-08-25 Thread Michael Ji
I guess, you need to delete a doc from lucene search
engine to avoid to be hit it; 

you can use LuakeAll (an admin tool for lucene
indexing) to manupilate the indexed content;

Michael Ji,

--- haipeng du [EMAIL PROTECTED] wrote:

 could lucene have a way to delete a document from
 index writer? how
 could lucene to search documents that have value
 something and do
 not need to worry about what kind of field names
 with it. For example:
 if document has field name: test1 with value
 something and another
 one has field name: test2 with value something
 should both be
 hitted.
 Thanks a lot.
 -- 
 Haipeng Du
 Software Engineer
 Comphealth, 
 Salt Lake City
 

-
 To unsubscribe, e-mail:
 [EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]
 
 


__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: newbie questions

2005-08-25 Thread Michael Ji
I guess so, but I didn't try that before,

Michael Ji,

--- haipeng du [EMAIL PROTECTED] wrote:

 yes, that is right. Could I do that from Lucene API?
 Thanks a lot.
 
 On 8/25/05, Michael Ji [EMAIL PROTECTED] wrote:
  I guess, you need to delete a doc from lucene
 search
  engine to avoid to be hit it;
  
  you can use LuakeAll (an admin tool for lucene
  indexing) to manupilate the indexed content;
  
  Michael Ji,
  
  --- haipeng du [EMAIL PROTECTED] wrote:
  
   could lucene have a way to delete a document
 from
   index writer? how
   could lucene to search documents that have value
   something and do
   not need to worry about what kind of field names
   with it. For example:
   if document has field name: test1 with value
   something and another
   one has field name: test2 with value something
   should both be
   hitted.
   Thanks a lot.
   --
   Haipeng Du
   Software Engineer
   Comphealth,
   Salt Lake City
  
  
 

-
   To unsubscribe, e-mail:
   [EMAIL PROTECTED]
   For additional commands, e-mail:
   [EMAIL PROTECTED]
  
  
  
  
  __
  Do You Yahoo!?
  Tired of spam?  Yahoo! Mail has the best spam
 protection around
  http://mail.yahoo.com
  
 

-
  To unsubscribe, e-mail:
 [EMAIL PROTECTED]
  For additional commands, e-mail:
 [EMAIL PROTECTED]
  
  
 
 
 -- 
 Haipeng Du
 Software Engineer
 Comphealth, 
 Salt Lake City
 

-
 To unsubscribe, e-mail:
 [EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]
 
 


__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



dumping lucene index to text file

2005-08-21 Thread Michael Ji
hi,

I wonder if I can output the content of the
individual files in index dir to a text format, means,
I can see the each text saved in index files.

I saw Plucene has this ability, but finally, I found
it only support index generated by old lucene version.

Any suggestions you could kindly provide?

thanks,

Michael Ji


__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



crawl-urlfilter.txt mechanics

2005-08-21 Thread Michael Ji

Hi,

When I use intranet crawling, such as, call 
bin/nutch crawl ..., crawl-urlfilter.txt works---it
filters out the urls that is not matched the domain I
included;

actually, when I take a look at crawltool.java, the
config files are read in Java Properties by
'NutchConf.get().addConfResource(crawl-tool.xml)'

But:

When I calling each steps explicitly by myself, such
as, 
Loop 
   generate segment
   fetch
   updateDB

The crawl-urlfilter.txt doesn't work; 

My question is:

1) If I want to control the crawler's behavior in
second case, should I call 'NutchConf.get()...' by
myself?

2) Where url-filter exactly works? In fetcher? So,
after loaded from .xml and .txt, all the configuration
data is kept in Properties for life time of nutch
running?

thanks,

Michael Ji


__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 


MD5 in fetchlist / fetcher

2005-08-19 Thread Michael Ji
hi there,

I dumped the contents in segment/fetchlist and
segment/fetcher;

My curious question is that: why MD5 signature of the
page content doesn't save in fetchlist? 

In my mind, I think it will save CPU time if we see a
page unchanged --- coz we can skip the parsing
process; From my view, if we have MD5 in fetchlist, we
can do it directly in memory. If we have MD5 in
fetcher, we need to search it in local file in order
to do compare with the new fetched page content MD5.

Did I miss some important points or my dumping is
wrong?

thanks,

Michael Ji 

fetchlist
fetch: true
page: Version: 4
URL: http://www.sina.com/
ID: d6a83e9c17e05d5602709a63c241bf68
Next fetch: Sun Aug 21 20:15:06 CDT 2005
Retries since fetch: 0
Retry interval: 30 days
Num outlinks: 0
Score: 1.0
NextScore: 1.0

anchors: 0

fetcher
fetch: true
page: Version: 4
URL: http://www.sina.com/
ID: d6a83e9c17e05d5602709a63c241bf68
Next fetch: Sun Aug 21 20:15:06 CDT 2005
Retries since fetch: 0
Retry interval: 30 days
Num outlinks: 0
Score: 1.0
NextScore: 1.0

anchors: 0
Fetch Result:
MD5Hash: 56eae3c2556cb10a00e7346738dcb318
ProtocolStatus: success(1), lastModified=0
FetchDate: Sun Aug 14 20:15:13 CDT 2005




__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 


Re: Parse-html should be enhanced!

2005-08-18 Thread Michael Ji
Will an extension from existing point be a solution? 

Our on-going project also needs to deal specific
crawling cases in some sites. We think about extending
the current java class to fit our usage.

Michael Ji,

--- Jack Tang [EMAIL PROTECTED] wrote:

 Hi Nutchers
 
 I think parse-html parse should be enhanced. In some
 of  my
 projects(Intranet search engine), we only need the
 content in the
 specified detectors and filter the junk, say the
 content between div
 class=start-here and /div or some detectors
 like XPath. Any
 thoughts on this enhancement?
 
 Regards
 /Jack
 -- 
 Keep Discovering ... ...
 http://www.jroller.com/page/jmars
 


__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 


Merge Lucene to Nutch

2005-08-17 Thread Michael Ji
As I understand, Nutch is a crawling/searching
application based on Lucene;

Just a curious question, when Lucene has a new
version/release, how to merge Lucene to Nutch? 

I didn't see an explicity Lucene Java source in Nutch
source tree. I don't think Nutch and Lucene do low
level API independently. 

Thanks,

Michael Ji

__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 


Re: How to extend Nutch?

2005-08-10 Thread Michael Ji
hi Fuad:

I am probably doing the same thing. I think plug-in is
the right place to put my own code. 

But not sure, why we need to touch other config files.

Regards,

Michael Ji

--- Fuad Efendi [EMAIL PROTECTED] wrote:

 
 I need some pre-processing, to add additional fields
 to Document, and to
 show it on a web-page
 I probably need to work with plugins, and to modify
 config files... 
 
 nutch-conf.xsl
 nutch-default.xml
 nutch-site.xml
 
 Am I right? 
 Thanks
 
 
 -Original Message-
 From: Fuad Efendi [mailto:[EMAIL PROTECTED] 
 Sent: Wednesday, August 10, 2005 2:15 PM
 To: nutch-user@lucene.apache.org
 Subject: RE: [Nutch-general] How to extend Nutch
 
 
 So, I need to modify some existing classes, isn't
 it?
 
 
 -Original Message-
 From: [EMAIL PROTECTED]
 [mailto:[EMAIL PROTECTED] 
 Sent: Wednesday, August 10, 2005 1:48 PM
 To: [EMAIL PROTECTED]
 Subject: Re: [Nutch-general] How to extend Nutch
 
 
 Probably IndexingFilter or HtmlParser for indexing
 and for indexing I
 think there is something in
 org.apache.nutch.search some class that
 starts with Raw  I just saw this in the Javadoc
 earlier.
 
 Otis
 
 --- Fuad Efendi [EMAIL PROTECTED] wrote:
 
  I need specific pre-processing of a html-page, to
 add more fields to 
  Document before storing it in Index, and to modify
 web-interface 
  accordingly.
  
  Where is the base point of extension?
  Thanks!
  
 
 
 
 


__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 


Re: Http Max Delays

2005-07-29 Thread Feng \(Michael\) Ji
I met that problem before, after I change the
http.timeout and max.delay values to 100 times the
default setting, the problem is gone, 

you might look at nutch-default.xml and make a
overwritten in nutch-site.xml,

Michael,

--- Drew Farris [EMAIL PROTECTED] wrote:

 By any chance are you crawling many pages stored on
 a single server or
 small number of servers? If so, take a look at:
 

http://www.mail-archive.com/nutch-developers%40lists.sourceforge.net/msg04414.html

http://www.mail-archive.com/nutch-developers%40lists.sourceforge.net/msg04427.html
 
 On 7/27/05, Christophe Noel
 [EMAIL PROTECTED] wrote:
  Hello,
  
  When I'm fetching , I really have too many Http
 Timeout with default
  nutch parameters.
  
  Does anyone have tips to improve that point ?
  
  Thanks very much.
  
  Christophe Noël.
  www.cetic.be
  
  =
  
  org.apache.nutch.protocol.RetryLater: Exceeded
 http.max.delays: retry later.
  at
 

org.apache.nutch.protocol.httpclient.Http.blockAddr(Http.java:133)
  at
 

org.apache.nutch.protocol.httpclient.Http.getProtocolOutput(Http.java:201)
  at
 

org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:135)
  org.apache.nutch.protocol.RetryLater: Exceeded
 http.max.delays: retry later.
  at
 

org.apache.nutch.protocol.httpclient.Http.blockAddr(Http.java:133)
  at
 

org.apache.nutch.protocol.httpclient.Http.getProtocolOutput(Http.java:201)
  at
 

org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:135)
 
 





Start your day with Yahoo! - make it your home page 
http://www.yahoo.com/r/hs 
 


http.max.delays

2005-07-27 Thread Feng \(Michael\) Ji
Hi there:

I checked log file, found some site link met error
asExceeded http.max.delays: retry later

I change the corresponding value in conf file,
nutch-default.xml, I changed it to 300, seems still
not enough. Will that affect performance of crawling?

Any idea?

thanks,

Michael

__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 


Nutch's intranet VS internet crawling

2005-07-24 Thread Feng \(Michael\) Ji

I wonder if there is any difference between these two?
Or intranet crawling must indicate an intranet site
explicitly in crawl-urlfilter.txt under /conf ?

thanks,

Michael,


__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 


RE: fetching behavior of Nutch

2005-07-24 Thread Feng \(Michael\) Ji
thanks Howie,

that guides me,

Michael,

--- Howie Wang [EMAIL PROTECTED] wrote:

 There are probably two settings you'll need to tweak
 in nutch-default.xml
 
 http.content.limit -- by default it's 64K, if the
 page is
 larger than that, then it essentially truncates the
 file.
 You could be missing lots of links that appear later
 in
 the page.
 
 max.outlinks.per.page -- by default it's 100. You
 might
 want to increase this since for pages with something
 like
 a nested navigation sidebar with tons of links, it
 won't
 get any links from the main part of the page.
 
 The *.xml files are fairly descriptive. So just
 reading through
 them can be pretty helpful. I don't know if there is
 a full
 guide to the config files.
 
 Howie
 
 
 
 
 1)
 I did several testing running to fetch page from
 two
 website. The fetching depth is 10.
 
 After checking log files, I found the actual
 fetched
 page linkage is very different for two sites.
 
 In one site with lots of news, only first two depth
 fetching running well and only fetching 5 linkages.
 The actual linkages in that site is far beyond
 that.
 
 The other site can fetch till 10 rounds and fetched
 100's linkage.
 
 I wonder if any one has similar experience. Should
 I
 setup configure files in /conf/?
 
 2)
 Also, in Nutch/conf/ directory, I found several
 configuration files. Actually, I only modify
 crawl-urlfilter.txt to let it accept all the url
 (*.*).
 
 Is it proper?
 
 I really doesn't touch other conf files. Is there a
 guideline how I use these files?
 
 thanks,
 
 Michael,
 
 
 
 __
 Do You Yahoo!?
 Tired of spam?  Yahoo! Mail has the best spam
 protection around
 http://mail.yahoo.com
 
 
 


__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 


Re: a silly question

2005-07-16 Thread Feng \(Michael\) Ji
hi there:

Actually, this is first time I setup Nutch;

1) I crawl a website in nutch directory

2) delete ROOT in tomcat and copy Nutch*.war to
tomcat/webapp

3) I can run tomcat successfully before and after copy
*.war file; by means I can see default tomcat homepage
and nutch search page after.

4) I change file permission to 777 by using chmod
command for both nutch and tomcat folder. Any
particular file and fold I need to take care
permission issue?

Then, when I type in a text in search text box, and
hit search button, it gives me http status 500 .
error message. Some more suggestons?

I found search engine actually running search.jsp.
Where is this JSP? In nutch somewhere I guess.

thanks,

Michale,

--- Fredrik Andersson [EMAIL PROTECTED]
wrote:

 Have you deleted your old nutch.war (and the
 directory named nutch) in
 the Tomcat webapps directory, prior to inserting a
 new war-file?
 Tomcat can act pretty strange if you don delete your
 old x.war and the
 webapps/x dir.
 
 Using port 8080? Permissions on the .jsp and
 war-file ok? Can you
 access the standard Tomcat site by browsing to
 localhost:8080
 (Congratulations, you have successfully installed
 Tomcat blabla...)?
 
 On 7/16/05, Feng (Michael) Ji [EMAIL PROTECTED]
 wrote:
  
  Hi there,
  
  I know this questions might be related to nutch's
 user
  group, but if any one could help me a bit, I will
  really appreciate it.
  
  --
  I try to run Nutch in my Linux server and followed
  each step in the tutorial of
  http://lucene.apache.org/nutch/tutorial.html
  
  I can run Nutch successful, such as, crawling and
 save
  db to my local server.
  
  I can launch tomcat web page with Nutch search
 home
  page successful, but after I hit search button,
 tomcat
  give me
  http status 500 error...
  
  Looks like JSP (java) doesn't compile properly in
  Tomcat container.
  
  I think it is not Tomcat problem, but somehow,
 Nutch
  doesn't compile well, or maybe some nutch library
  isn't being accesses properly.
  
  Anyone has the similar experience? Any suggestion
 will
  be very helpful,
  
  thanks ahead,
  
  Michael,
  
  
  __
  Do You Yahoo!?
  Tired of spam?  Yahoo! Mail has the best spam
 protection around
  http://mail.yahoo.com
 
 





Start your day with Yahoo! - make it your home page 
http://www.yahoo.com/r/hs