Re: How to implement web dictionary in nutch

2005-06-23 Thread Andy Liu
I posted a patch for this a little while ago.

http://issues.apache.org/jira/browse/NUTCH-48

It's a spell checker with a dictionary based on the terms in your
index.  I think this is what you're looking for.

Andy

On 22 Jun 2005 06:47:34 -, bala  santhanam
<[EMAIL PROTECTED]> wrote:
> 1) How to implement web dictionary in nutch. i.e., if im searching for a word
> 
> "free downloas" google automatically offers me
> Did you mean: "free downloads"
> 
> Hence it is able to maintain a web dictionary and correct my spelling 
> mistake. how to implement this n nutch. is there any classes available 
> readymade.
> 
> 2) Also, if im searching for "agentcy", searchengines offers
> Did  you mean: "agency". but on using "agentcy" occasionally searchengines 
> automatically adds the word  agentcy to the web dictionary. how that word is 
> added. suggestions please.
>


Re: How to implement web dictionary in nutch

2005-06-23 Thread Andy Liu
If anybody has had a chance to play with this patch, let me know what
your experiences are.  There's always room for improvement.  Just post
your comments on JIRA.

Andy

On 6/23/05, Stefan Groschupf <[EMAIL PROTECTED]> wrote:
> > http://issues.apache.org/jira/browse/NUTCH-48
> >
> > It's a spell checker with a dictionary based on the terms in your
> > index.  I think this is what you're looking for.
> 
> Guys,
> I would love to see that people are voting for issues,
> since issue voting's are the (may only) voice of the community. :-D
> 
> I already voted this is issue some time ago and would love to see
> such a solution in the sources.
> 
> Cheers,
> Stefan
>


Re: Iterating spidered pages

2005-07-05 Thread Andy Liu
You can use a SegmentReader object to give you references to the
FetcherOutput, ParseData, and Content objects for each page in the
segment.  The raw page data is encapsulated within the Content object
so you can parse out whatever you want from it.

However, somebody correct me if I'm wrong, but I don't think you can
update individual ArrayFile entries once they've been written.  So
while you're looping over each ParseData entry, you can write your
updated ParseData objects to a temporary ArrayFile and replace it with
the old one when you're done.

Andy

On 7/5/05, Fredrik Andersson <[EMAIL PROTECTED]> wrote:
> Hi!
> 
> I'm new to this list, so hello to you all.
> 
> Here's the gig - I have crawled and indexed a bunch of pages. The HTML
> Parser used in nutch only parses out the title, text, metadata and
> outlinks. Is there any way to extend this set of attributes
> post-crawling (i.e, without rewriting HtmlParser.java)? I'd like to
> iterate all the crawled pages, access their raw data, parse out some
> chunk of text and save it as a detail field or similar.
> 
> I haven't really got the full hang of the all the connections in the
> API yet, so forgive a poor guy for being a newbie.
> 
> Big thanks in advance,
> Fredrik
>


Re: [Nutch-dev] getDiscriptor

2005-07-21 Thread Andy Liu
Deprecating them won't hurt, especially for those creating custom
extension points and using the mispelled method names.

On 7/21/05, Jérôme Charron <[EMAIL PROTECTED]> wrote:
> > And also PluginRepository.dependencyIsAvailabel - rename, or
> > deprecate and correct?
> 
> Erik, what a good code reviewer you are!
> You know what I think about deprecated methods (If the probability to be
> used outside of Nutch, then must be deprecated, if the impact is only on
> nutch internal code, no needs to deprecate)
> ;)
> 
> Jerome
> 
> --
> http://motrech.free.fr/
> http://www.frutch.org/
> 
>


Re: IndexOptimizer bug?

2005-07-22 Thread Andy Liu
I believe this tool is unfinished and unsupported.

On 7/22/05, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
> I found an IndexOptimzer in nutch.
> When I run it, it dorps an exception:
> 
> Optimizing url:http from 226957 to 22696
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 22697
> at org.apache.lucene.util.PriorityQueue.put(PriorityQueue.java:46)
> at
> org.apache.nutch.indexer.IndexOptimizer$OptimizingTermPositions.seek(IndexOptimizer.java:153)
> at
> org.apache.lucene.index.SegmentMerger.appendPostings(SegmentMerger.java:325)
> at
> org.apache.lucene.index.SegmentMerger.mergeTermInfo(SegmentMerger.java:296)
> at
> org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java:270)
> at
> org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:234)
> at
> org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:96)
> at
> org.apache.lucene.index.IndexWriter.addIndexes(IndexWriter.java:578)
> at
> org.apache.nutch.indexer.IndexOptimizer.optimize(IndexOptimizer.java:215)
> at
> org.apache.nutch.indexer.IndexOptimizer.main(IndexOptimizer.java:235)
>


Detecting CJKV / Asian language pages

2005-08-01 Thread Andy Liu
The current Nutch language identifier plugin currently doesn't handle
CJKV pages.  Does anybody here have any experience with automatically
detecting the language of such pages?

I know there are specific encodings which give away what language the
page is, but for Asian language pages that use unicode or its
variants, I'm out of luck.

Andy


Re: Memory usage

2005-08-02 Thread Andy Liu
How do you figure that it takes 1.5G ram for 30M pages?  I believe
that when the Lucene indexes are read, it reads all the numbered *.f*
files and the *.tii files into memory.  The numbered *.f* files
contain the length normalization values for each indexed field (1 byte
per doc), and the .tii file contains every kth term (k=128 by default,
I think).

For 30M documents, each *.f* file is 30 megs, and your .tii file
should be less than 100 megs.  For 8 indexed fields, you'd be looking
at a memory footprint of about 340M.  Any extra memory on the server
can be used for buffer caching which will speed up searches.

If you'd like, you can set up search servers to spread the load across
seperate machines.

The servlet container you use shouldn't make much of a difference in
memory usage.

Andy

On 8/2/05, Jay Pound <[EMAIL PROTECTED]> wrote:
> I'm testing an index of 30 million pages, it requires 1.5gb of ram to search
> using tomcat 5, I plan on having an index with multiple billion pages, but
> if this is to scale then even with 16GB of ram I wont be able to have an
> index larger than 320million pages? how can I distribute the memory
> requirements across multiple machines, or is there another servlet program
> (like resin) that will require less memory to operate, has anyone else run
> into this?
> Thanks,
> -Jay Pound
> 
> 
>


Re: Memory usage2

2005-08-02 Thread Andy Liu
I have found that merging indexes does help performance significantly.

If you're not using the cached pages for anything, I believe you can
delete the /content directory for each segment and the engine should
work fine (test before you try for real!)  However, if you ever have
to reindex the segments for whatever reason, you'll run into problems
without the /content dirs.

Nutch doesn't use the HITS algorithm.  Nutch's analyze phase was based
off of PageRank, but it's no longer supported.  By default Nutch
boosts documents based on the # of incoming links, which works well in
small document collections, but is not a robust method in a whole-web
environment.  In terms of search quality, Nutch would not be able to
hang with the "big dogs" of search just yet.  There's still much work
that needs to be done in the area of search quality and spamming.

Andy

On 8/2/05, Fredrik Andersson <[EMAIL PROTECTED]> wrote:
> Hi Jay!
> 
> Why not use the "Google approach" and buy lots of cheap
> workstations/servers to distribute the search on? You can really get
> away cheap these days, compared to high-end servers. Even if NDFS and
> isn't fully up to par in 0.7-dev yet, you can still move your indices
> around to separate computers and distribute them that way.  Writing a
> small client/server for this purpose can be done in a matter of hours.
> Gathering as much data as you have on one server sounds like a bad
> idea to me, no matter how monstrous that server is.
> 
> Regarding the HITS algorithm - check out the example on the Nutch
> website for the Internet crawl, where you select the top scorers after
> you finished a segment (of arbitrary size), and continue on crawling
> from those high-ranking sites. That way you will get the most
> authorative sites in your index first, which is good.
> 
> Good night,
> Fredrik
> 
> On 8/2/05, Jay Pound <[EMAIL PROTECTED]> wrote:
> > 
> > one last important question, if I merge my indexes will it search faster
> > than if I don't merge them, I currently have 20 directories of 1-1.7mill
> > pages each.
> > and if I split up these indexes across multiple machines will the searching
> > be faster, I couldent get the nutch-server to work but I'm using 0.6.
> > ...
> > Thank you
> > -Jay Pound
> > Fromped.com
> > BTW windows 2000 is not 100% stable with dual core processors. nutch is ok
> > but cant do too many things at once or I'll get a kernel inpage error (guess
> > its time to migrate to 2003.net server-damn)
> > - Original Message -
> > From: "Doug Cutting" <[EMAIL PROTECTED]>
> > To: 
> > Sent: Tuesday, August 02, 2005 1:53 PM
> > Subject: Re: Memory usage
> >
> >
> > > Try the following settings in your nutch-site.xml:
> > >
> > > 
> > >io.map.index.skip
> > >7
> > > 
> > >
> > > 
> > >indexer.termIndexInterval
> > >1024
> > > 
> > >
> > > The first causes data files to use considerably less memory.
> > >
> > > The second affects index creation, so must be done before you create the
> > > index you search.  It's okay if your segment indexes were created
> > > without this, you can just (re-)merge indexes and the merged index will
> > > get the setting and use less memory when searching.
> > >
> > > Combining these two I have searched a 40+M page index on a machine using
> > > about 500MB of RAM.  That said, search times with such a large index are
> > > not good.  At some point, as your collection grows, you will want to
> > > merge multiple indexes containing different subsets of segments and put
> > > each on a separate box and search them with distributed search.
> > >
> > > Doug
> > >
> > > Jay Pound wrote:
> > > > I'm testing an index of 30 million pages, it requires 1.5gb of ram to
> > search
> > > > using tomcat 5, I plan on having an index with multiple billion pages,
> > but
> > > > if this is to scale then even with 16GB of ram I wont be able to have an
> > > > index larger than 320million pages? how can I distribute the memory
> > > > requirements across multiple machines, or is there another servlet
> > program
> > > > (like resin) that will require less memory to operate, has anyone else
> > run
> > > > into this?
> > > > Thanks,
> > > > -Jay Pound
> > > >
> > > >
> > >
> > >
> >
> >
> >
>


Re: Strange search results

2005-08-03 Thread Andy Liu
The fieldNorm is lengthNorm * document boost.  The final value is
"rounded" so that's why you're getting such clean numbers for your
fieldNorm.  If you're finding that these pages have too high of a
boost, you can lower indexer.score.power in your conf file.

As for your problem in #2, look at the explain page to see how your
search result got there.  Maybe there's a high score for an anchor
match.  The anchor text doesn't show up on the text of the page, so
maybe that's it.

Andy

On 8/3/05, Howie Wang <[EMAIL PROTECTED]> wrote:
> Hi,
> 
> I've been noticing some strange search results recently. I seem
> to be getting two issues.
> 
> 1. The fieldNorm for certain terms is unusually high for certain sites
> for anchors and titles. And they are usually just whole numbers (4.0, 5.0,
> etc).
> I find this strange since the lengthNorm used to calculate this is
> very unlikely to result in an integer. It's either 1/sqrt(numTokens) or
> 1/log(e+numTokens). Where is 5.0 coming from?
> 
> 2. I'm getting hits for sites that don't contain ANY of the terms in my
> search. This is exacerbated by issue #1 since the fieldNorm boosts this
> page to the top of the results. I thought it might be because of  my
> changes for stemming, but this happens for search terms that are not
> changed by stemming at all.
> 
> Anyone run into something like this? Any ideas on how to start debugging?
> 
> Thanks,
> Howie
> 
> 
> Howie
> 
> 
>


Re: near-term plan

2005-08-04 Thread Andy Liu
Sounds good.  I've used the io and fs classes for non-Nutch purposes,
so this separation makes sense.

On 8/4/05, Doug Cutting <[EMAIL PROTECTED]> wrote:
> Here's a near-term plan for Nutch.
> 
> 1. Release Nutch 0.7, based on current trunk.  We should do this ASAP.
> Are there bugs in trunk that we need to fix before this can be done?
> The trunk will be copied to a 0.7 release branch.
> 
> 2. Merge the mapred branch to trunk.
> 
> 3. Move the packages org.apache.nutch.{io,ipc,fs,ndfs,mapred} into a
> separate project for distributed computing tools.  If the Lucene PMC
> approves this, it would be a new Lucene sub-project, a Nutch sibling.
> 
> Does this sound reasonable to folks?
> 
> Doug
> 
>


Re: Injecting documents manually.

2005-08-12 Thread Andy Liu
This is built into Nutch.  Instead of injecting http:// url's, use
file:// , and Nutch will use protocol-file to fetch the files locally.

Andy

On 8/12/05, Dawid Weiss <[EMAIL PROTECTED]> wrote:
> 
> Has anyone considered/ implemented injecting static pages with a
> different URL scheme? I mean the rare scenario when you have tons of
> static HTML pages and would want to avoid rerouting queries through your
> own Web server, but rather fetch them directly from disk prefixing their
> disk path with a given URL prefix.
> 
> I looked at the problem briefly (I admit) and it seems it'd require some
> manual coding because of the the split between indexer and fetcher pipeline.
> 
> Any comments and suggestions are very welcome.
> Dawid
> 
> 
>


injection infinite loop

2006-01-04 Thread Andy Liu
If you inject the crawldb with a url file that doesn't end with a line feed,
an infinite loop is entered.  Anybody else encounter this problem?

060104 160950 Running job: job_7uku5w
060104 160952  map 0%
060104 160954  map 50%
060104 160957  map -2631%
060104 160959  map -259756%
060104 161002  map -538552%
060104 161006  map -818413%
060104 161009  map -1098421%
060104 161011  map -1377851%
060104 161014  map -1657718%
060104 161018  map -1939534%
060104 161021  map -2218515%
060104 161023  map -2588212%
060104 161026  map -2868787%
060104 161030  map -3147637%


[jira] Created: (NUTCH-188) Add searchable mailing list links to http://lucene.apache.org/nutch/mailing_lists.html

2006-01-26 Thread Andy Liu (JIRA)
Add searchable mailing list links to 
http://lucene.apache.org/nutch/mailing_lists.html
--

 Key: NUTCH-188
 URL: http://issues.apache.org/jira/browse/NUTCH-188
 Project: Nutch
Type: Improvement
Reporter: Andy Liu
Priority: Trivial


Post links to searchable mail archives on nutch.org 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-188) Add searchable mailing list links to http://lucene.apache.org/nutch/mailing_lists.html

2006-01-26 Thread Andy Liu (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-188?page=all ]

Andy Liu updated NUTCH-188:
---

Attachment: mailing_list.patch

> Add searchable mailing list links to 
> http://lucene.apache.org/nutch/mailing_lists.html
> --
>
>  Key: NUTCH-188
>  URL: http://issues.apache.org/jira/browse/NUTCH-188
>  Project: Nutch
> Type: Improvement
> Reporter: Andy Liu
> Priority: Trivial
>  Attachments: mailing_list.patch
>
> Post links to searchable mail archives on nutch.org 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Created: (NUTCH-189) Injection infinite loop

2006-01-26 Thread Andy Liu (JIRA)
Injection infinite loop
---

 Key: NUTCH-189
 URL: http://issues.apache.org/jira/browse/NUTCH-189
 Project: Nutch
Type: Bug
 Environment: Linux
Reporter: Andy Liu
Priority: Minor


f you inject the crawldb with a url file that doesn't end with a line feed, an 
infinite loop is entered.

060104 160950 Running job: job_7uku5w
060104 160952  map 0%
060104 160954  map 50%
060104 160957  map -2631%
060104 160959  map -259756%
060104 161002  map -538552%
060104 161006  map -818413%
060104 161009  map -1098421%
060104 161011  map -1377851%
060104 161014  map -1657718%
060104 161018  map -1939534%
060104 161021  map -2218515%
060104 161023  map -2588212%
060104 161026  map -2868787%
060104 161030  map -3147637%


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira