Re: writing a metadata content tag

2006-03-08 Thread Howie Wang
What i want to do is i should add some header info in parse-filter which will be used by index-filter to add my own nature of the new FIELD Rgds Prabhu I would recommend doing it at the index phase if possible. If the end goal is to have it searchable from the index, ask if you really need to h

Re: writing a metadata content tag

2006-03-08 Thread Raghavendra Prabhu
Hi Howie What you have mentioned is in the indexing fields I am speaking abt content i thought there are three steps parse-filter index-filter query-filter I think you are referring to the second step index-filter. I want more on the first step parse-filter What i want to do is i should add

RE: writing a metadata content tag

2006-03-08 Thread Howie Wang
You need to write your own indexing filter plugin. Take a look at index-basic. In BasicIndexingFilter.java there are a whole bunch of lines that do something like: doc.add(Field.Text("myfield", myFieldValue)); Just add your own field. You have access to title, anchor, and page text in this funct

Re: help with creating a directory ie front page menu of common terms

2006-03-08 Thread David Wallace
This is true. What I do is I have Nutch log all the searches. Every few weeks, I grab the most common search terms out of the log and turn them into my "common searches" menu. Although having a manual process is not desirable, it does remove the possibility that a spammer will sabotage my men

Re: Adaptive Refetching

2006-03-08 Thread Doug Cutting
Andrzej Bialecki wrote: Doug Cutting wrote: are refetched, their links are processed again. I think the easiest way to fix this would be to change ParseOutputFormat to not generate STATUS_LINKED crawldata when a page has been refetched. That way scores would only be adjusted for links in the

"already exists" error in indexing

2006-03-08 Thread Teruhiko Kurosaka
I'm using Nutch 0.7.1. on Windows. My crawling&indexing task ended with this Java IO Exception: java.io.IOException: already exists: C:\nutch-0.7\intranet_0308\db\webdb.new\pagesByURL at org.apache.nutch.io.MapFile$Writer.(MapFile.java:86) at org.apache.nutch.db.WebDBWriter$Clo

Search speed - resolution/summary

2006-03-08 Thread Insurance Squared Inc.
Seems we've found the problem that was causing our search delays. We had some indexes that were 32bytes, apparently they'd crashed somehow (not yet determined how). The existence of these segments were the source of the problem. We removed those segments and the search is running along much

Re: Adaptive Refetching

2006-03-08 Thread Andrzej Bialecki
Doug Cutting wrote: The OPIC algorithm is not really designed for re-fetching. It assumes that each link is seen only once. When pages Ummm. well, this is definitely not our case. are refetched, their links are processed again. I think the easiest way to fix this would be to change ParseO

writing a metadata content tag

2006-03-08 Thread Raghavendra Prabhu
Hi guys Sorry for the follow up mail My requirement as i was mentioning previously shud let me stamp documents with some kind of type How do i do it ? For example add sports to a field TYPEFIELD on seeing football,tennis in extracted text For example add technology to the same field TYPEFIEL

adding content

2006-03-08 Thread Raghavendra Prabhu
Hi I am planning to write a parse filter which it shud add a header on finding a keyword in the extracted text For example if the extracted text contains football,tennis or baseball i will add a header called sports If the extracted text contains internet,language i will add a header called tech

Re: Adaptive Refetching

2006-03-08 Thread Raghavendra Prabhu
Hi Is this not a critical problem? we right now generate segments and refetch pages and any refetched segment will rank relatively higher making search results irrevelant So ultimately relevant results are not returned . Is it Rgds Prabhu On 3/9/06, Doug Cutting <[EMAIL PROTECTED]> wro

Re: Boolean OR QueryFilter

2006-03-08 Thread Doug Cutting
David Odmark wrote: So am I correct in believing that in order to implement boolean OR using Nutch search and a QueryFilter, one must also (minimally) hack the NutchAnalysis.jj file to produce a new analyzer? Also, given that a Nutch Query object doesn't seem to have a method to add a non-requi

Re: Adaptive Refetching

2006-03-08 Thread Doug Cutting
Andrzej Bialecki wrote: What i infer is, 1. For every refetch, the score of files (but not the directory) is increasing This is curious, it should not be so. However, it's the same in the vanilla version of Nutch (without this patch), so we'll address this separately. The OPIC al

Re: help with creating a directory ie front page menu of common terms

2006-03-08 Thread Insurance Squared Inc.
Just a note that while this idea is good, displaying 'recent searches' can be used by spammers. All they have to do is hammer your server with a bunch of queries to 'www.some-poker-site.com' and their website gets a link from yours. I'd be very leary of republishing any user inputs to your syst

Re: how to search data on DSF (0.8)

2006-03-08 Thread Stefan Groschupf
I just have no login and ip to the box any more. In case you send me a login, ip and the path where the source are I can have someone taking a look tomorrow. Stefan Am 08.03.2006 um 19:03 schrieb Stefan Groschupf: Storing the index on a dfs works just change conf to use dfs in nutch.war/Web

Re: how to search data on DSF (0.8)

2006-03-08 Thread Stefan Groschupf
Storing the index on a dfs works just change conf to use dfs in nutch.war/Web-inf/classes/nutch-default.xml and setup the correct path in the property searcher.dir. However it is slow. Anyway in case you say little search app, than I strongly suggest using a local file system. Nutch 0.8 runs

Re: how to search data on DSF (0.8)

2006-03-08 Thread Olive g
Thank you! Sorry I am a newbie. I meant searching an index located on dfs for a term. I would like to run my little search app from command line on Linux. Help please! From: Stefan Groschupf <[EMAIL PROTECTED]> Reply-To: nutch-user@lucene.apache.org To: nutch-user@lucene.apache.org Subject: Re

Re: help with creating a directory ie front page menu of common terms

2006-03-08 Thread Stefan Groschupf
Have a look to the IndexReader a object in the the lucene package. Am 08.03.2006 um 10:07 schrieb Stephen Ensor: Hi, I am using nutch to create a vertical search site and wish to create a directory type menu for my front page with all the most common terms in my index. For example say my v

how to exclude URLs with particular string in them

2006-03-08 Thread Ivaylo Georgiev
Hi. How to exclude from fetching URLs with particular string in them (for example “SISID=”)? What regular expression have to put in regex-urlfilter.txt? Thank you, Ivaylo Georgiev -- No virus found in this outgoing message. Checked by AVG Free Edition. Version: 7.1.375 / Virus Database: 268

Content of page

2006-03-08 Thread Maciej Szwajcowski
Hello, I'm using nutch, version 0.8 dev. I'm using NUTCH API like this: Hits hits = nutchBean.search(query, numHits); HitDetails hitDetails = nutchBean.getDetails(hit.getHit(0)); byte[] content = nutchBean.getContent(hitDetails); Now I want to retrieve the content of the given URL rather then

help with creating a directory ie front page menu of common terms

2006-03-08 Thread Stephen Ensor
Hi, I am using nutch to create a vertical search site and wish to create a directory type menu for my front page with all the most common terms in my index. For example say my vertical search is pets and my index is full of pet sites and pages, the common terms would be (cat, dog, fish, food, vet,

Re: Search Speed

2006-03-08 Thread keren nutch
Hi Stefan, Thank you for reply. We have 31 segments. They are totally 106G. Keren Stefan Groschupf <[EMAIL PROTECTED]> wrote: How many segments you have and how big are they? Try a disc IO Measurement tool or script what does it says? Am 08.03.2006 um 17:38 schrieb Insurance Squared Inc.

Re: how to search data on DSF (0.8)

2006-03-08 Thread Stefan Groschupf
what means search data. you can do bin/hadoop dfs -ls to browse the dfs. Also there are some junit tests in the hadoop project that illustrate how to use the api (TestDFS). cheers Stefan Am 08.03.2006 um 18:40 schrieb Olive g: Hello, Does anyone have sample code (using the Nutch API and

how to search data on DSF (0.8)

2006-03-08 Thread Olive g
Hello, Does anyone have sample code (using the Nutch API and running from command line) to search data on DSF? I am using version 0.8. Thank you. Olive _ Don’t just search. Find. Check out the new MSN Search! http://search.msn.cl

Re: Search Speed

2006-03-08 Thread Insurance Squared Inc.
We've got about 6 or 8 segments, but we just merged our indexes in an attempt to speed things up. Total hard drive space is something like 60-80 gigs, in that neighbourhood. Nothing there strikes me as suspicious. I could look at disc i/o speeds, but I'm doubtful that's the issue. We're run

Re[4]: help - distributed crawl in 0.7.1

2006-03-08 Thread Dima Mazmanov
t;>>> http://clinic.mcafee.com/clinic/ibuy/campaign.asp? >>>> cid=3963 >>>> >>>> >> >>> --- >>> company:http://www.media-style.com >>> forum:http://www.t

Re: Re[2]: help - distributed crawl in 0.7.1

2006-03-08 Thread Stefan Groschupf
=3963 --- company:http://www.media-style.com forum:http://www.text-mining.org blog:http://www.find23.net __ NOD32 1.1434 (20060308) Information __ This message was checked by NOD32 antivirus system. http://www.eset.com -- Regards,

Re: why TOTAL urls: 1

2006-03-08 Thread Stefan Groschupf
I guess yahoo.com has a robot.txt to block crawling the complete page. Also check the level depth you use. Am 08.03.2006 um 17:53 schrieb Olive g: Hello everyone, I am also running distributed crawl on .8.0 (some dev version) and somehow the stats always returned TOTAL urls as 1 while I was

Re: help - distributed crawl in 0.7.1

2006-03-08 Thread Olive g
Thanks! I saw that one too, but according to Doug, it was for 0.8 only. Does anyone have step-by-step introductions like the one for 0.8? Also, anyone knows why URL total is always 1 when I ran 0.8? 060308 064420 map 0% 060308 064427 map 100% 060308 064433 reduce 100% 060308 064433 Job compl

Re: help - distributed crawl in 0.7.1

2006-03-08 Thread TDLN
Detailed distributed crawl implementation: http://www.mail-archive.com/nutch-user@lucene.apache.org/msg02270.html I am not sure it applies to 0.7 though, but it has a lot of info. Rgrds, Thomas

Re[2]: help - distributed crawl in 0.7.1

2006-03-08 Thread Dima Mazmanov
t;>Security. http://clinic.mcafee.com/clinic/ibuy/campaign.asp? cid=3963 >>> >>> >> >>--- >>company:http://www.media-style.com >>forum:http://www.text-mining.org >>blog:http://www.find23.net >> >> > _ > On the road to retirement? Check out MSN Life Events for advice on how to > get there! http://lifeevents.msn.com/category.aspx?cid=Retirement > __ NOD32 1.1434 (20060308) Information __ > This message was checked by NOD32 antivirus system. > http://www.eset.com -- Regards, Dima mailto:[EMAIL PROTECTED]

Re: Link Farms

2006-03-08 Thread Matt Kangas
Hi folks, Offhand, I'm not aware of any slam-dunk solution to link farms either. One thing that could help mitigate the problem is a pre-built blacklist of some sort. For example: http://www.squidguard.org/blacklist/ That one is really meant for blocking user-access to porn, known warez

Re: help - distributed crawl in 0.7.1

2006-03-08 Thread Olive g
Thank you so much for your reply! I just sent another message - because I am having other issues with 0.8 and somehow the TOTAL urls is always 1 when I search big sites such as www.yahoo.com. I thought 0.7.1 might be more stable? THe stats: 060308 064418 Client connection to 9.2.13.8:8010 : st

Re[2]: help - distributed crawl in 0.7.1

2006-03-08 Thread Dima Mazmanov
. >> http://clinic.mcafee.com/clinic/ibuy/campaign.asp? >> cid=3963 >> >> > --- > company:http://www.media-style.com > forum: http://www.text-mining.org > blog:http://www.find23.

Re: help - distributed crawl in 0.7.1

2006-03-08 Thread TDLN
You can start here http://wiki.apache.org/nutch/NutchDistributedFileSystem Also, I think there have been several posts in the mailing list that contain such a step-by-step overview. Rgrds, Thomas On 3/8/06, Olive g <[EMAIL PROTECTED]> wrote: > > Hi I am new here. > Could someone please let me kn

why TOTAL urls: 1

2006-03-08 Thread Olive g
Hello everyone, I am also running distributed crawl on .8.0 (some dev version) and somehow the stats always returned TOTAL urls as 1 while I was search some sites such as www.yahoo.com! My filter file allows everything. What might be the problem? There was no obvious error in log files and th

Re: help - distributed crawl in 0.7.1

2006-03-08 Thread Dima Mazmanov
in 0.7.1? > Thank you. > _ > Is your PC infected? Get a FREE online computer virus scan from McAfee® > Security. http://clinic.mcafee.com/clinic/ibuy/campaign.asp?cid=3963 > __ NOD32 1.1434 (20060308) Inform

Re: help - distributed crawl in 0.7.1

2006-03-08 Thread Stefan Groschupf
Better you use nutch .8 to run a crawl using several machines. There is some documentation in the wiki now. Am 08.03.2006 um 17:49 schrieb Olive g: Hi I am new here. Could someone please let me know the step-by-step instructions to set up distributed crawl in 0.7.1? Thank you. _

help - distributed crawl in 0.7.1

2006-03-08 Thread Olive g
Hi I am new here. Could someone please let me know the step-by-step instructions to set up distributed crawl in 0.7.1? Thank you. _ Is your PC infected? Get a FREE online computer virus scan from McAfee® Security. http://clinic.mcaf

Re: Search Speed

2006-03-08 Thread Stefan Groschupf
How many segments you have and how big are they? Try a disc IO Measurement tool or script what does it says? Am 08.03.2006 um 17:38 schrieb Insurance Squared Inc.: I appreciate your patience as we try to get over our search speed issues. We're getting closer - it seems we are having huge del

Search Speed

2006-03-08 Thread Insurance Squared Inc.
I appreciate your patience as we try to get over our search speed issues. We're getting closer - it seems we are having huge delays when retrieving the summaries for the various search results. Below are our logs from a search, you can see that retrieving some of the search summaries took in

Crawl crash hadoop

2006-03-08 Thread Bud Witney
whats going on with this. Tried nightly build to see future build and have following error on intranet crawl. IS there good documentation how to setup hadoop used the ./bin/nutch crawl urls -dir crawl.academic -depth 10 and export export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework

Re: Adaptive Refetching

2006-03-08 Thread Raghavendra Prabhu
A good analysis Even i was doing something in a similar manner We should also have more people testing this and contributing so that we can commit this to nutch Rgds Prabhu On 3/8/06, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: > > D.Saravanaraj wrote: > > Hi Andrzej, > > > > Thanks for your A

Crawling sites with Encoded URLs

2006-03-08 Thread sudhendra seshachala
Hi I have been trying to crawl sites with url encoded. But am trying to escape characters in crawl-urlfilter.txt. For some reason, does not seem to be working... One solution is to extend the crawler, .. if there are any other options ? :) Please let me know.. Thanks Sudhi Sud

Re: Adaptive Refetching

2006-03-08 Thread Andrzej Bialecki
D.Saravanaraj wrote: Hi Andrzej, Thanks for your Adaptice Reftech patch. I didn't get the working of adaptive refetch well. I examined working of adaptive refetching, by reading the crawldb. I created a folder in windows with 2 files and tried adaptive refetching on that (URL is file:/D:/Test/).

Re: Nutch for indexing local folders, files...

2006-03-08 Thread D . Saravanaraj
Generally for local files and folders indexing, it is better to use lucene. But it depends on your requirements. On 3/8/06, sudhendra seshachala <[EMAIL PROTECTED]> wrote: > > Question to experts.. > If I have users upload documents (WORD, PDF, PPt.. etc.) if I have to > search, I do index using

Re: Problem with searching

2006-03-08 Thread D . Saravanaraj
What is the version you are using? Place your index and segments folder incase of 0.71 version in the place where your "searcher.dir" points to. Place your entire "crawl" folder incase of 0.8 version of nutch (If i am not mistaken, the folder name should always be "crawl" ) On 3/8/06, fabrizio sil

Nutch commands and exit status

2006-03-08 Thread Steven Yelton
Any thoughts on making severe errors with the nutch commands return a non-zero exit status? I see several places where an exit status of -1 is 'returned', but not in all failure cases. Steven

Re: retry later

2006-03-08 Thread mos
Hi Andrzej, Thanks, for going into this subject. I'm glad that this issue will be resolved in version 0.8. That make's me hopeful. :) Sure, fixing this bug in version 0.7.1 wouldn't be necessary if the new version 0.8 will be available in the next weeks. And the workaround for me works until then

Beware of using LOG.severe in parsing filters/plugins

2006-03-08 Thread Gal Nitzan
Hi, I just noticed this behavior in Fetcher.java where if a severe log entry was used it will silently end the task :-) And I just didn't know why my fetcher was fetching too little pages. So just pay attention to that. Gal

Adaptive Refetching

2006-03-08 Thread D . Saravanaraj
Hi Andrzej, Thanks for your Adaptice Reftech patch. I didn't get the working of adaptive refetch well. I examined working of adaptive refetching, by reading the crawldb. I created a folder in windows with 2 files and tried adaptive refetching on that (URL is file:/D:/Test/). = Only injected t

Re: nutch and multilingualism

2006-03-08 Thread Wray Buntine
Ivan Sekulovic wrote: Jérôme Charron wrote: Would it be possible to generate ngram profiles for LanguageIdentifier plugin from crawled content and not from file? What is my idea? The best source for content in one language could be wikipedia.org. We would just crawl the wikipedia in desired

Re: nutch and multilingualism

2006-03-08 Thread Ivan Sekulovic
Jérôme Charron wrote: Would it be possible to generate ngram profiles for LanguageIdentifier plugin from crawled content and not from file? What is my idea? The best source for content in one language could be wikipedia.org. We would just crawl the wikipedia in desired language and then create

Re: Problem with searching

2006-03-08 Thread fabrizio silvestri
I inserted this lines searcher.dir /home/paul/nutch-searcher.dir My path to nutch's searcher dir. within the tags what's wrong with this? f On 3/8/06, Dima Mazmanov <[EMAIL PROTECTED]> wrote: > Hi,fabrizio. > > What are your changes in nutch-site.xml? > Did you point database di

Re: retry later

2006-03-08 Thread Andrzej Bialecki
mos wrote: when you get an error while fetching, and you get the org.apache.nutch.protocol.retrylater because the max retries have been reached, nutch says it has given up and will retry later, when does that retry occur? That's an issue I reported some weeks ago and which is in my opinion

Re: Problem with searching

2006-03-08 Thread Dima Mazmanov
Hi,fabrizio. What are your changes in nutch-site.xml? Did you point database directory? You wrote 8 ìàðòà 2006 ã., 13:41:26: > Hi Guys, > I have a question... > I successfully created an index using the tutorial example... now I > would like to search it using the jsp application. > I think I

Problem with searching

2006-03-08 Thread fabrizio silvestri
Hi Guys, I have a question... I successfully created an index using the tutorial example... now I would like to search it using the jsp application. I think I have correctly set up tomcat but when I try to search for something nutch always returns 0 results. I tried to start tomcat from the dir

Re: retry later

2006-03-08 Thread mos
> when you get an error while fetching, and you get the > org.apache.nutch.protocol.retrylater because the max retries have been > reached, nutch says it has given up and will retry later, when does that > retry occur? That's an issue I reported some weeks ago and which is in my opinion an annoyin

Nutch and authorization

2006-03-08 Thread Laurent Michenaud
Hi, Do u know good strategies to manage authorization ? I mean a user should only see the nutch results he has the rights on. Thanks for your comments.