What i want to do is i should add some header info in parse-filter which
will be used by index-filter to add my own nature of the new FIELD
Rgds
Prabhu
I would recommend doing it at the index phase if possible. If the end
goal is to have it searchable from the index, ask if you really need to h
Hi Howie
What you have mentioned is in the indexing fields
I am speaking abt content
i thought there are three steps
parse-filter
index-filter
query-filter
I think you are referring to the second step index-filter. I want more on
the first step parse-filter
What i want to do is i should add
You need to write your own indexing filter plugin. Take a look
at index-basic. In BasicIndexingFilter.java there are a whole
bunch of lines that do something like:
doc.add(Field.Text("myfield", myFieldValue));
Just add your own field. You have access to title, anchor,
and page text in this funct
This is true. What I do is I have Nutch log all the searches. Every few
weeks, I grab the most common search terms out of the log and turn them into my
"common searches" menu. Although having a manual process is not desirable, it
does remove the possibility that a spammer will sabotage my men
Andrzej Bialecki wrote:
Doug Cutting wrote:
are refetched, their links are processed again. I think the easiest
way to fix this would be to change ParseOutputFormat to not generate
STATUS_LINKED crawldata when a page has been refetched. That way
scores would only be adjusted for links in the
I'm using Nutch 0.7.1. on Windows.
My crawling&indexing task ended with this Java IO Exception:
java.io.IOException: already exists:
C:\nutch-0.7\intranet_0308\db\webdb.new\pagesByURL
at org.apache.nutch.io.MapFile$Writer.(MapFile.java:86)
at
org.apache.nutch.db.WebDBWriter$Clo
Seems we've found the problem that was causing our search delays. We
had some indexes that were 32bytes, apparently they'd crashed somehow
(not yet determined how). The existence of these segments were the
source of the problem. We removed those segments and the search is
running along much
Doug Cutting wrote:
The OPIC algorithm is not really designed for re-fetching. It assumes
that each link is seen only once. When pages
Ummm. well, this is definitely not our case.
are refetched, their links are processed again. I think the easiest
way to fix this would be to change ParseO
Hi guys
Sorry for the follow up mail
My requirement as i was mentioning previously shud let me stamp documents
with some kind of type
How do i do it ?
For example add sports to a field TYPEFIELD on seeing football,tennis in
extracted text
For example add technology to the same field TYPEFIEL
Hi
I am planning to write a parse filter which it shud add a header on finding
a keyword in the extracted text
For example if the extracted text contains football,tennis or baseball i
will add a header called sports
If the extracted text contains internet,language i will add a header called
tech
Hi
Is this not a critical problem?
we right now generate segments and refetch pages and any refetched segment
will rank relatively higher making search results irrevelant
So ultimately relevant results are not returned . Is it
Rgds
Prabhu
On 3/9/06, Doug Cutting <[EMAIL PROTECTED]> wro
David Odmark wrote:
So am I correct in believing that in order to implement boolean OR using
Nutch search and a QueryFilter, one must also (minimally) hack the
NutchAnalysis.jj file to produce a new analyzer? Also, given that a
Nutch Query object doesn't seem to have a method to add a non-requi
Andrzej Bialecki wrote:
What i infer is,
1. For every refetch, the score of files (but not the directory) is
increasing
This is curious, it should not be so. However, it's the same in the
vanilla version of Nutch (without this patch), so we'll address this
separately.
The OPIC al
Just a note that while this idea is good, displaying 'recent searches'
can be used by spammers. All they have to do is hammer your server with
a bunch of queries to 'www.some-poker-site.com' and their website gets a
link from yours. I'd be very leary of republishing any user inputs to
your syst
I just have no login and ip to the box any more.
In case you send me a login, ip and the path where the source are I
can have someone taking a look tomorrow.
Stefan
Am 08.03.2006 um 19:03 schrieb Stefan Groschupf:
Storing the index on a dfs works just change conf to use dfs in
nutch.war/Web
Storing the index on a dfs works just change conf to use dfs in
nutch.war/Web-inf/classes/nutch-default.xml and setup the correct
path in the property searcher.dir.
However it is slow.
Anyway in case you say little search app, than I strongly suggest
using a local file system.
Nutch 0.8 runs
Thank you! Sorry I am a newbie. I meant searching an index located on dfs
for a term.
I would like to run my little search app from command line on Linux.
Help please!
From: Stefan Groschupf <[EMAIL PROTECTED]>
Reply-To: nutch-user@lucene.apache.org
To: nutch-user@lucene.apache.org
Subject: Re
Have a look to the IndexReader a object in the the lucene package.
Am 08.03.2006 um 10:07 schrieb Stephen Ensor:
Hi, I am using nutch to create a vertical search site and wish to
create a
directory type menu for my front page with all the most common
terms in my
index.
For example say my v
Hi.
How to exclude from fetching URLs with particular string in them (for
example “SISID=”)? What regular expression have to put in
regex-urlfilter.txt?
Thank you,
Ivaylo Georgiev
--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.1.375 / Virus Database: 268
Hello,
I'm using nutch, version 0.8 dev.
I'm using NUTCH API like this:
Hits hits = nutchBean.search(query, numHits);
HitDetails hitDetails = nutchBean.getDetails(hit.getHit(0));
byte[] content = nutchBean.getContent(hitDetails);
Now I want to retrieve the content of the given URL rather then
Hi, I am using nutch to create a vertical search site and wish to create a
directory type menu for my front page with all the most common terms in my
index.
For example say my vertical search is pets and my index is full of pet sites
and pages, the common terms would be (cat, dog, fish, food, vet,
Hi Stefan,
Thank you for reply.
We have 31 segments. They are totally 106G.
Keren
Stefan Groschupf <[EMAIL PROTECTED]> wrote: How many segments you have and how
big are they?
Try a disc IO Measurement tool or script what does it says?
Am 08.03.2006 um 17:38 schrieb Insurance Squared Inc.
what means search data.
you can do
bin/hadoop dfs -ls to browse the dfs.
Also there are some junit tests in the hadoop project that illustrate
how to use the api (TestDFS).
cheers
Stefan
Am 08.03.2006 um 18:40 schrieb Olive g:
Hello,
Does anyone have sample code (using the Nutch API and
Hello,
Does anyone have sample code (using the Nutch API and running from command
line) to search data
on DSF? I am using version 0.8.
Thank you.
Olive
_
Dont just search. Find. Check out the new MSN Search!
http://search.msn.cl
We've got about 6 or 8 segments, but we just merged our indexes in an
attempt to speed things up. Total hard drive space is something like
60-80 gigs, in that neighbourhood. Nothing there strikes me as suspicious.
I could look at disc i/o speeds, but I'm doubtful that's the issue.
We're run
t;>>> http://clinic.mcafee.com/clinic/ibuy/campaign.asp?
>>>> cid=3963
>>>>
>>>>
>>
>>> ---
>>> company:http://www.media-style.com
>>> forum:http://www.t
=3963
---
company:http://www.media-style.com
forum:http://www.text-mining.org
blog:http://www.find23.net
__ NOD32 1.1434 (20060308) Information __
This message was checked by NOD32 antivirus system.
http://www.eset.com
--
Regards,
I guess yahoo.com has a robot.txt to block crawling the complete page.
Also check the level depth you use.
Am 08.03.2006 um 17:53 schrieb Olive g:
Hello everyone,
I am also running distributed crawl on .8.0 (some dev version) and
somehow the stats always
returned TOTAL urls as 1 while I was
Thanks! I saw that one too, but according to Doug, it was for 0.8 only. Does
anyone have
step-by-step introductions like the one for 0.8?
Also, anyone knows why URL total is always 1 when I ran 0.8?
060308 064420 map 0%
060308 064427 map 100%
060308 064433 reduce 100%
060308 064433 Job compl
Detailed distributed crawl implementation:
http://www.mail-archive.com/nutch-user@lucene.apache.org/msg02270.html
I am not sure it applies to 0.7 though, but it has a lot of info.
Rgrds, Thomas
t;>Security. http://clinic.mcafee.com/clinic/ibuy/campaign.asp? cid=3963
>>>
>>>
>>
>>---
>>company:http://www.media-style.com
>>forum:http://www.text-mining.org
>>blog:http://www.find23.net
>>
>>
> _
> On the road to retirement? Check out MSN Life Events for advice on how to
> get there! http://lifeevents.msn.com/category.aspx?cid=Retirement
> __ NOD32 1.1434 (20060308) Information __
> This message was checked by NOD32 antivirus system.
> http://www.eset.com
--
Regards,
Dima mailto:[EMAIL PROTECTED]
Hi folks,
Offhand, I'm not aware of any slam-dunk solution to link farms
either. One thing that could help mitigate the problem is a pre-built
blacklist of some sort. For example:
http://www.squidguard.org/blacklist/
That one is really meant for blocking user-access to porn, known
warez
Thank you so much for your reply!
I just sent another message - because I am having other issues with 0.8 and
somehow the
TOTAL urls is always 1 when I search big sites such as www.yahoo.com. I
thought 0.7.1 might
be more stable?
THe stats:
060308 064418 Client connection to 9.2.13.8:8010 : st
.
>> http://clinic.mcafee.com/clinic/ibuy/campaign.asp?
>> cid=3963
>>
>>
> ---
> company:http://www.media-style.com
> forum: http://www.text-mining.org
> blog:http://www.find23.
You can start here http://wiki.apache.org/nutch/NutchDistributedFileSystem
Also, I think there have been several posts in the mailing list that contain
such a step-by-step overview.
Rgrds, Thomas
On 3/8/06, Olive g <[EMAIL PROTECTED]> wrote:
>
> Hi I am new here.
> Could someone please let me kn
Hello everyone,
I am also running distributed crawl on .8.0 (some dev version) and somehow
the stats always
returned TOTAL urls as 1 while I was search some sites such as
www.yahoo.com!
My filter file allows everything. What might be the problem? There was no
obvious error
in log files and th
in 0.7.1?
> Thank you.
> _
> Is your PC infected? Get a FREE online computer virus scan from McAfee®
> Security. http://clinic.mcafee.com/clinic/ibuy/campaign.asp?cid=3963
> __ NOD32 1.1434 (20060308) Inform
Better you use nutch .8 to run a crawl using several machines.
There is some documentation in the wiki now.
Am 08.03.2006 um 17:49 schrieb Olive g:
Hi I am new here.
Could someone please let me know the step-by-step instructions to
set up
distributed crawl in 0.7.1?
Thank you.
_
Hi I am new here.
Could someone please let me know the step-by-step instructions to set up
distributed crawl in 0.7.1?
Thank you.
_
Is your PC infected? Get a FREE online computer virus scan from McAfee®
Security. http://clinic.mcaf
How many segments you have and how big are they?
Try a disc IO Measurement tool or script what does it says?
Am 08.03.2006 um 17:38 schrieb Insurance Squared Inc.:
I appreciate your patience as we try to get over our search speed
issues. We're getting closer - it seems we are having huge del
I appreciate your patience as we try to get over our search speed
issues. We're getting closer - it seems we are having huge delays when
retrieving the summaries for the various search results. Below are our
logs from a search, you can see that retrieving some of the search
summaries took in
whats going on with this. Tried nightly build to see future build and
have following error on intranet crawl. IS there good documentation
how to setup hadoop
used the ./bin/nutch crawl urls -dir crawl.academic -depth 10
and export
export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework
A good analysis
Even i was doing something in a similar manner
We should also have more people testing this and contributing so that we can
commit this to nutch
Rgds
Prabhu
On 3/8/06, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
>
> D.Saravanaraj wrote:
> > Hi Andrzej,
> >
> > Thanks for your A
Hi
I have been trying to crawl sites with url encoded. But am trying to escape
characters in crawl-urlfilter.txt.
For some reason, does not seem to be working... One solution is to extend the
crawler, .. if there are any other options ? :) Please let me know..
Thanks
Sudhi
Sud
D.Saravanaraj wrote:
Hi Andrzej,
Thanks for your Adaptice Reftech patch. I didn't get the working of adaptive
refetch well. I examined working of adaptive refetching, by reading the
crawldb. I created a folder in windows with 2 files and tried adaptive
refetching on that (URL is file:/D:/Test/).
Generally for local files and folders indexing, it is better to use lucene.
But it depends on your requirements.
On 3/8/06, sudhendra seshachala <[EMAIL PROTECTED]> wrote:
>
> Question to experts..
> If I have users upload documents (WORD, PDF, PPt.. etc.) if I have to
> search, I do index using
What is the version you are using? Place your index and segments folder
incase of 0.71 version in the place where your "searcher.dir" points to.
Place your entire "crawl" folder incase of 0.8 version of nutch (If i am not
mistaken, the folder name should always be "crawl" )
On 3/8/06, fabrizio sil
Any thoughts on making severe errors with the nutch commands return a
non-zero exit status? I see several places where an exit status of -1
is 'returned', but not in all failure cases.
Steven
Hi Andrzej,
Thanks, for going into this subject.
I'm glad that this issue will be resolved in version 0.8. That make's
me hopeful. :)
Sure, fixing this bug in version 0.7.1 wouldn't be necessary if the new
version 0.8 will be available in the next weeks.
And the workaround for me works until then
Hi,
I just noticed this behavior in Fetcher.java where if a severe log entry was
used it will silently end the task :-)
And I just didn't know why my fetcher was fetching too little pages.
So just pay attention to that.
Gal
Hi Andrzej,
Thanks for your Adaptice Reftech patch. I didn't get the working of adaptive
refetch well. I examined working of adaptive refetching, by reading the
crawldb. I created a folder in windows with 2 files and tried adaptive
refetching on that (URL is file:/D:/Test/).
= Only injected t
Ivan Sekulovic wrote:
Jérôme Charron wrote:
Would it be possible to generate ngram profiles for LanguageIdentifier
plugin from crawled content and not from file? What is my idea? The
best
source for content in one language could be wikipedia.org. We would
just crawl the wikipedia in desired
Jérôme Charron wrote:
Would it be possible to generate ngram profiles for LanguageIdentifier
plugin from crawled content and not from file? What is my idea? The best
source for content in one language could be wikipedia.org. We would
just crawl the wikipedia in desired language and then create
I inserted this lines
searcher.dir
/home/paul/nutch-searcher.dir
My path to nutch's searcher dir.
within the
tags
what's wrong with this?
f
On 3/8/06, Dima Mazmanov <[EMAIL PROTECTED]> wrote:
> Hi,fabrizio.
>
> What are your changes in nutch-site.xml?
> Did you point database di
mos wrote:
when you get an error while fetching, and you get the
org.apache.nutch.protocol.retrylater because the max retries have been
reached, nutch says it has given up and will retry later, when does that
retry occur?
That's an issue I reported some weeks ago and which is in my opinion
Hi,fabrizio.
What are your changes in nutch-site.xml?
Did you point database directory?
You wrote 8 ìàðòà 2006 ã., 13:41:26:
> Hi Guys,
> I have a question...
> I successfully created an index using the tutorial example... now I
> would like to search it using the jsp application.
> I think I
Hi Guys,
I have a question...
I successfully created an index using the tutorial example... now I
would like to search it using the jsp application.
I think I have correctly set up tomcat but when I try to search for
something nutch always returns 0 results.
I tried to start tomcat from the dir
> when you get an error while fetching, and you get the
> org.apache.nutch.protocol.retrylater because the max retries have been
> reached, nutch says it has given up and will retry later, when does that
> retry occur?
That's an issue I reported some weeks ago and which is in my opinion
an annoyin
Hi,
Do u know good strategies to manage authorization ?
I mean a user should only see the nutch results he has the rights on.
Thanks for your comments.
59 matches
Mail list logo