I am currently trying to figure out how to deploy Nutch and Hadoop
separately. I want to configure Hadoop outside of Nutch and have
Nutch use that service, rather than configuring hadoop within nutch.
I would think all that Nutch should need to know is the urls to
connect to Hadoop, but can't
at all. Though, I needed to replace
hadoop-12.whatever.jar to the lastest within the nutch build. It
seems to be working. yay.
Thanks.
On 7/11/07, Andrzej Bialecki [EMAIL PROTECTED] wrote:
Briggs wrote:
I am currently trying to figure out how to deploy Nutch and Hadoop
separately. I
Thanks for the answer. That was helpful.
I was sooo wrong.
On 7/7/07, Andrzej Bialecki [EMAIL PROTECTED] wrote:
Briggs wrote:
Please keep this thread going as I am also curious to know why this
has been 'forked'. I am sure that most of this lies within the
original OPIC filter but I
Please keep this thread going as I am also curious to know why this
has been 'forked'. I am sure that most of this lies within the
original OPIC filter but I still can't understand why straight forward
lucene queries have not been used within the application.
On 7/6/07, Kai_testing Middleton
.
On 6/20/07, Naess, Ronny [EMAIL PROTECTED] wrote:
Thanks, Briggs.
I will try to create a new NutchBean to se if that solves reloading
issue.
By the way. Your former mail do not seem to have reached the
mailinglist. I can't seem to find it anyway.
-Ronny
-Opprinnelig melding-
Fra
By the way, I was wrong about one thing, you can't override the 'get'
method of nutch bean because it's static. Doh, that was a silly
oversight.
But again, if you are using nutch and you need to 'reload' the index,
you need only to create a new NutchBean (that is if the NutchBean is
what you are
I would say that the best thing to do is to create a new nutch bean.
I never cared much for the nutch bean containing logic to store itself
in a servlet context. I do not believe that this is the place for
such logic. It should be up to the user to place the nutch bean into
the servlet context
Yeah, you still don't have the agent configured. All your values for
the agent (the value/value needs a value) are blank. So, you need
to at least confugure an agent name.
On 6/15/07, karan thakral [EMAIL PROTECTED] wrote:
i m using crawl on the cygwin while working on windows
but the
Oh and as for the web interface, take a look at the wiki page:
http://wiki.apache.org/nutch/NutchTutorial
The bottom of the page has a section on searching.
On 6/15/07, Briggs [EMAIL PROTECTED] wrote:
Yeah, you still don't have the agent configured. All your values for
the agent (the value
Well, the quick/simple exlanation is:
If you have 5 urls with their associate nutch score:
http://a.com/something1 = 5.0
http://b.com/something2 = 4.0
http://c.com/something3 = 3.0
http://d.com/something4 = 2.0
http://e.com/something5 = 1.0
Then you set nutch to crawl with topN = 3 then a,b,c
Ronny, your way is probably better. See, I was only dealing with the
fetched properties. But, in your case, you don't fetch it, which gets rid
of all that wasted bandwidth.
For dealing with types that can be dealt with via the file extension, this
would probably work better.
On 6/7/07,
this unnecessary object creation.
On 6/6/07, Briggs [EMAIL PROTECTED] wrote:
FYI, I ran into the same problem. I wanted my filters to be instantiated
only once, and they not only get instantiated repeatedly, but the
classloading is flawed in that it keeps reloading the classes. So, if you
ever dump
FYI, I ran into the same problem. I wanted my filters to be instantiated
only once, and they not only get instantiated repeatedly, but the
classloading is flawed in that it keeps reloading the classes. So, if you
ever dump the stats from your app (use 'jmap -histo;) you can see all the
classes
is urls/nutch a file or directory?
On 6/6/07, Martin Kammerlander [EMAIL PROTECTED] wrote:
Hi
I wanted to start a crawl like it is done in the nutch 0.8.x tutorial.
Unfortunately I get the following error:
[EMAIL PROTECTED] nutch-0.8.1]$ bin/nutch crawl urls/nutch -dir crawl.test
-depth
for nutch or 'only' mailing list?
thx
martin
Zitat von Briggs [EMAIL PROTECTED]:
is urls/nutch a file or directory?
On 6/6/07, Martin Kammerlander [EMAIL PROTECTED]
wrote:
Hi
I wanted to start a crawl like it is done in the nutch 0.8.x tutorial.
Unfortunately I get the following error
You set that up in your nutch-site.xml file. Open the
nutch-default.xml file (located in the NUTCH_INSTALL_DIR/conf. Look
for this element:
property
nameplugin.includes/name
So, I have been having huge problems with parsing. It seems that many
urls are being ignored because the parser plugins throw and exception
saying there is no parser found for, what is reportedly, and
unresolved contentType. So, if you look at the exception:
(Unix) DAV/1.0.3 ApacheJServ/1.1.2
Cache-Control: no-store
X-Highwire-SessionId: xlz2cgcww1.JS1
Set-Cookie: JServSessionIdroot=xlz2cgcww1.JS1; path=/
Transfer-Encoding: chunked
Content-Type: text/html
So, I'm lost.
On 6/1/07, Doğacan Güney [EMAIL PROTECTED] wrote:
Hi,
On 6/1/07, Briggs [EMAIL
PROTECTED] wrote:
Hi,
On 6/1/07, Briggs [EMAIL PROTECTED] wrote:
So, I have been having huge problems with parsing. It seems that many
urls are being ignored because the parser plugins throw and exception
saying there is no parser found for, what is reportedly, and
unresolved contentType. So
-SessionId: nh2ukcdpv1.JS1
Set-Cookie: JServSessionIdroot=nh2ukcdpv1.JS1; path=/
Transfer-Encoding: chunked
Content-Type: text/html
So, that's it. any ideas?
On 6/1/07, Briggs [EMAIL PROTECTED] wrote:
So, here is one:
http://hea.sagepub.com/cgi/alerts
Segment Reader reports:
Content
so, when in cygwin, if you type 'ssh' (without the quotes, do you get
the same error? If so, then you need to go back into the cygwin setup
and install ssh.
On 5/30/07, Ilya Vishnevsky [EMAIL PROTECTED] wrote:
Hello. I try to run shell scripts starting Nutch. I use Windows XP, so I
installed
Anyone have any good configuration ideas for indexing/merging with 0.9
using hadoop on a local fs? Our segment merging is taking an
extremely long time compared with nutch 0.7. Currently, I am trying
to merge 300 segments, which amounts to about 1gig of data. It has
taken hours to merge, and
Just curious, did you happen to limit the number of urls using the
topN switch?
On 5/14/07, Annona Keene [EMAIL PROTECTED] wrote:
I recently upgraded to 0.9, and I've started encountering a problem. I began
with a single url and crawled with a depth of 10, assuming I would get every
page on
I would assume that it need these for handling the indexing of the
link scores. Lucene puts no scoring weight on things such as urls,
page rank and such. Since lucene only indexes documents, and
calculates its keyword/query relevancy based only on term vectors (or
whatever) nutch needs to add the
Man, I should proofread this stuff before I send them. That is all I
have to say.
On 5/1/07, Briggs [EMAIL PROTECTED] wrote:
I would assume that it need these for handling the indexing of the
link scores. Lucene puts no scoring weight on things such as urls,
page rank and such. Since lucene
somewhere else in the code.
Since Nutch handles so much of its threading, could this be causing the problem?
I am not sure if I should x-post this to the dev group or not.
Anyway, thanks.
Briggs
--
Conscious decisions by conscious minds are what make reality real
, but then another creeped up.
On 4/30/07, Sami Siren [EMAIL PROTECTED] wrote:
Briggs wrote:
Version: Nutch 0.9 (but this applies to just about all versions)
I'm really in a bind.
Is anyone crawling from within a web application, or is everyone
running Nutch using the shell scripts provided
I'll look around the code to make sure I am creating only one instance
of Configuration in my classes, and will play around with the
maxpermgen settings.
Any other input from people that have attempted this sort of setup
would be appreciated.
On 4/30/07, Briggs [EMAIL PROTECTED] wrote:
Well
(or webdb I believe in your version) to include a
flag of 'good/bad' or something.
On 4/27/07, Briggs [EMAIL PROTECTED] wrote:
Isn't this what you are looking for?
org.apache.nutch.tools.PruneIndexTool.
On 4/27/07, franklinb4u [EMAIL PROTECTED] wrote:
hi Enis,
This is franklin
Well, it looks like the link I sent you goes to the 0.9 version of the
nutch api. There is a link error on the nutch project site because
the 0.7.2 doc link points to the 0.9 docs.
On 4/27/07, Briggs [EMAIL PROTECTED] wrote:
Here is the link to the docs:
http://lucene.apache.org/nutch
If you are just looking to have a seed list of domains, and would like
to mirror their content for indexing, why not just use the unix tool
'wget'? It will mirror the site on your system and then you can just
index that.
On 4/25/07, John Kleven [EMAIL PROTECTED] wrote:
Hello,
I am hoping
On the nutch wiki there is this tutorial:
http://wiki.apache.org/nutch/NutchHadoopTutorial
There is also (it is for version 0.8, but can still work with 0.9):
http://lucene.apache.org/nutch/tutorial8.html
On 4/24/07, ekoje ekoje [EMAIL PROTECTED] wrote:
Hi Guys,
I would like to create a
Perhaps someone else can chime in on this. I am not sure of exactly
what you are asking. The indexing is based on Lucene. So, if you need
to understand how the indexing works you will need to look into the
Lucene documentation. If you are only looking to add custom fields
and such to the
If you look into the BasicIndexingFilter.java plugin source you will
see that this is where those default fields get indexed. So, you can
either create a new plugin that is configurable for the properties you
want to index, or remove this plugin. Here is the snippet of code
that is in the
, Meryl Silverburgh [EMAIL PROTECTED] wrote:
Can you please tell me what is the meaning of this command? what is
the top 35 links? how nutch rank the top 35 links?
bin/nutch readdb crawl/crawldb -topN 35 test
On 4/19/07, Briggs [EMAIL PROTECTED] wrote:
Those links are links that were discovered
Look into org.apache.nutch.plugin. The custom plugin classloader, and
the resource loadeer reside in there.
On 4/18/07, Antony Bowesman [EMAIL PROTECTED] wrote:
I'm looking to use the Nutch parsing framework in a separate Lucene project.
I'd like to be able to use the existing plugins
I'll add that the PluginRespository is the class that recurses through
your plugins directory, and loads each plugin's descriptor file then
loads all dependencies for each plugin within their own classloader.
On 4/19/07, Briggs [EMAIL PROTECTED] wrote:
Look into org.apache.nutch.plugin
Nutch 0.9
Anyone know if it is possible to be more granular regarding crawl
frequency? Meaning, that I would like some sites to be crawled more
often then others. Like, a news site should be crawled every day, but
your average business website should be crawled every 30 days. So, is
it possible
Cool, cool. Thanks!
On 4/19/07, Gal Nitzan [EMAIL PROTECTED] wrote:
As it is right now... You answered the question yourself :-) ...
Separate db's and the whole ceremony...
-Original Message-
From: Briggs [mailto:[EMAIL PROTECTED]
Sent: Thursday, April 19, 2007 10:02 PM
From what I have gathered is that you may want to keep multiple
crawldbs for your crawls. So, you could have a crawldb for more
frequent crawls and fire off nutch and read that db with the
appropriate configs for that job. I was hoping for the same
mechanism, but it looks like we need to write
Those links are links that were discovered. It does not mean that they
were fetched, they weren't.
On 4/12/07, Meryl Silverburgh [EMAIL PROTECTED] wrote:
I think I find out the answer to my previous question by doing this:
bin/nutch readlinkdb crawl/linkdb/ -dump test
But my next question
(my
application) up to date with the latest and greatest!
Thanks for your time! And once I really get through this code I
promise to start posting answers.
Briggs.
--
Conscious decisions by conscious minds are what make reality real
? I'll just have to figure out
the format here so I can parse it. I'll probably write a wrapper that
exports to xml or something to make transformation of this easier.
Anyway, am I on the right track?
Briggs.
On 4/18/07, Briggs [EMAIL PROTECTED] wrote:
Is it possible to determine from which
Currently using 0.7.2.
We have a process that runs crawltool from within an application,
perhaps hundreds of times during the course of the day. The problem I
am seeing is that over time the log statements from my application (I
am using commons logging and Log4j) are also being logged within
where that is happening... It's either Nutch
or ActiveMQ stuff.
Anyway,
Have fun and Cheers!
On 3/23/07, Briggs [EMAIL PROTECTED] wrote:
Currently using 0.7.2.
We have a process that runs crawltool from within an application,
perhaps hundreds of times during the course of the day. The problem I
that if there are high boost values for those
fields, they will be pushed to the top.
Thanks again!
Briggs
--
Concious decisions by concious minds are what make reality real
-
Using Tomcat but need to do more? Need to support web
So, I am having ClassLoader issues with plugins. It seems that the
PluginRepository does some wierd class loading (PluginClassLoader)
when it starts up. Does this mean that my plugin will not inherit the
classpath of my web application that it is loaded within?
A simple example is that my webapp
Well, I found this:
http://wiki.apache.org/nutch/WhatsTheProblemWithPluginsAndClass-loading
Arrrgh. Well, looks like I am going to use JMX to have my plugin talk
to my application. That way I won't have have several copies of my
business jars around.
On 1/31/07, Briggs [EMAIL PROTECTED
that was crawled and I want to
merge them all into several large segments. So, if anyone has any pointers
I would appreciate it. Has anyone else attempted to keep segments at this
granularity? This doesn't seem to work so well.
briggs /
Concious decisions by concious minds are what make
and index those.
Sorry for my ignorance, but not really sure how to scale nutch
correctly. Do you know of a document, or some pointers as to how
segment/index data should be stored?
briggs /
Concious decisions by concious minds are what make reality real
Cool, thanks for your responses!
Next time I should probably mention that we are using 0.7.2. Not
quite sure if we can even think about moving to something 'more
current' as I don't really know the reasons to.
briggs /
Most of this information is already available on the Nutch Wiki. All I
51 matches
Mail list logo