Your conclusion is incorrect. You can just call the URL with the search
term from within a PHP script for example. Mozdex will return XML
formatted results. Just take the results and format them within your
page. No link or other mention of mozdex is required. It does do
exactly what you'r
7)
at org.apache.nutch.crawl.Injector.inject(Injector.java:138)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:105)
~~~
Daniel Clark, President
DAC Systems, Inc.
5209 Nanticoke Court
Centreville, VA 20120
Cell - (703) 403-0340
Email - [EMAIL PROTECTED]
I've got some nutch related books I'm looking to clear off my
bookshelf. If you're interested, I've posted them on Craigslist here:
http://kitchener.craigslist.org/bks/311929785.html
Personally I suspect a sheep in wolves clothing. I see this as another
move intended to monetize. Nothing wrong with that, unless everyone
believes what they're doing isn't for that purpose.
Enis Soztutar wrote:
Sean Dean wrote:
Ive been following it, but haven't posted anything over there.
Actually, I believe this information is available online if you do some
digging - but only for some TLD's. I'm pretty sure the .com info and
likely .org and .net is readily available. I know the .ca's are not
available. I don't recall any specific licensing issues the last time I
looked.
g
chael
g.
Michael Wechner wrote:
Insurance Squared Inc. wrote:
Make sure you don't have any empty or bad segments. We had some
serious speed issues for a long time until we realized we had some
empty segments that had been generated as we tested. Nutch would
then sit and spin on the
If I recall correctly, we just checked the segment directories for space
size. The bad ones had files of only 32K or something like that.
g.
Michael Wechner wrote:
Insurance Squared Inc. wrote:
Make sure you don't have any empty or bad segments. We had some
serious speed issues
Make sure you don't have any empty or bad segments. We had some
serious speed issues for a long time until we realized we had some empty
segments that had been generated as we tested. Nutch would then sit and
spin on these bad segments for a few seconds on every search. Simply
deleting the
Interesting. There goes the premise that Wiki is not for profit.
e w wrote:
Haven't seen anyone mention this on the lists yet but is probably of
interest to the community:
http://www.techcrunch.com/2006/12/23/wikipedia-to-launch-searchengine-exclusive-screenshot/
Hi Bruce,
This list is not only very active - it's full of people constantly
giving helpful, instructive answers. If you've got questions, this is
the place.
I would say based on my experience that nutch is a) excellent and b) not
for the faint of heart when it comes to java - you'll need s
I've lost the thread, but someone here had recently asked for our nutch
xml configuration file. Our developer's back from holidays so I've got
the info now. Note that some of the configuration variables are not in
the default file as we've made modifications. On our dual xeon, 8gigs
of ram,
As a second indicator of the scale, IIRC Doug Cutting posted a while
ago that he downloaded and indexed 50 million pages in a day or two
with about 10 servers.
We download about 100,000 pages per hour on a dedicated 10mbs
connection. Nutch will definitely fill more than a 10mbs connection
Well, just very roughly:
4billion pages X 20K per page / 1000K per meg / 1000 megs per gig =
80,000 gigs of data transfer every month.
100mbs connection /8 megabits per megabyte * 60 seconds in a minute *
60seconds in an hour*24 hours in a day *30 hours in a month=32,400 gigs
per month.
So
I've been using a paid (but low cost) script by smarterscripts.com to
display ads. It's not the ideal solution but works for lower volume of
searches (it's a php/mysql solution). There's an opensource script
called phpadsnew that may also work but IIRC it didn't allow for third
party adverti
Well so much for knee-jerk suspicions as to intent. No need to look for
conspiracy theories when default settings are more likely to be the
cause. That should probably a corollary to occam's razor or something :).
Andrzej Bialecki wrote:
Insurance Squared Inc. wrote:
The funny
The funny thing about that wiki page (and some others in that area) is
that they apparently use the nofollow tags. Given the topic of that
wiki, isn't that a bit odd? I personally dislike the nofollow tag and
think it should be used only in extreme circumstances (i.e. here's a
link to a site
Please ignore my last message. My apologies - I meant to send this off
to my local linux users group instead of this one.
Original Message
Subject:any java/tomcat experts in the crowd?
Date: Thu, 25 May 2006 18:00:18 -0400
From: Insurance Squared Inc. <[EM
Are there any java/tomcat setup experts in the crowd who'd be able to
take a (paid) look at my webserver and help with some setup problems?
We've got a java application running on one website, have just installed
a second website with a slightly modified version of the same
application, and ca
Hi All,
We've been running nutch on one website on our server, we've just added
a second website that's running nutch on a seperate
index/crawl/segments. We're experiencing some difficulty getting things
running and seperating the two. Not entirely sure that the difficulty
is with nutch or
Could I trouble anyone to post a link to the scoring API documentation,
as well as the "paper that underlies the current Nutch implementation"?
I've dipped into the docs in a few places and haven't bumped into either
of these documents.
Thanks,
g.
Ken Krugler wrote:
Eugen Kochuev wrote:
s' i.e.
other sites within the database, as part of the ranking.
Thanks,
g.
Insurance Squared Inc. wrote:
Is there any good way to boost for the number of inbound links to a
page? I guess we can't use PR as that's patented, but I thought that
we could somehow boost based on
Is there any good way to boost for the number of inbound links to a
page? I guess we can't use PR as that's patented, but I thought that we
could somehow boost based on the number of inbound links. Upon looking
at the conf though (and my developer's reply) it doesn't seem like we
can do this.
I'm trying to get rid of some spammy sites in our index.
First, I wonder if anyone has any suggestions on changes to the default
install config of Nutch that will help drive better sites to the top and
spammier sites down.
Secondly, I boosted the inbound anchor text config - but if anything
We've got a php front end for version 0.71 that starts and stops
crawling/indexing/fetching. One button starts the entire process -
updatedb/create fetchlist/fetch/index - over and over. A second button
stops the process, unless a crawl is in progress at which point it stops
after the current
I'd prefer not to make a longterm committment, but if a 2mbit connection
is good enough and this is a short term thing, I'll step up if no one
else can. I could probably make a longer term committment in a few
weeks. Worst case I can host it at home.
glenn
gekkokid wrote:
what about putt
on
behind the scenes in the ASP code. It knows url and content recieved.
As of right now in 0.8 dev meta level redirects (meta refesh tags)
don't work correctly. They did in 0.7 but I don't think that
functionality has been ported.
Dennis
Insurance Squared Inc. wrote:
How are redire
How are redirects listed in version 0.7? If the crawler finds a link like:
www.domain.com/?code.aspx&redirect=445454
and that link redirects through to www.another-domain.com, which of
those two links will show up in nutch?
(I'm wondering if I can use nutch to crawl sites with a lot of
redire
Hi All,
Two general questions:
- I'm wondering if there are any good sources of written information on
actually writing a search engine script. Things like scoring, indexing,
that kind of stuff. I bought the lucene book, but that's lucene
specific technical info. Looking for something at t
ure
using another sort of camera called robot :-) Nothing more really. If
a browser maker decides to show an HTML tag lets say in 300
pixels will that be a copyright or trademark violation then?
What one can do is to prevent one to be photographed or stop the
robots visit one's website :-)
On 3/
FWIW, I believe all of what's been stated is the case - and I'd also
assume that since Google/MSN/Yahoo are all doing this that it's been
tested and OK.
However I know many people complain about the cache. Some people see it
as a copyright violation - technically correct or not, the cache doe
We've got a website that is causing our crawler to slow down (from
20mbits down to 3-5) - 400K pages that are basically not available,
we're just getting 404's. I'd like to remove them from the DB to get
our crawl speed back up again.
Here's what our developer told me - I'm stumped, that seem
We've got a site that is causing our crawl to slow dramatically, from
20mbits down to about 3 or 4. The basic problem is that the site seems
to consist of huge numbers of pages that aren't responding. We can
remove the site from the index, but it seems like a problem to remove
this site perma
Hi All,
We're merrily proceeding down our route of a country specific search
engine, nutch seems to be working well. However we're finding some
sites creeping in that aren't from our country. Specifically, we
automatically allow in sites that are hosted within the country. We're
finding mo
Seems we've found the problem that was causing our search delays. We
had some indexes that were 32bytes, apparently they'd crashed somehow
(not yet determined how). The existence of these segments were the
source of the problem. We removed those segments and the search is
running along much
Just a note that while this idea is good, displaying 'recent searches'
can be used by spammers. All they have to do is hammer your server with
a bunch of queries to 'www.some-poker-site.com' and their website gets a
link from yours. I'd be very leary of republishing any user inputs to
your syst
ol or script what does it says?
Am 08.03.2006 um 17:38 schrieb Insurance Squared Inc.:
I appreciate your patience as we try to get over our search speed
issues. We're getting closer - it seems we are having huge delays
when retrieving the summaries for the various search results. Bel
I appreciate your patience as we try to get over our search speed
issues. We're getting closer - it seems we are having huge delays when
retrieving the summaries for the various search results. Below are our
logs from a search, you can see that retrieving some of the search
summaries took in
I don't think it is a slam dunk either, even Google doesn't do a super
job of detecting these. I think a lot of it's still done manually.
I think you'd have to look at detecting closed networks or mostly closed
networks (since the link farm would be relatively clustered from a link
perspectiv
I've seen it noted that a complete recrawl is necessary to migrate from
0.71 to 0.8. Is this absolutely necessary? Or could a converter be
created to migrate the data? Has anyone created this?
I expect at some point I'll have to move versions and something like
this would be very useful. I
something
at the OS or tomcat level, or with another system process that nutch is
using).
Stefan Groschupf wrote:
This is very slow!
You can expect results in less than a second from my experience.
+ check memory settings of tomcat.
+ you do not use ndfs, right?
Am 06.03.2006 um 00:23 schr
Asking again for the patience of the list, we're still working on speed.
I guess what I need to know is if we still have a 'problem' or if the
following search speeds are normal for nutch.
query: 'term life insurance'; first search 25 seconds, second search 6
seconds.
query: 'stratford bed and
We've built a php frontend onto nutch. We're finding that this
interface is dreadfully slow and the problem is the interface between
the two languages.
Here's where the slow down is:
$url = 'http://localhost:8080/opensearch?query=' . $query .
'&start=' . $start_index .
all newer tomcats require java 1.5 i do not
yet use for nutch.
Am 22.02.2006 um 15:59 schrieb Insurance Squared Inc.:
Thanks for your help Stefan (as always).
We've fixed the problem as follows:
export CATALINA_OPTS="-Xms512m -Xmx2000m"
into /var/jakarta-tomcat-4.1.31/bin
I personal had never such problems.
How many segments / indexes you have?
Am 22.02.2006 um 15:21 schrieb Insurance Squared Inc.:
We're getting an out of memory error when running a search using
nutch 0.71 on a machine with 3 gigs of Ram. Here's the error:
at org.apache.tomca
We're getting an out of memory error when running a search using nutch
0.71 on a machine with 3 gigs of Ram. Here's the error:
at
org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPoo
l.java:683)
at java.lang.Thread.run(Thread.java:534)
- Root Cause -
ja
Hi,
We're finding that we've got one or two domains that are providing
excessive retries - and that's drastically slowing our fetch process
down by hours.
Any general guidance on how to fix the problem? we've upped our max
retries variable to 3 from 1 I believe, still getting the problem.
I know this has been asked a number of times, but I don't think there's
been a definitive answer posted yet. Is there any way (in v0.71) to
immediately remove a site (or all the pages from a site) from the
index? Right now with our setup I think we have to wait the 30 days for
the segment to
We're finding nutch slightly slow when doing searches, I'm trying to
find the least expensive route to speed things up.
I swapped the server out from SCSI drives because we ran out of space.
Instead I threw in a couple of Seagate 300gig SATA hard drives with
software raid 0 so that I have eno
based upon url of the
page alone
The information which you gave was also useful .But i want to do the above
Rgds
Prabhu
On 2/8/06, *Insurance Squared Inc.* <[EMAIL PROTECTED]
<mailto:[EMAIL PROTECTED]>> wrote:
Hi Prabhu,
Below is the script we use for deleting old
Hi Prabhu,
Below is the script we use for deleting old segments.
Regards,
Glenn
#!/bin/sh
# Remove old dirs from segments dir
# PERIOD is threshold for old dirs
#
# Created by Keren Yu Jan 31, 2006
NUTCH_DIR=/home/glenn/nutch
PERIOD=30
# put dirs which are older than PERIOD into dates.tmp
ls
Hi,
Running nutch 0.71on Mandrake linux 2006 (P4 with a 2 sata drives on
raid 0, 2 gigs of ram, about 4 million pages, but expecting to hit 10+),
and finding that our initial queries take up to 15-20 seconds to return
results. I'd like to get that speeded up and am seeking thoughts on how
to
Would anyone care to comment on the speed of this please? Seems awfully
long to me.
20 threads, a crawl took 25 hours for about 400K URL's. It's now been
updating for 20 hours and is not yet complete.
System:
- nutch 0.7
- P4 2.8, 1 gig of ram
- No problems on the internet connection (I had
My ISP called and said my nutch crawler is chewing up 20mbits on a line
he's only supposed to be using 10. Is there an easy way to tinker with
how much bandwidth we're using at once? I know we can change the number
of open threads the crawler has, but it seems to me this won't make a
huge di
Hi,
I'm trying to determine if there's a better way to whitelist a large
number of domains than just adding them as a regular expression in the
filter.
We're setting up a regional search engine and using the filter file to
determine what URL's make it into the db. We've added specific domai
Our nutch installation (version .7, running on Mandrake linux) continues
to freeze sporadically during fetching. Our developer has it pinned
down to the deflateBytes library.
"it looped in the native method called deflateBytes for very long time.
Some times, it took several hours."
That's a
We're just wiping down a server to install nutch in more of a production
environment. Does anyone have any thoughts on whether we should upgrade
to version 0.8, from version .7? Our developer suggested that the only
reason we would need to do that would be if we needed distributed
computing;
at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:351)
at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:488)
"VM Thread" prio=1 tid=0x08090638 nid=0x1442 runnable
"VM Periodic Task Thread" prio=1 tid=0x08099cd8 nid=0x1442 waiting on
condition
"Suspend C
Andrzej Bialecki wrote:
Insurance Squared Inc. wrote:
Hi All,
We're experiencing problems with nutch freezing sporadically when
fetching. Not really to sure where to even start investigating.
Some digging into the archives suggested memory issues, so we did the
foll
Hi All,
We're experiencing problems with nutch freezing sporadically when
fetching. Not really to sure where to even start investigating. Some
digging into the archives suggested memory issues, so we did the following:
TOMCAT_OPTS=" -Xmx1024M" to increase Tomcat memory and
NUTCH_HEAPSIZE=1
lter.Any general thoughts on how we might start to tackle this?
Thanks.
Insurance Squared Inc. wrote:
We're running a crawl using nutch and the last crawl seemed to be
taking a long time. Looking at the output, it seems it's gone into
AOL's search and is actually crawling
We're running a crawl using nutch and the last crawl seemed to be taking
a long time. Looking at the output, it seems it's gone into AOL's
search and is actually crawling search results (it's also crawling some
cgi-bin search results page on another site). This sure seems like it
could go on
What should I be using for a user agent in the crawler? We just tried
crawling a government site and if we leave the user agent set to nutch,
we get the crawl. When I change it, I'm getting blocked with an error
about the user agent not being supported. It seems that I should be
changing the
We're putting a php front end on Nuche, have it working for the search
results page by grabbing an xml feed from nuche. However the cache link
is still calling nuche directly, and my developer indicated that the
cache page doesn't seem to be available via xml.
What's the best way for us to ta
ow that verisign makes this available for .com and .net as
"TLD zone files".
for ccTLDs like .us and .uk, you'll have to see if the TLD registrar
provides the same. the following page has some useful links to these
folks:
http://www.dnsstuff.com/info/dnslinks.htm
--matt
O
t that will
also require ad serving,
but
want it to be open source and give greater
transparency to the advertisers
than they get today with google and overture. If
you start developing
one,
were you thinking of making this open source
project?
Thanks.
Has anyone had any luck with advertising/ad management systems being
integrated into nutch? Not just something for the owner to admin ads,
but to allow external advertisers to manage their accounts/bids, that
kind of thing.
I'm drawing up plans for one if none are available, but clearly
somet
We're trying to index based on a country. What I'm trying to accomplish is:
- auto crawl sites with the correct TLD
- auto crawl manually injected sites.
from this, I then only want to further follow sites that match the TLD.
This means that sites with the correct TLD extension, if found anyw
Where's the best source for documentation on the configuration of
nutch? Not finding much/anything, my developer is wasting his days
pouring over the code :).
Thanks.
Along these same lines (as I'm interested in a similiar country-specific
project), is there any place to get a list of all the domains for a
specific TLD to use to seed nutch? i.e. if I wanted to get a list of
all currently registered .it, .de, or .ca's?
I've looked without success. I'm thin
I'm setting up a nutch-based SE in a small niche and need some
assistance setting up the program.
I need some help, for a quick walk through on setting up the system.
Specifically I'd like a walk through on setup, configuration, how to set
up a polite crawler, how often and deep to crawl, run
Hi,
We're barely past the install stages with nutch, I'd like to ask the
more experienced a few general questions before I jump in with both feet.
I'm thinking about creating a country specific (by TLD) search engine.
- Can nutch only crawl specific TLD's? (i.e. like .it, or .uk.com). My
71 matches
Mail list logo