the
database files are replaced. And you would continually get the best
urls in your index for the space you have. I imagine that this is very
similar to how the google dance works.
Dennis Kubes
charlie w wrote:
On 8/1/07, Dennis Kubes [EMAIL PROTECTED] wrote:
I am currently writing a python script
I am currently writing a python script to automate this whole process
from inject to pushing out to search servers. It should be done in a
day or two and I will post it on the wiki.
Dennis Kubes
charlie w wrote:
Thanks very much for the extended reply; lots of food for thought.
WRT
index is never down.
Hope this helps and let me know if you have any questions.
Dennis Kubes
-
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events
Is anybody doing really big indexing jobs on Nutch and Hadoop, say 50M
or more and seeing indexer timeout jobs?
Dennis
-
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?
Also it is best to open new JIRA issues and attach the patch inside the
JIRA if you are wanting the patch included in Nutch releases.
Dennis Kubes
Marcin Okraszewski wrote:
If you think of contributing patch to Nutch to be included in sources some
day, you should probably do it against head
If I am reading the message right :) then yes that problem would have
been fixed by now. I believe that problem was with an earlier version
of Nutch (0.7).
Dennis Kubes
Kai_testing Middleton wrote:
Am I correct that the 'new' mergedb and mergelinkdb commands together would
fix this problem
, say -Xmx512M (we have ours set for -Xmx1024M).
Dennis Kubes
Jason Ma wrote:
I'm running Nutch on RedHat Linux with Java 1.6.0_01. I have
successfully crawled and indexed smaller quantities of data in the
past. However, after I tried to scale up the crawling, Nutch would
give an exception
Sounds to me like you have reached the maximum number of open
connections or ran out of memory or swap space. What is the available
space on the box, how much memory do you have and how much swap?
Dennis Kubes
Fritz Bein wrote:
Hi,
after about 500'000 fetches I receive the message
version of tomcat in the 5x or 6x range.
Dennis Kubes
Jason Ma wrote:
Hi,
I'm new to Nutch and Tomcat, so there doubtlessly many stupid things
that I've done. I'm running Nutch 0.8.1 and Tomcat 4.1.36, on RedHat
Linux with Java 1.6.0_01. I have uncompressed the nutch-0.8.1.war
file
index per search server I would love to hear about it.
The former suggestions of space and architecture are what we have
experienced.
Dennis Kubes
-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE
calculations may be somewhat low on the segment space.
Dennis Kubes
other question would be what part of those 4G is taken by index, i think
it's the majority, but i might be very wrong...
You said above that you don't want local storage. Search has to be on
local file systems. While
hard drives (less if you can find them).
Network is ideally Gigabit ethernet.
Dennis Kubes
Karol Rybak
Programmer
University of Internet Technology and Management
-
This SF.net email is sponsored by DB2 Express
Download
Andrzej Bialecki wrote:
Dennis Kubes wrote:
100 million pages = 50-100 servers and 20-40T of space distributed.
Ideally the setup would be processing machines and search servers. You
[..]
That's a very nice description - thanks, Dennis. I think it would be
useful to include
I was asking if you can ping the master from the slaves. Can you hit
the namenode from one or more of the remote datanodes? If so in the
hadoop-site.xml files on the datanodes, if the namenode variable
pointing to the fqdn of the namenode instead of local?
Dennis Kubes
Bolle, Jeffrey F
fixes the
problem but it is not very robust and has no unit tests as of yet. I
have run this successfully myself. I will provide a more robust patch
when time allows but this should help you for now.
Dennis Kubes
djames wrote:
Thanks a lot for your help
I'll give you a feedback
If the hosts file on the namenode is not setup correctly it could be
listening only on localhost. Make sure your /etc/hosts file looks
something like this:
127.0.0.1 localhost, localhost.localdomain
x.x.x.x yourcomputer.domain.tld
Dennis Kubes
Bolle, Jeffrey F. wrote
of the servers are open to the public.
search domain.com
nameserver 127.0.0.1
nameserver 208.67.222.222
nameserver 208.67.220.220
nameserver 4.2.2.1
nameserver 4.2.2.2
nameserver 4.2.2.3
nameserver 4.2.2.4
nameserver 4.2.2.5
Dennis Kubes
Enzo Michelangeli wrote:
- Original Message - From
, not tens of
thousand).
We are also using BIND and our current index is 52,519,267 pages so you
should be fine with this. I think djbdns is just easier to use. Are
you using any big DNS caches as backups?
Dennis Kubes
I've had positive experience with djbdns / tinydns package, with some
In the nutch-default.xml file you have the configuration option
plugin.includes. Copy that property to the nutch-site.xml file and
change the parse-(text|html|js) to look like this
parse-(text|html|js|pdf) This will enable the pdf parser plugin.
Dennis Kubes
Sævaldur Arnar Gunnarsson wrote
. Your second crawl will be around 54 million pages. And a depth
of 3 will give you over 300 million pages. These are the numbers that
we are currently seeing.
Dennis Kubes
bbrown wrote:
This is kind of a generic question. Are there any stats on how many pages
will get crawled based on some
It should look like this but change out domain for your domain. Try
this and let me know if it works.
127.0.0.1 dhcppc0.domain.com dhcppc0
localhost.localdomain localhost
Dennis Kubes
Reza Harditya wrote:
Hi Dennis,
Yes dhcppc0 is the machine that Nutch is on. And yes
For some reason the nutch process can't resolve the hosts. This could
be due to incorrect setup of dns on the machine or a firewall or proxy
in place. See if you can ping one of the urls (hosts) that you are
trying to fetch.
Dennis Kubes
Reza Harditya wrote:
Hi,
I'm a new nutch user
PROTECTED] wrote:
I have checked and confirmed that the hosts I'm trying to fetch are
actually accessible (ping requests and loading the site itself).
However, I
still get the same error.
Any other alternatives?
On 5/14/07, Dennis Kubes [EMAIL PROTECTED] wrote:
For some reason the nutch
The problem may be that the machine is listening on only the local
interface. If you do a ping myhostname from the local box you should
receive the real IP and not the loopback address.
Let me know if this was the problem or if you need more help.
Dennis Kubes
cybercouf wrote:
I'm trying to setup
What errors are you seeing in your hadoop-namenode and datanode logs?
Dennis Kubes
cybercouf wrote:
Yes it is.
Here more details:
$ cat /etc/hosts
127.0.0.1 localhost
84.x.x.xmyhostname.mydomain.com myhostname
# ping myhostname
PING myhostname.mydomain.com (84.x.x.x) 56(84
Is your hadoop jar in the lib directory named
hadoop-0.4.0-patched.jar! with the exclamation point? If it is, that
may be causing the error. Also let me know if you can ping the namenode
from any of the data nodes.
Dennis Kubes
cybercouf wrote:
I tried both with localhost
Andrzej Bialecki wrote:
Dennis Kubes wrote:
So we moved 50 machines to a data center for a beta cluster of a new
search engine based on Nutch and Hadoop. We fired all of the machines
up and started fetching and almost immediately started experiencing
JVM crashes and checksum/IO errors
, but in part I hope some of this
information helps someone else to avoid having to spend a week tracking
down hardware and weird JVM problems.
Dennis Kubes
-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C
We use a substring the JSP pages to chop off after 150 characters. Then
it shows something like this with the ellipse.
http://www.somelongurl.com/?w=with;a;big;long;query;string...
Dennis Kubes
rubdabadub wrote:
Hi:
You have two option
1. Don't crawl/index URL's having more then X char
Did you set the agent name in the nutch configuration. I think even
when crawling only the local file system the agent name still needs to
be set. If not set I believe nothing is fetched and errors are thrown
but you would only see this if your logging was setup for it.
Dennis Kubes
jim
haven't really seen
anybody that has been active on the lists say they are going to be
involved in the project though? What is everyone's interest level on this?
Dennis Kubes
-
Take Surveys. Earn Cash. Influence the Future
Nutch by default will only parse the first 65536 bytes of an http
request. You can change this to your desired limit by changing the
http.content.limit configuration variable.
Another question is whether some of the links are duplicates?
Dennis Kubes
Mike Howarth wrote:
Thanks
If within nutch:
Configuration conf = NutchConfiguration.create();
Object obj = conf.get(my.variable.name)... or another get method
Dennis Kubes
djames wrote:
Thanks for your help but where i call this methode, she could'nt be resolved.
Is there an import i must do
in the hadoop source code.
Dennis Kubes
djames wrote:
Hello,
I need to add a parameter in the conf file of nutch.
What is the method to read the xml file in nutch?
Thanks
-
Take Surveys. Earn Cash. Influence the Future
.
The errors is basically stating that you wrote something out but haven't
read it back in.
Dennis Kubes
qi wu wrote:
Hi,
I am trying to modify the Fetcher code in Nutch.81 , but always get the
exceptions below in the hadoop.log.
java.lang.RuntimeException: java.io.IOException: Version
|suffix)...
Then you will need the prefix-urlfilter.txt and suffix-urlfilter.txt
files in the conf directory. Below is a configuration that only crawls
http pages with specific suffixes. On the suffix we start by allowing
everything and then specifically deny certain file types.
Dennis Kubes
will need
some experience with various query types.
How do we specify the directory where our crawl results are located to the
query engine?
This is specified by the searcher.dir configuration variable.
Dennis Kubes
Is the API for Lucene the one I should use to retreve results? How
properties are
overridden not the entire file.
Practically you should define properties having to do with Hadoop (i.e.
the DFS, Mapreduce, etc) in the hadoop-site.xml and properties having to
do with Nutch (i.e. fetcher, url-normalizers, etc) in the nutch-site.xml.
Dennis Kubes
Ricardo J. Méndez wrote
.. How is that happen ..
cos I am trying my best to read the code but I can't go beyond parse..
I started at crawl :-)
After looking through it
I don't want to hi jack the thread i just thought you answered the
question so clearly..
Regards
On 3/2/07, Dennis Kubes [EMAIL PROTECTED] wrote
of
eclipse you can move around the item on the classpath and put your
favored conf directory first.
Dennis Kubes
Ricardo J. Méndez
http://ricardo.strangevistas.net/
-
Take Surveys. Earn Cash. Influence the Future
into the CrawlDb. Or if you are writing your
own parse plugin, simply don't add the link to the Outlinks.
Dennis Kubes
Thanks in advance,
Ricardo J. Méndez
http://ricardo.strangevistas.net/
-
Take Surveys. Earn Cash
inject, generate, fetch process...don't use the same path) Then you can
merge those results using mergedb for the CrawlDb and mergesegs for the
Segments. You should have to do a full recrawl unless you don't know
what pages were changed.
Dennis Kubes
Thanks
Peter
You can use the python automation script found at:
http://wiki.apache.org/nutch/Automating_Fetches_with_Python
I almost have a new version ready. Will post it in the next couple of
days to the wiki.
Dennis Kubes
sandeep pujar wrote:
Greetings,
Are there ways we can initiate incremental
x, y, and z then I wouldn't do it through
HtmlParseFilter I would probably go with the lucene after index approach.
Dennis Kubes
-Brian
-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's
through the tutorials you will have an understanding of how the
system runs. Then read the Becoming_A_Nutch_Developer document on the
wiki and follow the steps. This will get you started, when you have
questions or errors post messages to the user list to get help.
Dennis Kubes
boycanfly wrote
.
This may be an area that we need to add an extension point to if one
doesn't already exist. I am sure there are many more people out there
that would like to selectively store content based on the content.
Dennis Kubes
Brian Whitman wrote:
In doing whole-internet focused crawls we'd like
Fetcher is using the correct proxy but the DNS isn't getting out. Take
a look at this, it might help.
http://www.rgagnon.com/javadetails/java-0085.html
Dennis Kubes
Damian Florczyk wrote:
ekoje ekoje napisał(a):
Hello, I tried to modify Nutch in order to pass through a web proxy as
advice
Someone overwrote the login page to the wiki. I restored it and you
should now be able to login regularly.
Dennis Kubes
rubdabadub wrote:
On 2/12/07, Ricardo J. Méndez [EMAIL PROTECTED] wrote:
Hi,
I was checking out the plug in writing example on the Wiki at
http://wiki.apache.org/nutch
Is anybody else getting ClassNotFoundExceptions when running the
injector on the newest trunk of Hadoop?
Dennis
-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the
It means searching a specific domain such as automotive, health, etc.
How to do it is another story, short answer you could either index only
specific sites that you know are in the domain or you could create ways
to determine automatically if a page is in a domain.
Dennis Kubes
Reddeppa
believe this is default) and then you
can add a required field to the query in the search.jsp for the language
like this:
query.addRequiredTerm(en, lang);// substitute language for en
Many thanks,
Nes
Dennis Kubes
for the logging.conf file. Is
that file in the same directory as the JobStream.py script? In the top
of the logging file there is a section called formatters like this:
[formatters]
keys=simple
Dennis Kubes
Justin Hartman wrote:
Hi Dennis
This is a great contribution and I personally thank you
file at all -
should i?
Regards
Justin
On 1/29/07, Dennis Kubes [EMAIL PROTECTED] wrote:
Justin,
Thanks for the update. I will update the script and the wiki to be able
to run this from a clean, no previous fetches run. Currently it did
assume that there were at least some previous
There was some work done on this problem in hadoop a while back so my
guess is you are probably using a version of Nutch 0.8? Take a look at
HADOOP-563 in the Jira
Denns Kubes
djames wrote:
Hello,
During the parse of a fetch of 600 000 pages in a cluster of 5 box,The job
failed with this
of job streams in python but that is not complete yet.
Andrzej, do you think this is something we should post to the wiki?
Dennis Kubes
Justin Hartman wrote:
Hi all
Just have a couple more questions which remain unclear to me at this stage.
1. I'm fetching urls on a P4 2.8ghz machine
+Calls%22
That being said it is important to have the time synchronized between
the machines and there are other errors (mostly stalls) that will occur
if they are not synchronized.
Dennis Kubes
djames wrote:
Thanks a lot for your response,
I'm using nutch 0.8.1.
I will rebuid hadoop
It is up on the wiki at the following location.
http://wiki.apache.org/nutch/Automating_Fetches_with_Python
It has also been added to the front page.
Dennis Kubes
Andrzej Bialecki wrote:
Dennis Kubes wrote:
We have a python script with logging which fully automates the
fetching
limit file types with prefix, suffix, or regex
filters. Let me know if you need to know more about how to do that.
Dennis Kubes
Steve Kallestad wrote:
I've implemented nutch as a site search to try it out.
When I crawl my own site with nutch, I end up with a strange set of links
My stupid mistake. I am using an older version, customized .8 branch
which didn't have normalization. I added normalization to it but in the
process wasn't updating the key with the normalized url for mergesegs
filtering.
Dennis
Andrzej Bialecki wrote:
Dennis Kubes wrote:
If I wrote a new
and 0.8.2
same problemsand also i tried with 0.9.2 version i can't succeed
..then i feel there is something to do with configurations?
Dennis Kubes wrote:
Can you ping the master computer (name node) from the slave (data node)
computers. Also is your namenode configuration
What nutch version are you using and what is your setup. An 80K reparse
should only take a few minutes at most.
Dennis
Brian Whitman wrote:
On yesterdays nutch-nightly, from Dennis Kubes suggestions on how to
normalize URLs, I removed the parsed folders via
rm -rf crawl_parse parse_data
I would take a look at the processes on the namenode server and see if
the namenode has started up. It doesn't look like it did. If this is a
new install, did you format the namenode?
Dennis
srinath wrote:
Hi,
While starting hadoop process we are getting the following error in logs
NutchBean creates a query through the [Query query =
Query.parse(args[0], conf);] call in its main method. The actual query
object is created behind the scenes by the whole nutch analysis
mechanism. This does alot of work that is helpful in creating general
queries but it is not the only
-parse, re-index and then dedup. Another option is a
url filter that simply removes urls with the #a as they are internal
links. Again you would need to re-parse, etc.
Let me know if you need more information on how to do this.
Dennis Kubes
Brian Whitman wrote:
I'm using Solr to search
I have not used the french analyzer...but did you use the french
analyzer for both indexing and searching?
Dennis
[EMAIL PROTECTED] wrote:
I am having a hardtime implementing the French Analyzer... Any help with
be immensely appreciated. Here are the details, first I tried with the
You can use prefix and suffix filters by making sure the plugin.includes
variable in the nutch-*.xml file has the urlfilters configured with the
urlfilter variable like so:
urlfilter-(prefix|suffix)...
Then you will need the prefix-urlfilter.txt and suffix-urlfilter.txt
files in the conf
(that.hash)) { // order first by
hash
return this.hash.compareTo(that.hash);
...
So, is that where I would place my similary score and return that value
there?
Dennis Kubes wrote:
If I am understanding what you are asking, in the getRecordReader method
of the InputFormat innner
Segment is indexed as a field so you could write a query filter the
includes the segment name. You could also use an IndexReader and loop
through document by document from 0 to maxDoc() -1 checking for the
segment field. The second option is much more resource intensive though.
Dennis
You could use suffix filters to filter out any document that isn't a PDF.
Dennis
Marco Vanossi wrote:
Hi,
Do you think there is an easy way to do make nutch generate a list of
only
certain documents type to fetch?
For example:
If one would like to crawl only PDF docs (after some pages
I agree with Andrzej that a thread dump would be best. Also what
version of nutch are you using?
Dennis
Andrzej Bialecki wrote:
Mike Smith wrote:
Hi Dennis,
But it doesn't make sense since the reducers' keys are URLs and the
heartbeat cannot be sent when the reduce task is called. Since
I don't know exactly what you are wanting to do below. Adding a term
through a query filter would be something like this:
import org.apache.nutch.searcher.FieldQueryFilter;
import org.apache.hadoop.conf.Configuration;
public class NewQueryFilter
extends FieldQueryFilter {
public
A guess would be that somewhere in your classpath you have the wrong
version of xalan.
Dennis
NG-Marketing, M.Schneider wrote:
Hello list,
when I use Java 1.4 everything works well, but if I switch to 1.5 i have the
following error:
I have seen this happen before if the box is loaded down with too many
tasks and the IO is maxed. I have also seen this happen when the regex
filters spin out. We changed our systems to use only prefix and suffix
url filters and that cleared up those types of problems for us.
Dennis
Mike
Did you set the user agent name in the nutch-site.xml file?
Dennis
kevin wrote:
Why crawl file so small?
Total size: 12.4 KB
I used this command:
./nutch crawl urls -dir crawled -depth 20
However,the website I crawled is not so small.
Regards!
the *.
Also, in the log file, I can not find any error regarding this
- Original Message - From: Dennis Kubes
[EMAIL PROTECTED]
To: nutch-user@lucene.apache.org
Sent: Wednesday, September 27, 2006 7:59 PM
Subject: Re: no results in nutch 0.8.1
Did you setup the user agent name
Did you setup the user agent name in the nutch-site.xml file or
nutch-default.xml file?
Dennis
carmmello wrote:
I have followed the steps in the 0.8.1 tutorial and, also, I have been using
Nutch for some time now, without seeing the kind of problem I am
encountering now.
After I have
It depends on settings in the conf/log4j.properites file for the level
of logging. The log files are in the HADOOP_LOG_DIR directory which can
be set in the hadoop-env.sh file in the conf directory. Usually the
file is called hadoop-phoenix-tasktracker...
Dennis
Mike Smith wrote:
Hi,
I
Do a search on this mailing list for fetcher slowness and you will find
a thread detailing this subject. Basically it is due to long crawl
delays. Patches have been submitted on that thread.
Dennis
Bruno Thiel wrote:
Hi all,
I have got a problem with the fetcher (nutch-0.8). The Fetcher
run ant package. the full distribution is under build/nutch-x,x folder.
heack wrote:
I run ant in nutch base dir, and It compile successfully. But it does not
generate nutch-0.8.jar or nutch-0.8.war, only a nutch-0.8.job file(and other
plunge class) in build folder. What options should I use
because the default target is job which creates the job file, run
package to create all.
heack wrote:
Only a nutch-0.8.job file there.
And also question what the next step should I do after I modified source
code like NutchAnalysis.jj and use ant to build it?
The search.jsp seems not use
Does it not have anything in the database or are there entries in the
index but nothing is being returned by the search?
Dennis
victor_emailbox wrote:
Can anyone help?
Thanks.
victor_emailbox wrote:
Hi,
I followed all the steps in the 0.8 tutorial except that I have only 2
urls in the
Isn't this the same problem that was happening before with the
SegmentMerger I think where the nutch-x.x.jar needed to be added to the
classpath on all of the task trackers. We added the following code to
our hadoop script just below the other for loops and redeployed script
and restarted all
Besides the initial fetch is the crawl_generate folder in a segment used
anywhere else? Would it be safe to delete or not have the
crawl_generate folder while searching?
Dennis
-
Using Tomcat but need to do more? Need to
I don't know if it is the same in 7.2 but in .8 there is a hadoop-env.sh
file where you can uncomment the JAVA_OPTS variable and give the heap
more memory. Either way the JVM must be started with more memory,
something like this vm option -Xmx1024M for a 1Gig heap.
Dennis
Bogdan Kecman
You would need to modify Fetcher line 433 to use a a text output format
like this:
job.setOutputFormat(TextOutputFormat.class);
and you would need to modify Fetcher line 307 only collect the
information you are looking for, maybe something link this:
Outlink[] links =
How many urls are you fetching and does each machine have the same
settings as below?
Remember that number of fetchers is number of fetcher threads per task
per machine. So you would be running 2 tasks per machine * 12 threads *
3 machines = 75 fetchers.
Dennis
Vishal Shah wrote:
Hi,
You may be running into problems with regex stalls on filtering. Try
removing the regex filter from the nutch-site.xml plugin.includes
property. I was having similar problems before switching to just use
prefix and suffix filters as below. I attached my prefix and suffix url
filter files
Assuming you have two separate war files deployed, it should be as easy
as setting the searcher.dir property in the nutch-site.xml file in the
different web-inf directories to the separate index locations. If you
want to go the distributed searching route there is a in depth
explanation on
Are those like the shuttle boards? Smaller 1/4 size boxes?
Dennis
Zaheed Haque wrote:
Renaud:
Yes or No!. I have done some testing as Dennis Kubes suggested and got
similler results like his test. In short having 4 nutch search servers
in one box but in 4 different disks with in my case
You can keep the indexes separate and use the distributed search server,
one per index or you can use the mergedb and mergesegs commands to merge
the two runs into a single crawldb and a single segments then re-run the
invertlinks and index to create a single index file which can then be
in windows under control panel - system -
advanced - environment variables - system variables.
Dennis Kubes
Philip Brown wrote:
nutnoob wrote:
how to set NUTCH_JAVA_HOME ???
I have java install in machine but don't know how to set it for nutch.
please help me .
see link: Setting
You will also need more than 1 terabyte to get to 100 million pages. A
good rule of thumb is 2 gigs * replication factor for every 1 million pages.
Dennis
Dan Morrill wrote:
Hi,
I found that with a 3 meg DSL line I was averaging 8 pages per second with a
similar set up, to reach 100
Unfortunately you have to start over. We started breaking our crawls
into 100K to 500K runs because of this.
Dennis
Abdelhakim Diab wrote:
Hi all:
What can I do if I were crawling a big list of sites and suddenly the
crawler stopped for any problem ?
must I return the whole process or I
I installed nutch, tomcat, and java fresh. All of my FC5 installs use
only the minimal amount of packages, I think just editors, admin tools
and base. I don't put x servers on them. We also use network boots and
kickstart load to get a consistent install across machines. We install
java,
. But the problem that I run into are the
fetcher threads hangs, and for crawl delay/robots.txt file (Please see
Dennis Kubes posting on this).
Yes, these are definitely problems.
Stefan has been working on a queue-based fetcher that uses NIO. Seems
very promising, but not yet ready for prime time.
-- Ken
You can add the property to the nutch-site.xml file to take precedence
over default in nutch-default.xml file. The value is as below. This is
for Nutch 0.8 I am not sure if this is the same for 0.72
property
namefetcher.store.content/name
valuefalse/value
descriptionIf true, fetcher
maximize my full bandwidth. But the problem that I run into are the
fetcher threads hangs, and for crawl delay/robots.txt file (Please see
Dennis Kubes posting on this).
Yes, these are definitely problems.
Stefan has been working on a queue-based fetcher that uses NIO.
Seems very promising
You can use a suffix filter if there are no query strings.
Dennis
Jens Martin Schubert wrote:
Hello,
is it possible to crawl e.g. http://www.domain.com,
but to skip crawling all urls matching to
(http://www.domain.com/subpage/)
I tried to achieve this with
There is also a mapred.tasktracker.tasks.maximum variable which may be
causing the task number to be different.
Dennis
Murat Ali Bayir wrote:
Hi everbody, Although I change the number of mappers in
hadoop-site.xml and use job.setNumMapTasks method the system gives
another number as a
The name node is running. Run the bin/stop-all.sh script first and then
do a ps -ef | grep NameNode to see if the process is still running. If
it is, it may need to be killed by hand kill -9 processid.
The second problem is the setup of ssh keys as described in previous
email. Also I would
1 - 100 of 149 matches
Mail list logo