ots.txt?
We had a similar situation.
We modified the parse-html plugin, with a configurable flag
to adhere to robots.txt or not adhere to robots.txt. Works
great.
JohnM
--
john mendenhall
j...@surfutopia.net
surf utopia
internet services
contain the successfully fetched urls
> and the redirected intermediate urls. At least that is what I think is
> happening.
>
> The final number indexed should be the successfully fetched urls, which
> would be db_fetched.
>
> Dennis
Anything I can do to help debug this?
g it to 3 and
> >your redirects should go down.
> >
> >Dennis
> >
> >John Mendenhall wrote:
> >>>We are using nutch version nutch-2008-07-22_04-01-29.
> >>>We have a crawldb with over 500k urls.
> >>>
> >>>The statu
t 8 times per day, with only small incremental
progress each round. Should topN be higher?
Or, do we need to rebuild the entire crawl database?
Please let me know if there is any information I need to
provide.
Thanks in advance for any assistance provided.
JohnM
--
john mendenhall
[EMAIL PROTECTED]
surf utopia
internet services
errors, to ensure
they are not something serious?
Of course, this is not even close to the missing
numbers we should be seeing.
Thanks in advance for any assistance or pointers
to other resources or ideas.
JohnM
--
john mendenhall
[EMAIL PROTECTED]
surf utopia
internet services
thout titles.
We have worked through this issue and the titles now exist,
along with the corresponding text.
JohnM
--
john mendenhall
[EMAIL PROTECTED]
surf utopia
internet services
> Can u post some of the urls for which parse text is missing.
I am unable to post the actual urls. This is a private
project for which exact urls cannot be shared.
JohnM
> On Tue, Oct 21, 2008 at 6:44 AM, John Mendenhall <[EMAIL PROTECTED]>wrote:
>
> > We are usin
guarantee all
urls get a parsetext, and hopefully, a title?
Thanks in advance for any assistance or pointers
to other resources or ideas.
JohnM
--
john mendenhall
[EMAIL PROTECTED]
surf utopia
internet services
directory. Then, start the hadoop processes. Once the filtering
is done, we stop the hadoop processes. Then, we unset the
NUTCH_CONF_DIR and HADOOP_CONF_DIR environment variables.
Finally, we restart the hadoop processes.
Everything works like a charm now.
JohnM
--
john mendenhall
[EMAIL PROTECTED]
surf utopia
internet services
.
Does anyone have any thoughts or ideas for what we can do to
get this to work with the NUTCH_CONF_DIR? Thank you in
advance for any pointers.
JohnM
--
john mendenhall
[EMAIL PROTECTED]
surf utopia
internet services
specific I should be looking
at first. Thanks in advance for any guidance or ideas provided.
Thanks!
JohnM
--
john mendenhall
[EMAIL PROTECTED]
surf utopia
internet services
a box configured like Linux. Assumption were made
on the default shell script. We have had nutch running on windows,
linux, and solaris. To get it to run on any of these boxes, changes
have been required to basic scripts to get them to run.
JohnM
--
john mendenhall
[EMAIL PROTECTED]
sur
er. Pipe it through sort, before you use tail.
You can only delete old segments after the refetch time has
surpassed for that segment, and all entries in that segment
have been refetced.
JohnM
--
john mendenhall
[EMAIL PROTECTED]
surf utopia
internet services
--
john mendenhall
[EMAIL PROTECTED]
surf utopia
internet services
n in the long
run.
Assuming this is the way Nutch moves forward, do we allow Nutch
to stay as-is, with plugins and all, and create a new project?
Or, do we not worry about abandoning the current setup and
changing it en masse?
JohnM
--
john mendenhall
[EMAIL PROTECTED]
surf utopia
internet services
be seem to think that is the best way to
go.
Any thoughts?
Thanks!
JohnM
--
john mendenhall
[EMAIL PROTECTED]
surf utopia
internet services
:50030/jobtracker.jsp, the cluster summary shows only one
> > node. ?
> >
> > Any suggestions
> >
> >
> > MapsReducesTasks/NodeNodes 0241 <http://ascot1:50030/machines.jsp>
Did you see all nodes listed in the output of the start-all script?
It should list
x27;nutch'. Fixed that and it works
like a charm.
Thanks again!
JohnM
--
john mendenhall
[EMAIL PROTECTED]
surf utopia
internet services
ng it anywhere.
Is there a place where I can set the memory
footprint for tomcat to use more memory?
Or, is there another place I should be looking?
Thanks in advance for any pointers or assistance.
JohnM
--
john mendenhall
[EMAIL PROTECTED]
surf utopia
internet services
unpacked.
Then, the nutch app is the default URL for your tomcat
setup.
I hope this helps.
JohnM
--
john mendenhall
[EMAIL PROTECTED]
surf utopia
internet services
using nutch 0.9.
> Thanks !
>
> On Fri, Jan 11, 2008 at 12:57 AM, John Mendenhall <[EMAIL PROTECTED]>
> wrote:
>
> > Hello,
> >
> > I am running nutch 0.9 currently.
> > I am running on 4 nodes, one is the master, in
> > addition to being a slave.
and it fixed my
> > problem:
> >
> > http://www.mail-archive.com/[EMAIL PROTECTED]/msg01991.html
Look at NUTCH-503, not NUTCH-507. I have no experience with NUTCH-507.
JohnM
--
john mendenhall
[EMAIL PROTECTED]
surf utopia
internet services
ere is a problem with the Generator. There was a change committed
after 0.9 was released. I implemented this change and it fixed my
problem:
http://www.mail-archive.com/[EMAIL PROTECTED]/msg01991.html
JohnM
--
john mendenhall
[EMAIL PROTECTED]
surf utopia
internet services
kay to me. I would start looking
at the logs closely. I would try setting your log4j
properties to INFO or DEBUG level for the generator
step.
The inject is obviously working since your stats shows
the urls in the crawldb as unfetched. So, debug the
generator.
JohnM
--
john mendenhall
[E
> Any help at all would be much appreciated.
Submit your submitted command, plus a sample of the
urls in the url file, plus your filter. We can start
from there.
JohnM
--
john mendenhall
[EMAIL PROTECTED]
surf utopia
internet services
s?
Thanks in advance for any pointers or rules of thumb
you can provide.
JohnM
--
john mendenhall
[EMAIL PROTECTED]
surf utopia
internet services
ask: task_0018_m_02_0
-
Thanks in advance for any assistance you can provide.
JohnM
--
john mendenhall
[EMAIL PROTECTED]
surf utopia
internet services
On Tue, 05 Feb 2008, John Mendenhall wrote:
> -
> Merging 14 segments to /var/nutch/crawl/mergesegs_dir/20080201220906
> SegmentMerger: adding /var/nutch/crawl/segments/20080128132506
> SegmentMerger: adding ...
> SegmentMerger: using segment data from: content crawl_gener
/nutch merge -workingdir $NUTCHTMPDIR $NEWINDEXDIR $NEWINDEXESDIR
The variable names should be self-explanatory. If not,
just let me know.
JohnM
--
john mendenhall
[EMAIL PROTECTED]
surf utopia
internet services
On Tue, 05 Feb 2008, John Mendenhall wrote:
> I am running nutch 0.9.
> I have run nutch mergesegs many times before.
> The last couple times I have run, I get the following
> errors:
>
> -
> Merging 14 segments to /var/nutch/crawl/mergesegs_dir/20080201220906
> Segm
Why is log4j not finding the log4j.properties file?
The nutch script in nutch/bin already adds the conf
dir to the class path.
Thanks in advance for any assistance you can provide.
JohnM
--
john mendenhall
[EMAIL PROTECTED]
surf utopia
internet services
some with 1.5gb ram,
and others with 4gb ram.
Sorry for all the questions. The fetch issue is
my current wall I am trying to overcome.
Should this be debugged in the fetch process or
is it possible the generate process is only
outputting 3%-4% of the topN value?
Thanks in advance for any poi
configuration
value is being used.
I recommend we modify Fetcher2.java to use this
value instead of requiring it to be on the command
line.
JohnM
--
john mendenhall
[EMAIL PROTECTED]
surf utopia
internet services
he jsp pages
are in the jsp directory. Simple, huh?
If you want to just modify what is already in
the tomcat directory, they are located in the
webapps/ROOT directory in various directories,
assuming you renamed it to ROOT.
I hope that helps.
JohnM
--
john mendenhall
[EMAIL PROTECTED]
surf utopia
internet services
onfiguration files and
what you are setting.
JohnM
--
john mendenhall
[EMAIL PROTECTED]
surf utopia
internet services
problem?
Thanks in advance for any assistance you can provide.
JohnM
--
john mendenhall
[EMAIL PROTECTED]
surf utopia
internet services
3 pure slaves. What is the best procedure for turning off
the 3 slaves?
Should I go back to a "local" setup only, without the overhead
of hadoop dfs?
What is the best recommendation?
Thanks!
JohnM
--
john mendenhall
[EMAIL PROTECTED]
surf utopia
internet services
On Fri, 25 Jan 2008, Dennis Kubes wrote:
> Yes you would need to run parsing after fetching and before updatedb.
Thanks!
JohnM
> John Mendenhall wrote:
> >On Fri, 25 Jan 2008, Dennis Kubes wrote:
> >
> >>>Is the recommendation to run fetcher in parsing mode?
>
would complete the download and if the parsing failed you would
> still have the page content and be able to try again without refetching.
To clarify, run the parsing after the fetch process
and before the updatedb process, correct?
Thanks!
JohnM
--
john mendenhall
[EMAIL PROTECTED]
surf utopia
internet services
m
> the same host are assigned to the same map task.
All hosts are the same. Everyone of them.
If there is no way to split them up, this seems to
imply the distributed nature of nutch is lost on
attempting to build an index for a single large
site. Please correct me if I am wrong with this
presumption.
Thanks!
JohnM
--
john mendenhall
[EMAIL PROTECTED]
surf utopia
internet services
slots.
What settings do I need to modify to get the generated
topN (10) urls to be spread out amongst all map
task slots?
Thanks!
JohnM
--
john mendenhall
[EMAIL PROTECTED]
surf utopia
internet services
uals(LocalFileSystem.NAME)) {
> ...
>
> because Hadoop reserves a specific URI of the local FS abstraction, no
> matter what is its implementation.
I found LocalFileSystem documentation at
http://hadoop.apache.org/core/docs/r0.14.4/api/org/apache/hadoop/fs/LocalFileSyste
On Wed, 23 Jan 2008, John Mendenhall wrote:
> I am using nutch-0.9.
>
> In the searcher.IndexSearcher class, there is a getDirectory
> method that uses the following two calls:
>
> -
> if ("local".equals(this.fs.getName())) {
> return FSDi
e just remove the boolean?
Please let me know how we are planning on modifying
this code to adhere to the APIs we are using.
Thanks!
JohnM
--
john mendenhall
[EMAIL PROTECTED]
surf utopia
internet services
?
> Or, is this something else in the configuration?
>
> Is this error the cause of only doing 3% of the 100k
> urls I requested to be done?
>
> Or, is it a problem with the other 96 map tasks not doing
> anything?
>
> Thanks again for all of your help.
>
> JohnM
Does anyone have any thoughts on how I can begin
addressing the issues I am experiencing above?
Thanks in advance for any pointers anyone can
provide.
JohnM
--
john mendenhall
[EMAIL PROTECTED]
surf utopia
internet services
ause of only doing 3% of the 100k
urls I requested to be done?
Or, is it a problem with the other 96 map tasks not doing
anything?
Thanks again for all of your help.
JohnM
--
john mendenhall
[EMAIL PROTECTED]
surf utopia
internet services
It sends me the jobdetails.jsp page, which is what I
reported on.
It seems to me you are referring to another interface.
Can you please let me know where I should be looking
for the errors in the fetcher tasks themselves?
Thanks!
JohnM
--
john mendenhall
[EMAIL PROTECTED]
surf utopia
internet services
e to check on the bandwidth available
for fetching.
Variable mapred.map.tasks is set to 97.
Variable mapred.reduce.tasks is set to 17.
Variable fetcher.threads.fetch is set to 10.
Thanks again for any pointers you can provide.
JohnM
> John Mendenhall wrote:
> >Hello,
> >
higher than the default of 2?
Is there something in the logs I should
look for to determine the exact cause of
this problem?
Thank you in advance for any assistance
that can be provided.
If you need any additional information,
please let me know and I'll send it.
Thanks!
JohnM
--
john mende
ditional information,
please let me know and I'll send them.
Thanks!
JohnM
--
john mendenhall
[EMAIL PROTECTED]
surf utopia
internet services
figuration.(Configuration.java:93)
at org.apache.hadoop.fs.FsShell.main(FsShell.java:910)
-
If you need me to post log excerpts from the other slaves, please
let me know and I'll put them up.
Thanks!
JohnM
--
john mendenhall
[EMAIL PROTECTED]
surf utopia
internet services
d and no new
URLs will be added.
I hope that helps.
JohnM
--
john mendenhall
[EMAIL PROTECTED]
surf utopia
internet services
t; > update our crawldb instead of re-crawling .
> >
> > So do u have any solution that how to update crawldb which already have
> > been crawled and storing some useful information.
> >
> > It's nice if I find any solutions from u or any of ur colleagues.
> >
> > With Thanks & Regards,
> >
> > Ratnesh,V2Solutions India
--
john mendenhall
[EMAIL PROTECTED]
surf utopia
internet services
ation on to System.out
-topN [] dump top urls sorted by score to
[] skip records with scores below this value.
This can significantly improve performance.
Or, you can write your own class that outputs
whatever you want from the database...
Joh
merge several segment indexes
dedup remove duplicates from a set of segment indexes
pluginload a plugin and run one of its classes main()
serverrun a search server
or
CLASSNAME run the class named CLASSNAME
Most commands print help
ainThread.run(libgcj.so.8rh)
> Caused by: java.lang.ClassNotFoundException: admin not found in
> gnu.gjc.runtime ...
>
> I did this a couple of weeks ago. At that point I couldĀ“nt find any
> documentation for Nutch 0.9, so I tried the
> ./bin/nutch admin db -create
>
> is that the Problem?
nutch 0.9 do
know where I should ask, or where I can
find the docs on this kinds of queries.
Thanks!
JohnM
--
john mendenhall
[EMAIL PROTECTED]
surf utopia
internet services
57 matches
Mail list logo