Hi Alex,
I cannot locate the java file you mention at
org.apache.nutch.parse.html.HtmlParser in either 1.2 or branch 1.3...
Having a quick look at org.apache.nutch.parse.HTMLMetaTags (in both versions
above it is identical) it appears that you are right the double quotes for
meta http-equiv
Hi,
To add to Markus' comments, if you take a look at the script it is written
in such a way that if run in safe mode it protects us against an error which
may occur. If this is the case we an recover segments etc and take
appropriate actions to resolve.
On Tue, Jun 7, 2011 at 9:01 PM, Markus
Hi,
I suggest that before you try to progress any further with this you read as
much of the wiki [1] as you can, in particular I would start here [2] [3]
After this, try looking through some of the source and understanding what
parameters are required to run various commands. The reason for this
Hi everyone,
Was wondering if anyone (familiar with the topics) would be interested in
sending me material for the following pages [1] [2]. The links appear to be
non existent in our wiki and it would be nice to get some material on these
topics if these topics are important and are required!
Hi abhayd,
In short...yes.
Although you have correctly specified an absolute path, you need to drop the
/crawldb/current/part-0
A good resource for this stuff can usually be found on the mailing lists.
On Wed, Jun 8, 2011 at 8:03 AM, abhayd ajdabhol...@hotmail.com wrote:
hi
I am using
We are a bit thin on supporting documentation for the new release at the
moment but are actively working towards producing this. Hopefully once we
have something contributed to the wiki the differences in configuration and
functionality within release 1.3 will be fully explained.
On Thu, Jun 9,
Hi Adelaida,
Assuming that you have been able to successfully crawl the top level domain
http://elcorreo.com e.g. that you have been able to crawl and create an
index, at least we know that your configuration options are OK.
I assume that you are using 1.2... can you confirm?
What does the rest
Hi,
Can you provide a use case? The reason I ask is that I can only assume that
you would be hacking some code to inject your urls from some other URL
store?
On Tue, Jun 14, 2011 at 5:18 PM, shanWDC ssar...@web.com wrote:
Is there a way to inject urls in the injector, through code, rather than
Off the top of my head one property springs to mind. Which you may or may
not have configured in nutch-site
http.content.limit
However I think that this is not the source of the problem.
I would advise you to have a look at your hadoop log file for any obvious
warnings... how do you know he
Hi Mohammad,
Try looking at the pre nutch 1.3 material on the wiki, I'm sure there must
be something in there you can build on... or that will at least point you in
the right direction
http://wiki.apache.org/nutch/Archive%20and%20Legacy
HTH
On Fri, Jun 17, 2011 at 9:27 PM, Mohammad Hassan
Have you set your crawl directory property value in nutch-site.xml when
launching the war file on tomcat?
On Tue, Jun 21, 2011 at 4:01 AM, Mohammad Hassan Pandi
pandi...@gmail.comwrote:
follwing http://wiki.apache.org/nutch/NutchHadoopTutorial I crawled
lucene.apache.org with command
to give a short answer to your question the answer is I don't know. Many of
us are not using Lucene as the indexing machanism. I think as this is
specifically linked to Lucene you would be better asking there.
try the user list
http://lucene.apache.org/java/docs/mailinglists.html#Java User List
Hi,
Assuming that you are using 1.2 the war file should definately be there. You
will be able to get step by step directions for this in the tutorial on the
Nutch site.
http://wiki.apache.org/nutch/NutchTutorial
Note that this will be getting updated soon to reflect changes incorporated
into
As this is open source I think the best way to solve your question/request
is to get down and dirty with your own configuration. Many implementation
scenarios are unique, to a new Nutch user this may provide no immediate
helpful credentials, however it clearly displays the adaptability and
Hi Markus,
Can you list the steps you executed prior to the solrdedup please?
I think I encountered something similar a while back and as my work was
moving on I didn't get a chance to investigate it fully.
On Tue, Jun 21, 2011 at 1:54 PM, Markus Jelsma
markus.jel...@openindex.iowrote:
Hi,
I tried to build Nutch trunk in eclipse about circa 2 months ago. Gora built
fine and from memory it was the ivy configuration within Nutch which had to
be altered. I'm positive the problems I was having have now been
rectified but I haven't tried since. That is why I am interested in why
JUnit
Hi Jefferson,
I cannot access either your nutch-site or nutch-default but I see that your
http.content.limit is INFO http.Http - http.content.limit = 65536
It is a fairly large page so maybe this can be the cause. I'm sorrry I don't
have access to my linux worktop so I can't test myself can you
Hi all,
With permission from the author I managed to adapt a blog entry for the
above which can be found here.
At this stage I would ask for anyone interested to
make changes/improvements/etc. Once we can verify the integrity and accuracy
of the entry it would be nice to rebuild the website with
Can you expand on this? I am not understanding your description of the
problem.
On Fri, Jun 24, 2011 at 12:52 PM, Jefferson jeff151520...@msn.com wrote:
ready.
Now I have another problem:
digit phenomena and he returns this:
-
Albert Einstein - Wikipedia, the free encyclopedia Albert
I see within you're nutch-site file that you have set an http.content.limit
value of 340,671. Is there any reason for this value? I'm assuming you are
not indexing this page so you can merely search for the term phenomena, and
that there is other textual content within the page that you are
Hello list,
Do we have any suggestions we wish to discuss regarding the above?
thanks
--
*Lewis*
nutch-site.xml is empty. Perhaps it means nutch uses default path as
Index location. right?
On Thu, Jun 23, 2011 at 10:57 PM, lewis john mcgibbney
lewis.mcgibb...@gmail.com wrote:
Have you set your crawl directory property value in nutch-site.xml when
launching the war file on tomcat
I will try to get a wiki entry for this sorted ASAP as it is a fundamental
requirement for anyone wishing to debug/understand how classes work in Nutch
1.3, when the time comes around any opinions/comments you have would be a
great addition.
Thanks
2011/6/30 Nutch User - 1 nutch.use...@gmail.com
How many threads do you have running concurrently?
Is there any log output to indicate any warnings or errors otherswise?
On Sat, Jul 2, 2011 at 7:40 AM, Markus Jelsma markus.jel...@openindex.iowrote:
Does it run out of memory? Is GC able to reclaim consumed heap space?
Have a 300K URLs
Hi,
Just finished the above, which you can find here [1] so please check out the
pages if you are having trouble passing parameters to any commands. It would
be great to mention if there are any mistakes or even better edit or add any
missing information you think would make the documentation
Absolutely...
There is a short (old) thread here on this topic [1], from what I can see
this issue has not been addressed. Therefore it looks like implementing your
own parser plugin is what's required.
[1]
http://www.lucidimagination.com/search/document/a8d53fac1caa578c/nutch_with_nsf_files
Hi,
I am sorry that I have not been able to try and replicate the scenario and
confirm whether I get zero scores in a similar situation as I am temporarily
unable to do so but I would like to add this resource [1], if you have not
seen it yet. I am aware that this doesn't address the problem
Hi,
I'm curious to hear if anyone has information for configuring Nutch to crawl
a RDB such as MySQL. In my hypothetical example there are N number of
databases residing in various distributed geographical locations, to make a
worst case scenario, say that they are NOT all the same type, and I
thanks to you both
On Tue, Jul 5, 2011 at 4:35 PM, Markus Jelsma markus.jel...@openindex.iowrote:
H,
About geographical search: Solr will do this for you. Built-in for 3.x+ and
using third-party plugins for 1.4.x. Both provide different features. In
Solr
it's you'd not base similarity on
Hi C.B.,
This is way to vague. We really require more information regarding roughly
what kind of results you wish to get. It would be a near impossible task for
anyone to try and specify a solution to this open ended question.
Please elaborate
Thank you
On Thu, Jul 7, 2011 at 12:56 PM, Cam
Hi Paul,
Please see this tutorial for working with Nutch 1.3 [1]
The tutorial you were using is for Nutch 1.2 from memory.
[1] http://wiki.apache.org/nutch/RunningNutchAndSolr
Thank you
On Thu, Jul 7, 2011 at 1:17 PM, Paul van Hoven
paul.van.ho...@googlemail.com wrote:
I'm completly new
Regards,
-C.B.
On Thu, Jul 7, 2011 at 6:21 PM, lewis john mcgibbney
lewis.mcgibb...@gmail.com wrote:
Hi C.B.,
This is way to vague. We really require more information regarding
roughly
what kind of results you wish to get. It would be a near impossible task
for
anyone to try
Hi Serenity,
I don't know if you are aware but this message has been duplicated across
both user@ nutch-user@.
In general it is good practice for what to put in nutch-site and
nutch-default can be found here [1] and here [2]. It is not required to add
the properties to both of the conf files.
Yes this would limit the number of URLs from any one domain, but it would
not explain why one domain seems to get fetched more after recursive fetches
of some given seed set.
Can you explain more about your crawling operation? Are you executing a
crawl command? If so what arguements are you
The web app was deprecated when we released Nutch 1.3. This was so we could
use Solr interface for searching and offload the builk associated with the
web app (amongst other things). There has been quite a lot of chat regarding
this on this list over the last while.
The last version of Nutch to
Hi C.B.,
It looks like you may have simply missed the '-dir' when you were specifying
your crawldb directory to be updated from the fetched segment. Have a look
here [1]
Can you please try and post your results.
[1] http://wiki.apache.org/nutch/bin/nutch_updatedb
On Fri, Jul 8, 2011 at 5:06
Hi Serenity,
How did you execute the crawl? with crawl command? Have you ensured that
parsing has been done?
This looks like a different IIE than other have been getting when indexing
to Solr. So please ensure that parsing has been done on all fetched content.
On Fri, Jul 8, 2011 at 6:20 PM,
Hi C.B.,
Your description gets slightly cloudy towards the end e.g. around One
diffuculty with my htmlcleaner...taken from firebug???
Are you trying to say that some of the URLs are bad HTML, you know this
because it is flagged up by firebug? If this is the case are you able to
edit the HTML and
are pretty dynamic just now and there is a lot of exciting
stuff in the pipeline for the near future.
Thanks
On Thu, Jun 23, 2011 at 11:55 PM, lewis john mcgibbney
lewis.mcgibb...@gmail.com wrote:
I tried to build Nutch trunk in eclipse about circa 2 months ago. Gora
built fine and from memory
Hi Carmmello,
I would like to stress that I am only speaking from my own views on the way
the project has been moving over the last year and a half or so but I would
like to add the following points to address you quite obvious concerns
There has been a lot of correspondence on closely linked
Hi C.B.,
Can you please expand on this description?
On Sun, Jul 10, 2011 at 11:52 AM, Cam Bazz camb...@gmail.com wrote:
Hello All,
Is there a way to save the plain htmls from the crawl? Or is this
already stored in segments dir?
Best Regards,
-C.B.
--
*Lewis*
Hi,
For a 1.3 tutorial please see here [1]. I am in the process of overhauling
the nutch site to accomodate new changes as per 1.3 release.
Thank you
On Sun, Jul 10, 2011 at 3:42 PM, Paul van Hoven
paul.van.ho...@googlemail.com wrote:
I'm completly new to nutch so I downloaded version 1.3
Hi,
Please see this new tutorial [1] for configuring Nutch 1.3. If you are
familiar/comnfortable working with Solr for improvements to indexing then
you will find it no problem.
If you require to stick with Lucene and the web application front end then
please stcik with Nutch 1.2 or before.
[1]
Hi Please see this tutorial [1] for up to date 1.3 tutorial on wiki.
Please try it out and take on Markus' points regarding Nutch trunk as the
problems you are experiencing are usual with Trunk as it stands.
[1] http://wiki.apache.org/nutch/RunningNutchAndSolr
On Mon, Jul 11, 2011 at 10:50 PM,
I must admit Markus that I agree with you that for making ad-hoc changes to
your configuration it is usually more time efficient to use a text editor.
Hi C.B.
Is there any reaon in particular you are interested in getting it up working
with an IDE? I had contemplated getting a revised tutorial
Hi Fernando,
One point for me to mention which I did not pick up from your post. Did you
rebuild your Nutch dist after making the changes to include your new parser?
I know that this is a pretty simple suggestion but hopefully it might be the
right one.
Also can you please provide more details
of a solr / php question than a Nutch question I think.
-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io]
Sent: Monday, July 11, 2011 3:19 PM
To: user@nutch.apache.org
Cc: lewis john mcgibbney
Subject: Re: Nutch Gotchas as of release 1.3
Well, now i'm
Fro mn the looks of it you need to parse all segments before indexing
attempting to index them.
As Markus has pointed out, the specific segment hasn't been parsed. Try
parsing as per the following link
http://wiki.apache.org/nutch/bin/nutch_parse
On Tue, Jul 12, 2011 at 1:50 PM, Paul van Hoven
be great to find
whether there is scope to file a JIRA with this.
Thank you
On Tue, Jul 12, 2011 at 2:02 PM, Nutch User - 1 nutch.use...@gmail.comwrote:
On 07/12/2011 03:42 PM, lewis john mcgibbney wrote:
Hi,
An observation is that you are using the 1.3 branch, which will now
contain
some
What plugin are you hacking away on? You're own custom one or one already
shipped with Nutch? Just so we are reading from the same page.
This, along with some further documentation for running various classes from
the command line is definately worth inclusion in the CommandLineOptions
page of
for fetching, exiting ...
Looks like I am missing some key step =(.
-param
On 7/12/11 1:37 PM, lewis john mcgibbney lewis.mcgibb...@gmail.com
wrote:
Hi,
I think you are maybe getting tangled here. Please see the following
tutorial for Nutch 1.3 [1]
Please also note that the URL you
Assuming your using Nutch 1.2, the web application you point to needs to be
the exact name of the WAR file.
In my case it was therefore always
http://localhost:8080/nutch-1.2 http://localhost:8080/nutch/
Also I do not understand written spanish (i think this is) so I can help you
out with the
I think you question should be more along the lines of, is it possible to
use data stored within a Lucene index in a Solr core for search?
Unfortunately I am unable to answer this question, my suggestion would be to
ask on solr-user@
Another option which you may wish to consider is using the
in a well tuned fashion should yield better results over
time.
Thanks again for the help (and apologies for the huge e-mail)
Chris
On 14 July 2011 10:59, lewis john mcgibbney lewis.mcgibb...@gmail.com
wrote:
Hi Chris,
Yes a Nutch 1.3 crawl and Solr index bash script is something that has
Hi Eric
Please add any comments you wish to the new tutorial that Markus mentioned
on the Wiki. I am in the process of rebuilding the Nutch site and this will
be included tomorrow e.g from now on the default tutorial people are
directed to from the wiki will be the RunningNutchAndSolr tutorial...
Hi C.B.,
Quite a few things here
On Fri, Jul 15, 2011 at 5:19 PM, Cam Bazz camb...@gmail.com wrote:
Hello,
Finally I got a working build environment, and I am doing some
modifications and playing around.
Good to hear, although it is off topic can you share any hurdles you
overcame with us
)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)/value
/property
property
namesearcher.dir/name
valueC:/Apache/apache-nutch-1.2/crawlvalue
/property
/configuration
-Original Message-
From: lewis john mcgibbney [mailto:lewis.mcgibb...@gmail.com]
Sent
of anything else it could be.
-Original Message-
From: lewis john mcgibbney [mailto:lewis.mcgibb...@gmail.com]
Sent: Friday, July 15, 2011 3:19 PM
To: user@nutch.apache.org
Subject: Re: Deploying the web application in Nutch 1.2
Are you adding this to nutch-site within your webapp
Hi C.B.,
I'm in the process of overhauling PluginCentral on the wiki and have opened
a wiki page for Plugin Gotchas [1]. Would it be possible to ask you to edit
and define your understanding of the problem more specifically please. There
is also an interesting page here [2], which you may or may
Hi,
Do we have any suggestion to demystify this. I intend to look into webgraph
in more detail soon as I wish to get a much more detailed picture of its
functionality for link analysis purposes.
On Wed, Jul 13, 2011 at 9:25 AM, Nutch User - 1 nutch.use...@gmail.comwrote:
Does anyone know how
Hi Gabriele,
At first this seems like a plausable arguement, however my question concerns
what Nutch would do if we wished to change the Solr core which to index to?
If we removed this functionality from the crawldb there would be no way to
determine what Nutch was to fetch and what it wasn't.
Please feel free to add this to the wiki as it is a question that will
undoubtably arise in the future.
Lewis
On Sat, Jul 16, 2011 at 12:37 PM, Gabriele Kahlout gabri...@mysimpatico.com
wrote:
On Sat, Jul 16, 2011 at 1:29 PM, lewis john mcgibbney
lewis.mcgibb...@gmail.com wrote:
Hi
Further to this, I have been working on a JIRA ticket for this [1]
If you could, can you please test. I will also shortly and hopefully we can
get this committed soon.
Thank you
[1] https://issues.apache.org/jira/browse/NUTCH-672
On Tue, Jul 12, 2011 at 9:36 PM, lewis john mcgibbney
Hi,
Is this currently possible with Tika 0.9 in Nutch branch 1.4? I would have
thought that this would have been dealt with in Tika, however I have seen no
mention of anyone having problems extracting this from web documents when
fetching with Nutch or even discussing it.
For example say I had
Hi Markus,
I think this is a good shout, and it is not hard to understand the points
you make. Quite clearly, good practice relating to the inclusion of accurate
and useful language information (as well as other types of information) in
HTTP headers is not a reality and it wouldn't be suitable
Hi,
If you have a look at your regex-ulrfilter.txt it will by default be
rejecting ? in the URL. Please test with line edited (or commented out) and
see if the problem fades.
On Mon, Jul 18, 2011 at 10:11 AM, Anders Rask anr...@gmail.com wrote:
Hi Markus!
We are using a custom parser, but I
Hi Cheng,
Please see this wiki page for some references to optimization [1]
I can see your problem though. I think a possible solution may to have two
seed directories, with a specifically tailored Nutch implementation ready to
crawl both. This way we guarantee top results if we take site in a
Hi Kelvin,
I see you are posting on a couple of threads with regards to the Lucene
index generated by Nutch which you correctly point out is not there. It is
not possible to create a Lucene index from Nutch 1.3 anymore as all
searching has been shifted to Solr therefore Nutch 1.3 has no use for a
I dont think this has anything to so with modifying the crawl src. It
doesn't infact have anything to do with optimization either. Try using your
URLFilters e.g. regex
It is important to try and understand what type of pages we can filter out
from a Nutch crawl using the filters provided.
HTH
I don't know if you are still pursuing this, and as you haven't had any
response I will give some tips.
It sounds like your using = Nutch 1.2, therefore unless you are comofrtable
working with JSP's then I wouldn't bother with the hastle. It might be
better to try and use Solr for indexing and
errors if I use 'crawl' and to
prove
that I do not have any faults in the conf files or the directories.
I still get the errors if I use the individual commands inject,
generate, fetch
Cheers,
Leo
On Tue, 2011-07-19 at 22:09 +0100, lewis john mcgibbney wrote
Hi Marek,
As were talking about automating the task were immediately looking at
implementing a bash script. In the situation we have described, we wish
Nutch to adopt a breadth first search BFS behaviour when crawling. Between
us can we suggest any methods for best practice relating to BFS?
As
Hi Chip,
I would try running your scripts after setting the environment variable
$NUTCH_HOME to nutch/runtime/local/NUTCH_HOME
On Wed, Jul 20, 2011 at 4:01 PM, Chip Calhoun ccalh...@aip.org wrote:
I've been working with $NUTCH_HOME/runtime/local/conf/nutch-site.xml, and
I'm pretty sure that's
the
google map js code in solr ?
Thanks again,
On Wed, Jul 20, 2011 at 1:51 PM, lewis john mcgibbney
lewis.mcgibb...@gmail.com wrote:
I don't know if you are still pursuing this, and as you haven't had any
response I will give some tips.
It sounds like your using = Nutch 1.2
: Merging segment data into db.
CrawlDb update: finished at 2011-07-21 12:28:04, elapsed: 00:00:01
On Wed, 2011-07-20 at 21:58 +0100, lewis john mcgibbney wrote:
There is no documentation for individual
Specifically I would mention that you would get a community input if this
question was directed towards the Solr user list, however I think you are
looking for the velocity response writer.
Have a search on the Solr wiki you will find info there.
In addition there are various other well
://evolvingweb.github.com/ajax-solr/
you gave me .
But I have some questions about that.
Where should I add the javascript code file ? Is it in some subdirectory
in apache-solr directory?
Can you explain a little bit more?
Thanks,
On Wed, Jul 20, 2011 at 2:28 PM, lewis john mcgibbney
HI Alexander,
I don't want to state the obvious here but this will depend directly on what
type of loading your Nutch implementation deals with...
You are correct in stating that we store data in segments, namely
/crawl_fetch
/content
/crawl_parse
/parse_data
/crawl_generate
/parse_text
I
documents. Am I misremembering that there was a total file size value
somewhere in Nutch or Solr that needs to be increased?
-Original Message-
From: lewis john mcgibbney [mailto:lewis.mcgibb...@gmail.com]
Sent: Wednesday, July 20, 2011 5:23 PM
To: user@nutch.apache.org
Subject: Re
Hi Markus,
I am getting you until the last parts of your comments.
cope with non-edited... edited by whom? and for what purpose? To give a
better relative tf score...
To comment on the first part, and please ignore or correct me if I am wrong,
but do we not give each page and therefore each
Hi Cheng Li,
Please experiment with this. We have been gradually getting the
pluginCentral section of the wiki updated as it needed a total face lift, so
would appreciate any additional input you may have for updating the writing
Plugin example which is already there. Apart being completely out
Hi Marseld,
I'm just putting my thoughts out here, however Hadoop is not shipped with
Nutch 1.3 anymore therefore I don't know where you would set this specific
property within yout Nutch instances...
How are you running Hadoop
what version of Nutch
what mode are you running Nutch in?
On Tue,
, automatically or is there a command to do it?
Thanks
Chris
On 27 July 2011 17:14, lewis john mcgibbney lewis.mcgibb...@gmail.com
wrote:
HI Alexander,
I don't want to state the obvious here but this will depend directly on
what
type of loading your Nutch implementation deals
which version of Nutch are you using?
Is chat a plain text file, with URLs in a list on per line? If this the case
there is no need to add it to your crawl command. Additionally, there is no
point in trying to read what is happeneing in your crawldb if your generator
log output indicates that
Hi Kiks,
What kind of changes have you made to your schema when transferring to Solr
instance?
You ask about the stored parsed text content, well the default Nutch schema
sets this by default to stored=false as it is not always required for all
content to be stored. Generally speaking terms that
Sorry
http://wiki.apache.org/nutch/RunNutchInEclipse
On Wed, Aug 3, 2011 at 2:12 PM, Dr.Ibrahim A Alkharashi
khara...@kacst.edu.sa wrote:
thanks for the info, would you please post a pointer to the page.
Regards
Ibrahim
On Aug 3, 2011, at 3:13 PM, lewis john mcgibbney
lewis.mcgibb
Hi Zhanibek,
I would like to refer specifically to Markus' thread which he initiated a
short time ago [1] sharing close similarity to your own questions. I think
the main question to be answered now is how do we extract tf-idf from a
crawled website? And as we now refer to Nutch as an independent
Hi Alex,
Did you get anywhere with this?
What condition led to you seeing unknown host exception?
Unless segment gets corrupted, I would assume you could fetch again.
Hopefully you can confirm this.
On Tue, Aug 16, 2011 at 9:23 PM, alx...@aim.com wrote:
Hello,
After running bin/nutch fetch
Correct
There should be comprehensive documentation on the wiki for these parameters
(and many more)
On Fri, Aug 19, 2011 at 6:46 PM, Markus Jelsma
markus.jel...@openindex.iowrote:
addDays is not a crawl switch but a generator switch. You cannot use the
crawl
command.
But if I use
Hi
Small suggestion, but I do not see any -dir argument passed alongside your
initial invertlinks command. I understand that you have multiple segment
directories, which have been fetched over a recent number of days, and that
the output would also suggest the process was properly executed,
If you please post your crawldb dump then we could see the structure of your
crawldb and may be able to begin pin pointing the issue.
It should not be required for you to undertake another crawl after inverting
links for these URLs to be indexed when calling solrindex command... there
must be
Hi
Can you explain how you tried to save raw html obtained during a crawl to a
local drive? I am not entirely sure what you mean here and why you would
want to do so given that we already have an array of alternative options
available. Can you please expand on this.
Thank you
On Wed, Aug 24,
Hi Adam,
My initial thoughts are that you are correct. It is very unusual for your
files to be located on an URL in the same domain which is not referenced by
the top level or a subsequent level URL within the domain.
What I would suggest is that you have a look through your hadoop.log as well
Hi JB,
We have recently finished a complete plugin tutorial which fully explains
the functionality of the urlmeta plugin on the wiki. It can be found here
[1], could I ask you to have a thorough look at it, and the code and if you
still have questions then please reinforce them.
[1]
Apart from looking through the list archives, as far as I aware nothing has
been specifically documented on this topic.
In the mean time you may find this helpful
http://geekswithblogs.net/brcraju/articles/235.aspx
On Fri, Aug 26, 2011 at 9:22 AM, Kaiwii Ho kaiwi...@gmail.com wrote:
I'm gonna
If you only wish to serve crawls to that one page, I'm sure this could
easily be set up by writing a bash script specifying the -adddays arguement
with your commands. This could then be set and run as a cron job?
Please someone correct me if I am wrong.
On Fri, Aug 26, 2011 at 10:22 PM, Radim
Hi,
As the title suggests, I'm in the process of getting some comprehensive
documentation sorted out for Nutch, this obviously starts at wiki level. I'm
currently working on the IndexStructure page [1]. I would appreciate if some
guys could have a quick look and correct where they see fit.
In
Hi Gabriele can you expand on your last comment... are you running in deploy
mode?
And to reply to your first point, yes you are correct, the FAQ's need
extensive updating. Please feel free to change anything you feel necessary,
however as a matter of retaining knowledge for the legacy of Nutch
Hi Zhao,
Do you have anymore verbose log info from hadoop.log, I have never worked
with Nutch 0.9 but if you could at least indicate whether you get something
like
LOG: info Dedup: starting ... blah blah blah
Taking this to a larger context I am not particularly happy with the
verboseness of
If it complains about SSH errors then I would ensure that you are logged
into your SSH client e.g. ssh -v localhost, prior to executing any hadoop
scripts. This would make sense.
Further to this, unless you are actually experiencing Nutch related problems
on a pseudo or cluster setup then
1 - 100 of 1408 matches
Mail list logo