Hi Danicela,
Have a look here [1]. Although your problem is not directly linked to
fetching, the symptoms and subsequent solution to the problem is the same.
Unfortunately this is quite a messy one but will hopefully get you going in
the right direction again.
[1]
Hi José,
If you look at what is generated when you have built Nutch using ant
runtime you will see correctly the runtime/local and runtime/deploy
folders. To run in deploy mode, it is necessary to specify all of your
nutch-site.xml (and any other configuration e.g. filters, plugins etc etc)
Hi Mina,
You can pick this type of stuff up easier from the mailing lists [1]. It
might save you some time rather than waiting for some replies from folks.
Thanks
[1] http://lucene.472066.n3.nabble.com/SocketTimeoutException-td604882.html
On Mon, Dec 5, 2011 at 11:36 PM, mina
Not got a clue. One thing I must say is that we wary of any out of date
code with these books. When reading about I found the Lucene API to be
somewhat different and outdated, I am not saying that it is the same with
the book you quoted but it definately is with this one [1]. On the positive
side,
Thanks Tim.
In addition Chip, the tutorial has now been updated to include Tim's
comments and to cover latest Nutch 1.4.
Thanks
Lewis
On Wed, Dec 7, 2011 at 10:45 PM, Tim Pease tim.pe...@gmail.com wrote:
On Dec 7, 2011, at 3:17 PM, Chip Calhoun wrote:
This is probably just down to my not
Hi Remi Markus,
Yeah, I can replicate this, good catch Remi.
lewis@lewis-desktop:~/ASF/trunk/runtime/local$ bin/nutch
org.apache.nutch.net.URLFilterChecker
http://www.heraldscotland.com-filterName regex-urlfilter.txt
Checking combination of all URLFilters available
^Z
[2]+ Stopped
Hi,
Can anyone confirm if this is an issue?
If so I think we should log it before it goes unnoticed.
Thanks
Lewis
On Fri, Dec 9, 2011 at 3:21 PM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
If you look at the output I posted, even when I specified a particular
filter, the checkAll
confirm if this is an issue?
If so I think we should log it before it goes unnoticed.
Thanks
Lewis
On Fri, Dec 9, 2011 at 3:21 PM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
If you look at the output I posted, even when I specified a particular
filter
, December 13, 2011, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
Heres my output from URLFilterChecker [1]
lewis@lewis-01:~/ASF/trunk/runtime/local$ bin/nutch
org.apache.nutch.net.URLFilterChecker -filterName urlfilter-regex
Exception in thread main java.lang.RuntimeException: Filter
Hi Remi,
This is a compatibility issue with conflicting versions of Solrj [1]
[1]
http://lucene.472066.n3.nabble.com/Invalid-version-or-the-data-in-not-in-javabin-format-td1460495.html
On Wed, Dec 14, 2011 at 1:57 PM, remi tassing tassingr...@gmail.com wrote:
Hello guys,
After crawling with
Can anyone confirm this?
If this is the case it would be great to get it fired on to the wiki
for future reference.
Thank you
On Wed, Dec 14, 2011 at 10:48 PM, Whitman, John jwhit...@ea.com wrote:
Hi Lewis –
I believe I know what this issue was – I would bet that the user has set an
It looks like its the parsing of these segments that is taking time... no?
On Thu, Dec 15, 2011 at 9:57 PM, Bai Shen baishen.li...@gmail.com wrote:
On Thu, Dec 15, 2011 at 12:47 PM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
This is overwhelmingly weighted towards Hadoop
Hi Dean,
What version are you on?
On Sun, Dec 18, 2011 at 2:20 PM, Dean Pullen dean.pul...@gmail.com wrote:
(can't access work email, so posting via this account!)
I've tried absolutely everything to resolve this issue, and have scoured
the web over the weekend in an attempt to rectify this
Hi Patrick,
So you must remove thre NUTCH_HOME/src which by default will be added
as the src folder.
Instead add
NUTCH_HOME/src/java
every occourance of
NUTCH_HOME/src/plugin/plugin_name/src/java
NUTCH_HOME/src/plugin/plugin_name/src/test
then
NUTCH_HOME/src/test/
hopefully you can follow the
Hi Mina,
Can you please check out the page now, I've edited this and would like
you to confirm this has been clarified.
Thank you
On Mon, Dec 26, 2011 at 9:08 PM, mina tahereganji...@gmail.com wrote:
i can solve this problem. i read nutch doc for solrindex in:
Hi,
I'm interested in crawling twitter feeds and haven't tried any
implementation yet. Does anyone know if this is possible? I haven't
seen anything on our archives to suggest that people are having
problems with this.
Thanks and happy NY to everyone when it comes around.
--
Lewis
Thanks guys.
All the best when the bells come around.
Lewis
On Fri, Dec 30, 2011 at 7:06 PM, Ken Krugler
kkrugler_li...@transpac.com wrote:
Note that for polite and efficient fetching, you want to resolve shortened
links first, then treat some set (e.g. over a 5-10 minute interval) as
Hi Guys,
Just to confirm, this has been addressed and committed. You can see
the changes here [1]
[1] http://nutch.apache.org/old_downloads.html
On Tue, Nov 29, 2011 at 6:29 AM, Mattmann, Chris A (388J)
chris.a.mattm...@jpl.nasa.gov wrote:
Hi Tim,
Thanks for your message.
You can find
John Mcgibbney wrote:
Hi Dean,
Depending on the size of the segments your fetching, in most cases I
would advise you to separate out fetching and parsing into individual
steps. This becomes self explanatory as your segments increase in size
and the possibility of something going wrong
Hi Dean,
Without discussing any of your configuration properties can you please try
6) MERGE SEGMENTS:
/opt/nutch_1_4/bin/nutch mergesegs
/opt/nutch_1_4/data/crawl/MERGEDsegments/ -dir
/opt/nutch_1_4/data/crawl/segments/* -filter -normalize
paying attention to the wildcard /* in -dir
:
LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input Pattern
file:/opt/nutch_1_4/data/crawl/segments/*/parse_data matches 0 files
And yes, your assumption was correct - it's a different segment directory
each loop.
Many thanks,
Dean.
On 06/01/2012 15:43, Lewis John Mcgibbney
that the directories exist after fetching and
parsing?
On Fri, Jan 6, 2012 at 4:24 PM, Dean Pullen dean.pul...@semantico.com wrote:
Good spot because all of that was meant to be removed! No, I'm afraid that's
just a copy/paste problem.
Dean
On 06/01/2012 16:17, Lewis John Mcgibbney wrote:
Ok then,
How
on ONE crawl also removes the
parse_data dir etc!
Dean.
On 06/01/2012 16:28, Lewis John Mcgibbney wrote:
How about merging segs after every subsequent iteration of the crawl
cycle... surely this is a problem with producing the specific
parse_data directory. If it doesn't work after two
/nutch_invertlinks
(However it is in the solrindex docs)
Adding it makes no difference to invertlinks.
I think the problem is definitely with mergesegs, as opposed to
invertlinks etc.
Thanks again,
Dean.
On 06/01/2012 17:53, Lewis John Mcgibbney wrote:
OK so now I think were at the bottom of it. If you
/2012 14:26, Dean Pullen wrote:
No Lewis, -linkdb was already been used for the solrindex command, so we
still have the same problem.
Many thanks,
Dean
On 08/01/2012 14:08, Lewis John Mcgibbney wrote:
Hi dean is this sorted
On Saturday, January 7, 2012, Dean Pullendean.pul
How are you running Nutch local or deploy mode? Which hadoop versions
are you using 0.20.2? This appears to be an open issue with this
version [1].
Also please have a look here [2] for a similar frustrating situation.
[1] https://issues.apache.org/jira/browse/HADOOP-6958
[2]
the update to solr when SolrIndexer runs.
Matt Wilson
-Original Message-
From: lewis john mcgibbney [mailto:lewis.mcgibb...@gmail.com]
Sent: Monday, September 26, 2011 3:04 PM
To: user@nutch.apache.org
Subject: Re: Indexing specific metadata tags with urlmeta
Hi
/browse/NUTCH-809),
hope this helps.
On 11.01.2012 22:44, Dean Del Ponte wrote:
Thank-you for your response.
My goal is to get Nutch to index meta tags. It's been quite an adventure
so far!
On Wed, Jan 11, 2012 at 3:30 PM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
Hi Dean
Hi Remi,
WRT fixing Nutch 1.2 I can't comment, we do not support this version any
longer and it is no longer actively maintained, however please keep an eye
on the issue (and related issues mentioned on NUTCH-566 thread) and you may
be able to back port some of the changes to Nutch-1.2 (fingers
Is it possible for you to fetch smaller segments, parse them, then merge
incrementally rather than attempting to merge several larger segments at
once?
Are you getting any IO problems when parsing the segments? If so this may
be an early warning light to attack the problem from another angle.
On
Hi Bowen,
I completely agree with Chris' comments, there have been a few guys popping
up from time to time asking about ES therefore any contrib in this area
would be excellent.
In the meantime I'll check your code out on Github.
Thanks for letting us in the loop.
Lewis
On Fri, Jan 13, 2012
Is anyone planning on heading to Berlin Buzzwords this year?
I missed last year, but was fortunate enough to catch up with a lot of the
stuff online. Even I don't get the opportunity to get something prepared, I
would still really like to make the event.
Anyone?
On Fri, Jan 13, 2012 at 9:33 AM,
Hi Vijith,
We are happy to help however you need to be more specific in terms of what
you want to achieve?
What are you required to obtain from the Web or some domain...
What nature of Nutch installation...
WRT restricting/focussing your crawl, what for?...
Thanks
On Fri, Jan 13, 2012 at 10:45
I haven't been working on this, but how does your schema configure these
fields? Have you configured it to store and index the new metadata
field(s)? Also you may wish to set it to some kind of custom setting via
conf/solr-mapping.xml
Only thoughts so please ignore if out of context.
Lewis
On
:19 PM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
Hi Bowen,
I completely agree with Chris' comments, there have been a few guys
popping up from time to time asking about ES therefore any contrib in this
area would be excellent.
In the meantime I'll check your code out on Github
Mmmm, I am not using Nutch on Windows at all, generally don't know too much
about configuring Cygwin and really hope thetreis some more help out there.
The main problem here seems to be that the relative path to
/cygdrive/c/server/nutch/urls is not being interpreted correctly.
You mention
{bq}
-
meta name=keywords content=plugin /
So I believe giving a query for 'plugin' should give me this page in
results. (the page content is nothing related to plugins)
Please correct me if I am wrong.
On Fri, Jan 13, 2012 at 6:09 PM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
I
Hi Dennis,
Would it be possible for you to open an issue on our Jira as this sounds
like we need to document and catch it.
Thanks very much for reporting.
Kind Regards
Lewis
On Tue, Jan 17, 2012 at 3:16 PM, Dennis Spathis dspat...@gmail.com wrote:
Hi,
The Nutch 1.4 distribution includes
Hi Remi,
This also looks like we need to document and address it.
Can you log a Jira issue and we will try to get on to it. Can you also have
a look through some of the existing issues in case there is something
similar, possibly relate them.
Thank you in advance
Lewis
On Tue, Jan 17, 2012 at
, 2012 at 6:01 PM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
Hi Vijith,
We are happy to help however you need to be more specific in terms
of
what you want to achieve?
What are you required to obtain from the Web or some domain...
What nature of Nutch
Well this might be the file that you edit if you are using the
urlfilter-automaton plugin for urlfiltering.
Markus was indicating that you may wish to begin by looking at
urlfilter-regex and subsequently regex-urlfilter.txt, I would go as far to
say that this is the most commonly implemented
It depends where you are wanting to remove the urls from... your Nutch
crawldb or your Solr index?
We offer and maintain quite a number of tools to enable you to maintain a
healthy crawldb e.g. purge, filtering, etc, we also maintain some tools to
help you maintain your Solr index e.g. delete
:00 PM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
It depends where you are wanting to remove the urls from... your Nutch
crawldb or your Solr index?
We offer and maintain quite a number of tools to enable you to maintain a
healthy crawldb e.g. purge, filtering, etc, we also
Hi Waleed,
Can you please read through the following section on our wiki and post a
more verbose description of your problem, I'm struggling to help at the
moment due to lack of information
http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer#Becoming_a_Nutch_Developer
On Wed, Jan 18, 2012
, Jan 19, 2012 at 8:26 PM, remi tassing tassingr...@gmail.com wrote:
The main purpose is to remove urls matching a certain pattern from the
Nutch segments(or database).
Remi
On Thursday, January 19, 2012, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
Maintenance tool for what? You
in accordance with release
- 1.4 (which i am using).
Don't know whether i did it right. but it works with 1.4. here is the new
patch file attached.
On Wed, Jan 18, 2012 at 3:34 PM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
Apply the patch from $NUTCH_HOME
Use 828-2 (newest by date
I believe some people have recently had success using 809, and Elisabeth
has done an excellent job of providing comprehensive documentation which
you will find linked to from the 809 issue. I can't comment fully on which
one to use over the other but 809 seems to be working for people.
Thanks
On
I'm not sure if I'm understanding you here. You are not wanting to index
the documents, but merely wanting to have stored documents in your hard
disk? What is the reasoning behind this?
Thanks
On Fri, Jan 20, 2012 at 9:48 AM, Adriana Farina
adriana.farin...@gmail.comwrote:
I forgot to write
Hi Marek,
What happens with the data in the segments, I guess the data of the
crawled urls are still in the segments after filtering out from the
crawldb, aren't they?
Well additionally you can merge several segments and again use urlfilters
to get rid of urls you don't wish to have. Again
Hi Marek,
If you have a look at the FetchSchedule class, you'll see that fetchTime
relates to the time date when a page was successfully fetched,
subsequently we configure fetchInterval to the time date when we wish to
revisit the successfully fetched URL.
hth
On Fri, Jan 20, 2012 at 4:07 PM,
On Fri, Jan 20, 2012 at 5:10 PM, Marek Bachmann m.bachm...@uni-kassel.dewrote:
Hello again,
I was inspecting the generator because it doesn't deliver all urls for the
fetcht list from the crawldb even if I set the addDays atribute to a value
much higher than the max fetch intervall.
How
I think it's been updated to work with trunk so unless you can back port it
(if there have been changes to the codebase) then it will not.
That being said, I haven't used it so trying won't hurt.
Thanks
On Fri, Jan 20, 2012 at 2:16 PM, abhayd ajdabhol...@hotmail.com wrote:
thaks lewis
Does
on a hard disk, but
at the moment I can't figure out what I could do.
Can you help me?
Thank you very much!
2012/1/20 Lewis John Mcgibbney lewis.mcgibb...@gmail.com
I'm not sure if I'm understanding you here. You are not wanting to index
the documents, but merely wanting to have stored
The best method is to read or dump the contents of your crawldb and work
based on this.
Please have a look on the wiki for using the readdb tool.
On Sun, Jan 22, 2012 at 10:51 AM, Sameendra Samarawickrama
smsa...@googlemail.com wrote:
Hi,
I am using Nutch to generate a small dataset of web;
Hi Ian,
What fetching depth are you using?
Lewis
On Mon, Jan 23, 2012 at 7:46 AM, Ian Piper ianpi...@tellura.co.uk wrote:
Hi all,
I'd appreciate some guidance... can't seem to find much useful stuff on
the web on this. I have set up a Nutch and Solr service that is crawling a
client's
Hi Kaveh,
I'm not sure if your problem is the same at all.
You're problem stems from the solr mapping configuration used by
AnchorIndexingFilter in the index-anchor plugin.
If this works properly then you should see a list of all of the source --
destination field mappings, this unfortunately is
Hi Remi,
1. Are the segments backward compatible? I tried updatedb but I get
skipping invalid segment
In all honesty I've not tried this!
Is it possible to use readseg -dump to get a text file then use freegen to
generate new segments to fetch???
2. With the same configuration, it
-02-01 16:05:52,927 WARN mapred.LocalJobRunner - job_local_0006
java.lang.NumberFormatException: null
at java.lang.Integer.parseInt(Integer.java:417)
at java.lang.Integer.parseInt(Integer.java:499)
...
what could be the problem ??
On Fri, Jan 20, 2012 at 4:38 PM, Lewis John Mcgibbney
Hi,
Off-list me and Chris have briefly discussed the possibility of hosting a
competition between teams of PG students from the universities here in
Glasgow and @ USC. The idea is to have a competition aimed at creating a
new Nutch 2.0 webapp taking in to consideration the Jira tickets [1] [2].
Looks liek your using an old version of Nutc here.
Please try upgrading to 1.4 Dean
hth
On Thu, Feb 2, 2012 at 5:22 PM, Dean Pullen dean.pul...@semantico.comwrote:
What I see in logs/userlogs/myfetchjobxx/**syslog is:
2012-02-02 17:15:25,045 INFO org.apache.nutch.fetcher.**Fetcher: fetch of
There's no log files attached
On Fri, Feb 3, 2012 at 10:06 AM, tiagorcs dasilva-ti...@mitsue.co.jpwrote:
Forgot to mention I am using Nutch 1.4, and that I have no problems with
the
exact same setup for Nutch 1.3.
--
View this message in context:
Hi,
On Fri, Feb 3, 2012 at 10:33 AM, Vijith vijithkv...@gmail.com wrote:
OK. It worked. There were some typo mistakes in the patch code. Also I
forgot to change the 'fetcher.parse' property to true.
I will attach the updated patch to the issue once I have a complete check.
Great
Also
Is
Hi Marek,
I really don't think so. We stripped all of the Lucene stuff @ 1.3 as you
know. There was however an interesting thread initiated by Adriana [1]
which began down the same route here...
[1] http://www.mail-archive.com/user@nutch.apache.org/msg05268.html
On Fri, Feb 3, 2012 at 6:55 PM,
Hey Peyman,
Do you care to discuss your experiences using Solandra, what was required
to get a Nutch -- Solandra workflow working? This is also a module I think
would be great in Gora and pluggable to Nutch trunk/nutchgora.
Thanks for any comments.
Lewis
On Wed, Feb 8, 2012 at 3:43 PM, Peyman
Hi,
On Thu, Feb 9, 2012 at 7:26 AM, Sudip Datta pid...@gmail.com wrote:
While, this indicates that a reattempt will be made in 1 day, the
'url' never really gets the state db_fetched. On the other hand, if I
set generate.max.count = -1, the page is indeed crawled but the crawl
is
In all honesty this is strange. We can assure you that 1.4 DOES work for
protocol-http!
Any cygwin users out there that can lend a hand?
On Mon, Feb 6, 2012 at 4:37 AM, tiagorcs dasilva-ti...@mitsue.co.jp wrote:
Also, this is what I got inside my *plugins* folder
creativecommons
I see a Jira ticket coming up here ...
I'll open one up.
Thanks
Lewis
On Sat, Feb 11, 2012 at 10:58 PM, Markus Jelsma mar...@apache.org wrote:
The xsl, xsd and dtd files are not used by Nutch anymore.
Hi,
When specifying configurations for Hadoop, we are actually for using
:
Is it really worth bothering?
On 12 February 2012 17:04, Lewis John Mcgibbney
lewis.mcgibb...@gmail.comwrote:
I see a Jira ticket coming up here ...
I'll open one up.
Thanks
Lewis
On Sat, Feb 11, 2012 at 10:58 PM, Markus Jelsma mar...@apache.org
wrote:
The xsl, xsd and dtd
Hi Puneet,
On Tue, Feb 14, 2012 at 5:12 AM, Puneet Pandey puneet...@gmail.com wrote:
I have started using nutch recently.
As I understand nutch crawling is a cyclic process
inject-generate-fetch-parse-update
Yes this is typically what you would execute.
1. When does parse start when I
Hi,
On Tue, Feb 14, 2012 at 3:13 PM, conta...@complexityintelligence.comwrote:
IMHO the Nutch eco system requires a more integrated
vision ;)
In what respect? Can you please be more verbose?
Thanks
Lewis
Hi Danicela,
Before I try this, have you configured any other overrides for generating
or fetching in nutch-site.xml?
Thanks
On Tue, Feb 14, 2012 at 3:10 PM, Danicela nutch danicela-nu...@mail.comwrote:
Hi,
I have in my nutch-site.xml the value fetcher.max.crawl.delay = -1.
When I try
John Mcgibbney wrote:
Hi,
On Wed, Feb 8, 2012 at 8:44 AM, Michael Kazekin
Michael.Kazekin@mediainsight.**info michael.kaze...@mediainsight.info
wrote:
I tried your solution and got rid of doesn't claim to support
contentType error indeed.
Maybe we should submit a patch for this indeed
iirc time stamp represents when page was last fetched. Yes you should be
able to specify this value in your schema and get it mapped to solr index.
Last modified is when the actual page was last modified e.g. when there was
a change to the page source or something.
On Wed, Feb 15, 2012 at 1:26
Hi Remi,
On Wed, Feb 15, 2012 at 1:51 PM, remi tassing tassingr...@gmail.com wrote:
Thanks for the clarification!
nb
For tstamp, I can actually see it in Solr results (even thought the format
is weird)
what is the format?
How can I get Last-Modified value in Solr as well? Does Nutch
Hi,
On Wed, Feb 15, 2012 at 4:00 PM, remi tassing tassingr...@gmail.com wrote:
tstamp shows a string of digits like 20020123123212
This is OK. -mm-dd-hh-mm-ssZ It is however hellishly old !
Never heard of the plugin index-more and it's poorly documented.
Well it's been included in
On Wed, Feb 15, 2012 at 9:08 AM, Danicela nutch danicela-nu...@mail.comwrote:
I don't think I configured such things, how can I be sure ?
- Message d'origine -
De : Lewis John Mcgibbney
Envoyés : 14.02.12 19:18
À : user@nutch.apache.org
Objet : Re: fetcher.max.crawl.delay = -1
Yes. You are correct. Is it possible for me to add you to the wiki admin
group and you could update this for us?
It is a long outstanding task...
If you are OK to do this, then please register a username with the wiki and
I'll get you added.
Thank you in advance
Lewis
On Thu, Feb 16, 2012 at
Hi,
Gora has now graduated to TLP so change your SVN url accordingly.
Do you really want to be working from inside eclipse? Why not just operate
from the cmdline?
If you have cassandra installed it will be more straighforward to work from
cmdline.
On Fri, Feb 17, 2012 at 4:52 AM, apachenutch
Hi,
I'm afraid you need to be more verbose about where and when you are using
this from. It looks like this is taken from the cassandra CLI?
This tutorial may be somewhat dated, it also onyl covers Nutchgora Gora
Cassandra use within Eclipse IDE, which was required at this time because
Gora was
Hi,
cassandra-mapping.xml should already be in your $nutchgora/conf directory,
as per here [1]. When you build the project it will be copied over to
runtime/local/conf and will then be on your classpath.
On Fri, Feb 17, 2012 at 6:53 PM, apachenutch poojasw...@gmail.com wrote:
/Please place
Hi Hadi,
Please see here
http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer#Becoming_a_Nutch_Developer
If you wish to post back with your question we will try and help.
Thanks
Lewis
On Sat, Feb 18, 2012 at 1:37 PM, hadi md.anb...@gmail.com wrote:
I have problem with some pdf,when i
Hi Hadi,
On Sat, Feb 18, 2012 at 1:05 PM, hadi md.anb...@gmail.com wrote:
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: java.io.IOException: Job failed!
at
Can you please paste how you have specified your hadoop temp dir. This
seems to be the cause of such stack trace error's
Thanks
On Sun, Feb 19, 2012 at 7:04 AM, hadi md.anb...@gmail.com wrote:
yes,there is a hadoop log :
i search this error but everyone says this error is about low space
Hi,
On Sun, Feb 19, 2012 at 4:23 PM, apachenutch poojasw...@gmail.com wrote:
I see the web page keyspace created, but I dont see any records after
configuring the crawler :(
How are you verifying this? Please be as verbose as possible when
discussing Nutchgora branch.
I performed a nutch
Hi,
On Sun, Feb 19, 2012 at 8:44 PM, apachenutch poojasw...@gmail.com wrote:
ERROR 12:37:59,770 Fatal configuration error
org.apache.cassandra.config.ConfigurationException: localhost/
127.0.0.1:7000
is in use by another process. Change listen_address:storage_port in
cassandra.yaml to
So maybe try hacking CrawlDb#createJob()
So that when you create the new NutchJob object you pass in the uri
parameter
124 JobConf job = new NutchJob(uri, config); as suggested in the thrown
stack trace.
Please get back to us with results. I've not been using anything like
Amazon EMR and would
Hi,
Can you post your nutch-site.xml and I will give it a spin.
Thank you
Lewis
On Thu, Feb 23, 2012 at 5:07 AM, xuyuanme xuyua...@gmail.com wrote:
Just checked the latest code in 1.4 but it's the same. See code line 138 in
below link:
*doc.removeField(title)*, just before
*doc.add(title,
result.group(1));*.
Should a bug be opened, or am I misunderstanding the function of this
plugin?
ShlomiJ
On Sun, Feb 19, 2012 at 2:06 PM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
Hi Shlomi,
On Sun, Feb 19, 2012 at 10:15 AM
would usually not set it so)
Lewis
On Thu, Feb 23, 2012 at 3:18 PM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
Additionally in your nutch-site.xml we don't maintain any query-(plugins),
and there is no parse-text plugin either.
On Thu, Feb 23, 2012 at 3:13 PM, Lewis John Mcgibbney
On Thu, Feb 23, 2012 at 1:59 PM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
Hi Christopher,
I don't think Nutch 1.2 could be used with a SOlr server running on basic
https authentication.
Markus committed a nice section of work which address this in 1.3 iirc,
or
maybe 1.4
Hi,
On Thu, Feb 23, 2012 at 7:27 PM, Christopher Gross cogr...@gmail.comwrote:
Unless -- is 1.2 able to crawl https sites? If it can't do that then
I may have to upgrade
You should be able to get https sites yes, however I'm not overly familiar
with the protocol-httpclient plugin.
If
Hi James,
On Fri, Feb 24, 2012 at 2:47 PM, Spadez james_will...@hotmail.com wrote:
However, having found nutch, it seems like this might be something worth
looking at. Firstly, is nutch simply a web scrapper or does it integrate
other aspects of lucene as well? Im wondering if I would need to
Mmmm... this is really a Tika question, this probably shadows why you have
received very little response from the community unfortunately.
So the problem is that you are always getting back isEmpty indicating that
_nothing_ is being produced as an output from your parser.
I would add in a try
Can you please provide one such URL so I can try.
Thanks
On Tue, Feb 28, 2012 at 9:02 AM, remi tassing tassingr...@gmail.com wrote:
Same question here...
I have similar issues where (redirection)links are given through JavaScript
I hope I haven't hijacked your post as I see these issues
Tiny chunk of info on this topic
https://developers.google.com/webmasters/ajax-crawling/
On Tue, Feb 28, 2012 at 9:39 AM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
Can you please provide one such URL so I can try.
Thanks
On Tue, Feb 28, 2012 at 9:02 AM, remi tassing tassingr
As far as I know, Elisabeth Adler contributed a patch exactly for this on
NUTCH-585 [0].
If you wish to get cracking with it please check out the latest trunk code
[1] patch it using the blacklist_whitelist_plugin.patch Elisabeth attached
to the issue.
Would be excellent if you could provide
issue is related to this specific site
http://www.accessdata.fda.gov/scripts/cder/drugsatfda/index.cfm?fuseaction=Search.SearchResults_BrowseDrugInitial=B
lewis john mcgibbney wrote
I've checked working with redirects and everything seems to work fine for
me.
The site I checked
Hi James,
You're seed URLs are more than likely being filtered out for searching by
your settings in conf/regex-urlfilter.xml. Have a good read through the
urlfilter documentation [0] and basic examples that are provided in other
urlfilters, also it might help to do a bit of reading regarding
What makes you think that?
On Fri, Mar 2, 2012 at 12:07 PM, James Ford simon.fo...@gmail.com wrote:
But it seems that the solution to my problem is to set
db.max.outlinks.per.page to 0?
Baring in mind that it makes it pretty difficult to provide help if this is
not mentioned initially.
Hi Christopher,
It appears that the page is being fetched successfully. What is not
successful is the parser obtaining the page content... these fields appears
the be returning empty values when as you have stated this is not the case.
How large is the page content? does you http.content.limit
201 - 300 of 1408 matches
Mail list logo