strange, check if text/html is mapped to parse-tika or parse-html in
parse-plugins.xml. You may also want to check tika's plugin.xml, it must be
mapped to * or a regex of content types.
-Original message-
> From:Sudip Datta
> Sent: Thu 12-Jul-2012 20:36
> To: user@nutch.apache.org
>
Great!
Thanks Lewis
-Original message-
> From:lewis john mcgibbney
> Sent: Tue 10-Jul-2012 17:01
> To: user@nutch.apache.org; annou...@apache.org; d...@nutch.apache.org
> Subject: [ANNOUNCEMENT] Apache Nutch v1.5.1 Released
>
> Good Afternoon Everyone,
>
> The Apache Nutch PMC are
Hello,
The index-more plugin might run after your custom plugin. You can configure the
order in which plugins are run. Please consult the indexingfilter.order
directive's description.in conf/nutch-default.xml.
Cheers,
-Original message-
> From:Jim Chandler
> Sent: Thu 05-Jul-2012
with any reasonable input... sorry.
> Lewis
>
> On Thu, Jul 5, 2012 at 8:51 AM, Markus Jelsma
> wrote:
> > Any ideas?
> >
> >
> >
> > -Original message-
> >> From:Markus Jelsma
> >> Sent: Mon 02-Jul-2012 23:05
> >> To:
Any ideas?
-Original message-
> From:Markus Jelsma
> Sent: Mon 02-Jul-2012 23:05
> To: user@nutch.apache.org
> Subject: Adaptive scheduling, but different
>
> Hi,
>
> We use an adaptive scheduler for our crawl, this works fine for most cases
> but a specific type of page is crawled
You can try the fetch filter:
https://issues.apache.org/jira/browse/NUTCH-828
-Original message-
> From:shekhar sharma
> Sent: Tue 03-Jul-2012 06:42
> To: user@nutch.apache.org
> Subject: Filtering pages during crawling
>
> Hello,
> Is it possible to define a filtering condition in Nu
No so odd after all. I should have known it started the reducer at that time,
silly me. The parse went perfectly fine in 42 minutes. The problem lies in your
regex.
Cheers
-Original message-
> From:sidbatra
> Sent: Tue 03-Jul-2012 00:18
> To: user@nutch.apache.org
> Subject: RE: Pa
I've modified the parser to log long running records and ran your segment.
There are quite a few records that run for more than a second on one machine
with 2x2.4GHz CPU. It, unfortunately doesn't show me a record it's waiting for.
I ouput a record prior to parsing and after parsing with elasped
Regex order matters. Happy to hear the results.
Considering your hardware you should parse this amount of pages in less than an
hour. And you should decrease your mapper/reducer heap size significantly, it
doesn't take 4G of RAM. 1G mapper and 500M reducer is safe enough. You can then
allocate
Hi,
We use an adaptive scheduler for our crawl, this works fine for most cases but
a specific type of page is crawled more often than it should. These are usually
news or article archives such as news/archive/12345. Most websites generate
these pages dynamically. The problem is that whenever a
You already have that rule configured? Is it one of the first simple
expressions you have? How many records are you processing each time, is it
roughly the same for all segments? And are you running on Hadoop or pseudo or
local?
-Original message-
> From:sidbatra
> Sent: Mon 02-Jul
Hi
The log output doesn't tell you what the task is actually doing, it is only
Hadoop output and initialization of the URL filters. There should be no real
problem with the parser job and URL filter programming in Nutch, we crawl large
parts of the internet but the parser never stalls, at least
Check your Solr log. It's likely to trip over the absense of the versioning
field.
-Original message-
> From:Daniel
> Sent: Mon 02-Jul-2012 15:05
> To: user@nutch.apache.org
> Subject: Solr 4.x and Nutch 1.5
>
> Hey,
>
> i have Solr 4.0 (Nightly-Build) and Nutch 1.5
> And if i go t
We have done that too. The biggest problem is not having a reliable
lastModified date and indeed inlinks and not knowing whether the document has
changed. The inlink problem can be solved with the new Solr update semantics
where partial updates are possible.
-Original message-
> From
It's a use case for a fetch filter:
https://issues.apache.org/jira/browse/NUTCH-828
-Original message-
> From:Alexander Aristov
> Sent: Sun 01-Jul-2012 20:43
> To: user@nutch.apache.org; safdar.kurei...@gmail.com
> Subject: Re: Language-focused crawling
>
> Hi
>
> First of all you u
The API changed a bit with NUTCH-1230. What version are you using?
-Original message-
> From:Jim Chandler
> Sent: Wed 27-Jun-2012 20:45
> To: user@nutch.apache.org
> Subject: NoSuchMethodError
>
> Greetings,
>
> I am trying to create a plugin similar to Index-More which uses MimeUti
>
>
> 5. Add the following lines to runtime/local/conf/nutch-site.xml
>
>
> tika.boilerpipe
>
> true
>
>
>
> Thanks again!
>
> Cheers,
etConf().get("tika.boilerpipe.extractor", "ArticleExtractor"
>
> Still, I am unsure where to specify these variables. Instead I added the
> following lines to the java code (and commented the previous lines):
>
> boolean useBoilerpipe = true;
> String b
Hi René,
It seems NUTCH-961-1.5-1.patch doesn't apply cleanly to the finally released
1.5 at all, the TikaParser.java has changed a bit since the patch and the
release of 1.5. Did you resolve the failde hunks? If so, are you sure Tika is
being used for (x)html pages? Nutch by default uses the o
h-1.x/
> in the package (src and bin). That's cosmetic but not blocking. Also:
> permissions of bin/nutch should be 755 (exec bits should be set).
>
> Beside: Runs (tested local mode only).
>
> Sebastian
>
> On 06/26/2012 06:32 PM, Markus Jelsma wrote:
> > Thi
t;extract here" from the file menu
>
> Not a blocker IMHO
>
> On 26 June 2012 08:04, Markus Jelsma wrote:
>
> > Hi,
> >
> > It builds and runs smoothly but there's something that didn't catch my eye
> > with 1.5 since i then used a GUI to unpack t
Hi,
It builds and runs smoothly but there's something that didn't catch my eye with
1.5 since i then used a GUI to unpack the src file, the src and bin packages
decompresses everything in the cwd, this means no apache-nutch-1.5 folder is
created. This was the case with 1.4 and earlier. I believ
Hello,
Did you add your parser to parse-plugins.xml?
Cheers
-Original message-
> From:Ake Tangkananond
> Sent: Mon 25-Jun-2012 16:56
> To: user@nutch.apache.org
> Subject: Content type config on Parser plugin work improperly
>
> Hi experts,
>
> I am experimenting a feature to add
ma or
> something? Or I have to hack crawling code too like you wrote about protocol
> plugin?
>
>
> Markus Jelsma-2 wrote
> >
> > What you can try is to add the referrer to outlinks when parsing records.
> > This outlink can be added to CrawlDatum's MetaData
Thanks for your comments. Please consider adding it to the issue so we can keep
track of it.
-Original message-
> From:John McCormac
> Sent: Sat 23-Jun-2012 16:36
> To: user@nutch.apache.org
> Subject: Re: Near Duplicate Detection in nutch /Solr
>
> On 23/06/
and a bit easier to deal with.
-Original message-
> From:John McCormac
> Sent: Sat 23-Jun-2012 15:11
> To: user@nutch.apache.org
> Subject: Re: Near Duplicate Detection in nutch /Solr
>
> On 23/06/2012 13:17, Markus Jelsma wrote:
> > Hello,
> >
> >
e: Near Duplicate Detection in nutch /Solr
>
> On 23/06/2012 12:14, Markus Jelsma wrote:
> > Nutch now has a HostURLNormalizer capable of normalizing source hosts to a
> > target host. This prevents duplication of complete websites and bad
> > hyperlinks.
> >
>
Nutch now has a HostURLNormalizer capable of normalizing source hosts to a
target host. This prevents duplication of complete websites and bad hyperlinks.
https://issues.apache.org/jira/browse/NUTCH-1319
-Original message-
> From:John McCormac
> Sent: Sat 23-Jun-2012 13:08
> To: user@
You can use Nutch TextProfileSignature to create a less than exact signature
for pages. It can delete some near duplicates.
-Original message-
> From:parnab kumar
> Sent: Sat 23-Jun-2012 10:42
> To: user@nutch.apache.org
> Subject: Near Duplicate Detection in nutch /Solr
>
> Hi,
>
>
I am not sure but the boost field may be available. I think it was populated
with the document score but you could increase it with a custom filter or some
hacking around.
-Original message-
> From:parnab kumar
> Sent: Fri 22-Jun-2012 17:36
> To: user@nutch.apache.org
> Subject: Doc
Hi,
If Nutch finds a relative URL it will be converted to absolute. This means that
any URL that does not explicitly start with http:// is going to have the host
prefixed. You domain.com pages produce bad URL's such as http/www. And since
this is not http://, it'll end up as
http://domain.com/
I tried debugging your problem but it doesn't seem to exist. I fixed Nutch'
RobotParser test [1] but i cannot confirm URL's being disallowed if there is NO
value for Disallow: in the robots.txt file.
https://issues.apache.org/jira/browse/NUTCH-1408
Test with:
$ bin/nutch plugin lib-http
org.a
Hi,
You can sue the domainstats tools to generate counts for domain, host, suffix
and tld. There's also the readdb -stats tool that shows your overall
statistics. NUTCH-1325 provides the same as readdb -stats but for individual
hosts.
Cheers
-Original message-
> From:kaveh minooie
Hi Lewis,
You got fooled by the ampersand switch on Unix terminals that sends a command
to the background. The [] integers are Unix process ID's of the commands you
have given.
$ a&b&c is not one but three commands, sending a and b to the background. Your
shell will output the [process ID] if
Sounds like:
https://issues.apache.org/jira/browse/NUTCH-1245
Also, with a recent Nutch you can index with a -deleteGone flag. It behaves
similar to SolrClean but only on records you just fetched.
-Original message-
> From:webdev1977
> Sent: Tue 19-Jun-2012 21:40
> To: user@nutch.apac
> To: user@nutch.apache.org
> Subject: RE: HTTP REFERER is missing
>
>
> Markus Jelsma-2 wrote
> >
> > Nutch cannot do this by default and is tricky to make because there may
> > not be one unique referrer per page.
> >
> I don't realy need unique ref
If you're sure Nutch treats an empty string the same as / then please file an
issue in Jira so we can track and fix it.
Thanks
-Original message-
> From:Magnús Skúlason
> Sent: Wed 20-Jun-2012 18:36
> To: nutch-u...@lucene.apache.org
> Subject: robots.txt, disallow: with empty string
-Original message-
> From:Lewis John Mcgibbney
> Sent: Wed 20-Jun-2012 22:23
> To: user@nutch.apache.org
> Subject: Re: Nutch and Solr Redundancy
>
> Hi Oakage,
>
> On Wed, Jun 20, 2012 at 9:08 PM, Oakage wrote:
> > Okay I've just started researching about nutch and knows that nutch
The log you provide doesn't look like the actual mapper log. Can you check it
out? The job has output for the main class but also separate logs for each map
and reduce task.
-Original message-
> From:sidbatra
> Sent: Wed 20-Jun-2012 20:29
> To: user@nutch.apache.org
> Subject: Re: N
In a parsing fetcher iirc outlinks are processed in the mapper (at least when
outlinks are followed). If a fetcher's reducer stalls you may run out of memory
or disk space.
-Original message-
> From:kaveh minooie
> Sent: Wed 13-Jun-2012 19:28
> To: user@nutch.apache.org
> Subject: Re
Hi
This CrawlDatum's FetchTime is tomorrow in EST
Fetch time: Tue Jun 12 02:59:27 EST 2012
-Original message-
> From:Andy Xue
> Sent: Mon 11-Jun-2012 11:00
> To: user@nutch.apache.org
> Subject: Generator: 0 records selected for fetching, exiting ...
>
> Hi all:
>
> This is regardin
Hello!
Sounds very interesting. Anyway, Solr can run embedded in a Java application
called EmbeddedSolrServer. You do need to make some changes to the SolrIndexer
tools in Nutch.
Cheers
-Original message-
> From:Emre Çelikten
> Sent: Thu 07-Jun-2012 22:24
> To: user@nutch.apache.org
Great work Lewis, Chris, committers and contributors!
Thanks all!
-Original message-
> From:lewis john mcgibbney
> Sent: Thu 07-Jun-2012 19:01
> To: annou...@apache.org; d...@nutch.apache.org; user@nutch.apache.org
> Subject: [ANNOUNCE] Apache Nutch 1.5 Released
>
> (apologies for cr
If Nutch runs on a different machine the DNS may not be resolving the host
after all. To solve the issue you will have to find a way to resolve the host.
Take a look in the Nutch logs.
-Original message-
> From:Chethan Prasad
> Sent: Thu 07-Jun-2012 16:49
> To: Markus Jels
t it find more links on
> the root page and follow them?
>
> Thanks,
> Chethan
>
> On Thu, Jun 7, 2012 at 7:49 PM, Markus Jelsma
> wrote:
>
> > Hi,
> >
> > Nutch will fetch URL's without robots.txt, but if robots.txt throws an
Hi,
Nutch will fetch URL's without robots.txt, but if robots.txt throws an
UnknownHostException, the URL will throw it as well and fail.
Cheers
-Original message-
> From:chethan
> Sent: Thu 07-Jun-2012 16:16
> To: user@nutch.apache.org
> Subject: robots.txt UnknownHostException
>
>
Hi
Nutch cannot do this by default and is tricky to make because there may not be
one unique referrer per page. What you can try is to add the referrer to
outlinks when parsing records. This outlink can be added to CrawlDatum's
MetaData which you can then later use to set the referrer. To set t
What's the problem with having the seed page? Can you not only inject the /news
pages? Anyway, you can always filter it away later after the first fetch cycle.
-Original message-
> From:Shameema Umer
> Sent: Wed 06-Jun-2012 13:02
> To: user@nutch.apache.org
> Subject: How to write co
s not used in the crawldb but in the parse job, which is
input to the crawldb.
>
>
>
> On Wed, Jun 6, 2012 at 10:02 AM, Markus Jelsma
> wrote:
> >
> > -Original message-
> >> From:Matthias Paul
> >> Sent: Wed 06-Jun-2012 09:47
> >> T
-Original message-
> From:Andy Xue
> Sent: Wed 06-Jun-2012 11:11
> To: Markus Jelsma ; user@nutch.apache.org
> Subject: Re: Behaviour of "urlfilter-suffix" plug-in when dealing
> with a URL without filename extension
>
> Hi Markus:
hi
>
> Thanks f
-Original message-
> From:pepe3059
> Sent: Wed 06-Jun-2012 02:58
> To: user@nutch.apache.org
> Subject: RE: threads disminution when fetching page
>
> me again :)
>
> at the end of fetch process, is the regex-urlfilter considered?
No. At the end of the fetch the mapper output is writti
-Original message-
> From:Andy Xue
> Sent: Wed 06-Jun-2012 05:04
> To: user@nutch.apache.org
> Subject: Behaviour of "urlfilter-suffix" plug-in when dealing with
> a URL without filename extension
>
> Hi all:
hi
>
> Does the "urlfilter-suffix" plug-in prune URL which does not have a
-Original message-
> From:chethan
> Sent: Wed 06-Jun-2012 05:12
> To: user@nutch.apache.org
> Subject: Nutch topN selection
>
> Hi,
hi
>
> Does the topN threshold consider page score for the selection. If it's set
> to say 10, does Nutch queue up the 10 top scoring URLs on a page?
Ye
-Original message-
> From:Matthias Paul
> Sent: Wed 06-Jun-2012 09:47
> To: user@nutch.apache.org
> Subject: Linkdb empty
>
> Hi all,
hi
>
> I noticed that my linkdb is always empty although I use the generated
> segments from the last crawl for the generation of the linkdb.
Check th
-Original message-
> From:pepe3059
> Sent: Mon 04-Jun-2012 20:42
> To: user@nutch.apache.org
> Subject: RE: threads disminution when fetching page
>
> thank you for your answer Markus
Hi
>
> you mean, until the fetch process finishes, is information stored using hdfs
> by nutch? mean
This is normal and means the fetcher is finishing all it's input URL's and
writing stuff to disk.
-Original message-
> From:pepe3059
> Sent: Sat 02-Jun-2012 22:15
> To: user@nutch.apache.org
> Subject: threads disminution when fetching page
>
> Hello, i hope you can help me
>
>
> i a
Hi,
The generator can only do it the other way around via the addDays parameter. To
make it work your way you can modifiy the generator to restrict to documents
younger than 48 hours.
Cheers
-Original message-
> From:Shameema Umer
> Sent: Mon 04-Jun-2012 08:33
> To: user@nutch.apac
a, there are no outlinks to
> external sites. (If you check the tinymce site, it has links to
> microsoft, facebook, etc) So I am thinking my problem is more or less
> related to the issue described
> here
>
> https://issues.apache.org/jira/browse/NUTCH-1346
No, that is a fi
> overriding the nutch-default.xml
>
>
>
> db.fetch.schedule.class
> com.custom.CustomEventFetchScheduler
>
>
>
> How do I include my custom logic so that it gets picked as a part of the
> crawl cycle.
>
> Regards | Vikas
>
> On Mon, May 21, 2012 at 6:14 PM, Markus Jelsma
ignore.limit.domain to false and the link.ignore.internal.xxx can be
> set to true? Or should I just set all of the link.ignore.xxx.xxx values
> to false?
>
> On 5/29/2012 4:43 PM, Markus Jelsma wrote:
> > Hi,
> >
> > That's a patch for the fetcher. The error you
Hi,
That's a patch for the fetcher. The error you are seeing is quite simple
actually. Because you set those two link.ignore parameters to true, no links
between the same domain and host or aggregated, only links from/to external
hosts and domains. This is a good setting for wide web crawls. If
Hi,
Yes, this is no problem.
Cheers
-Original message-
> From:Dustine Rene Bernasor
> Sent: Thu 24-May-2012 12:58
> To: user@nutch.apache.org
> Subject: Multiple nutch jobs on a Hadoop cluster simultaneosuly
>
> Hello
>
> I was wondering, would it be possible to run multiple nutch jo
You can inspect the CrawlDB with the readdb tool, check if it's there.
-Original message-
> From:Tolga
> Sent: Wed 23-May-2012 14:21
> To: user@nutch.apache.org
> Subject: Re: Apparently far from last question :)
>
> My colleague has just made me realize something. Is it possible tha
Great!
My +1 for a new release based on the state of the codebase.
-Original message-
> From:Julien Nioche
> Sent: Tue 22-May-2012 22:19
> To: d...@nutch.apache.org
> Cc: user@nutch.apache.org
> Subject: Re: Apache Nutch release 1.5 RC2
>
> Read http://people.apache.org/~lewismc/nutc
-Original message-
> From:Bai Shen
> Sent: Tue 22-May-2012 19:40
> To: user@nutch.apache.org
> Subject: URL filtering and normalization
>
> Somehow my crawler started fetching youtube. I'm not really sure why as I
> have db.ignore.external.links set to true.
Weird!
>
> I've since add
Please read the description.
-Original message-
> From:Tolga
> Sent: Tue 22-May-2012 11:37
> To: user@nutch.apache.org
> Subject: Re: PDF not crawled/indexed
>
> What is that value's unit? kilobytes? My PDF file is 4.7mb.
>
> On 5/22/12 12:34 PM, Lewis John Mcgibbney wrote:
> > Yes I
.MapTask.run(MapTask.java:307)
> at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
> ********
>
>
>
>
> - Mensaje original -
> De: "Markus Jelsma"
> Para: user@nutch.apache.org
> Enviados: Lunes, 21 de Mayo 201
Hi
Which version do you use? It should list the troubling URL. What's the stack
trace?
Cheers
-Original message-
> From:Ing. Eyeris Rodriguez Rueda
> Sent: Mon 21-May-2012 17:07
> To: user@nutch.apache.org
> Subject: error parsing some xml
>
> Hi all.
> When I try to crawl i have
Yes, you can pass ParseMeta keys to the FetchSchedule as part of the
CrawlDatum's meta data as i did with:
https://issues.apache.org/jira/browse/NUTCH-1024
-Original message-
> From:Vikas Hazrati
> Sent: Mon 21-May-2012 13:44
> To: user@nutch.apache.org
> Subject: Setting the Fetch ti
he Nutch 1.5 release rc #1
>
> When will Nutch 1.5 be released?
>
> Matthias
>
> On Wed, Apr 18, 2012 at 1:46 PM, Bharat Goyal
> wrote:
> > +1
> >
> >
> > On Monday 16 April 2012 12:34 PM, Markus Jelsma wrote:
> >>
> >>
-Original message-
> From:Matthias Paul
> Sent: Fri 18-May-2012 14:57
> To: user@nutch.apache.org
> Subject: Exclude certain mime-types
>
> How can I exlude certain mime-types from crawling, for example Word-documents?
> If I have parse-tika in plugin.includes it will parse them. Do
wling" there's the sentence "This also
> > > permits ... incremental crawling", as if the crawl command described
> > > before (3.1 Using the Crawl Command) couldn't do that.
> > >
> > > Could someone perhaps improve this part of the tuto
me?
>
> Regards,
>
> On 5/11/12 9:40 AM, Markus Jelsma wrote:
> > Ah, that means don't use the crawl command and do a little shell
> > scripting to execute the separte crawl cycle commands, see the nutch
> > wiki for examples. And don't do solrdedup. Sea
the meaning of "-53"
>
> If necessary ,I can provide the js files.
>
> Thank you for your help.
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Can-t-retrieve-Tika-parser-for-mime-type
> -text-javascript-tp3983599p3983627.html Sent from the Nutch - User mailing
> list archive at Nabble.com.
--
Markus Jelsma - CTO - Openindex
yes
On Tuesday 15 May 2012 12:45:28 Taeseong Kim wrote:
> is whole web content download possible?
>
> include Flash, Image, CSS, JavaScript
etrieve-Tika-parser-for-mime-type-text-javascript-tp3983599.html
Sent from the Nutch - User mailing list archive at Nabble.com.
--
Markus Jelsma - CTO - Openindex
to debug & resolve ??
--
View this message in context:
http://lucene.472066.n3.nabble.com/java-lang-NullPointerException-org-apache-hadoop-io-Text-encode-Text-java-388-tp3983600.html
Sent from the Nutch - User mailing list archive at Nabble.com.
--
Markus Jelsma - CTO - Openindex
ommitted heap usage (bytes): 26456621056.
So in fact it uses much less memory than it can.
Any idea?
--
View this message in context:
http://lucene.472066.n3.nabble.com/Heap-space-problem-when-running-nutch-on-cluster-tp3983561.html
Sent from the Nutch - User mailing list archive at Nabble.com.
--
Markus Jelsma - CTO - Openindex
e
> > > existing log.
> > > I am running nutch in eploy mode.
> > > Also I want some urls filtered by my urlfilter to be stored in
an
> > external
> > > flat file. How can I achieve this.
> > >
> > > --
> > > *Thanks & Re
How do I exactly "omit solrdedup and use Solr's internal
deduplication" instead.? I don't even know what any of that means :D
I've just used bin/nutch crawl urls -solr http://localhost:8983/solr/
-depth 3 -topN 100 to get the error. I have to use all the steps?
Regards,
On
n instead, it works similar and uses the same signature
algorithm as Nutch has. Please consult the Solr wiki page on
deduplication.
Good luck
On Thu, 10 May 2012 22:54:37 +0300, Tolga wrote:
Hi Markus,
On 05/10/2012 09:42 AM, Markus Jelsma wrote:
Hi,
On Thu, 10 May 2012 09:10:04 +0300, Tol
incremental
indexing but I can't find it just now sorry.
Lewis
On Thu, May 10, 2012 at 5:18 PM, Matthias Paul
wrote:
Hi all,
can the crawl-command also be used for iterative crawls?
In older Nutch-versions this was not possible but in 1.5 it seems to
work?
Thanks
Matthias
--
Markus J
gt; Nutch is above to index to Solr 3.6.0, however if not then maybe we
> should upgrade accordingly in trunk.
>
> Thanks
>
> Lewis
>
> On Thu, May 10, 2012 at 1:56 PM, Michael Erickson
>
> wrote:
> > On May 10, 2012, at 1:42 AM, Markus Jelsma wrote:
> >&g
hi
On Thursday 10 May 2012 15:19:09 Vikas Hazrati wrote:
> Hi Markus,
>
> Thanks for your response. My responses inline
>
> On Thu, May 10, 2012 at 12:34 AM, Markus Jelsma
>
> wrote:
> > hi
> >
> >
> > On Thu, 10 May 2012 00:26:40 +0530,
76568.html
Sent from the Nutch - User mailing list archive at Nabble.com.
--
Markus Jelsma - CTO - Openindex
Sent from the Nutch - User mailing list archive at Nabble.com.
--
Markus Jelsma - CTO - Openindex
free
space.
All the best,
Igor
On Thu, May 10, 2012 at 10:35 AM, Markus Jelsma wrote:
Plenty of disk space does not mean you have enough room in your
hadoop.tmp.dir which is /tmp by default.
On Thu, 10 May 2012 10:26:00 +0200, Igor Salma wrote:
Hi, Adriana, Sebastian,
We are struggling wit
o hadoop-core-0.20.203.0.jar but then
this
is
thrown:
Exception in thread "main" java.lang.**NoClassDefFoundError:
org/apache/commons/**configuration/Configuration
Can someone, please, shed some light on this?
Thanks.
Igor
--
Markus Jelsma - CTO - Openindex
iling list.
/Powered by Jetty://
/What am I doing wrong?
Regards,/
/
--
Markus Jelsma - CTO - Openindex
-inc.com
[7] mailto:krist...@yahoo-inc.com
[8] mailto:krist...@yahoo-inc.com
[9] mailto:krist...@yahoo-inc.com
[10] mailto:krist...@yahoo-inc.com
[11]
http://webmail.openindex.io/cid:part1.02010906.02030606@yahoo-inc.com
[12] mailto:krist...@yahoo-inc.com
--
Markus Jelsma - CTO - Openindex
anks,
James Ford
--
View this message in context:
http://lucene.472066.n3.nabble.com/Make-Nutch-to-crawl-internal-urls-only-tp3974397.html
Sent from the Nutch - User mailing list archive at Nabble.com.
--
Markus Jelsma - CTO - Openindex
links to get inside it?
What link deduplication do you mean? CrawlDB records have a unique key
on the URL.
Regards | Vikas
www.knoldus.com
--
Markus Jelsma - CTO - Openindex
mail.com
[1] http://www8.org/w8-papers/5a-search-query/crawling/
[2] http://www.cse.iitb.ac.in/~soumen/focus/
[3]
http://nutch.apache.org/apidocs-1.3/org/apache/nutch/indexer/IndexingFilter.html
--
Markus Jelsma - CTO - Openindex
io/tel:%28408%29%20349%203300
[20] http://webmail.openindex.io/tel:%28408%29%20349%203301
[21] mailto:krist...@yahoo-inc.com
[22]
http://webmail.openindex.io/tel:%2B49%20%280%2989%20231%2097%20207
[23]
http://webmail.openindex.io/tel:%2B49%20%280%29%20162%2028899%2002
[24] http://webmail.openindex.i
custom URL Normalizer to get this to work.
But why? It doesn't seem alright.
On Tue, 08 May 2012 14:46:14 +0200, Markus Jelsma
wrote:
I'm not sure this is going to work as a lowercase flag is used on the
regular expressions.
On Tue, 08 May 2012 13:37:47 +0100, Dean Pullen
wrote:
Hi
2633&pid=1043ELE&site=191";1;"db_unfetched";Tue
May 01 17:37:56 BST 2012;Thu Jan 01 01:00:00 GMT
1970;0;2592000.0;30.0;500.0;"null"
Notice the URL starts with an L? (Thus not matching http/https in
another config). Is this some problem with the regex above?
Regards,
Dean Pullen
--
Markus Jelsma - CTO - Openindex
Hi
Nutch should parse an HTML file with a .txt extension just as a normal
HTML file, at least, here it does. What does your parserchecker say? In
any case you must strip potential left-over HTML in your Solr analyzer,
if left like this it's a bad XSS vulnerability.
Cheers
On Tue, 8 May 2012
l how many segments of ~N records
are generated.
Markus Jelsma-2 wrote
On Mon, 7 May 2012 22:31:43 -0700 (PDT), "nutch.buddy@"
wrote:
In a previous discussion about handling of failures in nutch, it
was
mentioned that a broken segment cannot be fixed and it's urls
sh
etried by Hadoop.
Any existing way in nutch to do this?
Sure, the -topN parameter of the generator tool.
--
View this message in context:
http://lucene.472066.n3.nabble.com/Is-it-possible-to-control-the-segment-size-tp3970452.html
Sent from the Nutch - User mailing list archive at Nabble.com.
the url in the following html snippet
> as a link?
>
> http://www.example.com/link";);">...
>
>
> Thanks,
> Mohammad
--
Markus Jelsma - CTO - Openindex
901 - 1000 of 2005 matches
Mail list logo