from:"Matthew Holt"

Re: [Fwd: Re: 0.8 Recrawl script updated]

2006-08-08 Thread Matthew Holt

Since it wasn't really clear whether my script approached the problem of
deleting segments correctly, I refactored it so it generates the new
number of segments, merges them into one, then deletes the new
segments. Not as efficient disk space wise, but still removes a large
number of the segments that are not being referenced by anything due to
not being indexed yet.

I reupdated the wiki. Unless there is any more clarification regarding
the issue, hopefully I won't have to bombard your inbox with any more
emails regarding this.

Matt

Lukas Vlcek wrote:

Hi again,

I just found related discussion here:
http://www.nabble.com/NullPointException-tf2045994r1.html

I think these guys are discussing similar problem and if I understood
the conclusion correctly then the only solution right now is to write
some code and test which segments are used in index and which are not.

Regards,
Lukas

On 8/4/06, Lukas Vlcek [EMAIL PROTECTED] wrote:

Matthew,

In fact I didn't realize you are doing merge stuff (sorry for that)
but frankly I don't know how exactly merging works and if this
strategy would work in the long time perspective and whether it is
universal approach in all variability of cases which may occur during
crawling (-topN, threads frozen, pages unavailable, crawling dies, ...
etc), may be it is correct path. I would appreciate if anybody can
answer this question precisely.

Thanks,
Lukas

On 8/4/06, Matthew Holt [EMAIL PROTECTED] wrote:
If anyone doesnt mind taking a look...

-- Forwarded message --
From: Matthew Holt [EMAIL PROTECTED]
To: nutch-user@lucene.apache.org
Date: Fri, 04 Aug 2006 10:07:57 -0400
Subject: Re: 0.8 Recrawl script updated
Lukas,
Thanks for your e-mail. I assumed I could drop the $depth number of
oldest segments because I first merged them all into one segment
(which

I don't drop). Am I incorrect in my assumption and can this cause
problems in the future? If so, then I'll go back to the original
version

of my script when I kept all the segments without merging. However, it
just seemed like if that is the case, it will be a problem after
enough

number of recrawls due to the large amount of segments being kept.

Thanks,
Matt

Lukas Vlcek wrote:
Hi Matthew,

I am surious about one thing. How do you know you can just drop
$depth
number of the most oldest segments in the end? I haven't studied
nutch

code regarding this topic yet but I thought that segment can be
dropped once you are sure that all its content is already crawled in
some newer segments (which should be checked somehow via some
function/script - which hasen't been yet implemented to my
knowledge).

Also I don't think this question has been discussed on dev/user
lists

in detail yet so I just wanted to ask you about your opinion. The
situation could get even more complicated if people add -topN
parameter into script (which can happen because some might prefer
crawling in ten smaller bunches over to two huge crawls due to
various

technical reasons).

Anyway, never mind if you don't want to bother about my silly
question

:-)

Regards,
Lukas

On 8/4/06, Matthew Holt [EMAIL PROTECTED] wrote:
Last email regarding this script. I found a bug in it that is
sporadic
(i think it only affected different setups). However, since it
would be
a problem sometimes, I refactored the script. I'd suggest you
redownload

the script if you are using it.

Matt

Matthew Holt wrote:
I'm currently pretty busy at work. If I have I'll do it later.

The version 0.8 recrawl script has a working version online
now. I
temporarily modified it on the website yesterday when I ran
into some

problems, but I further tested it and the actual working code is
modified now. So if you got it off the web site any time
yesterday, I

would redownload the script.

Matt

Lourival Júnior wrote:
Hi Matthew!

Could you update the script to the version 0.7.2 with the same
functionalities? I write a scritp that do this, but it don't
work

very
well...

Regards!

On 8/2/06, Matthew Holt [EMAIL PROTECTED] wrote:

Just letting everyone know that I updated the recrawl script
on the

Wiki. It now merges the created segments them deletes the old
segs to
prevent a lot of unneeded data remaining/growing on the hard
drive.

Matt

http://wiki.apache.org/nutch/IntranetRecrawl?action=show#head-e58e25a0b9530bb6fcdfb282fd27a207fc0aff03

parse-oo plugin

2006-08-08 Thread Matthew Holt


Hey there,
 Hope all has been going well for you. I noticed a small issue with the 
parse-oo plugin. It parses the documents correctly, however, when you 
find a open office document as a result and click cached, it returns 
with a NullPointerException error. I looked into it and the line in 
cached.jsp that is throwing the NPE is below:


String contentType = (String) metaData.get(Metadata.CONTENT_TYPE);

So apparently the parse-oo plugin does not store the CONTENT_TYPE of the 
document. I looked and modified around line 100 and changed:


   Outlink[] links = (Outlink[])outlinks.toArray(new 
Outlink[outlinks.size()]);
   ParseData parseData = new ParseData(ParseStatus.STATUS_SUCCESS, 
title, links, metadata);

   return new ParseImpl(text, parseData);

to:

   Outlink[] links = (Outlink[])outlinks.toArray(new 
Outlink[outlinks.size()]);
   ParseData parseData = new ParseData(ParseStatus.STATUS_SUCCESS, 
title, links, content.getMetadata(), metadata);

   parseData.setConf(this.conf);
   return new ParseImpl(text, parseData);

This fixes the problem of the cached.jsp throwing an exception, but 
instead it displays every document type as either [octet-stream] or 
[oleobject].


So it seems as if it's not interpreting the mime types correctly. Do you 
know how to fix both the cached.jsp issue and the mime-type issue 
concurrently??

Thanks,
 Matt

[Fwd: Re: 0.8 Recrawl script updated]

2006-08-04 Thread Matthew Holt

If anyone doesnt mind taking a look...
---BeginMessage---

Lukas,
Thanks for your e-mail. I assumed I could drop the $depth number of
oldest segments because I first merged them all into one segment (which
I don't drop). Am I incorrect in my assumption and can this cause
problems in the future? If so, then I'll go back to the original version
of my script when I kept all the segments without merging. However, it
just seemed like if that is the case, it will be a problem after enough
number of recrawls due to the large amount of segments being kept.

Thanks,
Matt

Lukas Vlcek wrote:

Hi Matthew,

I am surious about one thing. How do you know you can just drop $depth
number of the most oldest segments in the end? I haven't studied nutch
code regarding this topic yet but I thought that segment can be
dropped once you are sure that all its content is already crawled in
some newer segments (which should be checked somehow via some
function/script - which hasen't been yet implemented to my knowledge).

Also I don't think this question has been discussed on dev/user lists
in detail yet so I just wanted to ask you about your opinion. The
situation could get even more complicated if people add -topN
parameter into script (which can happen because some might prefer
crawling in ten smaller bunches over to two huge crawls due to various
technical reasons).

Anyway, never mind if you don't want to bother about my silly question
:-)

Regards,
Lukas

On 8/4/06, Matthew Holt [EMAIL PROTECTED] wrote:

Last email regarding this script. I found a bug in it that is sporadic
(i think it only affected different setups). However, since it would be
a problem sometimes, I refactored the script. I'd suggest you redownload
the script if you are using it.

Matt

Matthew Holt wrote:
I'm currently pretty busy at work. If I have I'll do it later.

The version 0.8 recrawl script has a working version online now. I
temporarily modified it on the website yesterday when I ran into some
problems, but I further tested it and the actual working code is
modified now. So if you got it off the web site any time yesterday, I
would redownload the script.

Matt

Lourival Júnior wrote:
Hi Matthew!

Could you update the script to the version 0.7.2 with the same
functionalities? I write a scritp that do this, but it don't work
very

well...

Regards!

On 8/2/06, Matthew Holt [EMAIL PROTECTED] wrote:

Just letting everyone know that I updated the recrawl script on the
Wiki. It now merges the created segments them deletes the old
segs to

prevent a lot of unneeded data remaining/growing on the hard drive.
Matt

http://wiki.apache.org/nutch/IntranetRecrawl?action=show#head-e58e25a0b9530bb6fcdfb282fd27a207fc0aff03

---End Message---

Re: [Fwd: Re: 0.8 Recrawl script updated]

parse-oo plugin

[Fwd: Re: 0.8 Recrawl script updated]

3 matches

Site Navigation

Mail list logo

Footer information