parse-oo plugin
Hey there, Hope all has been going well for you. I noticed a small issue with the parse-oo plugin. It parses the documents correctly, however, when you find a open office document as a result and click "cached", it returns with a NullPointerException error. I looked into it and the line in cached.jsp that is throwing the NPE is below: String contentType = (String) metaData.get(Metadata.CONTENT_TYPE); So apparently the parse-oo plugin does not store the CONTENT_TYPE of the document. I looked and modified around line 100 and changed: Outlink[] links = (Outlink[])outlinks.toArray(new Outlink[outlinks.size()]); ParseData parseData = new ParseData(ParseStatus.STATUS_SUCCESS, title, links, metadata); return new ParseImpl(text, parseData); to: Outlink[] links = (Outlink[])outlinks.toArray(new Outlink[outlinks.size()]); ParseData parseData = new ParseData(ParseStatus.STATUS_SUCCESS, title, links, content.getMetadata(), metadata); parseData.setConf(this.conf); return new ParseImpl(text, parseData); This fixes the problem of the cached.jsp throwing an exception, but instead it displays every document type as either [octet-stream] or [oleobject]. So it seems as if it's not interpreting the mime types correctly. Do you know how to fix both the cached.jsp issue and the mime-type issue concurrently?? Thanks, Matt
[jira] Resolved: (NUTCH-266) hadoop bug when doing updatedb
[ http://issues.apache.org/jira/browse/NUTCH-266?page=all ] Sami Siren resolved NUTCH-266. -- Resolution: Fixed I just updated hadoop versions, trunk contains 0.5.0, 0.8-branch contains patched 0.4.0 > hadoop bug when doing updatedb > -- > > Key: NUTCH-266 > URL: http://issues.apache.org/jira/browse/NUTCH-266 > Project: Nutch > Issue Type: Bug >Affects Versions: 0.8 > Environment: windows xp, JDK 1.4.2_04 >Reporter: Eugen Kochuev > Fix For: 0.8.1, 0.9.0 > > Attachments: patch.diff, patch_hadoop-0.5.0.diff > > > I constantly get the following error message > 060508 230637 Running job: job_pbhn3t > 060508 230637 > c:/nutch/crawl-20060508230625/crawldb/current/part-0/data:0+245 > 060508 230637 > c:/nutch/crawl-20060508230625/segments/20060508230628/crawl_fetch/part-0/data:0+296 > 060508 230637 > c:/nutch/crawl-20060508230625/segments/20060508230628/crawl_parse/part-0:0+5258 > 060508 230637 job_pbhn3t > java.io.IOException: Target > /tmp/hadoop/mapred/local/reduce_qnd5sx/map_qjp7tf.out already exists > at org.apache.hadoop.fs.FileUtil.checkDest(FileUtil.java:162) > at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:62) > at > org.apache.hadoop.fs.LocalFileSystem.renameRaw(LocalFileSystem.java:191) > at org.apache.hadoop.fs.FileSystem.rename(FileSystem.java:306) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:101) > Exception in thread "main" java.io.IOException: Job failed! > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:341) > at org.apache.nutch.crawl.CrawlDb.update(CrawlDb.java:54) > at org.apache.nutch.crawl.Crawl.main(Crawl.java:114) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Resolved: (NUTCH-344) Fetcher threads blocked on synchronized block in cleanExpiredServerBlocks
[ http://issues.apache.org/jira/browse/NUTCH-344?page=all ] Sami Siren resolved NUTCH-344. -- Fix Version/s: 0.8.1 0.9.0 Resolution: Fixed I just committed this to 0.8 branch and trunk, thanks Greg! > Fetcher threads blocked on synchronized block in cleanExpiredServerBlocks > - > > Key: NUTCH-344 > URL: http://issues.apache.org/jira/browse/NUTCH-344 > Project: Nutch > Issue Type: Bug > Components: fetcher >Affects Versions: 0.8, 0.9.0, 0.8.1 > Environment: All >Reporter: Greg Kim > Fix For: 0.8.1, 0.9.0 > > Attachments: cleanExpiredServerBlocks.patch > > > With the recent change to the following code in HttpBase.java has tendencies > to block fetcher threads while one thread busy waits... > private static void cleanExpiredServerBlocks() { > synchronized (BLOCKED_ADDR_TO_TIME) { > while (!BLOCKED_ADDR_QUEUE.isEmpty()) { <= LINE 3: > String host = (String) BLOCKED_ADDR_QUEUE.getLast(); > long time = ((Long) BLOCKED_ADDR_TO_TIME.get(host)).longValue(); > if (time <= System.currentTimeMillis()) { > BLOCKED_ADDR_TO_TIME.remove(host); > BLOCKED_ADDR_QUEUE.removeLast(); > } > } > } > } > LINE3: As long as there are *any* entries in the BLOCKED_ADDR_QUEUE, the > thread that first enters this block busy-waits until it becomes empty while > all other threads block on the synchronized block. This leads to extremely > poor fetcher performance. > Since the checkin to respect crawlDelay in robots.txt, we are no longer > guranteed that BLOCKED_ADDR_TO_TIME queue is a fifo list. The simple fix is > to iterate the queue once rather than busy waiting... -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: [Fwd: Re: 0.8 Recrawl script updated]
Since it wasn't really clear whether my script approached the problem of deleting segments correctly, I refactored it so it generates the new number of segments, merges them into one, then deletes the "new" segments. Not as efficient disk space wise, but still removes a large number of the segments that are not being referenced by anything due to not being indexed yet. I reupdated the wiki. Unless there is any more clarification regarding the issue, hopefully I won't have to bombard your inbox with any more emails regarding this. Matt Lukas Vlcek wrote: Hi again, I just found related discussion here: http://www.nabble.com/NullPointException-tf2045994r1.html I think these guys are discussing similar problem and if I understood the conclusion correctly then the only solution right now is to write some code and test which segments are used in index and which are not. Regards, Lukas On 8/4/06, Lukas Vlcek <[EMAIL PROTECTED]> wrote: Matthew, In fact I didn't realize you are doing merge stuff (sorry for that) but frankly I don't know how exactly merging works and if this strategy would work in the long time perspective and whether it is universal approach in all variability of cases which may occur during crawling (-topN, threads frozen, pages unavailable, crawling dies, ... etc), may be it is correct path. I would appreciate if anybody can answer this question precisely. Thanks, Lukas On 8/4/06, Matthew Holt <[EMAIL PROTECTED]> wrote: > If anyone doesnt mind taking a look... > > > > -- Forwarded message -- > From: Matthew Holt <[EMAIL PROTECTED]> > To: nutch-user@lucene.apache.org > Date: Fri, 04 Aug 2006 10:07:57 -0400 > Subject: Re: 0.8 Recrawl script updated > Lukas, >Thanks for your e-mail. I assumed I could drop the $depth number of > oldest segments because I first merged them all into one segment (which > I don't drop). Am I incorrect in my assumption and can this cause > problems in the future? If so, then I'll go back to the original version > of my script when I kept all the segments without merging. However, it > just seemed like if that is the case, it will be a problem after enough > number of recrawls due to the large amount of segments being kept. > > Thanks, > Matt > > Lukas Vlcek wrote: > > Hi Matthew, > > > > I am surious about one thing. How do you know you can just drop $depth > > number of the most oldest segments in the end? I haven't studied nutch > > code regarding this topic yet but I thought that segment can be > > dropped once you are sure that all its content is already crawled in > > some newer segments (which should be checked somehow via some > > function/script - which hasen't been yet implemented to my knowledge). > > > > Also I don't think this question has been discussed on dev/user lists > > in detail yet so I just wanted to ask you about your opinion. The > > situation could get even more complicated if people add -topN > > parameter into script (which can happen because some might prefer > > crawling in ten smaller bunches over to two huge crawls due to various > > technical reasons). > > > > Anyway, never mind if you don't want to bother about my silly question > > :-) > > > > Regards, > > Lukas > > > > On 8/4/06, Matthew Holt <[EMAIL PROTECTED]> wrote: > >> Last email regarding this script. I found a bug in it that is sporadic > >> (i think it only affected different setups). However, since it would be > >> a problem sometimes, I refactored the script. I'd suggest you redownload > >> the script if you are using it. > >> > >> Matt > >> > >> Matthew Holt wrote: > >> > I'm currently pretty busy at work. If I have I'll do it later. > >> > > >> > The version 0.8 recrawl script has a working version online now. I > >> > temporarily modified it on the website yesterday when I ran into some > >> > problems, but I further tested it and the actual working code is > >> > modified now. So if you got it off the web site any time yesterday, I > >> > would redownload the script. > >> > > >> > Matt > >> > > >> > Lourival JĂșnior wrote: > >> >> Hi Matthew! > >> >> > >> >> Could you update the script to the version 0.7.2 with the same > >> >> functionalities? I write a scritp that do this, but it don't work > >> very > >> >> well... > >> >> > >> >> Regards! > >> >> > >> >> On 8/2/06, Matthew Holt <[EMAIL PROTECTED]> wrote: > >> >>> > >> >>> Just letting everyone know that I updated the recrawl script on the > >> >>> Wiki. It now merges the created segments them deletes the old > >> segs to > >> >>> prevent a lot of unneeded data remaining/growing on the hard drive. > >> >>> Matt > >> >>> > >> >>> > >> >>> > >> http://wiki.apache.org/nutch/IntranetRecrawl?action=show#head-e58e25a0b9530bb6fcdfb282fd27a207fc0aff03 > >> > >> >>> > >> >>> > >> >> > >> >> > >> >> > >> > > >> > > > > > >
[jira] Commented: (NUTCH-330) command line tool to search a Lucene index
[ http://issues.apache.org/jira/browse/NUTCH-330?page=comments#action_12426629 ] Renaud Richardet commented on NUTCH-330: This bug is obsolte, I just found out that Nutch already allows to search from the command line via bin/nutch org.apache.nutch.searcher.NutchBean [searchterm]. It assumes that you call it from the base of your crawl directory. > command line tool to search a Lucene index > -- > > Key: NUTCH-330 > URL: http://issues.apache.org/jira/browse/NUTCH-330 > Project: Nutch > Issue Type: Improvement > Components: searcher >Affects Versions: 0.8 > Environment: ubuntu >Reporter: Renaud Richardet >Priority: Minor > Attachments: clSearch.diff, clSearch.diff > > > Tool to allow to search a Lucene index from the command line, makes > development and testing faster > usage: bin/nutch searchindex [index dir] [searchkeyword] > example: bin/nutch searchindex crawl/index flowers -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-266) hadoop bug when doing updatedb
[ http://issues.apache.org/jira/browse/NUTCH-266?page=comments#action_12426579 ] Renaud Richardet commented on NUTCH-266: KuroSaka, yes you can download the hadoop jar, release 0.5.0 from the project website: http://lucene.apache.org/hadoop/ and http://www.apache.org/dyn/closer.cgi/lucene/hadoop/ > hadoop bug when doing updatedb > -- > > Key: NUTCH-266 > URL: http://issues.apache.org/jira/browse/NUTCH-266 > Project: Nutch > Issue Type: Bug >Affects Versions: 0.8 > Environment: windows xp, JDK 1.4.2_04 >Reporter: Eugen Kochuev > Fix For: 0.9.0, 0.8.1 > > Attachments: patch.diff, patch_hadoop-0.5.0.diff > > > I constantly get the following error message > 060508 230637 Running job: job_pbhn3t > 060508 230637 > c:/nutch/crawl-20060508230625/crawldb/current/part-0/data:0+245 > 060508 230637 > c:/nutch/crawl-20060508230625/segments/20060508230628/crawl_fetch/part-0/data:0+296 > 060508 230637 > c:/nutch/crawl-20060508230625/segments/20060508230628/crawl_parse/part-0:0+5258 > 060508 230637 job_pbhn3t > java.io.IOException: Target > /tmp/hadoop/mapred/local/reduce_qnd5sx/map_qjp7tf.out already exists > at org.apache.hadoop.fs.FileUtil.checkDest(FileUtil.java:162) > at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:62) > at > org.apache.hadoop.fs.LocalFileSystem.renameRaw(LocalFileSystem.java:191) > at org.apache.hadoop.fs.FileSystem.rename(FileSystem.java:306) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:101) > Exception in thread "main" java.io.IOException: Job failed! > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:341) > at org.apache.nutch.crawl.CrawlDb.update(CrawlDb.java:54) > at org.apache.nutch.crawl.Crawl.main(Crawl.java:114) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Patch: deflate encoding
On 8/8/06, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: Forgot to say - attachments get stripped. Please put them in JIRA. Done that, see https://issues.apache.org/jira/browse/NUTCH-345 Cheers Jan-Pascal
[jira] Created: (NUTCH-345) Add support for Content-Encoding: deflated
Add support for Content-Encoding: deflated -- Key: NUTCH-345 URL: http://issues.apache.org/jira/browse/NUTCH-345 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 0.9.0 Reporter: Pascal Beis Priority: Minor Attachments: nutch-deflate.patch Add support for the "deflated" content-encoding, next to the already implemented GZIP content-encoding. Patch attached. See also the "Patch: deflate encoding" thread on nutch-dev on August 7/8 2006. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira