date:20060602

[jira] Created: (NUTCH-294) Topic-maps of related searchwords

2006-06-02 Thread Stefan Neufeind (JIRA)

Topic-maps of related searchwords
-

 Key: NUTCH-294
 URL: http://issues.apache.org/jira/browse/NUTCH-294
 Project: Nutch
Type: New Feature

  Components: searcher  
Reporter: Stefan Neufeind


Would it be possible to offer a user  topic-maps? It's when you search for 
something and get topic-related words that might also be of interest for you. I 
wonder if that's somehow possible with the ngram-index for did you mean (see 
separate feature-enhancement-bug for this), but we'd need to have a relation 
between words (in what context do they occur).

For the webfrontend usually trees are used  - which for some users offer quite 
impressive eye-candy :-) E.g. see this advertisement by Novell where I've just 
seen a similar topic-map as well:
http://www.novell.com/de-de/company/advertising/defineyouropen.html

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-282) Showing too few results on a page (Paging not correct)

2006-06-02 Thread Stefan Groschupf (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-282?page=comments#action_12414435 ] 

Stefan Groschupf commented on NUTCH-282:


Is that related to host grouping we discussed? Can we in this case close this 
bug?

 Showing too few results on a page (Paging not correct)
 --

  Key: NUTCH-282
  URL: http://issues.apache.org/jira/browse/NUTCH-282
  Project: Nutch
 Type: Bug

   Components: web gui
 Versions: 0.8-dev
 Reporter: Stefan Neufeind


 I did a search and got back the  value itemsPerPage from opensearch. But 
 the output shows results 1-8 and I have a total of 46 searchresults.
 Same happens for the webinterface.
 Why aren't enough results fetched?
 The problem might be somewhere in the area of where Nutch should only display 
 a certaian number of websites per site.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-286) Handling common error-pages as 404

2006-06-02 Thread Stefan Groschupf (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-286?page=comments#action_12414439 ] 

Stefan Groschupf commented on NUTCH-286:


This is difficult to realize since the http error code is readed from response 
in the fetcher and setted into the protocol status , content analysis can only 
done during parsing. 
Also normally such pages do not get a high OPIC score and should be not in the 
top search results. 
However this is a wrong configured http server response, so you may should open 
a bug in the typo3 issue tracking. 
Should we close this issue?

 Handling common error-pages as 404
 --

  Key: NUTCH-286
  URL: http://issues.apache.org/jira/browse/NUTCH-286
  Project: Nutch
 Type: Improvement

 Reporter: Stefan Neufeind


 Idea: Some pages from some software-packages/scripts report an http 200 ok 
 even though a specific page could not be found. Example I just found  is:
 http://www.deteimmobilien.de/unternehmen/nbjmup;Uipnbt/IfsctuAefufjnnpcjmjfo/ef
 That's a typo3-page explaining in it's standard-layout and wording: The 
 requested page did not exist or was inaccessible.
 So I had the idea if somebody might create a plugin that could find commonly 
 used formulations for page does not exist etc. and turn the page into a 404 
 before feeding them  into the nutch-index  - although the server responded 
 with status 200 ok.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-292) OpenSearchServlet: OutOfMemoryError: Java heap space

2006-06-02 Thread Stefan Groschupf (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-292?page=comments#action_12414443 ] 

Stefan Groschupf commented on NUTCH-292:


+1, Can someone create a clean patch file?

 OpenSearchServlet: OutOfMemoryError: Java heap space
 

  Key: NUTCH-292
  URL: http://issues.apache.org/jira/browse/NUTCH-292
  Project: Nutch
 Type: Bug

   Components: web gui
 Versions: 0.8-dev
 Reporter: Stefan Neufeind
 Priority: Critical
  Attachments: summarizer.diff

 java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space
   
 org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:203)
   org.apache.nutch.searcher.NutchBean.getSummary(NutchBean.java:329)
   
 org.apache.nutch.searcher.OpenSearchServlet.doGet(OpenSearchServlet.java:155)
   javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
   javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
 The URL I use is:
 [...]something[...]/opensearch?query=mysearchstart=0hitsPerSite=3hitsPerPage=20sort=url
 It seems to be a problem specific to the date I'm working with. Moving the 
 start from 0 to 10 or changing the query works fine.
 Or maybe it doesn't have to do with sorting but it's just that I hit one bad 
 search-result that has a broken summary?
 !! The problem is repeatable. So if anybody has an idea where to search / 
 what to fix, I can easily try that out !!

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-291) OpenSearchServlet should return date as well as lastModified

2006-06-02 Thread Stefan Groschupf (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-291?page=comments#action_12414445 ] 

Stefan Groschupf commented on NUTCH-291:


lastModified will be only indexed if you switch on the index-more plugin.
If you think you should change the way lastmodified and date is stored in the 
index, please submit a patch for MoreIndexingFilter.

 OpenSearchServlet should return date as well as lastModified
 

  Key: NUTCH-291
  URL: http://issues.apache.org/jira/browse/NUTCH-291
  Project: Nutch
 Type: Improvement

   Components: web gui
 Versions: 0.8-dev
 Reporter: Stefan Neufeind
  Attachments: NUTCH-291-unfinished.patch

 Currently lastModified is provided by OpenSearchServlet - but only in case 
 the date lastModified-date is known.
 Since you can sort by date (which is lastModified or if not present the 
 fetchdate), it might be useful if OpenSearchServlet could provide date as 
 well.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-290) parse-pdf: Garbage indexed when text-extraction not allowed

2006-06-02 Thread Stefan Groschupf (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-290?page=comments#action_12414448 ] 

Stefan Groschupf commented on NUTCH-290:


If a parser throws an exeption:
Fetcher, 261:
 try {
  parse = this.parseUtil.parse(content);
  parseStatus = parse.getData().getStatus();
} catch (Exception e) {
  parseStatus = new ParseStatus(e);
}
if (!parseStatus.isSuccess()) {
  LOG.warning(Error parsing:  + key + :  + parseStatus);
  parse = parseStatus.getEmptyParse(getConf());
}

than we use the empty parse object:
and a empthy parse contans just no text, see getText
private static class EmptyParseImpl implements Parse {

private ParseData data = null;

public EmptyParseImpl(ParseStatus status, Configuration conf) {
  data = new ParseData(status, , new Outlink[0],
   new Metadata(), new Metadata());
  data.setConf(conf);
}

public ParseData getData() {
  return data;
}

public String getText() {
  return ;
}
  }
 So the Problem should be somewhere else.

 parse-pdf: Garbage indexed when text-extraction not allowed
 ---

  Key: NUTCH-290
  URL: http://issues.apache.org/jira/browse/NUTCH-290
  Project: Nutch
 Type: Bug

   Components: indexer
 Versions: 0.8-dev
 Reporter: Stefan Neufeind
  Attachments: NUTCH-290-canExtractContent.patch

 It seems that garbage (or undecoded text?) is indexed when text-extraction 
 for a PDF is not allowed.
 Example-PDF:
 http://www.task-switch.nl/Dutch/articles/Management_en_Architectuur_v3.pdf

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-292) OpenSearchServlet: OutOfMemoryError: Java heap space

2006-06-02 Thread Stefan Neufeind (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-292?page=all ]

Stefan Neufeind updated NUTCH-292:
--

Attachment: NUTCH-292-summarizer08.diff

As per demand, here is the patch.

Please note that it has not throughly been testeed by myself. But the patch 
looks fine and makes sense :-) Oh, and it compiles clean ...

 OpenSearchServlet: OutOfMemoryError: Java heap space
 

  Key: NUTCH-292
  URL: http://issues.apache.org/jira/browse/NUTCH-292
  Project: Nutch
 Type: Bug

   Components: web gui
 Versions: 0.8-dev
 Reporter: Stefan Neufeind
 Priority: Critical
  Attachments: NUTCH-292-summarizer08.diff, summarizer.diff

 java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space
   
 org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:203)
   org.apache.nutch.searcher.NutchBean.getSummary(NutchBean.java:329)
   
 org.apache.nutch.searcher.OpenSearchServlet.doGet(OpenSearchServlet.java:155)
   javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
   javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
 The URL I use is:
 [...]something[...]/opensearch?query=mysearchstart=0hitsPerSite=3hitsPerPage=20sort=url
 It seems to be a problem specific to the date I'm working with. Moving the 
 start from 0 to 10 or changing the query works fine.
 Or maybe it doesn't have to do with sorting but it's just that I hit one bad 
 search-result that has a broken summary?
 !! The problem is repeatable. So if anybody has an idea where to search / 
 what to fix, I can easily try that out !!

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Closed: (NUTCH-287) Exception when searching with sort

2006-06-02 Thread Stefan Groschupf (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-287?page=all ]
 
Stefan Groschupf closed NUTCH-287:
--

Resolution: Won't Fix

http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg04696.html

 Exception when searching with sort
 --

  Key: NUTCH-287
  URL: http://issues.apache.org/jira/browse/NUTCH-287
  Project: Nutch
 Type: Bug

   Components: searcher
 Versions: 0.8-dev
 Reporter: Stefan Neufeind
 Priority: Critical


 Running a search with  sort=url works.
 But when usingsort=title   I get the following exception.
 2006-05-25 14:04:25 StandardWrapperValve[jsp]: Servlet.service() for servlet 
 jsp threw exception
 java.lang.RuntimeException: Unknown sort value type!
 at 
 org.apache.nutch.searcher.IndexSearcher.translateHits(IndexSearcher.java:157)
 at 
 org.apache.nutch.searcher.IndexSearcher.search(IndexSearcher.java:95)
 at org.apache.nutch.searcher.NutchBean.search(NutchBean.java:239)
 at org.apache.jsp.search_jsp._jspService(search_jsp.java:257)
 at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:94)
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
 at 
 org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:324)
 at 
 org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:292)
 at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:236)
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
 at 
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:252)
 at 
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:173)
 at 
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:214)
 at 
 org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
 at 
 org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
 at 
 org.apache.catalina.core.StandardContextValve.invokeInternal(StandardContextValve.java:198)
 at 
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:152)
 at 
 org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
 at 
 org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
 at 
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:137)
 at 
 org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
 at 
 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:118)
 at 
 org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:102)
 at 
 org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
 at 
 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
 at 
 org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
 at 
 org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
 at 
 org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:929)
 at 
 org.apache.coyote.tomcat5.CoyoteAdapter.service(CoyoteAdapter.java:160)
 at 
 org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:799)
 at 
 org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.processConnection(Http11Protocol.java:705)
 at 
 org.apache.tomcat.util.net.TcpWorkerThread.runIt(PoolTcpEndpoint.java:577)
 at 
 org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:684)
 at java.lang.Thread.run(Thread.java:595)
 What is in those lines is:
   WritableComparable sortValue;   // convert value to writable
   if (sortField == null) {
 sortValue = new FloatWritable(scoreDocs[i].score);
   } else {
 Object raw = ((FieldDoc)scoreDocs[i]).fields[0];
 if (raw instanceof Integer) {
   sortValue = new IntWritable(((Integer)raw).intValue());
 } else if (raw instanceof Float) {
   sortValue = new FloatWritable(((Float)raw).floatValue());
 } else if (raw instanceof String) {
   sortValue = new UTF8((String)raw);
 } else {
   throw new RuntimeException(Unknown sort value type!);
 }
   }
 So I thought that maybe raw is an instance of something strange and tried 
 raw.getClass().getName() or also raw.toString() to track the cause down - but 
 that always resulted in a NullPointerException. So it seems I'm having raw 
 being null for some strange reason.
 When I try with title2 (or something none-existing) I get a different error 
 that title2 is unknown / not indexed. So I suspect that title

[jira] Closed: (NUTCH-284) NullPointerException during index

2006-06-02 Thread Stefan Groschupf (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-284?page=all ]
 
Stefan Groschupf closed NUTCH-284:
--

Resolution: Won't Fix

Yes, I was missing index-basic.

 NullPointerException during index
 -

  Key: NUTCH-284
  URL: http://issues.apache.org/jira/browse/NUTCH-284
  Project: Nutch
 Type: Bug

   Components: indexer
 Versions: 0.8-dev
 Reporter: Stefan Neufeind


 For  quite a few this reduce  sort has been going on. Then it fails. What 
 could be wrong with this?
 060524 212613 reduce  sort
 060524 212614 reduce  sort
 060524 212615 reduce  sort
 060524 212615 found resource common-terms.utf8 at 
 file:/home/mm/nutch-nightly-prod/conf/common-terms.utf8
 060524 212615 found resource common-terms.utf8 at 
 file:/home/mm/nutch-nightly-prod/conf/common-terms.utf8
 060524 212619 Optimizing index.
 060524 212619 job_jlbhhm
 java.lang.NullPointerException
 at 
 org.apache.nutch.indexer.Indexer$OutputFormat$1.write(Indexer.java:111)
 at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:269)
 at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:253)
 at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:282)
 at 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:114)
 Exception in thread main java.io.IOException: Job failed!
 at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:341)
 at org.apache.nutch.indexer.Indexer.index(Indexer.java:287)
 at org.apache.nutch.indexer.Indexer.main(Indexer.java:304)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-284) NullPointerException during index

2006-06-02 Thread Stefan Groschupf (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-284?page=comments#action_12414453 ] 

Stefan Groschupf commented on NUTCH-284:


Please try discuss such things first in the user mailing list than open a 
issue. 
Maintaining the issue tracking is very time consuming. But if there is a bug 
please continue open bug reports. :)
Thanks.


 NullPointerException during index
 -

  Key: NUTCH-284
  URL: http://issues.apache.org/jira/browse/NUTCH-284
  Project: Nutch
 Type: Bug

   Components: indexer
 Versions: 0.8-dev
 Reporter: Stefan Neufeind


 For  quite a few this reduce  sort has been going on. Then it fails. What 
 could be wrong with this?
 060524 212613 reduce  sort
 060524 212614 reduce  sort
 060524 212615 reduce  sort
 060524 212615 found resource common-terms.utf8 at 
 file:/home/mm/nutch-nightly-prod/conf/common-terms.utf8
 060524 212615 found resource common-terms.utf8 at 
 file:/home/mm/nutch-nightly-prod/conf/common-terms.utf8
 060524 212619 Optimizing index.
 060524 212619 job_jlbhhm
 java.lang.NullPointerException
 at 
 org.apache.nutch.indexer.Indexer$OutputFormat$1.write(Indexer.java:111)
 at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:269)
 at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:253)
 at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:282)
 at 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:114)
 Exception in thread main java.io.IOException: Job failed!
 at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:341)
 at org.apache.nutch.indexer.Indexer.index(Indexer.java:287)
 at org.apache.nutch.indexer.Indexer.main(Indexer.java:304)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-281) cached.jsp: base-href needs to be outside comments

2006-06-02 Thread Stefan Groschupf (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-281?page=comments#action_12414454 ] 

Stefan Groschupf commented on NUTCH-281:


Can you submit a patch file?

 cached.jsp: base-href needs to be outside comments
 --

  Key: NUTCH-281
  URL: http://issues.apache.org/jira/browse/NUTCH-281
  Project: Nutch
 Type: Bug

   Components: web gui
 Reporter: Stefan Neufeind
 Priority: Trivial


 see cached.jsp
 base href=...
 does not take effect when showing a cached page because of the comments 
 around it

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-274) Empty row in/at end of URL-list results in error

2006-06-02 Thread Stefan Groschupf (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-274?page=comments#action_12414457 ] 

Stefan Groschupf commented on NUTCH-274:


Should we fix this in TextInputFormat of Hadoop to ignore emthy lines or in the 
Injector?

 Empty row in/at end of URL-list results in error
 

  Key: NUTCH-274
  URL: http://issues.apache.org/jira/browse/NUTCH-274
  Project: Nutch
 Type: Bug

 Versions: 0.8-dev
  Environment: nightly-2006-05-20
 Reporter: Stefan Neufeind
 Priority: Minor


 This is minor - but it's a little unclean :-)
 Reproduce: Have a URL-file with one URL followed by a newline, thus producing 
 an empty line.
 Outcome: Fetcher-threads try to fetch two URLs at the same time. First one is 
 fine - but second is empty and therefor fails proper protocol-detection.
 60521 022639   Nutch Analysis (org.apache.nutch.analysis.NutchAnalyzer)
 060521 022639   Nutch Query Filter (org.apache.nutch.searcher.QueryFilter)
 060521 022639 found resource parse-plugins.xml at 
 file:/home/mm/nutch-nightly/conf/parse-plugins.xml
 060521 022639 Using URL normalizer: org.apache.nutch.net.BasicUrlNormalizer
 060521 022639 fetching http://www.bild.de/
 060521 022639 fetching 
 060521 022639 fetch of  failed with: 
 org.apache.nutch.protocol.ProtocolNotFound: java.net.MalformedURLException: 
 no protocol: 
 060521 022639 http.proxy.host = null
 060521 022639 http.proxy.port = 8080
 060521 022639 http.timeout = 1
 060521 022639 http.content.limit = 65536
 060521 022639 http.agent = NutchCVS/0.8-dev (Nutch; 
 http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
 060521 022639 fetcher.server.delay = 1000
 060521 022639 http.max.delays = 1000
 060521 022640 ParserFactory:Plugin: org.apache.nutch.parse.text.TextParser 
 mapped to contentType text/xml via parse-plugins.xml, but
  its plugin.xml file does not claim to support contentType: text/xml
 060521 022640 ParserFactory:Plugin: org.apache.nutch.parse.html.HtmlParser 
 mapped to contentType text/xml via parse-plugins.xml, but
  its plugin.xml file does not claim to support contentType: text/xml
 060521 022640 ParserFactory: Plugin: org.apache.nutch.parse.rss.RSSParser 
 mapped to contentType text/xml via parse-plugins.xml, but 
 not enabled via plugin.includes in nutch-default.xml
 060521 022640 Using Signature impl: org.apache.nutch.crawl.MD5Signature
 060521 022640  map 0%  reduce 0%
 060521 022640 1 pages, 1 errors, 1.0 pages/s, 40 kb/s, 
 060521 022640 1 pages, 1 errors, 1.0 pages/s, 40 kb/s, 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-290) parse-pdf: Garbage indexed when text-extraction not allowed

2006-06-02 Thread Stefan Neufeind (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-290?page=comments#action_12414458 ] 

Stefan Neufeind commented on NUTCH-290:
---

But if one plugin fails in 0.8-dev, isn't the next used? I understand that in 
the default-config the text-parser would be used as the last resort fallback.

Also I'm not sure where the summary-text comes from if I use the patch above to 
prevent generating an exception but return empty parse-data.

 parse-pdf: Garbage indexed when text-extraction not allowed
 ---

  Key: NUTCH-290
  URL: http://issues.apache.org/jira/browse/NUTCH-290
  Project: Nutch
 Type: Bug

   Components: indexer
 Versions: 0.8-dev
 Reporter: Stefan Neufeind
  Attachments: NUTCH-290-canExtractContent.patch

 It seems that garbage (or undecoded text?) is indexed when text-extraction 
 for a PDF is not allowed.
 Example-PDF:
 http://www.task-switch.nl/Dutch/articles/Management_en_Architectuur_v3.pdf

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-291) OpenSearchServlet should return date as well as lastModified

2006-06-02 Thread Stefan Neufeind (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-291?page=comments#action_12414466 ] 

Stefan Neufeind commented on NUTCH-291:
---

Which way is most favorable? To always set lastModified although it was not 
returned from the webserver (maybe unclean) or always return date as well 
(cleaner?).

 OpenSearchServlet should return date as well as lastModified
 

  Key: NUTCH-291
  URL: http://issues.apache.org/jira/browse/NUTCH-291
  Project: Nutch
 Type: Improvement

   Components: web gui
 Versions: 0.8-dev
 Reporter: Stefan Neufeind
  Attachments: NUTCH-291-unfinished.patch

 Currently lastModified is provided by OpenSearchServlet - but only in case 
 the date lastModified-date is known.
 Since you can sort by date (which is lastModified or if not present the 
 fetchdate), it might be useful if OpenSearchServlet could provide date as 
 well.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-290) parse-pdf: Garbage indexed when text-extraction not allowed

2006-06-02 Thread Stefan Groschupf (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-290?page=comments#action_12414469 ] 

Stefan Groschupf commented on NUTCH-290:


As far I understand the code, the next parser is only used if the previous 
parser return with a unsuccessfully paring status. If the parser throws an 
expception these exception is not catched in the parseutil at all.
So the pdf parser should throw an expception and not report a unsucessfully 
status to solve this problem, isn't it?


 parse-pdf: Garbage indexed when text-extraction not allowed
 ---

  Key: NUTCH-290
  URL: http://issues.apache.org/jira/browse/NUTCH-290
  Project: Nutch
 Type: Bug

   Components: indexer
 Versions: 0.8-dev
 Reporter: Stefan Neufeind
  Attachments: NUTCH-290-canExtractContent.patch

 It seems that garbage (or undecoded text?) is indexed when text-extraction 
 for a PDF is not allowed.
 Example-PDF:
 http://www.task-switch.nl/Dutch/articles/Management_en_Architectuur_v3.pdf

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Closed: (NUTCH-286) Handling common error-pages as 404

2006-06-02 Thread Stefan Groschupf (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-286?page=all ]
 
Stefan Groschupf closed NUTCH-286:
--

Resolution: Won't Fix

I hope everybody agree with the statement: We can not detect http response 
codes based on responded html content.
Prune the index is a good idea to solve the problem.

 Handling common error-pages as 404
 --

  Key: NUTCH-286
  URL: http://issues.apache.org/jira/browse/NUTCH-286
  Project: Nutch
 Type: Improvement

 Reporter: Stefan Neufeind


 Idea: Some pages from some software-packages/scripts report an http 200 ok 
 even though a specific page could not be found. Example I just found  is:
 http://www.deteimmobilien.de/unternehmen/nbjmup;Uipnbt/IfsctuAefufjnnpcjmjfo/ef
 That's a typo3-page explaining in it's standard-layout and wording: The 
 requested page did not exist or was inaccessible.
 So I had the idea if somebody might create a plugin that could find commonly 
 used formulations for page does not exist etc. and turn the page into a 404 
 before feeding them  into the nutch-index  - although the server responded 
 with status 200 ok.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-290) parse-pdf: Garbage indexed when text-extraction not allowed

2006-06-02 Thread Stefan Neufeind (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-290?page=comments#action_12414477 ] 

Stefan Neufeind commented on NUTCH-290:
---

But to my understanding of the plugin it still extracts as much as possible 
(meta-data) from the PDF. So if indexing is not allowed but this is a PDF, then 
returning empty text as the document-body should be fine - shouldn't it? 
Nothing else except a PDF-plugin will be able to handle PDF correclty in this 
case.

Stefan G., can you point out why in the summary I see binary data for a PDF as 
summary and if there is a possible fix for it in the context of this current 
bug here?

 parse-pdf: Garbage indexed when text-extraction not allowed
 ---

  Key: NUTCH-290
  URL: http://issues.apache.org/jira/browse/NUTCH-290
  Project: Nutch
 Type: Bug

   Components: indexer
 Versions: 0.8-dev
 Reporter: Stefan Neufeind
  Attachments: NUTCH-290-canExtractContent.patch

 It seems that garbage (or undecoded text?) is indexed when text-extraction 
 for a PDF is not allowed.
 Example-PDF:
 http://www.task-switch.nl/Dutch/articles/Management_en_Architectuur_v3.pdf

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Created: (NUTCH-295) More description for fetcher.threads.fetch property

2006-06-02 Thread Dennis Kubes (JIRA)

More description for fetcher.threads.fetch property
---

 Key: NUTCH-295
 URL: http://issues.apache.org/jira/browse/NUTCH-295
 Project: Nutch
Type: Improvement

  Components: fetcher  
Versions: 0.8-dev
Reporter: Dennis Kubes
Priority: Minor


Added some description to the fetcher.threads.fetch property to explain the 
number of threads running in a cluster. Patch is attached.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-295) More description for fetcher.threads.fetch property

2006-06-02 Thread Dennis Kubes (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-295?page=all ]

Dennis Kubes updated NUTCH-295:
---

Attachment: fetcher_threads_desc.patch

More description for fetcher.threads.fetch property as relating to running in 
distributed mode.

 More description for fetcher.threads.fetch property
 ---

  Key: NUTCH-295
  URL: http://issues.apache.org/jira/browse/NUTCH-295
  Project: Nutch
 Type: Improvement

   Components: fetcher
 Versions: 0.8-dev
 Reporter: Dennis Kubes
 Priority: Minor
  Attachments: fetcher_threads_desc.patch

 Added some description to the fetcher.threads.fetch property to explain the 
 number of threads running in a cluster. Patch is attached.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Created: (NUTCH-294) Topic-maps of related searchwords

[jira] Commented: (NUTCH-282) Showing too few results on a page (Paging not correct)

[jira] Commented: (NUTCH-286) Handling common error-pages as 404

[jira] Commented: (NUTCH-292) OpenSearchServlet: OutOfMemoryError: Java heap space

[jira] Commented: (NUTCH-291) OpenSearchServlet should return date as well as lastModified

[jira] Commented: (NUTCH-290) parse-pdf: Garbage indexed when text-extraction not allowed

[jira] Updated: (NUTCH-292) OpenSearchServlet: OutOfMemoryError: Java heap space

[jira] Closed: (NUTCH-287) Exception when searching with sort

[jira] Closed: (NUTCH-284) NullPointerException during index

[jira] Commented: (NUTCH-284) NullPointerException during index

[jira] Commented: (NUTCH-281) cached.jsp: base-href needs to be outside comments

[jira] Commented: (NUTCH-274) Empty row in/at end of URL-list results in error

[jira] Commented: (NUTCH-290) parse-pdf: Garbage indexed when text-extraction not allowed

[jira] Commented: (NUTCH-291) OpenSearchServlet should return date as well as lastModified

[jira] Commented: (NUTCH-290) parse-pdf: Garbage indexed when text-extraction not allowed

[jira] Closed: (NUTCH-286) Handling common error-pages as 404

[jira] Commented: (NUTCH-290) parse-pdf: Garbage indexed when text-extraction not allowed

[jira] Created: (NUTCH-295) More description for fetcher.threads.fetch property

[jira] Updated: (NUTCH-295) More description for fetcher.threads.fetch property

19 matches

Site Navigation

Mail list logo

Footer information