[jira] Commented: (NUTCH-377) Add possibility to search for multiple values

2006-10-01 Thread Stefan Neufeind (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-377?page=comments#action_12439018 ] 

Stefan Neufeind commented on NUTCH-377:
---

Hmm, I'm not too sure I understand how to do that. There is one part which adds 
prohibited or required phrases but ...

To my understanding isn't the above example parsed "as is" into one string for 
the whole "site:...|..." ? If yes, could the split be done where evaluating the 
site-command maybe? Had a look at query-site - but there doesn't seem to be 
much code over there ...

What is a good syntax that the nutch-community could agree on? And could you 
maybe wrap up an initial patch for that?

> Add possibility to search for multiple values
> -
>
> Key: NUTCH-377
> URL: http://issues.apache.org/jira/browse/NUTCH-377
> Project: Nutch
>  Issue Type: Improvement
>  Components: searcher
>Reporter: Stefan Neufeind
>
> Searches with boolean operators (AND or OR) are not (yet) possible. All 
> search-items are always searched with AND.
> But it would be nice to have the possibility to allow multiple values for a 
> certain field. Maybe that could done using a separator?
> As an example you might want to search for:
> somewordsite:www.example.org|www.apache.org
> Which (to my understand) would allow to search for one or more words with a 
> restriction to those two sites. It would prevent having to implement AND and 
> OR fully (maybe even including brackets) but would allow to cover a few often 
> used cases imho.
> Easy/hard to do? To my understanding Lucene itself allows AND/OR-searches. So 
> might basically be a problem of string-parsing and query-building towards 
> Lucene?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Created: (NUTCH-377) Add possibility to search for multiple values

2006-10-01 Thread Stefan Neufeind (JIRA)
Add possibility to search for multiple values
-

 Key: NUTCH-377
 URL: http://issues.apache.org/jira/browse/NUTCH-377
 Project: Nutch
  Issue Type: Improvement
  Components: searcher
Reporter: Stefan Neufeind


Searches with boolean operators (AND or OR) are not (yet) possible. All 
search-items are always searched with AND.

But it would be nice to have the possibility to allow multiple values for a 
certain field. Maybe that could done using a separator?

As an example you might want to search for:

somewordsite:www.example.org|www.apache.org

Which (to my understand) would allow to search for one or more words with a 
restriction to those two sites. It would prevent having to implement AND and OR 
fully (maybe even including brackets) but would allow to cover a few often used 
cases imho.

Easy/hard to do? To my understanding Lucene itself allows AND/OR-searches. So 
might basically be a problem of string-parsing and query-building towards 
Lucene?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-334) I am using the search technique

2006-07-31 Thread Stefan Neufeind (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-334?page=comments#action_12424554 ] 

Stefan Neufeind commented on NUTCH-334:
---

Right, that's a real bug :-)

> I am using the search technique
> ---
>
> Key: NUTCH-334
> URL: http://issues.apache.org/jira/browse/NUTCH-334
> Project: Nutch
>  Issue Type: Bug
>Reporter: Siddharudh nadgeri
>


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-335) Pdf summary corrupt issue

2006-07-31 Thread Stefan Neufeind (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-335?page=comments#action_12424553 ] 

Stefan Neufeind commented on NUTCH-335:
---

The problem that that in most cases I've come across the PDF is protected and 
does not allow text-extraction. Though this could be theoretically worked 
around, it's not really allowed afaik.

Also there is an issue for this already:
http://issues.apache.org/jira/browse/NUTCH-290

The problem that still needs to be worked around imho is that no text should be 
shown instead - and I'd which a clarification why currently we the raw binary 
data is taken as summary.

PS: Next time please search before opening a new issue. (Meant just as 
information, not to make anybody angry ...)

> Pdf summary corrupt issue
> -
>
> Key: NUTCH-335
> URL: http://issues.apache.org/jira/browse/NUTCH-335
> Project: Nutch
>  Issue Type: Bug
> Environment: As it is web application it is not nessasary
>Reporter: Siddharudh nadgeri
>
> I am using the Nutch search but for pdf it is giving summary as some garbage 
> like
> "!!"#"#"#"#"#"#"#"!$%$%$#&##'$$ ("$$$
> please provide the solution

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-271) Meta-data per URL/site/section

2006-07-19 Thread Stefan Neufeind (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-271?page=comments#action_1246 ] 

Stefan Neufeind commented on NUTCH-271:
---

Does somebody have an existing demo-plugin for that, that would catch 
URL-prefixes from a file and in case matches are found certain tags are then 
added? I don't yet fully get it how to do it "the elegant way" :-)

> Meta-data per URL/site/section
> --
>
> Key: NUTCH-271
> URL: http://issues.apache.org/jira/browse/NUTCH-271
> Project: Nutch
>  Issue Type: New Feature
>Affects Versions: 0.7.2
>Reporter: Stefan Neufeind
>
> We have the need to index sites and attach additional meta-data-tags to them. 
> Afaik this is not yet possible, or is there a "workaround" I don't see? What 
> I think of is using meta-tags per start-url, only indexing content below that 
> URL, and have the ability to limit searches upon those meta-tags. E.g.
> http://www.example1.com/something1/   -> meta-tag "companybranch1"
> http://www.example2.com/something2/   -> meta-tag "companybranch2"
> http://www.example3.com/something3/   -> meta-tag "companybranch1"
> http://www.example4.com/something4/   -> meta-tag "companybranch3"
> search for everything in companybranch1 or across 1 and 3 or similar

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (NUTCH-279) Additions for regex-normalize

2006-07-09 Thread Stefan Neufeind (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-279?page=all ]

Stefan Neufeind updated NUTCH-279:
--

Attachment: regex-normalize2.patch

New patch with just one session-ID-regex extended (also including . - , now), 
since I came across those extra chars while used on a common German website 
(www.bahn.de).

> Additions for regex-normalize
> -
>
>  Key: NUTCH-279
>  URL: http://issues.apache.org/jira/browse/NUTCH-279
>  Project: Nutch
> Type: Improvement

> Versions: 0.8-dev
> Reporter: Stefan Neufeind
>  Attachments: regex-normalize.patch, regex-normalize2.patch
>
> Imho needed:
> 1) Extend normalize-rules to commonly used session-id's etc.
> 2) Ship a checker to check rules easily by hand

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-48) "Did you mean" query enhancement/refignment feature request

2006-06-13 Thread Stefan Neufeind (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-48?page=comments#action_12415970 ] 

Stefan Neufeind commented on NUTCH-48:
--

Could somebody please have a look? I currently lack a test-system to try that 
...

> "Did you mean"  query enhancement/refignment feature request
> 
>
>  Key: NUTCH-48
>  URL: http://issues.apache.org/jira/browse/NUTCH-48
>  Project: Nutch
> Type: New Feature

>   Components: web gui
>  Environment: All platforms
> Reporter: byron miller
> Assignee: Sami Siren
> Priority: Minor
>  Attachments: did-you-mean-combined08.patch, rss-spell.patch, 
> spell-check.patch
>
> Looking to implement a "Did you mean" feature for query result pages that 
> return < = x amount of results to invoke a response that would recommend a 
> fixed/related or spell checked query to try.
> Note from Doug to users list:
> David Spencer has worked on this some.
> http://www.searchmorph.com/weblog/index.php?id=23
> I think the code on his site might be more recent than what's committed
> to the lucene/contrib directory.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-305) Update crawl and url filter lists to exclude jpeg|JPEG|bmp|BMP

2006-06-09 Thread Stefan Neufeind (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-305?page=all ]

Stefan Neufeind updated NUTCH-305:
--

Attachment: suffix-urlfilter.txt

Find attached an suffix-urlfilter.txt that might be interesting to some people. 
More contributions welcome at any time. Maybe we should ship such a list and 
use the suffix-filter instead of regex to filter by document-extension?

> Update crawl and url filter lists to exclude jpeg|JPEG|bmp|BMP
> --
>
>  Key: NUTCH-305
>  URL: http://issues.apache.org/jira/browse/NUTCH-305
>  Project: Nutch
> Type: Bug

> Versions: 0.8-dev
> Reporter: chris finne
>  Attachments: suffix-urlfilter.txt
>


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-294) Topic-maps of related searchwords

2006-06-06 Thread Stefan Neufeind (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-294?page=comments#action_12414962 ] 

Stefan Neufeind commented on NUTCH-294:
---

1) I enabled it in plugins.include and restarted tomcat - but there is no 
checkbox for me.

2) My "idea" was if maybe an index of top-keywords (from "did you mean"-plugin 
possibly?) could be used and a query could be run on it like "the current 
search we searched for appeared in NNN pages, where the top10-top-keywords are 
...". Wouldn't that work as a topicmap?

> Topic-maps of related searchwords
> -
>
>  Key: NUTCH-294
>  URL: http://issues.apache.org/jira/browse/NUTCH-294
>  Project: Nutch
> Type: New Feature

>   Components: searcher
> Reporter: Stefan Neufeind

>
> Would it be possible to offer a user  "topic-maps"? It's when you search for 
> something and get topic-related words that might also be of interest for you. 
> I wonder if that's somehow possible with the ngram-index for "did you mean" 
> (see separate feature-enhancement-bug for this), but we'd need to have a 
> relation between words (in what context do they occur).
> For the webfrontend usually trees are used  - which for some users offer 
> quite impressive eye-candy :-) E.g. see this advertisement by Novell where 
> I've just seen a similar "topic-map" as well:
> http://www.novell.com/de-de/company/advertising/defineyouropen.html

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-294) Topic-maps of related searchwords

2006-06-04 Thread Stefan Neufeind (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-294?page=comments#action_12414653 ] 

Stefan Neufeind commented on NUTCH-294:
---

I'm not sure. On a quick run I wasn't able to get the 
"clustering-carrot2"-plugin to work - though I thought I simply need to include 
it.
Maybe somebody else already worked with it and could comment if that plugin is 
within scope of this feature-request.
To what I found about carror2 it's also used to cluster data from multiple 
search-engines - not sure how that relates to topic-clusters.

> Topic-maps of related searchwords
> -
>
>  Key: NUTCH-294
>  URL: http://issues.apache.org/jira/browse/NUTCH-294
>  Project: Nutch
> Type: New Feature

>   Components: searcher
> Reporter: Stefan Neufeind

>
> Would it be possible to offer a user  "topic-maps"? It's when you search for 
> something and get topic-related words that might also be of interest for you. 
> I wonder if that's somehow possible with the ngram-index for "did you mean" 
> (see separate feature-enhancement-bug for this), but we'd need to have a 
> relation between words (in what context do they occur).
> For the webfrontend usually trees are used  - which for some users offer 
> quite impressive eye-candy :-) E.g. see this advertisement by Novell where 
> I've just seen a similar "topic-map" as well:
> http://www.novell.com/de-de/company/advertising/defineyouropen.html

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-298) if a 404 for a robots.txt is returned no page is fetched at all from the host

2006-06-04 Thread Stefan Neufeind (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-298?page=comments#action_12414647 ] 

Stefan Neufeind commented on NUTCH-298:
---

Is the description-line of this bug correct? I've been indexing pages without 
robots.txt, and I just  checked that those hosts give a 404 since robots.txt 
does not exist.

> if a 404 for a robots.txt is returned no page is fetched at all from the host
> -
>
>  Key: NUTCH-298
>  URL: http://issues.apache.org/jira/browse/NUTCH-298
>  Project: Nutch
> Type: Bug

> Reporter: Stefan Groschupf
>  Fix For: 0.8-dev
>  Attachments: fixNpeRobotRuleSet.patch
>
> What happen:
> Is no RobotRuleSet is in the cache for a host, we create try to fetch the 
> robots.txt.
> In case http response code is not 200 or 403 but for example 404 we do " 
> robotRules = EMPTY_RULES; " (line: 402)
> EMPTY_RULES is a RobotRuleSet created with the default constructor.
> tmpEntries and entries is null and will never changed.
> If we now try to fetch a page from the host that use the EMPTY_RULES is used 
> and we call isAllowed in the RobotRuleSet.
> In this case a NPE is thrown in this line:
>  if (entries == null) {
> entries= new RobotsEntry[tmpEntries.size()];
> possible Solution:
> We can intialize the tmpEntries by default and also remove other null checks 
> and initialisations.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-258) Once Nutch logs a SEVERE log item, Nutch fails forevermore

2006-06-04 Thread Stefan Neufeind (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-258?page=comments#action_12414646 ] 

Stefan Neufeind commented on NUTCH-258:
---

Agreed. The root-causee of the loop should be identified. So I'd suggest 
turning this into a wont-fix-bug - and if it occurs again somewhere, we should 
try to track down the root cause.

> Once Nutch logs a SEVERE log item, Nutch fails forevermore
> --
>
>  Key: NUTCH-258
>  URL: http://issues.apache.org/jira/browse/NUTCH-258
>  Project: Nutch
> Type: Bug

>   Components: fetcher
> Versions: 0.8-dev
>  Environment: All
> Reporter: Scott Ganyo
> Priority: Critical
>  Attachments: dumbfix.patch
>
> Once a SEVERE log item is written, Nutch shuts down any fetching forevermore. 
>  This is from the run() method in Fetcher.java:
> public void run() {
>   synchronized (Fetcher.this) {activeThreads++;} // count threads
>   
>   try {
> UTF8 key = new UTF8();
> CrawlDatum datum = new CrawlDatum();
> 
> while (true) {
>   if (LogFormatter.hasLoggedSevere()) // something bad happened
> break;// exit
>   
> Notice the last 2 lines.  This will prevent Nutch from ever Fetching again 
> once this is hit as LogFormatter is storing this data as a static.
> (Also note that "LogFormatter.hasLoggedSevere()" is also checked in 
> org.apache.nutch.net.URLFilterChecker and will disable this class as well.)
> This must be fixed or Nutch cannot be run as any kind of long-running 
> service.  Furthermore, I believe it is a poor decision to rely on a logging 
> event to determine the state of the application - this could have any number 
> of side-effects that would be extremely difficult to track down.  (As it has 
> already for me.)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-299) Bittorrent Parser

2006-06-04 Thread Stefan Neufeind (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-299?page=comments#action_12414643 ] 

Stefan Neufeind commented on NUTCH-299:
---

Could you briefly explain what it does? Extract meta-data and index the comment 
as "content of that page"? Or does it also follow the URL to the tracker 
(maybe) to discover other torrents etc.?

> Bittorrent Parser
> -
>
>  Key: NUTCH-299
>  URL: http://issues.apache.org/jira/browse/NUTCH-299
>  Project: Nutch
> Type: New Feature

> Reporter: Hasan Diwan
> Priority: Minor
>  Attachments: BitTorrent.jar
>
> BitTorrent information file parser

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-290) parse-pdf: Garbage indexed when text-extraction not allowed

2006-06-02 Thread Stefan Neufeind (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-290?page=comments#action_12414477 ] 

Stefan Neufeind commented on NUTCH-290:
---

But to my understanding of the plugin it still extracts as much as possible 
(meta-data) from the PDF. So if indexing is not allowed but this is a PDF, then 
returning empty text as the document-body should be fine - shouldn't it? 
Nothing else except a PDF-plugin will be able to handle PDF correclty in this 
case.

Stefan G., can you point out why in the summary I see binary data for a PDF as 
summary and if there is a possible fix for it in the context of this current 
bug here?

> parse-pdf: Garbage indexed when text-extraction not allowed
> ---
>
>  Key: NUTCH-290
>  URL: http://issues.apache.org/jira/browse/NUTCH-290
>  Project: Nutch
> Type: Bug

>   Components: indexer
> Versions: 0.8-dev
> Reporter: Stefan Neufeind
>  Attachments: NUTCH-290-canExtractContent.patch
>
> It seems that garbage (or undecoded text?) is indexed when text-extraction 
> for a PDF is not allowed.
> Example-PDF:
> http://www.task-switch.nl/Dutch/articles/Management_en_Architectuur_v3.pdf

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-275) Fetcher not parsing XHTML-pages at all

2006-06-02 Thread Stefan Neufeind (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-275?page=comments#action_12414476 ] 

Stefan Neufeind commented on NUTCH-275:
---

Maybe just XHTML is something special in this casee? In general I guess 
mime-magic is a good idea. But could it be extended to differentiate xml and 
xhtml?

> Fetcher not parsing XHTML-pages at all
> --
>
>  Key: NUTCH-275
>  URL: http://issues.apache.org/jira/browse/NUTCH-275
>  Project: Nutch
> Type: Bug

> Versions: 0.8-dev
>  Environment: problem with nightly-2006-05-20; worked fine with same website 
> on 0.7.2
> Reporter: Stefan Neufeind

>
> Server reports page as "text/html" - so I thought it would be processed as 
> html.
> But something I guess evaluated the headers of the document and re-labeled it 
> as "text/xml" (why not text/xhtml?).
> For some reason there is no plugin to be found for indexing text/xml (why 
> does TextParser not feel responsible?).
> Links inside this document are NOT indexed at all - no digging this website 
> actually stops here.
> Funny thing: For some magical reasons the dtd-files referenced in the header 
> seem to be valid links for the fetcher and as such are indexed in the next 
> round (if urlfilter allows).
> 060521 025018 fetching http://www.secreturl.something/
> 060521 025018 http.proxy.host = null
> 060521 025018 http.proxy.port = 8080
> 060521 025018 http.timeout = 1
> 060521 025018 http.content.limit = 65536
> 060521 025018 http.agent = NutchCVS/0.8-dev (Nutch; 
> http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
> 060521 025018 fetcher.server.delay = 1000
> 060521 025018 http.max.delays = 1000
> 060521 025018 ParserFactory:Plugin: org.apache.nutch.parse.text.TextParser 
> mapped to contentType text/xml via parse-plugins.xml, but
>  its plugin.xml file does not claim to support contentType: text/xml
> 060521 025018 ParserFactory: Plugin: org.apache.nutch.parse.rss.RSSParser 
> mapped to contentType text/xml via parse-plugins.xml, but 
> not enabled via plugin.includes in nutch-default.xml
> 060521 025019 Using Signature impl: org.apache.nutch.crawl.MD5Signature
> 060521 025019  map 0%  reduce 0%
> 060521 025019 1 pages, 0 errors, 1.0 pages/s, 40 kb/s, 
> 060521 025019 1 pages, 0 errors, 1.0 pages/s, 40 kb/s, 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-291) OpenSearchServlet should return "date" as well as "lastModified"

2006-06-02 Thread Stefan Neufeind (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-291?page=comments#action_12414466 ] 

Stefan Neufeind commented on NUTCH-291:
---

Which way is most favorable? To always set lastModified although it was not 
returned from the webserver (maybe unclean) or always return date as well 
(cleaner?).

> OpenSearchServlet should return "date" as well as "lastModified"
> 
>
>  Key: NUTCH-291
>  URL: http://issues.apache.org/jira/browse/NUTCH-291
>  Project: Nutch
> Type: Improvement

>   Components: web gui
> Versions: 0.8-dev
> Reporter: Stefan Neufeind
>  Attachments: NUTCH-291-unfinished.patch
>
> Currently lastModified is provided by OpenSearchServlet - but only in case 
> the date lastModified-date is known.
> Since you can sort by "date" (which is lastModified or if not present the 
> fetchdate), it might be useful if OpenSearchServlet could provide "date" as 
> well.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-286) Handling common error-pages as 404

2006-06-02 Thread Stefan Neufeind (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-286?page=comments#action_12414464 ] 

Stefan Neufeind commented on NUTCH-286:
---

Well, we _could_  close it, though the question still remains for me. The 
problem imho is that you say it's hard to do.
For sure you could always write searches to prune those pages from the index - 
but I wonder if that's a clean solution or if it would be better to have a way 
of excluding certain pages (like these common error-pages, though their header 
is wrong). I guess it's the typical problem when crawling the web: Technician 
will say  "that webserver/typo3 is wrong and is to be fixed" - but management 
will not care, and you will have to solve the problem in  whatever way.

> Handling common error-pages as 404
> --
>
>  Key: NUTCH-286
>  URL: http://issues.apache.org/jira/browse/NUTCH-286
>  Project: Nutch
> Type: Improvement

> Reporter: Stefan Neufeind

>
> Idea: Some pages from some software-packages/scripts report an "http 200 ok" 
> even though a specific page could not be found. Example I just found  is:
> http://www.deteimmobilien.de/unternehmen/nbjmup;Uipnbt/IfsctuAefufjnnpcjmjfo/ef
> That's a typo3-page explaining in it's standard-layout and wording: "The 
> requested page did not exist or was inaccessible."
> So I had the idea if somebody might create a plugin that could find commonly 
> used formulations for "page does not exist" etc. and turn the page into a 404 
> before feeding them  into the nutch-index  - although the server responded 
> with status 200 ok.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-282) Showing too few results on a page (Paging not correct)

2006-06-02 Thread Stefan Neufeind (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-282?page=comments#action_12414461 ] 

Stefan Neufeind commented on NUTCH-282:
---

Sorry for not getting back to this. Actually it had to do with per-site-dedup. 
I had a page-navigation built on the total number of pages, and the first page 
I saw were already the "last" result-page. When moving to page 2 I got no 
results, when moving to a later page, I got exceptions. For me it was fixed 
simply by using the pagination correctly :-) and applying the fix from 
NUTCH-288 to not fetch results when out of bounds.

> Showing too few results on a page (Paging not correct)
> --
>
>  Key: NUTCH-282
>  URL: http://issues.apache.org/jira/browse/NUTCH-282
>  Project: Nutch
> Type: Bug

>   Components: web gui
> Versions: 0.8-dev
> Reporter: Stefan Neufeind

>
> I did a search and got back the  value "itemsPerPage" from opensearch. But 
> the output shows "results 1-8" and I have a total of 46 searchresults.
> Same happens for the webinterface.
> Why aren't "enough" results fetched?
> The problem might be somewhere in the area of where Nutch should only display 
> a certaian number of websites per site.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-290) parse-pdf: Garbage indexed when text-extraction not allowed

2006-06-02 Thread Stefan Neufeind (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-290?page=comments#action_12414458 ] 

Stefan Neufeind commented on NUTCH-290:
---

But if one plugin fails in 0.8-dev, isn't the next used? I understand that in 
the default-config the text-parser would be used as the last resort fallback.

Also I'm not sure where the summary-text comes from if I use the patch above to 
prevent generating an exception but return empty parse-data.

> parse-pdf: Garbage indexed when text-extraction not allowed
> ---
>
>  Key: NUTCH-290
>  URL: http://issues.apache.org/jira/browse/NUTCH-290
>  Project: Nutch
> Type: Bug

>   Components: indexer
> Versions: 0.8-dev
> Reporter: Stefan Neufeind
>  Attachments: NUTCH-290-canExtractContent.patch
>
> It seems that garbage (or undecoded text?) is indexed when text-extraction 
> for a PDF is not allowed.
> Example-PDF:
> http://www.task-switch.nl/Dutch/articles/Management_en_Architectuur_v3.pdf

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-292) OpenSearchServlet: OutOfMemoryError: Java heap space

2006-06-02 Thread Stefan Neufeind (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-292?page=all ]

Stefan Neufeind updated NUTCH-292:
--

Attachment: NUTCH-292-summarizer08.diff

As per demand, here is the patch.

Please note that it has not throughly been testeed by myself. But the patch 
looks fine and makes sense :-) Oh, and it compiles clean ...

> OpenSearchServlet: OutOfMemoryError: Java heap space
> 
>
>  Key: NUTCH-292
>  URL: http://issues.apache.org/jira/browse/NUTCH-292
>  Project: Nutch
> Type: Bug

>   Components: web gui
> Versions: 0.8-dev
> Reporter: Stefan Neufeind
> Priority: Critical
>  Attachments: NUTCH-292-summarizer08.diff, summarizer.diff
>
> java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space
>   
> org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:203)
>   org.apache.nutch.searcher.NutchBean.getSummary(NutchBean.java:329)
>   
> org.apache.nutch.searcher.OpenSearchServlet.doGet(OpenSearchServlet.java:155)
>   javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
>   javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
> The URL I use is:
> [...]something[...]/opensearch?query=mysearch&start=0&hitsPerSite=3&hitsPerPage=20&sort=url
> It seems to be a problem specific to the date I'm working with. Moving the 
> start from 0 to 10 or changing the query works fine.
> Or maybe it doesn't have to do with sorting but it's just that I hit one "bad 
> search-result" that has a broken summary?
> !! The problem is repeatable. So if anybody has an idea where to search / 
> what to fix, I can easily try that out !!

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Created: (NUTCH-294) Topic-maps of related searchwords

2006-06-01 Thread Stefan Neufeind (JIRA)
Topic-maps of related searchwords
-

 Key: NUTCH-294
 URL: http://issues.apache.org/jira/browse/NUTCH-294
 Project: Nutch
Type: New Feature

  Components: searcher  
Reporter: Stefan Neufeind


Would it be possible to offer a user  "topic-maps"? It's when you search for 
something and get topic-related words that might also be of interest for you. I 
wonder if that's somehow possible with the ngram-index for "did you mean" (see 
separate feature-enhancement-bug for this), but we'd need to have a relation 
between words (in what context do they occur).

For the webfrontend usually trees are used  - which for some users offer quite 
impressive eye-candy :-) E.g. see this advertisement by Novell where I've just 
seen a similar "topic-map" as well:
http://www.novell.com/de-de/company/advertising/defineyouropen.html

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-292) OpenSearchServlet: OutOfMemoryError: Java heap space

2006-05-30 Thread Stefan Neufeind (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-292?page=comments#action_12413778 ] 

Stefan Neufeind commented on NUTCH-292:
---

That patch is for the 0.7-branch, right? In 0.8-dev you'd want to do that in 
BasicSummarizer.java. But to me it looks like something similar is already in 
place:

// Iterate through as long as we're before the end of
// the document and we haven't hit the max-number-of-items
// -in-a-summary.
//
while ((j < endToken) && (j - startToken < sumLength)) {

But I also suspect it might have something to do with tokens. What I 
experienced is that several search-results currently contain arbitrary binary 
data. Those are the cases where a parser-plugin has "failed" and where 
parse-text was used as a fallback. If I'm right this might lead to actually 
quite large tokens because no whitespace is found in a row of characters.

@Marcel: Thank you for the fix anyway ... you help is very much appreciated.

> OpenSearchServlet: OutOfMemoryError: Java heap space
> 
>
>  Key: NUTCH-292
>  URL: http://issues.apache.org/jira/browse/NUTCH-292
>  Project: Nutch
> Type: Bug

>   Components: web gui
> Versions: 0.8-dev
> Reporter: Stefan Neufeind
> Priority: Critical
>  Attachments: summarizer.diff
>
> java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space
>   
> org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:203)
>   org.apache.nutch.searcher.NutchBean.getSummary(NutchBean.java:329)
>   
> org.apache.nutch.searcher.OpenSearchServlet.doGet(OpenSearchServlet.java:155)
>   javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
>   javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
> The URL I use is:
> [...]something[...]/opensearch?query=mysearch&start=0&hitsPerSite=3&hitsPerPage=20&sort=url
> It seems to be a problem specific to the date I'm working with. Moving the 
> start from 0 to 10 or changing the query works fine.
> Or maybe it doesn't have to do with sorting but it's just that I hit one "bad 
> search-result" that has a broken summary?
> !! The problem is repeatable. So if anybody has an idea where to search / 
> what to fix, I can easily try that out !!

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-290) parse-pdf: Garbage indexed when text-extraction not allowed

2006-05-30 Thread Stefan Neufeind (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-290?page=comments#action_12413780 ] 

Stefan Neufeind commented on NUTCH-290:
---

The plugin itself imho works fine now. Does not throw an exception anymore and 
if allowed outputs text correctly.
However I still get the "garbage-output" from a PDF. Could that be due to the 
fact that in case no extraction is allowed (empty parsing-text returned) the 
parser will still fallback to using the raw text to index?

What I did was deleting crawl_parse and parse_* from the segments-directory, 
running "nutch parse" and reindexing everything. However the raw chars in the 
search-output (summary) remain. :-((

> parse-pdf: Garbage indexed when text-extraction not allowed
> ---
>
>  Key: NUTCH-290
>  URL: http://issues.apache.org/jira/browse/NUTCH-290
>  Project: Nutch
> Type: Bug

>   Components: indexer
> Versions: 0.8-dev
> Reporter: Stefan Neufeind
>  Attachments: NUTCH-290-canExtractContent.patch
>
> It seems that garbage (or undecoded text?) is indexed when text-extraction 
> for a PDF is not allowed.
> Example-PDF:
> http://www.task-switch.nl/Dutch/articles/Management_en_Architectuur_v3.pdf

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-290) parse-pdf: Garbage indexed when text-extraction not allowed

2006-05-28 Thread Stefan Neufeind (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-290?page=all ]

Stefan Neufeind updated NUTCH-290:
--

Summary: parse-pdf: Garbage indexed when text-extraction not allowed  (was: 
parse-pdf: Garbage (?) indexed when text-extraction now allowed)

> parse-pdf: Garbage indexed when text-extraction not allowed
> ---
>
>  Key: NUTCH-290
>  URL: http://issues.apache.org/jira/browse/NUTCH-290
>  Project: Nutch
> Type: Bug

>   Components: indexer
> Versions: 0.8-dev
> Reporter: Stefan Neufeind
>  Attachments: NUTCH-290-canExtractContent.patch
>
> It seems that garbage (or undecoded text?) is indexed when text-extraction 
> for a PDF is not allowed.
> Example-PDF:
> http://www.task-switch.nl/Dutch/articles/Management_en_Architectuur_v3.pdf

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-290) parse-pdf: Garbage (?) indexed when text-extraction now allowed

2006-05-28 Thread Stefan Neufeind (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-290?page=all ]

Stefan Neufeind updated NUTCH-290:
--

Attachment: NUTCH-290-canExtractContent.patch

This patch adds a check to first see if text-extraction is allowed - and only 
in that case try to extract text (prevents the above mentioned exception and a 
parse-fail).

Note: The line

  ((PDStandardEncryption) encDict).setCanExtractContent(true);

is imho up to discussion. It only sets a bit on "encrypted" documents. Since 
I've read in several places that many people seem to be setting this to "false" 
for no good reason, I believe we don't really "brake encryption" with this line 
- and as such should try to index as much data as possible.
Does anybody have "problems" with this line? If yes, maybe it could be a 
config-option that's false by default?

> parse-pdf: Garbage (?) indexed when text-extraction now allowed
> ---
>
>  Key: NUTCH-290
>  URL: http://issues.apache.org/jira/browse/NUTCH-290
>  Project: Nutch
> Type: Bug

>   Components: indexer
> Versions: 0.8-dev
> Reporter: Stefan Neufeind
>  Attachments: NUTCH-290-canExtractContent.patch
>
> It seems that garbage (or undecoded text?) is indexed when text-extraction 
> for a PDF is not allowed.
> Example-PDF:
> http://www.task-switch.nl/Dutch/articles/Management_en_Architectuur_v3.pdf

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-290) parse-pdf: Garbage (?) indexed when text-extraction now allowed

2006-05-28 Thread Stefan Neufeind (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-290?page=comments#action_12413637 ] 

Stefan Neufeind commented on NUTCH-290:
---

this one here fires in the PDF-parser:

} catch (Exception e) { // run time exception
LOG.warning("General exception in PDF parser: "+e.getMessage());
e.printStackTrace();
  return new ParseStatus(ParseStatus.FAILED,
  "Can't be handled as pdf document. " + 
e).getEmptyParse(getConf());
}

The exception is:

060522 001010 General exception in PDF parser: You do not have permission to 
extract text
java.io.IOException: You do not have permission to extract text
at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:189)
at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:140)
at org.apache.nutch.parse.pdf.PdfParser.getParse(PdfParser.java:120)
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:77)
at 
org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:257)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:143)


Could it be that, maybe as a fallback, in case the document can't be parsed and 
no "description" is returned that in search-output the document itself is used 
as "description"? If yes: In case of binary files this seems to lead to 
problems.

> parse-pdf: Garbage (?) indexed when text-extraction now allowed
> ---
>
>  Key: NUTCH-290
>  URL: http://issues.apache.org/jira/browse/NUTCH-290
>  Project: Nutch
> Type: Bug

>   Components: indexer
> Versions: 0.8-dev
> Reporter: Stefan Neufeind

>
> It seems that garbage (or undecoded text?) is indexed when text-extraction 
> for a PDF is not allowed.
> Example-PDF:
> http://www.task-switch.nl/Dutch/articles/Management_en_Architectuur_v3.pdf

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Created: (NUTCH-292) OpenSearchServlet: OutOfMemoryError: Java heap space

2006-05-28 Thread Stefan Neufeind (JIRA)
OpenSearchServlet: OutOfMemoryError: Java heap space


 Key: NUTCH-292
 URL: http://issues.apache.org/jira/browse/NUTCH-292
 Project: Nutch
Type: Bug

  Components: web gui  
Versions: 0.8-dev
Reporter: Stefan Neufeind
Priority: Critical


java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space

org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:203)
org.apache.nutch.searcher.NutchBean.getSummary(NutchBean.java:329)

org.apache.nutch.searcher.OpenSearchServlet.doGet(OpenSearchServlet.java:155)
javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
javax.servlet.http.HttpServlet.service(HttpServlet.java:802)

The URL I use is:

[...]something[...]/opensearch?query=mysearch&start=0&hitsPerSite=3&hitsPerPage=20&sort=url

It seems to be a problem specific to the date I'm working with. Moving the 
start from 0 to 10 or changing the query works fine.
Or maybe it doesn't have to do with sorting but it's just that I hit one "bad 
search-result" that has a broken summary?

!! The problem is repeatable. So if anybody has an idea where to search / what 
to fix, I can easily try that out !!

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-291) OpenSearchServlet should return "date" as well as "lastModified"

2006-05-28 Thread Stefan Neufeind (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-291?page=all ]

Stefan Neufeind updated NUTCH-291:
--

Attachment: NUTCH-291-unfinished.patch

I tried implementing this in OpenSearchServlet.java (see patch). The idea for 
this match is based on more.jsp. However I receive:

java.lang.NumberFormatException: null
java.lang.Long.parseLong(Long.java:372)
java.lang.Long.(Long.java:671)

org.apache.nutch.searcher.OpenSearchServlet.doGet(OpenSearchServlet.java:230)
javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
javax.servlet.http.HttpServlet.service(HttpServlet.java:802)

Guess that has to do with date not being present here?!? I've tried hunting 
down the "problem" and it seems that in
java/org/apache/nutch/searcher/IndexSearcher.java the field also needs to be 
provided. But I assume that the Lucene-engine here correctly provides the 
date-field.

Maybe somebody could fix up my patch and then maybe commit as well. I guess 
always knowing the date from the RSS-feed might be good.

> OpenSearchServlet should return "date" as well as "lastModified"
> 
>
>  Key: NUTCH-291
>  URL: http://issues.apache.org/jira/browse/NUTCH-291
>  Project: Nutch
> Type: Improvement

>   Components: web gui
> Versions: 0.8-dev
> Reporter: Stefan Neufeind
>  Attachments: NUTCH-291-unfinished.patch
>
> Currently lastModified is provided by OpenSearchServlet - but only in case 
> the date lastModified-date is known.
> Since you can sort by "date" (which is lastModified or if not present the 
> fetchdate), it might be useful if OpenSearchServlet could provide "date" as 
> well.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Created: (NUTCH-291) OpenSearchServlet should return "date" as well as "lastModified"

2006-05-28 Thread Stefan Neufeind (JIRA)
OpenSearchServlet should return "date" as well as "lastModified"


 Key: NUTCH-291
 URL: http://issues.apache.org/jira/browse/NUTCH-291
 Project: Nutch
Type: Improvement

  Components: web gui  
Versions: 0.8-dev
Reporter: Stefan Neufeind


Currently lastModified is provided by OpenSearchServlet - but only in case the 
date lastModified-date is known.

Since you can sort by "date" (which is lastModified or if not present the 
fetchdate), it might be useful if OpenSearchServlet could provide "date" as 
well.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Created: (NUTCH-290) parse-pdf: Garbage (?) indexed when text-extraction now allowed

2006-05-28 Thread Stefan Neufeind (JIRA)
parse-pdf: Garbage (?) indexed when text-extraction now allowed
---

 Key: NUTCH-290
 URL: http://issues.apache.org/jira/browse/NUTCH-290
 Project: Nutch
Type: Bug

  Components: indexer  
Versions: 0.8-dev
Reporter: Stefan Neufeind


It seems that garbage (or undecoded text?) is indexed when text-extraction for 
a PDF is not allowed.

Example-PDF:
http://www.task-switch.nl/Dutch/articles/Management_en_Architectuur_v3.pdf

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-288) hitsPerSite-functionality "flawed": problems writing a page-navigation

2006-05-25 Thread Stefan Neufeind (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-288?page=all ]

Stefan Neufeind updated NUTCH-288:
--

Attachment: NUTCH-288-OpenSearch-fix.patch

This patch includes Doug's one-line fix to prevent an exception.
Also it does go back page by page until you get to the last result-page. The 
start-value returned in the RSS-feed is correct afterwards(!). This easily 
allows you to check whether the requested result-start and the one received are 
identical - otherwise you are on the last page and were "redirected" - and now 
know that you don't need to display any pages in your page-navigation following 
this one :-)

Applies and works fine for me.

> hitsPerSite-functionality "flawed": problems writing a page-navigation
> --
>
>  Key: NUTCH-288
>  URL: http://issues.apache.org/jira/browse/NUTCH-288
>  Project: Nutch
> Type: Bug

>   Components: web gui
> Versions: 0.8-dev
> Reporter: Stefan Neufeind
>  Attachments: NUTCH-288-OpenSearch-fix.patch
>
> The deduplication-functionality on a per-site-basis (hitsPerSite = 3) leads 
> to problems when trying to offer a page-navigation (e.g. allow the user to 
> jump to page 10). This is because dedup is done after fetching.
> RSS shows a maximum number of 7763 documents (that is without dedup!), I set 
> it to display 10 items per page. My "naive" approach was to estimate I have 
> 7763/10 = 777 pages. But already when moving to page 3 I got no more 
> searchresults (I guess because of dedup). And when moving to page 10 I  got 
> an exception (see below).
> 2006-05-25 16:24:43 StandardWrapperValve[OpenSearch]: Servlet.service() for 
> servlet OpenSearch threw exception
> java.lang.NegativeArraySizeException
> at org.apache.nutch.searcher.Hits.getHits(Hits.java:65)
> at 
> org.apache.nutch.searcher.OpenSearchServlet.doGet(OpenSearchServlet.java:149)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
> at 
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:252)
> at 
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:173)
> at 
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:214)
> at 
> org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
> at 
> org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
> at 
> org.apache.catalina.core.StandardContextValve.invokeInternal(StandardContextValve.java:198)
> at 
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:152)
> at 
> org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
> at 
> org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
> at 
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:137)
> at 
> org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
> at 
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:118)
> at 
> org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:102)
> at 
> org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
> at 
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
> at 
> org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
> at 
> org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
> at 
> org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:929)
> at 
> org.apache.coyote.tomcat5.CoyoteAdapter.service(CoyoteAdapter.java:160)
> at 
> org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:799)
> at 
> org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.processConnection(Http11Protocol.java:705)
> at 
> org.apache.tomcat.util.net.TcpWorkerThread.runIt(PoolTcpEndpoint.java:577)
> at 
> org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:684)
> at java.lang.Thread.run(Thread.java:595)
> Only workaround I see for the moment: Fetching RSS without duplication, dedup 
> myself and cache the RSS-result to improve performance. But a cleaner 
> solution would imho be nice. Is there a performant way of doing deduplication 
> and knowing for sure how many documents are available to view? For sure this 
> would mean to dedup all search-results first ...

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administr

[jira] Commented: (NUTCH-288) hitsPerSite-functionality "flawed": problems writing a page-navigation

2006-05-25 Thread Stefan Neufeind (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-288?page=comments#action_12413275 ] 

Stefan Neufeind commented on NUTCH-288:
---

How do they do that? Right, I'm transfered to page 16. But if I click on page 
14 this also seems to be the last page in order? Something looks strange there, 
too ...

And using Nutch: How should I know (using the RSS-feed) on which page I am? I'm 
getting the above exception - no reply, and no new "start"-value so I could 
compute on which page I actually am. Is there a quickfix possible somehow?

> hitsPerSite-functionality "flawed": problems writing a page-navigation
> --
>
>  Key: NUTCH-288
>  URL: http://issues.apache.org/jira/browse/NUTCH-288
>  Project: Nutch
> Type: Bug

>   Components: web gui
> Versions: 0.8-dev
> Reporter: Stefan Neufeind

>
> The deduplication-functionality on a per-site-basis (hitsPerSite = 3) leads 
> to problems when trying to offer a page-navigation (e.g. allow the user to 
> jump to page 10). This is because dedup is done after fetching.
> RSS shows a maximum number of 7763 documents (that is without dedup!), I set 
> it to display 10 items per page. My "naive" approach was to estimate I have 
> 7763/10 = 777 pages. But already when moving to page 3 I got no more 
> searchresults (I guess because of dedup). And when moving to page 10 I  got 
> an exception (see below).
> 2006-05-25 16:24:43 StandardWrapperValve[OpenSearch]: Servlet.service() for 
> servlet OpenSearch threw exception
> java.lang.NegativeArraySizeException
> at org.apache.nutch.searcher.Hits.getHits(Hits.java:65)
> at 
> org.apache.nutch.searcher.OpenSearchServlet.doGet(OpenSearchServlet.java:149)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
> at 
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:252)
> at 
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:173)
> at 
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:214)
> at 
> org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
> at 
> org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
> at 
> org.apache.catalina.core.StandardContextValve.invokeInternal(StandardContextValve.java:198)
> at 
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:152)
> at 
> org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
> at 
> org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
> at 
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:137)
> at 
> org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
> at 
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:118)
> at 
> org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:102)
> at 
> org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
> at 
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
> at 
> org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
> at 
> org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
> at 
> org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:929)
> at 
> org.apache.coyote.tomcat5.CoyoteAdapter.service(CoyoteAdapter.java:160)
> at 
> org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:799)
> at 
> org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.processConnection(Http11Protocol.java:705)
> at 
> org.apache.tomcat.util.net.TcpWorkerThread.runIt(PoolTcpEndpoint.java:577)
> at 
> org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:684)
> at java.lang.Thread.run(Thread.java:595)
> Only workaround I see for the moment: Fetching RSS without duplication, dedup 
> myself and cache the RSS-result to improve performance. But a cleaner 
> solution would imho be nice. Is there a performant way of doing deduplication 
> and knowing for sure how many documents are available to view? For sure this 
> would mean to dedup all search-results first ...

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml characters

2006-05-25 Thread Stefan Neufeind (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-110?page=all ]

Stefan Neufeind updated NUTCH-110:
--

Attachment: fixIllegalXmlChars08.patch

Since original patch didn't cleanly apply for me on 0.8-dev 
(nightly-2006-05-20) I re-did it for 0.8 ...

With this patch the XML is fine. Without I had big trouble parsing the RSS-feed 
in another application.

> OpenSearchServlet outputs illegal xml characters
> 
>
>  Key: NUTCH-110
>  URL: http://issues.apache.org/jira/browse/NUTCH-110
>  Project: Nutch
> Type: Bug

>   Components: searcher
> Versions: 0.7
>  Environment: linux, jdk 1.5
> Reporter: [EMAIL PROTECTED]
>  Attachments: NUTCH-110-version2.patch, fixIllegalXmlChars.patch, 
> fixIllegalXmlChars08.patch
>
> OpenSearchServlet does not check text-to-output for illegal xml characters; 
> dependent on  search result, its possible for OSS to output xml that is not 
> well-formed.  For example, if text has the character FF character in it -- -- 
> i.e. the ascii character at position (decimal) 12 --  the produced XML will 
> show the FF character as '' The character/entity '' is not legal in 
> XML according to http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Created: (NUTCH-288) hitsPerSite-functionality "flawed": problems writing a page-navigation

2006-05-25 Thread Stefan Neufeind (JIRA)
hitsPerSite-functionality "flawed": problems writing a page-navigation
--

 Key: NUTCH-288
 URL: http://issues.apache.org/jira/browse/NUTCH-288
 Project: Nutch
Type: Bug

  Components: web gui  
Versions: 0.8-dev
Reporter: Stefan Neufeind


The deduplication-functionality on a per-site-basis (hitsPerSite = 3) leads to 
problems when trying to offer a page-navigation (e.g. allow the user to jump to 
page 10). This is because dedup is done after fetching.

RSS shows a maximum number of 7763 documents (that is without dedup!), I set it 
to display 10 items per page. My "naive" approach was to estimate I have 
7763/10 = 777 pages. But already when moving to page 3 I got no more 
searchresults (I guess because of dedup). And when moving to page 10 I  got an 
exception (see below).

2006-05-25 16:24:43 StandardWrapperValve[OpenSearch]: Servlet.service() for 
servlet OpenSearch threw exception
java.lang.NegativeArraySizeException
at org.apache.nutch.searcher.Hits.getHits(Hits.java:65)
at 
org.apache.nutch.searcher.OpenSearchServlet.doGet(OpenSearchServlet.java:149)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:252)
at 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:173)
at 
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:214)
at 
org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
at 
org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
at 
org.apache.catalina.core.StandardContextValve.invokeInternal(StandardContextValve.java:198)
at 
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:152)
at 
org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
at 
org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
at 
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:137)
at 
org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
at 
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:118)
at 
org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:102)
at 
org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
at 
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at 
org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
at 
org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
at org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:929)
at 
org.apache.coyote.tomcat5.CoyoteAdapter.service(CoyoteAdapter.java:160)
at 
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:799)
at 
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.processConnection(Http11Protocol.java:705)
at 
org.apache.tomcat.util.net.TcpWorkerThread.runIt(PoolTcpEndpoint.java:577)
at 
org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:684)
at java.lang.Thread.run(Thread.java:595)

Only workaround I see for the moment: Fetching RSS without duplication, dedup 
myself and cache the RSS-result to improve performance. But a cleaner solution 
would imho be nice. Is there a performant way of doing deduplication and 
knowing for sure how many documents are available to view? For sure this would 
mean to dedup all search-results first ...

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Created: (NUTCH-287) Exception when searching with sort

2006-05-25 Thread Stefan Neufeind (JIRA)
Exception when searching with sort
--

 Key: NUTCH-287
 URL: http://issues.apache.org/jira/browse/NUTCH-287
 Project: Nutch
Type: Bug

  Components: searcher  
Versions: 0.8-dev
Reporter: Stefan Neufeind
Priority: Critical


Running a search with  &sort=url works.
But when using&sort=title   I get the following exception.

2006-05-25 14:04:25 StandardWrapperValve[jsp]: Servlet.service() for servlet 
jsp threw exception
java.lang.RuntimeException: Unknown sort value type!
at 
org.apache.nutch.searcher.IndexSearcher.translateHits(IndexSearcher.java:157)
at org.apache.nutch.searcher.IndexSearcher.search(IndexSearcher.java:95)
at org.apache.nutch.searcher.NutchBean.search(NutchBean.java:239)
at org.apache.jsp.search_jsp._jspService(search_jsp.java:257)
at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:94)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
at 
org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:324)
at 
org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:292)
at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:236)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:252)
at 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:173)
at 
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:214)
at 
org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
at 
org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
at 
org.apache.catalina.core.StandardContextValve.invokeInternal(StandardContextValve.java:198)
at 
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:152)
at 
org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
at 
org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
at 
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:137)
at 
org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
at 
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:118)
at 
org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:102)
at 
org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
at 
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at 
org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
at 
org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
at org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:929)
at 
org.apache.coyote.tomcat5.CoyoteAdapter.service(CoyoteAdapter.java:160)
at 
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:799)
at 
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.processConnection(Http11Protocol.java:705)
at 
org.apache.tomcat.util.net.TcpWorkerThread.runIt(PoolTcpEndpoint.java:577)
at 
org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:684)
at java.lang.Thread.run(Thread.java:595)

What is in those lines is:

  WritableComparable sortValue;   // convert value to writable
  if (sortField == null) {
sortValue = new FloatWritable(scoreDocs[i].score);
  } else {
Object raw = ((FieldDoc)scoreDocs[i]).fields[0];
if (raw instanceof Integer) {
  sortValue = new IntWritable(((Integer)raw).intValue());
} else if (raw instanceof Float) {
  sortValue = new FloatWritable(((Float)raw).floatValue());
} else if (raw instanceof String) {
  sortValue = new UTF8((String)raw);
} else {
  throw new RuntimeException("Unknown sort value type!");
}
  }


So I thought that maybe raw is an instance of something "strange" and tried 
raw.getClass().getName() or also raw.toString() to track the cause down - but 
that always resulted in a NullPointerException. So it seems I'm having raw 
being null for some strange reason.

When I try with "title2" (or something none-existing) I get a different error 
that title2 is unknown / not indexed. So I suspect that title should be fine 
here ...

If there is any information I can help out with, let me know.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://ww

[jira] Commented: (NUTCH-284) NullPointerException during index

2006-05-25 Thread Stefan Neufeind (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-284?page=comments#action_12413240 ] 

Stefan Neufeind commented on NUTCH-284:
---

Yes, I was missing index-basic. Please apologize. I needed the extra fields of 
index-more and thought it would do the basic fields as well.
The same thing occured in NUTCH-51.

Would it be possible to maybe demand that index-basic is loaded (same like 
"well, you need a scoring-plugin" etc.)? What if somebody writes his own 
index-basic2-plugin - then he'd have to be able to put an "provides 
index-basic" into his plugin to notify that he indexes the basic fields or so. 
Maybe something like this could avoid trouble / searching for some people like 
me :-)

> NullPointerException during index
> -
>
>  Key: NUTCH-284
>  URL: http://issues.apache.org/jira/browse/NUTCH-284
>  Project: Nutch
> Type: Bug

>   Components: indexer
> Versions: 0.8-dev
> Reporter: Stefan Neufeind

>
> For  quite a few this "reduce > sort" has been going on. Then it fails. What 
> could be wrong with this?
> 060524 212613 reduce > sort
> 060524 212614 reduce > sort
> 060524 212615 reduce > sort
> 060524 212615 found resource common-terms.utf8 at 
> file:/home/mm/nutch-nightly-prod/conf/common-terms.utf8
> 060524 212615 found resource common-terms.utf8 at 
> file:/home/mm/nutch-nightly-prod/conf/common-terms.utf8
> 060524 212619 Optimizing index.
> 060524 212619 job_jlbhhm
> java.lang.NullPointerException
> at 
> org.apache.nutch.indexer.Indexer$OutputFormat$1.write(Indexer.java:111)
> at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:269)
> at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:253)
> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:282)
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:114)
> Exception in thread "main" java.io.IOException: Job failed!
> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:341)
> at org.apache.nutch.indexer.Indexer.index(Indexer.java:287)
> at org.apache.nutch.indexer.Indexer.main(Indexer.java:304)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Created: (NUTCH-286) Handling common error-pages as 404

2006-05-24 Thread Stefan Neufeind (JIRA)
Handling common error-pages as 404
--

 Key: NUTCH-286
 URL: http://issues.apache.org/jira/browse/NUTCH-286
 Project: Nutch
Type: Improvement

Reporter: Stefan Neufeind


Idea: Some pages from some software-packages/scripts report an "http 200 ok" 
even though a specific page could not be found. Example I just found  is:
http://www.deteimmobilien.de/unternehmen/nbjmup;Uipnbt/IfsctuAefufjnnpcjmjfo/ef
That's a typo3-page explaining in it's standard-layout and wording: "The 
requested page did not exist or was inaccessible."

So I had the idea if somebody might create a plugin that could find commonly 
used formulations for "page does not exist" etc. and turn the page into a 404 
before feeding them  into the nutch-index  - although the server responded with 
status 200 ok.


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Created: (NUTCH-284) NullPointerException during index

2006-05-24 Thread Stefan Neufeind (JIRA)
NullPointerException during index
-

 Key: NUTCH-284
 URL: http://issues.apache.org/jira/browse/NUTCH-284
 Project: Nutch
Type: Bug

  Components: indexer  
Versions: 0.8-dev
Reporter: Stefan Neufeind


For  quite a few this "reduce > sort" has been going on. Then it fails. What 
could be wrong with this?


060524 212613 reduce > sort
060524 212614 reduce > sort
060524 212615 reduce > sort
060524 212615 found resource common-terms.utf8 at 
file:/home/mm/nutch-nightly-prod/conf/common-terms.utf8
060524 212615 found resource common-terms.utf8 at 
file:/home/mm/nutch-nightly-prod/conf/common-terms.utf8
060524 212619 Optimizing index.
060524 212619 job_jlbhhm
java.lang.NullPointerException
at 
org.apache.nutch.indexer.Indexer$OutputFormat$1.write(Indexer.java:111)
at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:269)
at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:253)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:282)
at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:114)
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:341)
at org.apache.nutch.indexer.Indexer.index(Indexer.java:287)
at org.apache.nutch.indexer.Indexer.main(Indexer.java:304)


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-70) duplicate pages - virtual hosts in db.

2006-05-24 Thread Stefan Neufeind (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-70?page=comments#action_12413169 ] 

Stefan Neufeind commented on NUTCH-70:
--

Is the content exactly the same? Maybe could the page be checked  against an 
already existing one by an MD5 on the content? But I'm not sure if there is a 
clean way to workaround the problem - what if all pages are the same except 
one, on the other vhost? Would have to crawl all anyway, wouldn't you?

> duplicate pages - virtual hosts in db.
> --
>
>  Key: NUTCH-70
>  URL: http://issues.apache.org/jira/browse/NUTCH-70
>  Project: Nutch
> Type: Bug

>  Environment: 0,7 dev
> Reporter: YourSoft

>
> Dear Developers,
> I have a problem with nutch:
> - There are many sites duplicates in the webdb and in the segments.
> The source of this problem is:
> - If the site make 'virtual hosts' (like Apache), e.g. www.origo.hu, 
> origo.hu, origo.matav.hu, origo.matavnet.hu etc.: the result pages are the 
> same, only the inlinks are differents.
> - The ip address is the same.
> - When search, all virtualhosts are in the results.
> Google only show one of these virtual hosts, the nutch show all. The result 
> nutch db is larger, and this case slower, than google.
> Have any idea, how to remove these duplicates?
> Regards,
> Ferenc

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-44) too many search results

2006-05-24 Thread Stefan Neufeind (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-44?page=comments#action_12413155 ] 

Stefan Neufeind commented on NUTCH-44:
--

hi,
any progress on this?

> too many search results
> ---
>
>  Key: NUTCH-44
>  URL: http://issues.apache.org/jira/browse/NUTCH-44
>  Project: Nutch
> Type: Bug

>   Components: web gui
>  Environment: web environment
> Reporter: Emilijan Mirceski

>
> There should be a limitation (user defined) on the number of results the 
> search engine can return. 
> For example, if one modifies the seach url as:
> http:///search.jsp?query=&hitsPerPage=2&hitsPerSite=0
> The search will try to return 20,000 pages which isn't good for the server 
> side performance. 
> Is it possible to have a setting in the config xml files to control this?
> Thanks,
> Emilijan

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Created: (NUTCH-282) Showing too few results on a page (Paging not correct)

2006-05-23 Thread Stefan Neufeind (JIRA)
Showing too few results on a page (Paging not correct)
--

 Key: NUTCH-282
 URL: http://issues.apache.org/jira/browse/NUTCH-282
 Project: Nutch
Type: Bug

  Components: web gui  
Versions: 0.8-dev
Reporter: Stefan Neufeind


I did a search and got back the  value "itemsPerPage" from opensearch. But the 
output shows "results 1-8" and I have a total of 46 searchresults.
Same happens for the webinterface.

Why aren't "enough" results fetched?

The problem might be somewhere in the area of where Nutch should only display a 
certaian number of websites per site.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-281) cached.jsp: base-href needs to be outside comments

2006-05-23 Thread Stefan Neufeind (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-281?page=all ]

Stefan Neufeind updated NUTCH-281:
--

Component: web gui
 Priority: Trivial  (was: Major)

> cached.jsp: base-href needs to be outside comments
> --
>
>  Key: NUTCH-281
>  URL: http://issues.apache.org/jira/browse/NUTCH-281
>  Project: Nutch
> Type: Bug

>   Components: web gui
> Reporter: Stefan Neufeind
> Priority: Trivial

>
> see cached.jsp
> 
> does not take effect when showing a cached page because of the comments 
> around it

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Created: (NUTCH-281) cached.jsp: base-href needs to be outside comments

2006-05-23 Thread Stefan Neufeind (JIRA)
cached.jsp: base-href needs to be outside comments
--

 Key: NUTCH-281
 URL: http://issues.apache.org/jira/browse/NUTCH-281
 Project: Nutch
Type: Bug

  Components: web gui  
Reporter: Stefan Neufeind


see cached.jsp


does not take effect when showing a cached page because of the comments around 
it

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-255) Regular Expression for RegexUrlNormalizer to remove jsessionid

2006-05-22 Thread Stefan Neufeind (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-255?page=comments#action_12412777 ] 

Stefan Neufeind commented on NUTCH-255:
---

You might want to have a / right after the .com in the example - but that's not 
too important here :-)
You can also omit the (.*) at beginning/end of expression as it's not needed 
for this task

NUTCH-279 includes your patch modified in there.
PS: Thanks for the contribution.

> Regular Expression for RegexUrlNormalizer to remove jsessionid
> --
>
>  Key: NUTCH-255
>  URL: http://issues.apache.org/jira/browse/NUTCH-255
>  Project: Nutch
> Type: Improvement

>   Components: fetcher
> Versions: 0.8-dev
>  Environment: Windows XP Media Center 2005, 2 Gigs RAM, 3.0 Ghz Pentium 4 
> Hyperthreaded, Eclipse 3.2.0
> Reporter: Dennis Kubes
> Priority: Trivial
>  Attachments: urlnormalize_jessionid.patch
>
> Some URLs are filtered out by the crawl url filter for special characters (by 
> default).  One of these is the jsessionid urls such as:
> http://www.somesite.com;jsessionid=A8D7D812B5EFD3099F099A760F779E3B?query=string
> We want to get rid of the jessionid and keep everything else so that it looks 
> like this:
> http://www.somesite.com?query=string
> Below is a regular expression for the regex-normalize.xml file used by the 
> RegexUrlNormalizer that sucessfully removes jsessionid strings while leaving 
> the hostname and querystring.  I have also attached a patch for the 
> regex-normalize.xml.template file that adds the following expression.
> 
>   (.*)(;jsessionid=[a-zA-Z0-9]{32})(.*)
>   $1$3
> 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-279) Additions for regex-normalize

2006-05-22 Thread Stefan Neufeind (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-279?page=all ]

Stefan Neufeind updated NUTCH-279:
--

Attachment: regex-normalize.patch

1) Incorporates jsessionid-normalization from NUTCH-255
2) Adds further normalizations
3) Adds a commandline-checker. Start with:
bin/nutch org.apache.nutch.net.RegexUrlNormalizerChecker

> Additions for regex-normalize
> -
>
>  Key: NUTCH-279
>  URL: http://issues.apache.org/jira/browse/NUTCH-279
>  Project: Nutch
> Type: Improvement

> Versions: 0.8-dev
> Reporter: Stefan Neufeind
>  Attachments: regex-normalize.patch
>
> Imho needed:
> 1) Extend normalize-rules to commonly used session-id's etc.
> 2) Ship a checker to check rules easily by hand

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Created: (NUTCH-279) Additions for regex-normalize

2006-05-22 Thread Stefan Neufeind (JIRA)
Additions for regex-normalize
-

 Key: NUTCH-279
 URL: http://issues.apache.org/jira/browse/NUTCH-279
 Project: Nutch
Type: Improvement

Versions: 0.8-dev
Reporter: Stefan Neufeind


Imho needed:
1) Extend normalize-rules to commonly used session-id's etc.
2) Ship a checker to check rules easily by hand

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Created: (NUTCH-278) Fetcher-status might need clarification: kbit/s instead of kb/s shown

2006-05-21 Thread Stefan Neufeind (JIRA)
Fetcher-status might need clarification: kbit/s instead of kb/s shown
-

 Key: NUTCH-278
 URL: http://issues.apache.org/jira/browse/NUTCH-278
 Project: Nutch
Type: Improvement

  Components: fetcher  
Versions: 0.8-dev
Reporter: Stefan Neufeind
Priority: Trivial


In Fetcher.java, method reportStatus() there is

+ Math.round(float)bytes)*8)/1024)/elapsed)+" kb/s, ";

Is that a bit misleading, since the user reading the status might guess it's 
"kilobytes" (kb) whereas "kbit/s" would be more clear in this case?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-277) Fetcher dies because of "max. redirects" (avoiding infinite loop)

2006-05-21 Thread Stefan Neufeind (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-277?page=comments#action_12412706 ] 

Stefan Neufeind commented on NUTCH-277:
---

Problem was reproducable with the URL-set we had here.

After moving from protocol-httpclient to protocol-http the problem is gone, 
crawling is fine. Could there be a problem in httpclient-interface, maybe with 
redirects?

PS: Too bad we're missing https-support for now - but it works for the moment 
...

> Fetcher dies because of "max. redirects" (avoiding infinite loop)
> -
>
>  Key: NUTCH-277
>  URL: http://issues.apache.org/jira/browse/NUTCH-277
>  Project: Nutch
> Type: Bug

>   Components: fetcher
> Versions: 0.8-dev
>  Environment: nightly-2006-05-20
> Reporter: Stefan Neufeind
> Priority: Critical

>
> Error in the logs is:
> 060521 213401 SEVERE Narrowly avoided an infinite loop in execute
> org.apache.commons.httpclient.RedirectException: Maximum redirects (100) 
> exceeded
> at 
> org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:183)
> at 
> org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:396)
> at 
> org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:324)
> at 
> org.apache.nutch.protocol.httpclient.HttpResponse.(HttpResponse.java:87)
> at org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:97)
> at 
> org.apache.nutch.protocol.http.api.RobotRulesParser.isAllowed(RobotRulesParser.java:394)
> at 
> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:173)
> at 
> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:135)
> This happens during normal crawling. Unfortunately I don't know how to 
> further track this down. But it's problematic, since it actually makes the 
> fetcher die.
> Workaround (for the symptom) is in NUTCH-258 (avoid dying on SEVERE 
> logentry). That works for me, crawling works fine and it does not hang/crash. 
>  However this is working around the problems not solving them - I know. But 
> it helps for the moment ...
> Hope somebody can help - this loops quite important to track down to me.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-258) Once Nutch logs a SEVERE log item, Nutch fails forevermore

2006-05-21 Thread Stefan Neufeind (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-258?page=comments#action_12412705 ] 

Stefan Neufeind commented on NUTCH-258:
---

Beware of simply silencing  the error! It helped me at one place - but at 
another it really caused an infinite loop not to end.

> Once Nutch logs a SEVERE log item, Nutch fails forevermore
> --
>
>  Key: NUTCH-258
>  URL: http://issues.apache.org/jira/browse/NUTCH-258
>  Project: Nutch
> Type: Bug

>   Components: fetcher
> Versions: 0.8-dev
>  Environment: All
> Reporter: Scott Ganyo
> Priority: Critical
>  Attachments: dumbfix.patch
>
> Once a SEVERE log item is written, Nutch shuts down any fetching forevermore. 
>  This is from the run() method in Fetcher.java:
> public void run() {
>   synchronized (Fetcher.this) {activeThreads++;} // count threads
>   
>   try {
> UTF8 key = new UTF8();
> CrawlDatum datum = new CrawlDatum();
> 
> while (true) {
>   if (LogFormatter.hasLoggedSevere()) // something bad happened
> break;// exit
>   
> Notice the last 2 lines.  This will prevent Nutch from ever Fetching again 
> once this is hit as LogFormatter is storing this data as a static.
> (Also note that "LogFormatter.hasLoggedSevere()" is also checked in 
> org.apache.nutch.net.URLFilterChecker and will disable this class as well.)
> This must be fixed or Nutch cannot be run as any kind of long-running 
> service.  Furthermore, I believe it is a poor decision to rely on a logging 
> event to determine the state of the application - this could have any number 
> of side-effects that would be extremely difficult to track down.  (As it has 
> already for me.)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-258) Once Nutch logs a SEVERE log item, Nutch fails forevermore

2006-05-21 Thread Stefan Neufeind (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-258?page=all ]

Stefan Neufeind updated NUTCH-258:
--

Attachment: dumbfix.patch

I know this is a dumb fix :-) But it solves the problem for the moment ...

> Once Nutch logs a SEVERE log item, Nutch fails forevermore
> --
>
>  Key: NUTCH-258
>  URL: http://issues.apache.org/jira/browse/NUTCH-258
>  Project: Nutch
> Type: Bug

>   Components: fetcher
> Versions: 0.8-dev
>  Environment: All
> Reporter: Scott Ganyo
> Priority: Critical
>  Attachments: dumbfix.patch
>
> Once a SEVERE log item is written, Nutch shuts down any fetching forevermore. 
>  This is from the run() method in Fetcher.java:
> public void run() {
>   synchronized (Fetcher.this) {activeThreads++;} // count threads
>   
>   try {
> UTF8 key = new UTF8();
> CrawlDatum datum = new CrawlDatum();
> 
> while (true) {
>   if (LogFormatter.hasLoggedSevere()) // something bad happened
> break;// exit
>   
> Notice the last 2 lines.  This will prevent Nutch from ever Fetching again 
> once this is hit as LogFormatter is storing this data as a static.
> (Also note that "LogFormatter.hasLoggedSevere()" is also checked in 
> org.apache.nutch.net.URLFilterChecker and will disable this class as well.)
> This must be fixed or Nutch cannot be run as any kind of long-running 
> service.  Furthermore, I believe it is a poor decision to rely on a logging 
> event to determine the state of the application - this could have any number 
> of side-effects that would be extremely difficult to track down.  (As it has 
> already for me.)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-277) Fetcher dies because of "max. redirects" (avoiding infinite loop)

2006-05-21 Thread Stefan Neufeind (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-277?page=all ]

Stefan Neufeind updated NUTCH-277:
--

Component: fetcher
  Version: 0.8-dev

> Fetcher dies because of "max. redirects" (avoiding infinite loop)
> -
>
>  Key: NUTCH-277
>  URL: http://issues.apache.org/jira/browse/NUTCH-277
>  Project: Nutch
> Type: Bug

>   Components: fetcher
> Versions: 0.8-dev
>  Environment: nightly-2006-05-20
> Reporter: Stefan Neufeind
> Priority: Critical

>
> Error in the logs is:
> 060521 213401 SEVERE Narrowly avoided an infinite loop in execute
> org.apache.commons.httpclient.RedirectException: Maximum redirects (100) 
> exceeded
> at 
> org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:183)
> at 
> org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:396)
> at 
> org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:324)
> at 
> org.apache.nutch.protocol.httpclient.HttpResponse.(HttpResponse.java:87)
> at org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:97)
> at 
> org.apache.nutch.protocol.http.api.RobotRulesParser.isAllowed(RobotRulesParser.java:394)
> at 
> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:173)
> at 
> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:135)
> This happens during normal crawling. Unfortunately I don't know how to 
> further track this down. But it's problematic, since it actually makes the 
> fetcher die.
> Workaround (for the symptom) is in NUTCH-258 (avoid dying on SEVERE 
> logentry). That works for me, crawling works fine and it does not hang/crash. 
>  However this is working around the problems not solving them - I know. But 
> it helps for the moment ...
> Hope somebody can help - this loops quite important to track down to me.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Created: (NUTCH-277) Fetcher dies because of "max. redirects" (avoiding infinite loop)

2006-05-21 Thread Stefan Neufeind (JIRA)
Fetcher dies because of "max. redirects" (avoiding infinite loop)
-

 Key: NUTCH-277
 URL: http://issues.apache.org/jira/browse/NUTCH-277
 Project: Nutch
Type: Bug

 Environment: nightly-2006-05-20
Reporter: Stefan Neufeind
Priority: Critical


Error in the logs is:
060521 213401 SEVERE Narrowly avoided an infinite loop in execute
org.apache.commons.httpclient.RedirectException: Maximum redirects (100) 
exceeded
at 
org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:183)
at 
org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:396)
at 
org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:324)
at 
org.apache.nutch.protocol.httpclient.HttpResponse.(HttpResponse.java:87)
at org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:97)
at 
org.apache.nutch.protocol.http.api.RobotRulesParser.isAllowed(RobotRulesParser.java:394)
at 
org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:173)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:135)

This happens during normal crawling. Unfortunately I don't know how to further 
track this down. But it's problematic, since it actually makes the fetcher die.

Workaround (for the symptom) is in NUTCH-258 (avoid dying on SEVERE logentry). 
That works for me, crawling works fine and it does not hang/crash.  However 
this is working around the problems not solving them - I know. But it helps for 
the moment ...

Hope somebody can help - this loops quite important to track down to me.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-254) Fetcher throws NullPointer if redirect URL is filtered

2006-05-21 Thread Stefan Neufeind (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-254?page=comments#action_12412684 ] 

Stefan Neufeind commented on NUTCH-254:
---

looks fine and applies fine for me - could this be merged in the dev-trunk?

> Fetcher throws NullPointer if redirect URL is filtered
> --
>
>  Key: NUTCH-254
>  URL: http://issues.apache.org/jira/browse/NUTCH-254
>  Project: Nutch
> Type: Bug

>   Components: fetcher
> Versions: 0.8-dev
>  Environment: Tested on Windows XP Media Center 2005, 2Gigs RAM, 3.0 Ghz 
> Pentium 4 Hyperthreaded.  Should be on any platform.
> Reporter: Dennis Kubes
> Priority: Minor
>  Attachments: fetcher_filter_url_patch.txt
>
> Inside the Fetcher class if a redirect URL is filtered, for example jessionid 
> pages are filtered with the default URL filter, then a NullPointerException 
> is thrown when Fetcher trys to print out that the url was skipped for being 
> an identical url.  It is not an identical URL but a filtered url.  So what we 
> really need is two different checks.  One for null url and one for identical 
> url.  I have included a patch that handles this.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-48) "Did you mean" query enhancement/refignment feature request

2006-05-21 Thread Stefan Neufeind (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-48?page=all ]

Stefan Neufeind updated NUTCH-48:
-

Attachment: did-you-mean-combined08.patch

Here are both patches combined into one, built against 0.8-dev (namely: 
nightly-2006-05-20).

- The necessary API-changes in 0.8-dev are incorporated in the patch.
- Some smaller things also fixed, (e.g.:
--- missing ../ in front of link to search.jsp
--- missing  at end of did-you-mean-part

Small To-Do left: Maybe put text "Did you mean" into template to make it 
translatable to other languages. But I guess that can be done when finally 
merging this into the dev-tree.

Patch tested and proved to work.

> "Did you mean"  query enhancement/refignment feature request
> 
>
>  Key: NUTCH-48
>  URL: http://issues.apache.org/jira/browse/NUTCH-48
>  Project: Nutch
> Type: New Feature

>   Components: web gui
>  Environment: All platforms
> Reporter: byron miller
> Assignee: Sami Siren
> Priority: Minor
>  Attachments: did-you-mean-combined08.patch, rss-spell.patch, 
> spell-check.patch
>
> Looking to implement a "Did you mean" feature for query result pages that 
> return < = x amount of results to invoke a response that would recommend a 
> fixed/related or spell checked query to try.
> Note from Doug to users list:
> David Spencer has worked on this some.
> http://www.searchmorph.com/weblog/index.php?id=23
> I think the code on his site might be more recent than what's committed
> to the lucene/contrib directory.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-275) Fetcher not parsing XHTML-pages at all

2006-05-20 Thread Stefan Neufeind (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-275?page=all ]

Stefan Neufeind updated NUTCH-275:
--

Description: 
Server reports page as "text/html" - so I thought it would be processed as html.
But something I guess evaluated the headers of the document and re-labeled it 
as "text/xml" (why not text/xhtml?).

For some reason there is no plugin to be found for indexing text/xml (why does 
TextParser not feel responsible?).

Links inside this document are NOT indexed at all - no digging this website 
actually stops here.
Funny thing: For some magical reasons the dtd-files referenced in the header 
seem to be valid links for the fetcher and as such are indexed in the next 
round (if urlfilter allows).


060521 025018 fetching http://www.secreturl.something/
060521 025018 http.proxy.host = null
060521 025018 http.proxy.port = 8080
060521 025018 http.timeout = 1
060521 025018 http.content.limit = 65536
060521 025018 http.agent = NutchCVS/0.8-dev (Nutch; 
http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
060521 025018 fetcher.server.delay = 1000
060521 025018 http.max.delays = 1000
060521 025018 ParserFactory:Plugin: org.apache.nutch.parse.text.TextParser 
mapped to contentType text/xml via parse-plugins.xml, but
 its plugin.xml file does not claim to support contentType: text/xml
060521 025018 ParserFactory: Plugin: org.apache.nutch.parse.rss.RSSParser 
mapped to contentType text/xml via parse-plugins.xml, but 
not enabled via plugin.includes in nutch-default.xml
060521 025019 Using Signature impl: org.apache.nutch.crawl.MD5Signature
060521 025019  map 0%  reduce 0%
060521 025019 1 pages, 0 errors, 1.0 pages/s, 40 kb/s, 
060521 025019 1 pages, 0 errors, 1.0 pages/s, 40 kb/s, 

  was:
Server reports page as "text/html" - so I thought it would be processed as html.
But something I guess evaluated the headers of the document and re-labeled it 
as "text/xml" (why not text/xhtml?).

For some reason there is no plugin to be found for indexing text/xml (why does 
TextParser not feel responsible?).

Links inside this document are NOT indexed at all - no digging this website 
actually stops here.
Funny thing: For some magical reasons the dtd-files referenced in the header 
seem to be valid links for the fetcher and as such are indexed in the next 
round (if urlfilter allows).


060521 025018 fetching http://www.speedpartner.de/
060521 025018 http.proxy.host = null
060521 025018 http.proxy.port = 8080
060521 025018 http.timeout = 1
060521 025018 http.content.limit = 65536
060521 025018 http.agent = NutchCVS/0.8-dev (Nutch; 
http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
060521 025018 fetcher.server.delay = 1000
060521 025018 http.max.delays = 1000
060521 025018 ParserFactory:Plugin: org.apache.nutch.parse.text.TextParser 
mapped to contentType text/xml via parse-plugins.xml, but
 its plugin.xml file does not claim to support contentType: text/xml
060521 025018 ParserFactory: Plugin: org.apache.nutch.parse.rss.RSSParser 
mapped to contentType text/xml via parse-plugins.xml, but 
not enabled via plugin.includes in nutch-default.xml
060521 025019 Using Signature impl: org.apache.nutch.crawl.MD5Signature
060521 025019  map 0%  reduce 0%
060521 025019 1 pages, 0 errors, 1.0 pages/s, 40 kb/s, 
060521 025019 1 pages, 0 errors, 1.0 pages/s, 40 kb/s, 


> Fetcher not parsing XHTML-pages at all
> --
>
>  Key: NUTCH-275
>  URL: http://issues.apache.org/jira/browse/NUTCH-275
>  Project: Nutch
> Type: Bug

> Versions: 0.8-dev
>  Environment: problem with nightly-2006-05-20; worked fine with same website 
> on 0.7.2
> Reporter: Stefan Neufeind

>
> Server reports page as "text/html" - so I thought it would be processed as 
> html.
> But something I guess evaluated the headers of the document and re-labeled it 
> as "text/xml" (why not text/xhtml?).
> For some reason there is no plugin to be found for indexing text/xml (why 
> does TextParser not feel responsible?).
> Links inside this document are NOT indexed at all - no digging this website 
> actually stops here.
> Funny thing: For some magical reasons the dtd-files referenced in the header 
> seem to be valid links for the fetcher and as such are indexed in the next 
> round (if urlfilter allows).
> 060521 025018 fetching http://www.secreturl.something/
> 060521 025018 http.proxy.host = null
> 060521 025018 http.proxy.port = 8080
> 060521 025018 http.timeout = 1
> 060521 025018 http.content.limit = 65536
> 060521 025018 http.agent = NutchCVS/0.8-dev (Nutch; 
> http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
> 060521 025018 fetcher.server.delay = 1000
> 060521 025018 http.max.delays = 1000
> 060521 025018 ParserFactory:Plugin: org.apache.nutch.parse.text.TextParser 
> mapped to contentType text/xml via parse-plugins.xml, but
>  its plugin.xml file does not claim to support contentType: tex

[jira] Commented: (NUTCH-275) Fetcher not parsing XHTML-pages at all

2006-05-20 Thread Stefan Neufeind (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-275?page=comments#action_12412659 ] 

Stefan Neufeind commented on NUTCH-275:
---

I've found out that the first line actually leads to the problems. Without it, 
the file is parsed as html.
- But why can't XML be parsed at all (not even by TextParser)?
- And afaik that header is valid as is - been told so - and validator from w3 
does not complain as well.



http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd";>
http://www.w3.org/1999/xhtml"; xml:lang="de" lang="de">


> Fetcher not parsing XHTML-pages at all
> --
>
>  Key: NUTCH-275
>  URL: http://issues.apache.org/jira/browse/NUTCH-275
>  Project: Nutch
> Type: Bug

> Versions: 0.8-dev
>  Environment: problem with nightly-2006-05-20; worked fine with same website 
> on 0.7.2
> Reporter: Stefan Neufeind

>
> Server reports page as "text/html" - so I thought it would be processed as 
> html.
> But something I guess evaluated the headers of the document and re-labeled it 
> as "text/xml" (why not text/xhtml?).
> For some reason there is no plugin to be found for indexing text/xml (why 
> does TextParser not feel responsible?).
> Links inside this document are NOT indexed at all - no digging this website 
> actually stops here.
> Funny thing: For some magical reasons the dtd-files referenced in the header 
> seem to be valid links for the fetcher and as such are indexed in the next 
> round (if urlfilter allows).
> 060521 025018 fetching http://www.speedpartner.de/
> 060521 025018 http.proxy.host = null
> 060521 025018 http.proxy.port = 8080
> 060521 025018 http.timeout = 1
> 060521 025018 http.content.limit = 65536
> 060521 025018 http.agent = NutchCVS/0.8-dev (Nutch; 
> http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
> 060521 025018 fetcher.server.delay = 1000
> 060521 025018 http.max.delays = 1000
> 060521 025018 ParserFactory:Plugin: org.apache.nutch.parse.text.TextParser 
> mapped to contentType text/xml via parse-plugins.xml, but
>  its plugin.xml file does not claim to support contentType: text/xml
> 060521 025018 ParserFactory: Plugin: org.apache.nutch.parse.rss.RSSParser 
> mapped to contentType text/xml via parse-plugins.xml, but 
> not enabled via plugin.includes in nutch-default.xml
> 060521 025019 Using Signature impl: org.apache.nutch.crawl.MD5Signature
> 060521 025019  map 0%  reduce 0%
> 060521 025019 1 pages, 0 errors, 1.0 pages/s, 40 kb/s, 
> 060521 025019 1 pages, 0 errors, 1.0 pages/s, 40 kb/s, 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Created: (NUTCH-275) Fetcher not parsing XHTML-pages at all

2006-05-20 Thread Stefan Neufeind (JIRA)
Fetcher not parsing XHTML-pages at all
--

 Key: NUTCH-275
 URL: http://issues.apache.org/jira/browse/NUTCH-275
 Project: Nutch
Type: Bug

Versions: 0.8-dev
 Environment: problem with nightly-2006-05-20; worked fine with same website on 
0.7.2
Reporter: Stefan Neufeind


Server reports page as "text/html" - so I thought it would be processed as html.
But something I guess evaluated the headers of the document and re-labeled it 
as "text/xml" (why not text/xhtml?).

For some reason there is no plugin to be found for indexing text/xml (why does 
TextParser not feel responsible?).

Links inside this document are NOT indexed at all - no digging this website 
actually stops here.
Funny thing: For some magical reasons the dtd-files referenced in the header 
seem to be valid links for the fetcher and as such are indexed in the next 
round (if urlfilter allows).


060521 025018 fetching http://www.speedpartner.de/
060521 025018 http.proxy.host = null
060521 025018 http.proxy.port = 8080
060521 025018 http.timeout = 1
060521 025018 http.content.limit = 65536
060521 025018 http.agent = NutchCVS/0.8-dev (Nutch; 
http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
060521 025018 fetcher.server.delay = 1000
060521 025018 http.max.delays = 1000
060521 025018 ParserFactory:Plugin: org.apache.nutch.parse.text.TextParser 
mapped to contentType text/xml via parse-plugins.xml, but
 its plugin.xml file does not claim to support contentType: text/xml
060521 025018 ParserFactory: Plugin: org.apache.nutch.parse.rss.RSSParser 
mapped to contentType text/xml via parse-plugins.xml, but 
not enabled via plugin.includes in nutch-default.xml
060521 025019 Using Signature impl: org.apache.nutch.crawl.MD5Signature
060521 025019  map 0%  reduce 0%
060521 025019 1 pages, 0 errors, 1.0 pages/s, 40 kb/s, 
060521 025019 1 pages, 0 errors, 1.0 pages/s, 40 kb/s, 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Created: (NUTCH-274) Empty row in/at end of URL-list results in error

2006-05-20 Thread Stefan Neufeind (JIRA)
Empty row in/at end of URL-list results in error


 Key: NUTCH-274
 URL: http://issues.apache.org/jira/browse/NUTCH-274
 Project: Nutch
Type: Bug

Versions: 0.8-dev
 Environment: nightly-2006-05-20
Reporter: Stefan Neufeind
Priority: Minor


This is minor - but it's a little unclean :-)

Reproduce: Have a URL-file with one URL followed by a newline, thus producing 
an empty line.

Outcome: Fetcher-threads try to fetch two URLs at the same time. First one is 
fine - but second is empty and therefor fails proper protocol-detection.


60521 022639   Nutch Analysis (org.apache.nutch.analysis.NutchAnalyzer)
060521 022639   Nutch Query Filter (org.apache.nutch.searcher.QueryFilter)
060521 022639 found resource parse-plugins.xml at 
file:/home/mm/nutch-nightly/conf/parse-plugins.xml
060521 022639 Using URL normalizer: org.apache.nutch.net.BasicUrlNormalizer
060521 022639 fetching http://www.bild.de/
060521 022639 fetching 
060521 022639 fetch of  failed with: 
org.apache.nutch.protocol.ProtocolNotFound: java.net.MalformedURLException: no 
protocol: 
060521 022639 http.proxy.host = null
060521 022639 http.proxy.port = 8080
060521 022639 http.timeout = 1
060521 022639 http.content.limit = 65536
060521 022639 http.agent = NutchCVS/0.8-dev (Nutch; 
http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
060521 022639 fetcher.server.delay = 1000
060521 022639 http.max.delays = 1000
060521 022640 ParserFactory:Plugin: org.apache.nutch.parse.text.TextParser 
mapped to contentType text/xml via parse-plugins.xml, but
 its plugin.xml file does not claim to support contentType: text/xml
060521 022640 ParserFactory:Plugin: org.apache.nutch.parse.html.HtmlParser 
mapped to contentType text/xml via parse-plugins.xml, but
 its plugin.xml file does not claim to support contentType: text/xml
060521 022640 ParserFactory: Plugin: org.apache.nutch.parse.rss.RSSParser 
mapped to contentType text/xml via parse-plugins.xml, but 
not enabled via plugin.includes in nutch-default.xml
060521 022640 Using Signature impl: org.apache.nutch.crawl.MD5Signature
060521 022640  map 0%  reduce 0%
060521 022640 1 pages, 1 errors, 1.0 pages/s, 40 kb/s, 
060521 022640 1 pages, 1 errors, 1.0 pages/s, 40 kb/s, 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-173) PerHost Crawling Policy ( crawl.ignore.external.links )

2006-05-20 Thread Stefan Neufeind (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-173?page=all ]

Stefan Neufeind updated NUTCH-173:
--

Attachment: patch08-new.patch

Here is the 08-patch, corrected to work against nightly from 2006-05-20.
Also fromHost is now only generated if really needed and nutch-default.xml is 
patched as well. By the way: Where should a property for "crawl" be located in 
the config-file? In the "fetcher"-section? In that case please somebody move it 
up/down or rename the property before including it in the dev-tree.

But could somebody please review it quickly? I'm not sure it's 100% correct. 
Still investigating on my side ...

> PerHost Crawling Policy ( crawl.ignore.external.links )
> ---
>
>  Key: NUTCH-173
>  URL: http://issues.apache.org/jira/browse/NUTCH-173
>  Project: Nutch
> Type: New Feature

>   Components: fetcher
> Versions: 0.7.1, 0.7, 0.8-dev
> Reporter: Philippe EUGENE
> Priority: Minor
>  Attachments: patch.txt, patch08-new.patch, patch08.txt
>
> There is two major way of crawl in Nutch.
> Intranet Crawl : forbidden all, allow somes few host
> Whole-web crawl : allow all, forbidden few thinks
> I propose a third type of crawl.
> Directory Crawl : The purpose of this crawl is to manage few thousands of 
> host wihtout managing rules pattern in UrlFilterRegexp.
> I made two patch for : 0.7, 0.7.1 and 0.8-dev
> I propose a new boolean property in nutch-site.xml : 
> crawl.ignore.external.links, with false value at default.
> By default this new feature don't modify the behavior of nutch crawler.
> When you setup this property to true, the crawler don't fetch external links 
> of the host.
> So the crawl is limited to the host that you inject at the beginning at the 
> crawl.
> I know there is some proposal of new crawl policy using the CrawlDatum in 
> 0.8-dev branch. 
> This feature colud be a easiest way to add quickly new crawl feature to 
> nutch, waiting for a best way to improve crawl policy.
> I post two patch.
> Sorry for my very poor english 
> --
> Philippe

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-175) No input directories specified in: while crawing in nightly build from the 14.1.2006: sh ./nutch crawl urllist.txt -dir tmpdir

2006-05-20 Thread Stefan Neufeind (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-175?page=comments#action_12412644 ] 

Stefan Neufeind commented on NUTCH-175:
---

My bad I didn't pay close attention when moving from 0.7 to 0.8. But I'd like 
to stress in this bug-entry that "urls" in the example-call to "nutch crawl" is 
no longer a file - but actually a directory containing files with urls in them.

RTFM - and now it works :-)

> No input directories specified in: while crawing in nightly build from the 
> 14.1.2006: sh ./nutch crawl urllist.txt -dir tmpdir
> --
>
>  Key: NUTCH-175
>  URL: http://issues.apache.org/jira/browse/NUTCH-175
>  Project: Nutch
> Type: Bug

>  Environment: SUSE Linux 9.3
> Reporter: Matthias Günter
> Priority: Trivial

>
> [EMAIL PROTECTED]:~/workspace/lucene/nutch-nightly/bin> sh ./nutch crawl 
> urllist.txt -dir tmpdir
> 060114 205612 parsing 
> file:/home/guenter/workspace/lucene/nutch-nightly/conf/nutch-default.xml
> 060114 205612 parsing 
> file:/home/guenter/workspace/lucene/nutch-nightly/conf/crawl-tool.xml
> 060114 205612 parsing 
> file:/home/guenter/workspace/lucene/nutch-nightly/conf/mapred-default.xml
> 060114 205612 parsing 
> file:/home/guenter/workspace/lucene/nutch-nightly/conf/nutch-site.xml
> 060114 205612 crawl started in: tmpdir
> 060114 205612 rootUrlDir = urllist.txt
> 060114 205612 threads = 10
> 060114 205612 depth = 5
> 060114 205612 parsing 
> file:/home/guenter/workspace/lucene/nutch-nightly/conf/nutch-default.xml
> 060114 205612 parsing 
> file:/home/guenter/workspace/lucene/nutch-nightly/conf/crawl-tool.xml
> 060114 205612 parsing 
> file:/home/guenter/workspace/lucene/nutch-nightly/conf/nutch-site.xml
> 060114 205612 Injector: starting
> 060114 205612 Injector: crawlDb: tmpdir/crawldb
> 060114 205612 Injector: urlDir: urllist.txt
> 060114 205612 Injector: Converting injected urls to crawl db entries.
> 060114 205612 parsing 
> file:/home/guenter/workspace/lucene/nutch-nightly/conf/nutch-default.xml
> 060114 205612 parsing 
> file:/home/guenter/workspace/lucene/nutch-nightly/conf/crawl-tool.xml
> 060114 205612 parsing 
> file:/home/guenter/workspace/lucene/nutch-nightly/conf/mapred-default.xml
> 060114 205612 parsing 
> file:/home/guenter/workspace/lucene/nutch-nightly/conf/mapred-default.xml
> 060114 205612 parsing 
> file:/home/guenter/workspace/lucene/nutch-nightly/conf/nutch-site.xml
> 060114 205612 Running job: job_n0o7ps
> 060114 205612 parsing 
> file:/home/guenter/workspace/lucene/nutch-nightly/conf/nutch-default.xml
> 060114 205613 parsing 
> file:/home/guenter/workspace/lucene/nutch-nightly/conf/mapred-default.xml
> 060114 205613 parsing /tmp/nutch/mapred/local/localRunner/job_n0o7ps.xml
> 060114 205613 parsing 
> file:/home/guenter/workspace/lucene/nutch-nightly/conf/nutch-site.xml
> java.io.IOException: No input directories specified in: NutchConf: 
> nutch-default.xml , mapred-default.xml , 
> /tmp/nutch/mapred/local/localRunner/job_n0o7ps.xml , nutch-site.xml
> at 
> org.apache.nutch.mapred.InputFormatBase.listFiles(InputFormatBase.java:85)
> at 
> org.apache.nutch.mapred.InputFormatBase.getSplits(InputFormatBase.java:95)
> at 
> org.apache.nutch.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:63)
> 060114 205613  map 0%
> Exception in thread "main" java.io.IOException: Job failed!
> at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:308)
> at org.apache.nutch.crawl.Injector.inject(Injector.java:102)
> at org.apache.nutch.crawl.Crawl.main(Crawl.java:105)
> urllist.txt contains
>   http://www.mentor.ch
> PS: Is there a committer or developer (near Switzerland) who can support 
> (paid support) with a mixed index for intranet, some internet sites and 
> scanning of local drives (P:\ , S:\ etc)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-272) Max. pages to crawl/fetch per site (emergency limit)

2006-05-19 Thread Stefan Neufeind (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-272?page=comments#action_12412620 ] 

Stefan Neufeind commented on NUTCH-272:
---

Oh, I just discovered this new parameter was added in 0.8-dev :-)

But to my understanding of the description in nutch-default.xml this only 
applies to "per fetchlist". And that would mean "for one run", right? So in 
case I set this to 100 and fetch 10 rounds I'd have max. 1000 documents? But 
what if there is one document on the first level (theoretically) with 200 links 
in it? In this case I suspect that they are all written to the webdb as "to-do" 
in the first run, in the next the first 100 are fetched with rest skipped and 
upon another round the next 100 are fetched? Is that right?

My idea was also to have this as a "per host" or "per site"-setting - or to be 
able to override the value for a certain host ...

> Max. pages to crawl/fetch per site (emergency limit)
> 
>
>  Key: NUTCH-272
>  URL: http://issues.apache.org/jira/browse/NUTCH-272
>  Project: Nutch
> Type: Improvement

> Reporter: Stefan Neufeind

>
> If I'm right, there is no way in place right now for setting an "emergency 
> limit" to fetch a certain max. number of pages per site. Is there an "easy" 
> way to implement such a limit, maybe as a plugin?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-173) PerHost Crawling Policy ( crawl.ignore.external.links )

2006-05-19 Thread Stefan Neufeind (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-173?page=comments#action_12412530 ] 

Stefan Neufeind commented on NUTCH-173:
---

Applies fine and works for me on 0.7.2.

> PerHost Crawling Policy ( crawl.ignore.external.links )
> ---
>
>  Key: NUTCH-173
>  URL: http://issues.apache.org/jira/browse/NUTCH-173
>  Project: Nutch
> Type: New Feature

>   Components: fetcher
> Versions: 0.7.1, 0.7, 0.8-dev
> Reporter: Philippe EUGENE
> Priority: Minor
>  Attachments: patch.txt, patch08.txt
>
> There is two major way of crawl in Nutch.
> Intranet Crawl : forbidden all, allow somes few host
> Whole-web crawl : allow all, forbidden few thinks
> I propose a third type of crawl.
> Directory Crawl : The purpose of this crawl is to manage few thousands of 
> host wihtout managing rules pattern in UrlFilterRegexp.
> I made two patch for : 0.7, 0.7.1 and 0.8-dev
> I propose a new boolean property in nutch-site.xml : 
> crawl.ignore.external.links, with false value at default.
> By default this new feature don't modify the behavior of nutch crawler.
> When you setup this property to true, the crawler don't fetch external links 
> of the host.
> So the crawl is limited to the host that you inject at the beginning at the 
> crawl.
> I know there is some proposal of new crawl policy using the CrawlDatum in 
> 0.8-dev branch. 
> This feature colud be a easiest way to add quickly new crawl feature to 
> nutch, waiting for a best way to improve crawl policy.
> I post two patch.
> Sorry for my very poor english 
> --
> Philippe

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Created: (NUTCH-272) Max. pages to crawl/fetch per site (emergency limit)

2006-05-19 Thread Stefan Neufeind (JIRA)
Max. pages to crawl/fetch per site (emergency limit)


 Key: NUTCH-272
 URL: http://issues.apache.org/jira/browse/NUTCH-272
 Project: Nutch
Type: Improvement

Reporter: Stefan Neufeind


If I'm right, there is no way in place right now for setting an "emergency 
limit" to fetch a certain max. number of pages per site. Is there an "easy" way 
to implement such a limit, maybe as a plugin?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Created: (NUTCH-271) Meta-data per URL/site/section

2006-05-18 Thread Stefan Neufeind (JIRA)
Meta-data per URL/site/section
--

 Key: NUTCH-271
 URL: http://issues.apache.org/jira/browse/NUTCH-271
 Project: Nutch
Type: New Feature

Versions: 0.7.2
Reporter: Stefan Neufeind


We have the need to index sites and attach additional meta-data-tags to them. 
Afaik this is not yet possible, or is there a "workaround" I don't see? What I 
think of is using meta-tags per start-url, only indexing content below that 
URL, and have the ability to limit searches upon those meta-tags. E.g.

http://www.example1.com/something1/   -> meta-tag "companybranch1"
http://www.example2.com/something2/   -> meta-tag "companybranch2"
http://www.example3.com/something3/   -> meta-tag "companybranch1"
http://www.example4.com/something4/   -> meta-tag "companybranch3"

search for everything in companybranch1 or across 1 and 3 or similar

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira