Re: Issues to work on
+1 on this Dennis Kubes <[EMAIL PROTECTED]> wrote: What would be good issues to tackle, bugs to fix in either Nutch or the Hadoop code base. I lookup through the JIRA but don' t really understand if things are being worked on or not? Dennis
Re: http chunked content
Okay, saw the code in the http-protocol plugin. I remember looking at this about a year ago. RFC 2616 (HTTP/1.1) does say, as Jerome pointed out: "A server MUST NOT send transfer-codings to an HTTP/1.0 client." Regardless, I can attest that there are servers out there that return chunked content regardless of the client. We had a socket implementation akin to the HttpResponse.java in http-protocol plugin and were stumped on how to handle identifying whether the response was chunked or not - as we could not reliably use the Transfer-coding header. The only way we could see was trying to use the initial hex characters denoting the size of the first chunk. "The chunk-size field is a string of hex digits indicating the size of the chunk. The chunked encoding is ended by any chunk whose size is zero, followed by the trailer, which is terminated by an empty line." - more from RFC 2616 But in practice this was error prone. Switching over to apache httpclient eliminated this problem, as it transparently handles chunked and un-chunked content. But httpclient is much more heavy weight and so the conversion could only be done after implementing some basic resource pooling on the primary httpclient object. It does look like this would be a serious refactor job as nutch uses all java.net classes. On the other hand, it might simplify some areas of the nutch protocol classes and httpclient does have some interesting built in support for multi-threading/performance tuning requests. I hope this helps towards a solution. Best Regards, Chris --- Andrzej Bialecki <[EMAIL PROTECTED]> wrote: > Chris Fellows wrote: > > Just remembered, got around it by using HTTPClient > > which handles reading the response (chunked or > not) > > transparently. Haven't looked at the nutch code, > but > > if we were to use HTTPClient 3.0.x or later, > should > > take care of it. > > > > > > Take a look at protocol-httpclient. This discussion > is on whether/how to > fix protocol-http. The other plugin already supports > this. > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ > __ > [__ || __|__/|__||\/| Information Retrieval, > Semantic Web > ___|||__|| \| || | Embedded Unix, System > Integration > http://www.sigram.com Contact: info at sigram dot > com > > >
Re: http chunked content
Just remembered, got around it by using HTTPClient which handles reading the response (chunked or not) transparently. Haven't looked at the nutch code, but if we were to use HTTPClient 3.0.x or later, should take care of it. --- Chris Fellows <[EMAIL PROTECTED]> wrote: > > Furthermore, we can read in HTTP/1.1 specification > > that "A server MUST NOT > > send > > transfer-codings to an HTTP/1.0 client". > > I once did an socket implementation against > Anonymizer. This is well established proxy service > that services $100K+ government and private > contracts. > > Their server always sent chunked content despite all > headers. I'm pretty sure that there are other well > established servers that send chunked content > despite > the rfc. > > Guessing that it might have something to do with > wanting to control content compression. All the > browsers can handle it, and that's probably all > apple > is concerned with - even though they're overriding > an > rfc spec req. > > Chris > > --- Jérôme Charron <[EMAIL PROTECTED]> wrote: > > > > http://www.apple.com for example answer with > > chunked content also if > > > you request with a http 1.0 header. > > > > > > Stefan, > > > > I don't see any "Transfer-Encoding: chunked" > header > > in responses from > > www.apple.com > > Furthermore, we can read in HTTP/1.1 specification > > that "A server MUST NOT > > send > > transfer-codings to an HTTP/1.0 client". > > > > Jérôme > > > > -- > > http://motrech.free.fr/ > > http://www.frutch.org/ > > > >
Re: http chunked content
> Furthermore, we can read in HTTP/1.1 specification > that "A server MUST NOT > send > transfer-codings to an HTTP/1.0 client". I once did an socket implementation against Anonymizer. This is well established proxy service that services $100K+ government and private contracts. Their server always sent chunked content despite all headers. I'm pretty sure that there are other well established servers that send chunked content despite the rfc. Guessing that it might have something to do with wanting to control content compression. All the browsers can handle it, and that's probably all apple is concerned with - even though they're overriding an rfc spec req. Chris --- Jérôme Charron <[EMAIL PROTECTED]> wrote: > > http://www.apple.com for example answer with > chunked content also if > > you request with a http 1.0 header. > > > Stefan, > > I don't see any "Transfer-Encoding: chunked" header > in responses from > www.apple.com > Furthermore, we can read in HTTP/1.1 specification > that "A server MUST NOT > send > transfer-codings to an HTTP/1.0 client". > > Jérôme > > -- > http://motrech.free.fr/ > http://www.frutch.org/ >
Re: Merging segments
That's great. Well, my follow up to that then is: Will the new tool allow any form of "diff'ing" segments? In practice this would allow you to run a crawl on a series of sites one week. Then run another crawl on the same sites a week or so later. Diff the segments and allow users to search on changes within the search domain. --- Andrzej Bialecki <[EMAIL PROTECTED]> wrote: > Chris Fellows wrote: > > Hello, > > > > So the last discussion on merging segments was > back in > > Jan. Has there been any progress in this > direction? > > What would be the benefit of being able merge > > segments? Would being able to merge segments open > up > > new functionality options or is merging just a > > convience? Also, what's the estimate for how > involved > > merge functionality development is? > > > > Relief is on the way. Fine folks at houxou.com have > sponsored the > development of a brand-new SegmentMerger + slicer, > and decided to donate > it to the project - big thanks! > > I'm running some final tests, and will commit it > today/tomorrow. > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ > __ > [__ || __|__/|__||\/| Information Retrieval, > Semantic Web > ___|||__|| \| || | Embedded Unix, System > Integration > http://www.sigram.com Contact: info at sigram dot > com > > >
Merging segments
Hello, So the last discussion on merging segments was back in Jan. Has there been any progress in this direction? What would be the benefit of being able merge segments? Would being able to merge segments open up new functionality options or is merging just a convience? Also, what's the estimate for how involved merge functionality development is? Regards, - Chris
[jira] Commented: (NUTCH-134) Summarizer doesn't select the best snippets
[ http://issues.apache.org/jira/browse/NUTCH-134?page=comments#action_12377866 ] Chris Fellows commented on NUTCH-134: - Jerome, Let me know if you could use a hand in implementation. I'd like to get to know nutch and lucene code base better for my project. This looks like a good area to start in, so any opportunity to jump in would be great. chris > Summarizer doesn't select the best snippets > --- > > Key: NUTCH-134 > URL: http://issues.apache.org/jira/browse/NUTCH-134 > Project: Nutch > Type: Bug > Components: searcher > Versions: 0.7.2, 0.7.1, 0.7, 0.8-dev > Reporter: Andrzej Bialecki > > Summarizer.java tries to select the best fragments from the input text, where > the frequency of query terms is the highest. However, the logic in line 223 > is flawed in that the excerptSet.add() operation will add new excerpts only > if they are not already present - the test is performed using the Comparator > that compares only the numUniqueTokens. This means that if there are two or > more excerpts, which score equally high, only the first of them will be > retained, and the rest of equally-scoring excerpts will be discarded, in > favor of other excerpts (possibly lower scoring). > To fix this the Set should be replaced with a List + a sort operation. To > keep the relative position of excerpts in the original order the Excerpt > class should be extended with an "int order" field, and the collected > excerpts should be sorted in that order prior to adding them to the summary. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-134) Summarizer doesn't select the best snippets
[ http://issues.apache.org/jira/browse/NUTCH-134?page=comments#action_12377654 ] Chris Fellows commented on NUTCH-134: - byron, Did you ever get a chance to run a cpu perf test on using lucene/contrib/highlighter for extracting summaries? chris > Summarizer doesn't select the best snippets > --- > > Key: NUTCH-134 > URL: http://issues.apache.org/jira/browse/NUTCH-134 > Project: Nutch > Type: Bug > Components: searcher > Versions: 0.7.2, 0.7.1, 0.7, 0.8-dev > Reporter: Andrzej Bialecki > > Summarizer.java tries to select the best fragments from the input text, where > the frequency of query terms is the highest. However, the logic in line 223 > is flawed in that the excerptSet.add() operation will add new excerpts only > if they are not already present - the test is performed using the Comparator > that compares only the numUniqueTokens. This means that if there are two or > more excerpts, which score equally high, only the first of them will be > retained, and the rest of equally-scoring excerpts will be discarded, in > favor of other excerpts (possibly lower scoring). > To fix this the Set should be replaced with a List + a sort operation. To > keep the relative position of excerpts in the original order the Excerpt > class should be extended with an "int order" field, and the collected > excerpts should be sorted in that order prior to adding them to the summary. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-25) needs 'character encoding' detector
[ http://issues.apache.org/jira/browse/NUTCH-25?page=comments#action_12376611 ] Chris Fellows commented on NUTCH-25: This was last updated May '05. Has this charset and language detection been integrated into Nutch yet? If not, at what point should the detection happen? Fetching, parsing, etc. If this hasn't been fixed any leads into where to insert the detection would helpful. > needs 'character encoding' detector > --- > > Key: NUTCH-25 > URL: http://issues.apache.org/jira/browse/NUTCH-25 > Project: Nutch > Type: Wish > Reporter: Stefan Groschupf > Priority: Trivial > > transferred from: > http://sourceforge.net/tracker/index.php?func=detail&aid=995730&group_id=59548&atid=491356 > submitted by: > Jungshik Shin > this is a follow-up to bug 993380 (figure out 'charset' > from the meta tag). > Although we can cover a lot of ground using the 'C-T' > header field in in the HTTP header and the > corresponding meta tag in html documents (and in case > of XML, we have to use a similar but a different > 'parsing'), in the wild, there are a lot of documents > without any information about the character encoding > used. Browsers like Mozilla and search engines like > Google use character encoding detectors to deal with > these 'unlabelled' documents. > Mozilla's character encoding detector is GPL/MPL'd and > we might be able to port it to Java. Unfortunately, > it's not fool-proof. However, along with some other > heuristic used by Mozilla and elsewhere, it'll be > possible to achieve a high rate of the detection. > The following page has links to some other related pages. > http://trainedmonkey.com/week/2004/26 > In addition to the character encoding detection, we > also need to detect the language of a document, which > is even harder and should be a separate bug (although > it's related). -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-18) Windows servers include illegal characters in URLs
[ http://issues.apache.org/jira/browse/NUTCH-18?page=comments#action_12376601 ] Chris Fellows commented on NUTCH-18: So checking out other SE's, google and yahoo use decoded display urls. ie en.wiktionary.org/wiki/ç whereas altavista uses encoded urls ie. en.wiktionary.org/wiki/%C3%A7 I would say that the human readable, decoded urls is the way to go, especially since google and yahoo both support this. Its a small item, but it's one that many users will experience. The code that controls this is in search.jsp: <%=Entities.encode(url)%> I need the decoded forms for my project. If any contributors want the change I'll submit the one file patch for the decoded urls. If any contributers want the url completely encoded per RFC1738 for use in fetching and searching, then I can submit that patch as well. This last item is what I believe this bug was opened for in the first place, though after research posted above, doesn't look like its required. > Windows servers include illegal characters in URLs > -- > > Key: NUTCH-18 > URL: http://issues.apache.org/jira/browse/NUTCH-18 > Project: Nutch > Type: Bug > Components: fetcher > Reporter: Stefan Groschupf > Priority: Minor > > Transfered from: > http://sourceforge.net/tracker/index.php?func=detail&aid=1110243&group_id=59548&atid=491356 > submitted by: > Ken Meltsner > While spidering our intranet, I found that IIS may include > illegal characters in URLs -- specifically, characters with > the high bit set to produce non-English letters. In > addition, both Firefox and IE will accept URLs with high- > bit characters, but Java won't. > While this may not be Nutch's (or Java's) fault, it would > help if high-bit characters (and other illegal characters) > in URLs could be escaped (using percent-hex notation) > as part of the URL fix-up process, probably right after > the hostname lower-case conversion. > Example document name in Portuguese(with high-bit > characters) taken from a longer URL: > Nota%20tecnica%20-%20Alteração%20de% > 20escopo.doc > and with percent-escaped characters: > Nota%20tecnica%20-%20Altera%e7%e3o%20de% > 20escopo.doc -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-18) Windows servers include illegal characters in URLs
[ http://issues.apache.org/jira/browse/NUTCH-18?page=comments#action_12376554 ] Chris Fellows commented on NUTCH-18: Was looking into nutch-18 bug that revolves around illegal, non-ascii, characters in a url. An example of a high bit character is 'ç' that maps to a high bit set. Before applying any fix, did a brief test with 0.8 trunk. After fetching and indexing on http://en.wiktionary.org/wiki/ç, I was able to search on ç and got the following result off the browser results: ç - Wiktionary ... Letter [ edit ] Translingual [ edit ] Letter Ç , ç C with a cedilla ... visit IRC or Wiktionary:AOL . ç ... http://en.wiktionary.org/wiki/%C3%A7 (cached) (explain) (anchors) (more from en.wiktionary.org) So it looks like it will fetch and parse off of urls with high bit set characters. Additionally, the display url has the ç encoded correctly as %C3%A7. Is this really a bug? Doing a similar test off of Google on keywords: "ç" wiktionary. yields: ç - Wiktionary AOL users can access Wiktionary through this link after accepting the CACERT certificate. ... Ç, ç. "tʃə", the fourth letter of the Albanian alphabet. ... en.wiktionary.org/wiki/ç - 15k - Cached - Similar pages [ More results from en.wiktionary.org ] Nearly identical, but the see that the ç is in its decoded form, not %C3%A7. I'd say if anything, the bug is that the display urls are in encoded form and not human readable. > Windows servers include illegal characters in URLs > -- > > Key: NUTCH-18 > URL: http://issues.apache.org/jira/browse/NUTCH-18 > Project: Nutch > Type: Bug > Components: fetcher > Reporter: Stefan Groschupf > Priority: Minor > > Transfered from: > http://sourceforge.net/tracker/index.php?func=detail&aid=1110243&group_id=59548&atid=491356 > submitted by: > Ken Meltsner > While spidering our intranet, I found that IIS may include > illegal characters in URLs -- specifically, characters with > the high bit set to produce non-English letters. In > addition, both Firefox and IE will accept URLs with high- > bit characters, but Java won't. > While this may not be Nutch's (or Java's) fault, it would > help if high-bit characters (and other illegal characters) > in URLs could be escaped (using percent-hex notation) > as part of the URL fix-up process, probably right after > the hostname lower-case conversion. > Example document name in Portuguese(with high-bit > characters) taken from a longer URL: > Nota%20tecnica%20-%20Alteração%20de% > 20escopo.doc > and with percent-escaped characters: > Nota%20tecnica%20-%20Altera%e7%e3o%20de% > 20escopo.doc -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Nutch-18 illegal chars in urls: Not sure what the problem is
Hello, Was looking into nutch-18 bug that revolves around illegal, non-ascii, characters in a url. An example of a high bit character is 'ç' that maps to a high bit set. Before applying any fix, did a brief test with 0.8 trunk. After fetching and indexing on http://en.wiktionary.org/wiki/ç, I was able to search on ç and got the following result off the browser results: ç - Wiktionary ... Letter [ edit ] Translingual [ edit ] Letter Ç , ç C with a cedilla ... visit IRC or Wiktionary:AOL . ç ... http://en.wiktionary.org/wiki/%C3%A7 (cached) (explain) (anchors) (more from en.wiktionary.org) So it looks like it will fetch and parse off of urls with high bit set characters. Additionally, the display url has the ç encoded correctly as %C3%A7. Is this really a bug? Doing a similar test off of Google on keywords: "ç" wiktionary. yields: ç - Wiktionary AOL users can access Wiktionary through this link after accepting the CACERT certificate. ... Ç, ç. "tʃə", the fourth letter of the Albanian alphabet. ... en.wiktionary.org/wiki/ç - 15k - Cached - Similar pages [ More results from en.wiktionary.org ] Nearly identical, but the see that the ç is in its decoded form, not %C3%A7. If there's any interest in this issue, let me know. Chris