Re: Issues to work on

2006-05-10 Thread Chris Fellows
+1 on this

Dennis Kubes <[EMAIL PROTECTED]> wrote: What would be good issues to tackle, 
bugs to fix in either Nutch or the 
Hadoop code base.  I lookup through the JIRA but don' t really 
understand if things are being worked on or not?

Dennis



Re: http chunked content

2006-05-08 Thread Chris Fellows
Okay, saw the code in the http-protocol plugin. I
remember looking at this about a year ago. RFC 2616
(HTTP/1.1) does say, as Jerome pointed out:

"A server MUST NOT send transfer-codings to an
HTTP/1.0 client."

Regardless, I can attest that there are servers out
there that return chunked content regardless of the
client.

We had a socket implementation akin to the
HttpResponse.java in http-protocol plugin and were
stumped on how to handle identifying whether the
response was chunked or not - as we could not reliably
use the Transfer-coding header. The only way we could
see was trying to use the initial hex characters
denoting the size of the first chunk.

"The chunk-size field is a string of hex digits
indicating the size of the chunk. The chunked encoding
is ended by any chunk whose size is zero, followed by
the trailer, which is terminated by an empty line." -
more from RFC 2616

But in practice this was error prone. Switching over
to apache httpclient eliminated this problem, as it
transparently handles chunked and un-chunked content.
But httpclient is much more heavy weight and so the
conversion could only be done after implementing some
basic resource pooling on the primary httpclient
object. 

It does look like this would be a serious refactor job
as nutch uses all java.net classes. On the other hand,
it might simplify some areas of the nutch protocol
classes and httpclient does have some interesting
built in support for multi-threading/performance
tuning requests.

I hope this helps towards a solution.

Best Regards,

Chris

--- Andrzej Bialecki <[EMAIL PROTECTED]> wrote:

> Chris Fellows wrote:
> > Just remembered, got around it by using HTTPClient
> > which handles reading the response (chunked or
> not)
> > transparently. Haven't looked at the nutch code,
> but
> > if we were to use HTTPClient 3.0.x or later,
> should
> > take care of it.
> >
> >   
> 
> Take a look at protocol-httpclient. This discussion
> is on whether/how to 
> fix protocol-http. The other plugin already supports
> this.
> 
> -- 
> Best regards,
> Andrzej Bialecki <><
>  ___. ___ ___ ___ _ _  
> __
> [__ || __|__/|__||\/|  Information Retrieval,
> Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System
> Integration
> http://www.sigram.com  Contact: info at sigram dot
> com
> 
> 
> 



Re: http chunked content

2006-05-08 Thread Chris Fellows
Just remembered, got around it by using HTTPClient
which handles reading the response (chunked or not)
transparently. Haven't looked at the nutch code, but
if we were to use HTTPClient 3.0.x or later, should
take care of it.

--- Chris Fellows <[EMAIL PROTECTED]> wrote:

> > Furthermore, we can read in HTTP/1.1 specification
> > that "A server MUST NOT
> > send
> > transfer-codings to an HTTP/1.0 client".
> 
> I once did an socket implementation against
> Anonymizer. This is well established proxy service
> that services $100K+ government and private
> contracts.
> 
> Their server always sent chunked content despite all
> headers. I'm pretty sure that there are other well
> established servers that send chunked content
> despite
> the rfc.
> 
> Guessing that it might have something to do with
> wanting to control content compression. All the
> browsers can handle it, and that's probably all
> apple
> is concerned with - even though they're overriding
> an
> rfc spec req.
> 
> Chris
> 
> --- Jérôme Charron <[EMAIL PROTECTED]> wrote:
> 
> > > http://www.apple.com for example answer with
> > chunked content also if
> > > you request with a http 1.0 header.
> > 
> > 
> > Stefan,
> > 
> > I don't see any "Transfer-Encoding: chunked"
> header
> > in responses from
> > www.apple.com
> > Furthermore, we can read in HTTP/1.1 specification
> > that "A server MUST NOT
> > send
> > transfer-codings to an HTTP/1.0 client".
> > 
> > Jérôme
> > 
> > --
> > http://motrech.free.fr/
> > http://www.frutch.org/
> > 
> 
> 



Re: http chunked content

2006-05-08 Thread Chris Fellows
> Furthermore, we can read in HTTP/1.1 specification
> that "A server MUST NOT
> send
> transfer-codings to an HTTP/1.0 client".

I once did an socket implementation against
Anonymizer. This is well established proxy service
that services $100K+ government and private contracts.

Their server always sent chunked content despite all
headers. I'm pretty sure that there are other well
established servers that send chunked content despite
the rfc.

Guessing that it might have something to do with
wanting to control content compression. All the
browsers can handle it, and that's probably all apple
is concerned with - even though they're overriding an
rfc spec req.

Chris

--- Jérôme Charron <[EMAIL PROTECTED]> wrote:

> > http://www.apple.com for example answer with
> chunked content also if
> > you request with a http 1.0 header.
> 
> 
> Stefan,
> 
> I don't see any "Transfer-Encoding: chunked" header
> in responses from
> www.apple.com
> Furthermore, we can read in HTTP/1.1 specification
> that "A server MUST NOT
> send
> transfer-codings to an HTTP/1.0 client".
> 
> Jérôme
> 
> --
> http://motrech.free.fr/
> http://www.frutch.org/
> 



Re: Merging segments

2006-05-08 Thread Chris Fellows
That's great.

Well, my follow up to that then is: 

Will the new tool allow any form of "diff'ing"
segments? In practice this would allow you to run a
crawl on a series of sites one week. Then run another
crawl on the same sites a week or so later. Diff the
segments and allow users to search on changes within
the search domain.

--- Andrzej Bialecki <[EMAIL PROTECTED]> wrote:

> Chris Fellows wrote:
> > Hello,
> >
> > So the last discussion on merging segments was
> back in
> > Jan. Has there been any progress in this
> direction?
> > What would be the benefit of being able merge
> > segments? Would being able to merge segments open
> up
> > new functionality options or is merging just a
> > convience? Also, what's the estimate for how
> involved
> > merge functionality development is?
> >   
> 
> Relief is on the way. Fine folks at houxou.com have
> sponsored the 
> development of a brand-new SegmentMerger + slicer,
> and decided to donate 
> it to the project - big thanks!
> 
> I'm running some final tests, and will commit it
> today/tomorrow.
> 
> -- 
> Best regards,
> Andrzej Bialecki <><
>  ___. ___ ___ ___ _ _  
> __
> [__ || __|__/|__||\/|  Information Retrieval,
> Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System
> Integration
> http://www.sigram.com  Contact: info at sigram dot
> com
> 
> 
> 



Merging segments

2006-05-05 Thread Chris Fellows
Hello,

So the last discussion on merging segments was back in
Jan. Has there been any progress in this direction?
What would be the benefit of being able merge
segments? Would being able to merge segments open up
new functionality options or is merging just a
convience? Also, what's the estimate for how involved
merge functionality development is?

Regards,

- Chris



[jira] Commented: (NUTCH-134) Summarizer doesn't select the best snippets

2006-05-04 Thread Chris Fellows (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-134?page=comments#action_12377866 ] 

Chris Fellows commented on NUTCH-134:
-

Jerome,

Let me know if you could use a hand in implementation. I'd like to get to know 
nutch and lucene code base better for my project. This looks like a good area 
to start in, so any opportunity to jump in would be great.

chris

> Summarizer doesn't select the best snippets
> ---
>
>  Key: NUTCH-134
>  URL: http://issues.apache.org/jira/browse/NUTCH-134
>  Project: Nutch
> Type: Bug

>   Components: searcher
> Versions: 0.7.2, 0.7.1, 0.7, 0.8-dev
> Reporter: Andrzej Bialecki 

>
> Summarizer.java tries to select the best fragments from the input text, where 
> the frequency of query terms is the highest. However, the logic in line 223 
> is flawed in that the excerptSet.add() operation will add new excerpts only 
> if they are not already present - the test is performed using the Comparator 
> that compares only the numUniqueTokens. This means that if there are two or 
> more excerpts, which score equally high, only the first of them will be 
> retained, and the rest of equally-scoring excerpts will be discarded, in 
> favor of other excerpts (possibly lower scoring).
> To fix this the Set should be replaced with a List + a sort operation. To 
> keep the relative position of excerpts in the original order the Excerpt 
> class should be extended with an "int order" field, and the collected 
> excerpts should be sorted in that order prior to adding them to the summary.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-134) Summarizer doesn't select the best snippets

2006-05-03 Thread Chris Fellows (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-134?page=comments#action_12377654 ] 

Chris Fellows commented on NUTCH-134:
-

byron,

Did you ever get a chance to run a cpu perf test on using 
lucene/contrib/highlighter for extracting summaries?

chris

> Summarizer doesn't select the best snippets
> ---
>
>  Key: NUTCH-134
>  URL: http://issues.apache.org/jira/browse/NUTCH-134
>  Project: Nutch
> Type: Bug

>   Components: searcher
> Versions: 0.7.2, 0.7.1, 0.7, 0.8-dev
> Reporter: Andrzej Bialecki 

>
> Summarizer.java tries to select the best fragments from the input text, where 
> the frequency of query terms is the highest. However, the logic in line 223 
> is flawed in that the excerptSet.add() operation will add new excerpts only 
> if they are not already present - the test is performed using the Comparator 
> that compares only the numUniqueTokens. This means that if there are two or 
> more excerpts, which score equally high, only the first of them will be 
> retained, and the rest of equally-scoring excerpts will be discarded, in 
> favor of other excerpts (possibly lower scoring).
> To fix this the Set should be replaced with a List + a sort operation. To 
> keep the relative position of excerpts in the original order the Excerpt 
> class should be extended with an "int order" field, and the collected 
> excerpts should be sorted in that order prior to adding them to the summary.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-25) needs 'character encoding' detector

2006-04-26 Thread Chris Fellows (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-25?page=comments#action_12376611 ] 

Chris Fellows commented on NUTCH-25:


This was last updated May '05. Has this charset and language detection been 
integrated into Nutch yet? 

If not, at what point should the detection happen? Fetching, parsing, etc. If 
this hasn't been fixed any leads into where to insert the detection would 
helpful.

> needs 'character encoding' detector
> ---
>
>  Key: NUTCH-25
>  URL: http://issues.apache.org/jira/browse/NUTCH-25
>  Project: Nutch
> Type: Wish

> Reporter: Stefan Groschupf
> Priority: Trivial

>
> transferred from:
> http://sourceforge.net/tracker/index.php?func=detail&aid=995730&group_id=59548&atid=491356
> submitted by:
> Jungshik Shin
> this is a follow-up to bug 993380 (figure out 'charset'
> from the meta tag).
> Although we can cover a lot of ground using the 'C-T'
> header field in in the HTTP header and the
> corresponding meta tag in html documents (and in case
> of XML, we have to use a similar but a different
> 'parsing'), in the wild, there are a lot of documents
> without any information about the character encoding
> used. Browsers like Mozilla and search engines like
> Google use character encoding detectors to deal with
> these 'unlabelled' documents. 
> Mozilla's character encoding detector is GPL/MPL'd and
> we might be able to port it to Java. Unfortunately,
> it's not fool-proof. However, along with some other
> heuristic used by Mozilla and elsewhere, it'll be
> possible to achieve a high rate of the detection. 
> The following page has links to some other related pages.
> http://trainedmonkey.com/week/2004/26
> In addition to the character encoding detection, we
> also need to detect the language of a document, which
> is even harder and should be a separate bug (although
> it's related).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-18) Windows servers include illegal characters in URLs

2006-04-26 Thread Chris Fellows (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-18?page=comments#action_12376601 ] 

Chris Fellows commented on NUTCH-18:


So checking out other SE's, google and yahoo use decoded display urls. ie 
en.wiktionary.org/wiki/ç
whereas altavista uses encoded urls ie. en.wiktionary.org/wiki/%C3%A7

I would say that the human readable, decoded urls is the way to go, especially 
since google and yahoo both support this. Its a small item, but it's one that 
many users will experience.

The code that controls this is in search.jsp:

<%=Entities.encode(url)%>

I need the decoded forms for my project. If any contributors want the change 
I'll submit the one file patch for the decoded urls.

If any contributers want the url completely encoded per RFC1738 for use in 
fetching and searching, then I can submit that patch as well. This last item is 
what I believe this bug was opened for in the first place, though after 
research posted above, doesn't look like its required.

> Windows servers include illegal characters in URLs
> --
>
>  Key: NUTCH-18
>  URL: http://issues.apache.org/jira/browse/NUTCH-18
>  Project: Nutch
> Type: Bug

>   Components: fetcher
> Reporter: Stefan Groschupf
> Priority: Minor

>
> Transfered from:
> http://sourceforge.net/tracker/index.php?func=detail&aid=1110243&group_id=59548&atid=491356
> submitted by:
> Ken Meltsner
> While spidering our intranet, I found that IIS may include 
> illegal characters in URLs -- specifically, characters with 
> the high bit set to produce non-English letters. In 
> addition, both Firefox and IE will accept URLs with high-
> bit characters, but Java won't.
> While this may not be Nutch's (or Java's) fault, it would 
> help if high-bit characters (and other illegal characters) 
> in URLs could be escaped (using percent-hex notation) 
> as part of the URL fix-up process, probably right after 
> the hostname lower-case conversion.
> Example document name in Portuguese(with high-bit 
> characters) taken from a longer URL:
> Nota%20tecnica%20-%20Alteração%20de%
> 20escopo.doc
> and with percent-escaped characters:
> Nota%20tecnica%20-%20Altera%e7%e3o%20de%
> 20escopo.doc

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-18) Windows servers include illegal characters in URLs

2006-04-26 Thread Chris Fellows (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-18?page=comments#action_12376554 ] 

Chris Fellows commented on NUTCH-18:


Was looking into nutch-18 bug that revolves around
illegal, non-ascii, characters in a url. An example of
a high bit character is 'ç' that maps to a high bit
set.

Before applying any fix, did a brief test with 0.8
trunk. After fetching and indexing on
http://en.wiktionary.org/wiki/ç, I was able to search
on ç and got the following result off the browser
results:

ç - Wiktionary
... Letter [ edit ] Translingual [ edit ] Letter Ç , ç
C with a cedilla ... visit IRC or Wiktionary:AOL . ç
...
http://en.wiktionary.org/wiki/%C3%A7 (cached)
(explain) (anchors) (more from en.wiktionary.org) 

So it looks like it will fetch and parse off of urls
with high bit set characters. Additionally, the
display url has the ç encoded correctly as %C3%A7. 

Is this really a bug?

Doing a similar test off of Google on keywords: "ç"
wiktionary. yields:

ç - Wiktionary
AOL users can access Wiktionary through this link
after accepting the CACERT certificate. ... Ç, ç.
"tʃə", the fourth letter of the Albanian
alphabet. ...
en.wiktionary.org/wiki/ç - 15k - Cached - Similar
pages
[ More results from en.wiktionary.org ]

Nearly identical, but the see that the ç is in its
decoded form, not %C3%A7.

I'd say if anything, the bug is that the display urls are in encoded form and 
not human readable.

> Windows servers include illegal characters in URLs
> --
>
>  Key: NUTCH-18
>  URL: http://issues.apache.org/jira/browse/NUTCH-18
>  Project: Nutch
> Type: Bug

>   Components: fetcher
> Reporter: Stefan Groschupf
> Priority: Minor

>
> Transfered from:
> http://sourceforge.net/tracker/index.php?func=detail&aid=1110243&group_id=59548&atid=491356
> submitted by:
> Ken Meltsner
> While spidering our intranet, I found that IIS may include 
> illegal characters in URLs -- specifically, characters with 
> the high bit set to produce non-English letters. In 
> addition, both Firefox and IE will accept URLs with high-
> bit characters, but Java won't.
> While this may not be Nutch's (or Java's) fault, it would 
> help if high-bit characters (and other illegal characters) 
> in URLs could be escaped (using percent-hex notation) 
> as part of the URL fix-up process, probably right after 
> the hostname lower-case conversion.
> Example document name in Portuguese(with high-bit 
> characters) taken from a longer URL:
> Nota%20tecnica%20-%20Alteração%20de%
> 20escopo.doc
> and with percent-escaped characters:
> Nota%20tecnica%20-%20Altera%e7%e3o%20de%
> 20escopo.doc

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



Nutch-18 illegal chars in urls: Not sure what the problem is

2006-04-26 Thread Chris Fellows
Hello,

Was looking into nutch-18 bug that revolves around
illegal, non-ascii, characters in a url. An example of
a high bit character is 'ç' that maps to a high bit
set.

Before applying any fix, did a brief test with 0.8
trunk. After fetching and indexing on
http://en.wiktionary.org/wiki/ç, I was able to search
on ç and got the following result off the browser
results:

ç - Wiktionary
... Letter [ edit ] Translingual [ edit ] Letter Ç , ç
C with a cedilla ... visit IRC or Wiktionary:AOL . ç
...
http://en.wiktionary.org/wiki/%C3%A7 (cached)
(explain) (anchors) (more from en.wiktionary.org) 

So it looks like it will fetch and parse off of urls
with high bit set characters. Additionally, the
display url has the ç encoded correctly as %C3%A7. 

Is this really a bug?

Doing a similar test off of Google on keywords: "ç"
wiktionary. yields:

ç - Wiktionary
AOL users can access Wiktionary through this link
after accepting the CACERT certificate. ... Ç, ç.
"tʃə", the fourth letter of the Albanian
alphabet. ...
en.wiktionary.org/wiki/ç - 15k - Cached - Similar
pages
[ More results from en.wiktionary.org ]

Nearly identical, but the see that the ç is in its
decoded form, not %C3%A7.

If there's any interest in this issue, let me know.

Chris