[jira] Commented: (NUTCH-537) TestMP3Parser.java, TestRTFParser.java, TestMSWordParser.java compile

2007-08-08 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12518346
 ] 

Doğacan Güney commented on NUTCH-537:
-

Plugins parse-mp3 and parse-rtf don't work anyway, they are not built by 
default and they don't compile without extra jars. What is the point of keeping 
them in nutch source? I think it makes more sense to remove them from source 
and put them in http://wiki.apache.org/nutch/PluginCentral .

PS: I don't see TestMSWordParser in your patch, and it compiles and passes 
tests for me.

PPS: Please use svn diff to generate your patches.


 TestMP3Parser.java, TestRTFParser.java, TestMSWordParser.java compile
 -

 Key: NUTCH-537
 URL: https://issues.apache.org/jira/browse/NUTCH-537
 Project: Nutch
  Issue Type: Test
Reporter: Hasan Diwan
 Attachments: nutch-tests.pat


 add .get(content.getUrl()); to parse = new 
 ParseUtil(conf).parseByExtensionId(foo, content), per the working 
 TestRSSParser.java

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-537) TestMP3Parser.java, TestRTFParser.java, TestMSWordParser.java compile

2007-08-08 Thread Hasan Diwan (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hasan Diwan updated NUTCH-537:
--

Attachment: (was: nutch-tests.pat)

 TestMP3Parser.java, TestRTFParser.java, TestMSWordParser.java compile
 -

 Key: NUTCH-537
 URL: https://issues.apache.org/jira/browse/NUTCH-537
 Project: Nutch
  Issue Type: Test
Reporter: Hasan Diwan

 add .get(content.getUrl()); to parse = new 
 ParseUtil(conf).parseByExtensionId(foo, content), per the working 
 TestRSSParser.java

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-537) TestMP3Parser.java, TestRTFParser.java, TestMSWordParser.java compile

2007-08-08 Thread Hasan Diwan (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hasan Diwan updated NUTCH-537:
--

Attachment: nutch-tests.pat

 TestMP3Parser.java, TestRTFParser.java, TestMSWordParser.java compile
 -

 Key: NUTCH-537
 URL: https://issues.apache.org/jira/browse/NUTCH-537
 Project: Nutch
  Issue Type: Test
Reporter: Hasan Diwan
 Attachments: nutch-tests.pat


 add .get(content.getUrl()); to parse = new 
 ParseUtil(conf).parseByExtensionId(foo, content), per the working 
 TestRSSParser.java

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-522) Use URLValidator in the Injector

2007-08-08 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/NUTCH-522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doğacan Güney updated NUTCH-522:


Attachment: NUTCH_522_v4.patch

This is the patch that I am going to commit (very soon) if there are no 
objections.

* Run normalizers first then run validator.

* Update TestInjector to use valid urls.

* Removed UrlValidator's main method.

 Use URLValidator in the Injector
 

 Key: NUTCH-522
 URL: https://issues.apache.org/jira/browse/NUTCH-522
 Project: Nutch
  Issue Type: Improvement
  Components: injector
Reporter: Emmanuel Joke
Assignee: Emmanuel Joke
Priority: Minor
 Fix For: 1.0.0

 Attachments: NUTCH-522.patch, NUTCH-522_v2.patch, NUTCH-522_v3.patch, 
 NUTCH_522_v4.patch


 Same as NUTCH-505, we should use the UrlValidator to check url in the Injector

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-535) ParseData's contentMeta accumulates unnecessary values during parse

2007-08-08 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/NUTCH-535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doğacan Güney resolved NUTCH-535.
-

Resolution: Fixed

Fixed in rev. 563777.


 ParseData's contentMeta accumulates unnecessary values during parse
 ---

 Key: NUTCH-535
 URL: https://issues.apache.org/jira/browse/NUTCH-535
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Doğacan Güney
Assignee: Doğacan Güney
 Fix For: 1.0.0

 Attachments: NUTCH-535.patch, NUTCH_535_v2.patch


 After NUTCH-506, if you run parse on a segment, parseData's contentMeta 
 accumulates metadata of every content parsed so far. This is because 
 NUTCH-506 changed constructor to create a new metadata (before NUTCH-506, a 
 new metadata was created for every call to readFields). It seems hadoop 
 somehow caches Content instance so each new call to Content.readFields during 
 ParseSegment increases size of metadata. Because of this, one can end up with 
 *huge* parse_data directory (something like 10 times larger than content 
 directory)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Closed: (NUTCH-535) ParseData's contentMeta accumulates unnecessary values during parse

2007-08-08 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/NUTCH-535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doğacan Güney closed NUTCH-535.
---


Resolved and committed.

 ParseData's contentMeta accumulates unnecessary values during parse
 ---

 Key: NUTCH-535
 URL: https://issues.apache.org/jira/browse/NUTCH-535
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Doğacan Güney
Assignee: Doğacan Güney
 Fix For: 1.0.0

 Attachments: NUTCH-535.patch, NUTCH_535_v2.patch


 After NUTCH-506, if you run parse on a segment, parseData's contentMeta 
 accumulates metadata of every content parsed so far. This is because 
 NUTCH-506 changed constructor to create a new metadata (before NUTCH-506, a 
 new metadata was created for every call to readFields). It seems hadoop 
 somehow caches Content instance so each new call to Content.readFields during 
 ParseSegment increases size of metadata. Because of this, one can end up with 
 *huge* parse_data directory (something like 10 times larger than content 
 directory)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-536) Reduce number of warnings in nutch core

2007-08-08 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/NUTCH-536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doğacan Güney updated NUTCH-536:


Attachment: NUTCH-536_v2.patch

New patch.

* Remove unused variables/labels/methods in o.a.n.analysis.Nutch* packages.

 Reduce number of warnings in nutch core
 ---

 Key: NUTCH-536
 URL: https://issues.apache.org/jira/browse/NUTCH-536
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
Reporter: Doğacan Güney
Assignee: Doğacan Güney
Priority: Minor
 Fix For: 1.0.0

 Attachments: NUTCH-536_v2.patch, NUTCH_536.patch


 Nutch core (code under src/java) gives around 600 warnings. Most of them are 
 unused variables/imports and Java5 generics style warnings. This issue is to 
 track changes to reduce number of warnings.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-538) Delete unused classes under o.a.n.util

2007-08-08 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/NUTCH-538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doğacan Güney updated NUTCH-538:


Attachment: delete_unused.patch

Trivial patch to delete ThreadPool and FibonacciHeap.

 Delete unused classes under o.a.n.util
 --

 Key: NUTCH-538
 URL: https://issues.apache.org/jira/browse/NUTCH-538
 Project: Nutch
  Issue Type: Improvement
Reporter: Doğacan Güney
Priority: Trivial
 Fix For: 1.0.0

 Attachments: delete_unused.patch


 ThreadPool and FibonacciHeap is not used by anything else in nutch source so, 
 IMHO, there is no point keeping them. Java 5 has really nice alternatives to 
 ThreadPool under java.util.concurrent. I don't know if FibonacciHeap is 
 faster than java's priority queue but we don't make huge priority queues 
 anyway so even if it is faster it is probably not noticable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-538) Delete unused classes under o.a.n.util

2007-08-08 Thread JIRA
Delete unused classes under o.a.n.util
--

 Key: NUTCH-538
 URL: https://issues.apache.org/jira/browse/NUTCH-538
 Project: Nutch
  Issue Type: Improvement
Reporter: Doğacan Güney
Priority: Trivial
 Fix For: 1.0.0
 Attachments: delete_unused.patch

ThreadPool and FibonacciHeap is not used by anything else in nutch source so, 
IMHO, there is no point keeping them. Java 5 has really nice alternatives to 
ThreadPool under java.util.concurrent. I don't know if FibonacciHeap is faster 
than java's priority queue but we don't make huge priority queues anyway so 
even if it is faster it is probably not noticable.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Closed: (NUTCH-522) Use URLValidator in the Injector

2007-08-08 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/NUTCH-522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doğacan Güney closed NUTCH-522.
---


 Use URLValidator in the Injector
 

 Key: NUTCH-522
 URL: https://issues.apache.org/jira/browse/NUTCH-522
 Project: Nutch
  Issue Type: Improvement
  Components: injector
Reporter: Emmanuel Joke
Assignee: Emmanuel Joke
Priority: Minor
 Fix For: 1.0.0

 Attachments: NUTCH-522.patch, NUTCH-522_v2.patch, NUTCH-522_v3.patch, 
 NUTCH_522_v4.patch


 Same as NUTCH-505, we should use the UrlValidator to check url in the Injector

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-25) needs 'character encoding' detector

2007-08-08 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doğacan Güney updated NUTCH-25:
---

Attachment: NUTCH-25_v3.patch

Here is a new version.

* Code style cleanups (use '} else {' instead of else on next line).

* Add confidences from different clues pointing to same encoding.

* Check if encoding passes the threshold.

* Add a utility getThreshold(String charset) method that returns the
  global threshold if charset-specific threshold is unavailable.

* Make clues a ThreadLocal variable for thread-safety.


 needs 'character encoding' detector
 ---

 Key: NUTCH-25
 URL: https://issues.apache.org/jira/browse/NUTCH-25
 Project: Nutch
  Issue Type: New Feature
Reporter: Stefan Groschupf
Assignee: Doğacan Güney
 Fix For: 1.0.0

 Attachments: EncodingDetector.java, EncodingDetector_additive.java, 
 NUTCH-25.patch, NUTCH-25_draft.patch, NUTCH-25_v2.patch, NUTCH-25_v3.patch, 
 patch


 transferred from:
 http://sourceforge.net/tracker/index.php?func=detailaid=995730group_id=59548atid=491356
 submitted by:
 Jungshik Shin
 this is a follow-up to bug 993380 (figure out 'charset'
 from the meta tag).
 Although we can cover a lot of ground using the 'C-T'
 header field in in the HTTP header and the
 corresponding meta tag in html documents (and in case
 of XML, we have to use a similar but a different
 'parsing'), in the wild, there are a lot of documents
 without any information about the character encoding
 used. Browsers like Mozilla and search engines like
 Google use character encoding detectors to deal with
 these 'unlabelled' documents. 
 Mozilla's character encoding detector is GPL/MPL'd and
 we might be able to port it to Java. Unfortunately,
 it's not fool-proof. However, along with some other
 heuristic used by Mozilla and elsewhere, it'll be
 possible to achieve a high rate of the detection. 
 The following page has links to some other related pages.
 http://trainedmonkey.com/week/2004/26
 In addition to the character encoding detection, we
 also need to detect the language of a document, which
 is even harder and should be a separate bug (although
 it's related).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-536) Reduce number of warnings in nutch core

2007-08-08 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/NUTCH-536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doğacan Güney resolved NUTCH-536.
-

Resolution: Fixed

Committed in rev. 563894.


 Reduce number of warnings in nutch core
 ---

 Key: NUTCH-536
 URL: https://issues.apache.org/jira/browse/NUTCH-536
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
Reporter: Doğacan Güney
Assignee: Doğacan Güney
Priority: Minor
 Fix For: 1.0.0

 Attachments: NUTCH-536_v2.patch, NUTCH_536.patch


 Nutch core (code under src/java) gives around 600 warnings. Most of them are 
 unused variables/imports and Java5 generics style warnings. This issue is to 
 track changes to reduce number of warnings.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-539) HttpClient plugin does not work with BasicAuthentication

2007-08-08 Thread Ravi Chintakunta (JIRA)
HttpClient plugin does not work with BasicAuthentication


 Key: NUTCH-539
 URL: https://issues.apache.org/jira/browse/NUTCH-539
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 0.8
Reporter: Ravi Chintakunta
Priority: Minor


For Nutch to fetch pages with basic authentication, the HttpClient should be 
configured with the username and password credentials. 

For this to work:

1. Add the username and password credentials to nutch-site.xml as below:

property
  namehttp.auth.basic.username/name
  valuemyusername/value
  description
username for http basic auth
  /description
/property

property
  namehttp.auth.basic.password/name
  valuemypassword/value
  description
password for http basic auth
  /description
/property

2. Configure httpclient with these credentials by applying the attached patch 
to 
nutch/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/Http.java




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Is there any chance that my patches will be considered?

2007-08-08 Thread Marcin Okraszewski
Hello Nutch Developers,
On May 22 I have contributed two patches - NUTCH-487 and NUTCH-490. The next 
release is probably coming soon. I would be really pleased if they are merged 
by then I don't have to patch it. Is there any chance they will be merged into 
source code? I can port them to current head, but so far nobody asked for this. 

I would also like to point your attention to one point. It is already two and 
half month since I have added the patches. There is not even a single comment 
on this. This is really discouraging for me, as a contributor. I know that 
merging patches is not the thing that developers love to do, but you are the 
only one who can do it. Of course I don't mean you should thank to every 
contribution, but take it into account. Having someone's work being ignored, 
and it looks like this for me, really discourages from further work. Reviewing 
it and saying you won't merge it because something would be much better than 
leaving it without a single comment. This may reduce your active community.

Think of this.

Best regards,
Marcin Okraszewski


Re: Is there any chance that my patches will be considered?

2007-08-08 Thread Doğacan Güney
On 8/8/07, Marcin Okraszewski [EMAIL PROTECTED] wrote:
 Hello Nutch Developers,
 On May 22 I have contributed two patches - NUTCH-487 and NUTCH-490. The next 
 release is probably coming soon. I would be really pleased if they are merged 
 by then I don't have to patch it. Is there any chance they will be merged 
 into source code? I can port them to current head, but so far nobody asked 
 for this.

NUTCH-487 is mostly a duplicate of NUTCH-369. I merged patches from
both issues in NUTCH-25 since NUTCH-25 needs those patches to work
correctly (and I gave you and Renaud Richardet credit in comments).

As for NUTCH-490, I haven't taken an in-depth look at it, but I don't
see the point of it. Why not just use HtmlParseFilters since you have
access to the DOM object? What advantage do neko filters have? Also,
having an extension point for a library possibly used by a possibly
used plugin looks really really wrong from a design point.


 I would also like to point your attention to one point. It is already two and 
 half month since I have added the patches. There is not even a single comment 
 on this. This is really discouraging for me, as a contributor. I know that 
 merging patches is not the thing that developers love to do, but you are the 
 only one who can do it. Of course I don't mean you should thank to every 
 contribution, but take it into account. Having someone's work being ignored, 
 and it looks like this for me, really discourages from further work. 
 Reviewing it and saying you won't merge it because something would be much 
 better than leaving it without a single comment. This may reduce your active 
 community.

 Think of this.

 Best regards,
 Marcin Okraszewski



-- 
Doğacan Güney


Re: Re: Is there any chance that my patches will be considered?

2007-08-08 Thread Marcin Okraszewski
Thanks for a quick answer.

 As for NUTCH-490, I haven't taken an in-depth look at it, but I don't
 see the point of it. Why not just use HtmlParseFilters since you have
 access to the DOM object? What advantage do neko filters have? Also,
 having an extension point for a library possibly used by a possibly
 used plugin looks really really wrong from a design point.

In my case I want to achieve two things:
1. Ensure there is always TBODY element.
2. Drop all SELECT elements (I don't want it to be in 

You are right, I could manipulate DOM for this. But filters seems to be less 
costly operation, this is why I took this approach first. Though I haven't done 
any tests - maybe I'm too concerned and it doesn't matter that much.

Or possibly my suggestion to make extension point for parser is the best one? 
Then if you want to modify parsing itself, you can do whatever you want. Then 
you also wouldn't need any switch for parsing implementation as it is now. 
Simply modify plugin inclusion. I could do it, if you find it a good idea.

Anyway, as I said, even if you find it all as not having much sense, just close 
the issue with this comment. I really prefer it over hanging request, because I 
know I should rather think of different solution for my case.

Thanks,
Marcin Okraszewski