[jira] Commented: (NUTCH-537) TestMP3Parser.java, TestRTFParser.java, TestMSWordParser.java compile
[ https://issues.apache.org/jira/browse/NUTCH-537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12518346 ] Doğacan Güney commented on NUTCH-537: - Plugins parse-mp3 and parse-rtf don't work anyway, they are not built by default and they don't compile without extra jars. What is the point of keeping them in nutch source? I think it makes more sense to remove them from source and put them in http://wiki.apache.org/nutch/PluginCentral . PS: I don't see TestMSWordParser in your patch, and it compiles and passes tests for me. PPS: Please use svn diff to generate your patches. TestMP3Parser.java, TestRTFParser.java, TestMSWordParser.java compile - Key: NUTCH-537 URL: https://issues.apache.org/jira/browse/NUTCH-537 Project: Nutch Issue Type: Test Reporter: Hasan Diwan Attachments: nutch-tests.pat add .get(content.getUrl()); to parse = new ParseUtil(conf).parseByExtensionId(foo, content), per the working TestRSSParser.java -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-537) TestMP3Parser.java, TestRTFParser.java, TestMSWordParser.java compile
[ https://issues.apache.org/jira/browse/NUTCH-537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hasan Diwan updated NUTCH-537: -- Attachment: (was: nutch-tests.pat) TestMP3Parser.java, TestRTFParser.java, TestMSWordParser.java compile - Key: NUTCH-537 URL: https://issues.apache.org/jira/browse/NUTCH-537 Project: Nutch Issue Type: Test Reporter: Hasan Diwan add .get(content.getUrl()); to parse = new ParseUtil(conf).parseByExtensionId(foo, content), per the working TestRSSParser.java -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-537) TestMP3Parser.java, TestRTFParser.java, TestMSWordParser.java compile
[ https://issues.apache.org/jira/browse/NUTCH-537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hasan Diwan updated NUTCH-537: -- Attachment: nutch-tests.pat TestMP3Parser.java, TestRTFParser.java, TestMSWordParser.java compile - Key: NUTCH-537 URL: https://issues.apache.org/jira/browse/NUTCH-537 Project: Nutch Issue Type: Test Reporter: Hasan Diwan Attachments: nutch-tests.pat add .get(content.getUrl()); to parse = new ParseUtil(conf).parseByExtensionId(foo, content), per the working TestRSSParser.java -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-522) Use URLValidator in the Injector
[ https://issues.apache.org/jira/browse/NUTCH-522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doğacan Güney updated NUTCH-522: Attachment: NUTCH_522_v4.patch This is the patch that I am going to commit (very soon) if there are no objections. * Run normalizers first then run validator. * Update TestInjector to use valid urls. * Removed UrlValidator's main method. Use URLValidator in the Injector Key: NUTCH-522 URL: https://issues.apache.org/jira/browse/NUTCH-522 Project: Nutch Issue Type: Improvement Components: injector Reporter: Emmanuel Joke Assignee: Emmanuel Joke Priority: Minor Fix For: 1.0.0 Attachments: NUTCH-522.patch, NUTCH-522_v2.patch, NUTCH-522_v3.patch, NUTCH_522_v4.patch Same as NUTCH-505, we should use the UrlValidator to check url in the Injector -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-535) ParseData's contentMeta accumulates unnecessary values during parse
[ https://issues.apache.org/jira/browse/NUTCH-535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doğacan Güney resolved NUTCH-535. - Resolution: Fixed Fixed in rev. 563777. ParseData's contentMeta accumulates unnecessary values during parse --- Key: NUTCH-535 URL: https://issues.apache.org/jira/browse/NUTCH-535 Project: Nutch Issue Type: Bug Affects Versions: 1.0.0 Reporter: Doğacan Güney Assignee: Doğacan Güney Fix For: 1.0.0 Attachments: NUTCH-535.patch, NUTCH_535_v2.patch After NUTCH-506, if you run parse on a segment, parseData's contentMeta accumulates metadata of every content parsed so far. This is because NUTCH-506 changed constructor to create a new metadata (before NUTCH-506, a new metadata was created for every call to readFields). It seems hadoop somehow caches Content instance so each new call to Content.readFields during ParseSegment increases size of metadata. Because of this, one can end up with *huge* parse_data directory (something like 10 times larger than content directory) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (NUTCH-535) ParseData's contentMeta accumulates unnecessary values during parse
[ https://issues.apache.org/jira/browse/NUTCH-535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doğacan Güney closed NUTCH-535. --- Resolved and committed. ParseData's contentMeta accumulates unnecessary values during parse --- Key: NUTCH-535 URL: https://issues.apache.org/jira/browse/NUTCH-535 Project: Nutch Issue Type: Bug Affects Versions: 1.0.0 Reporter: Doğacan Güney Assignee: Doğacan Güney Fix For: 1.0.0 Attachments: NUTCH-535.patch, NUTCH_535_v2.patch After NUTCH-506, if you run parse on a segment, parseData's contentMeta accumulates metadata of every content parsed so far. This is because NUTCH-506 changed constructor to create a new metadata (before NUTCH-506, a new metadata was created for every call to readFields). It seems hadoop somehow caches Content instance so each new call to Content.readFields during ParseSegment increases size of metadata. Because of this, one can end up with *huge* parse_data directory (something like 10 times larger than content directory) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-536) Reduce number of warnings in nutch core
[ https://issues.apache.org/jira/browse/NUTCH-536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doğacan Güney updated NUTCH-536: Attachment: NUTCH-536_v2.patch New patch. * Remove unused variables/labels/methods in o.a.n.analysis.Nutch* packages. Reduce number of warnings in nutch core --- Key: NUTCH-536 URL: https://issues.apache.org/jira/browse/NUTCH-536 Project: Nutch Issue Type: Improvement Affects Versions: 1.0.0 Reporter: Doğacan Güney Assignee: Doğacan Güney Priority: Minor Fix For: 1.0.0 Attachments: NUTCH-536_v2.patch, NUTCH_536.patch Nutch core (code under src/java) gives around 600 warnings. Most of them are unused variables/imports and Java5 generics style warnings. This issue is to track changes to reduce number of warnings. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-538) Delete unused classes under o.a.n.util
[ https://issues.apache.org/jira/browse/NUTCH-538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doğacan Güney updated NUTCH-538: Attachment: delete_unused.patch Trivial patch to delete ThreadPool and FibonacciHeap. Delete unused classes under o.a.n.util -- Key: NUTCH-538 URL: https://issues.apache.org/jira/browse/NUTCH-538 Project: Nutch Issue Type: Improvement Reporter: Doğacan Güney Priority: Trivial Fix For: 1.0.0 Attachments: delete_unused.patch ThreadPool and FibonacciHeap is not used by anything else in nutch source so, IMHO, there is no point keeping them. Java 5 has really nice alternatives to ThreadPool under java.util.concurrent. I don't know if FibonacciHeap is faster than java's priority queue but we don't make huge priority queues anyway so even if it is faster it is probably not noticable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-538) Delete unused classes under o.a.n.util
Delete unused classes under o.a.n.util -- Key: NUTCH-538 URL: https://issues.apache.org/jira/browse/NUTCH-538 Project: Nutch Issue Type: Improvement Reporter: Doğacan Güney Priority: Trivial Fix For: 1.0.0 Attachments: delete_unused.patch ThreadPool and FibonacciHeap is not used by anything else in nutch source so, IMHO, there is no point keeping them. Java 5 has really nice alternatives to ThreadPool under java.util.concurrent. I don't know if FibonacciHeap is faster than java's priority queue but we don't make huge priority queues anyway so even if it is faster it is probably not noticable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (NUTCH-522) Use URLValidator in the Injector
[ https://issues.apache.org/jira/browse/NUTCH-522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doğacan Güney closed NUTCH-522. --- Use URLValidator in the Injector Key: NUTCH-522 URL: https://issues.apache.org/jira/browse/NUTCH-522 Project: Nutch Issue Type: Improvement Components: injector Reporter: Emmanuel Joke Assignee: Emmanuel Joke Priority: Minor Fix For: 1.0.0 Attachments: NUTCH-522.patch, NUTCH-522_v2.patch, NUTCH-522_v3.patch, NUTCH_522_v4.patch Same as NUTCH-505, we should use the UrlValidator to check url in the Injector -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-25) needs 'character encoding' detector
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doğacan Güney updated NUTCH-25: --- Attachment: NUTCH-25_v3.patch Here is a new version. * Code style cleanups (use '} else {' instead of else on next line). * Add confidences from different clues pointing to same encoding. * Check if encoding passes the threshold. * Add a utility getThreshold(String charset) method that returns the global threshold if charset-specific threshold is unavailable. * Make clues a ThreadLocal variable for thread-safety. needs 'character encoding' detector --- Key: NUTCH-25 URL: https://issues.apache.org/jira/browse/NUTCH-25 Project: Nutch Issue Type: New Feature Reporter: Stefan Groschupf Assignee: Doğacan Güney Fix For: 1.0.0 Attachments: EncodingDetector.java, EncodingDetector_additive.java, NUTCH-25.patch, NUTCH-25_draft.patch, NUTCH-25_v2.patch, NUTCH-25_v3.patch, patch transferred from: http://sourceforge.net/tracker/index.php?func=detailaid=995730group_id=59548atid=491356 submitted by: Jungshik Shin this is a follow-up to bug 993380 (figure out 'charset' from the meta tag). Although we can cover a lot of ground using the 'C-T' header field in in the HTTP header and the corresponding meta tag in html documents (and in case of XML, we have to use a similar but a different 'parsing'), in the wild, there are a lot of documents without any information about the character encoding used. Browsers like Mozilla and search engines like Google use character encoding detectors to deal with these 'unlabelled' documents. Mozilla's character encoding detector is GPL/MPL'd and we might be able to port it to Java. Unfortunately, it's not fool-proof. However, along with some other heuristic used by Mozilla and elsewhere, it'll be possible to achieve a high rate of the detection. The following page has links to some other related pages. http://trainedmonkey.com/week/2004/26 In addition to the character encoding detection, we also need to detect the language of a document, which is even harder and should be a separate bug (although it's related). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-536) Reduce number of warnings in nutch core
[ https://issues.apache.org/jira/browse/NUTCH-536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doğacan Güney resolved NUTCH-536. - Resolution: Fixed Committed in rev. 563894. Reduce number of warnings in nutch core --- Key: NUTCH-536 URL: https://issues.apache.org/jira/browse/NUTCH-536 Project: Nutch Issue Type: Improvement Affects Versions: 1.0.0 Reporter: Doğacan Güney Assignee: Doğacan Güney Priority: Minor Fix For: 1.0.0 Attachments: NUTCH-536_v2.patch, NUTCH_536.patch Nutch core (code under src/java) gives around 600 warnings. Most of them are unused variables/imports and Java5 generics style warnings. This issue is to track changes to reduce number of warnings. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-539) HttpClient plugin does not work with BasicAuthentication
HttpClient plugin does not work with BasicAuthentication Key: NUTCH-539 URL: https://issues.apache.org/jira/browse/NUTCH-539 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 0.8 Reporter: Ravi Chintakunta Priority: Minor For Nutch to fetch pages with basic authentication, the HttpClient should be configured with the username and password credentials. For this to work: 1. Add the username and password credentials to nutch-site.xml as below: property namehttp.auth.basic.username/name valuemyusername/value description username for http basic auth /description /property property namehttp.auth.basic.password/name valuemypassword/value description password for http basic auth /description /property 2. Configure httpclient with these credentials by applying the attached patch to nutch/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/Http.java -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Is there any chance that my patches will be considered?
Hello Nutch Developers, On May 22 I have contributed two patches - NUTCH-487 and NUTCH-490. The next release is probably coming soon. I would be really pleased if they are merged by then I don't have to patch it. Is there any chance they will be merged into source code? I can port them to current head, but so far nobody asked for this. I would also like to point your attention to one point. It is already two and half month since I have added the patches. There is not even a single comment on this. This is really discouraging for me, as a contributor. I know that merging patches is not the thing that developers love to do, but you are the only one who can do it. Of course I don't mean you should thank to every contribution, but take it into account. Having someone's work being ignored, and it looks like this for me, really discourages from further work. Reviewing it and saying you won't merge it because something would be much better than leaving it without a single comment. This may reduce your active community. Think of this. Best regards, Marcin Okraszewski
Re: Is there any chance that my patches will be considered?
On 8/8/07, Marcin Okraszewski [EMAIL PROTECTED] wrote: Hello Nutch Developers, On May 22 I have contributed two patches - NUTCH-487 and NUTCH-490. The next release is probably coming soon. I would be really pleased if they are merged by then I don't have to patch it. Is there any chance they will be merged into source code? I can port them to current head, but so far nobody asked for this. NUTCH-487 is mostly a duplicate of NUTCH-369. I merged patches from both issues in NUTCH-25 since NUTCH-25 needs those patches to work correctly (and I gave you and Renaud Richardet credit in comments). As for NUTCH-490, I haven't taken an in-depth look at it, but I don't see the point of it. Why not just use HtmlParseFilters since you have access to the DOM object? What advantage do neko filters have? Also, having an extension point for a library possibly used by a possibly used plugin looks really really wrong from a design point. I would also like to point your attention to one point. It is already two and half month since I have added the patches. There is not even a single comment on this. This is really discouraging for me, as a contributor. I know that merging patches is not the thing that developers love to do, but you are the only one who can do it. Of course I don't mean you should thank to every contribution, but take it into account. Having someone's work being ignored, and it looks like this for me, really discourages from further work. Reviewing it and saying you won't merge it because something would be much better than leaving it without a single comment. This may reduce your active community. Think of this. Best regards, Marcin Okraszewski -- Doğacan Güney
Re: Re: Is there any chance that my patches will be considered?
Thanks for a quick answer. As for NUTCH-490, I haven't taken an in-depth look at it, but I don't see the point of it. Why not just use HtmlParseFilters since you have access to the DOM object? What advantage do neko filters have? Also, having an extension point for a library possibly used by a possibly used plugin looks really really wrong from a design point. In my case I want to achieve two things: 1. Ensure there is always TBODY element. 2. Drop all SELECT elements (I don't want it to be in You are right, I could manipulate DOM for this. But filters seems to be less costly operation, this is why I took this approach first. Though I haven't done any tests - maybe I'm too concerned and it doesn't matter that much. Or possibly my suggestion to make extension point for parser is the best one? Then if you want to modify parsing itself, you can do whatever you want. Then you also wouldn't need any switch for parsing implementation as it is now. Simply modify plugin inclusion. I could do it, if you find it a good idea. Anyway, as I said, even if you find it all as not having much sense, just close the issue with this comment. I really prefer it over hanging request, because I know I should rather think of different solution for my case. Thanks, Marcin Okraszewski