[jira] Updated: (NUTCH-677) Segment merge filering based on segment content
[ https://issues.apache.org/jira/browse/NUTCH-677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcin Okraszewski updated NUTCH-677: - Attachment: SegmentMergeFilter.java Added Apache License. Segment merge filering based on segment content --- Key: NUTCH-677 URL: https://issues.apache.org/jira/browse/NUTCH-677 Project: Nutch Issue Type: Improvement Affects Versions: 0.9.0 Reporter: Marcin Okraszewski Fix For: 1.1 Attachments: MergeFilter.patch, MergeFilter_for_1.0.patch, SegmentMergeFilter.java, SegmentMergeFilter.java, SegmentMergeFilters.java I needed a segment filtering based on meta data detected during parse phase. Unfortunately current URL based filtering does not allow for this. So I have created a new SegmentMergeFilter extension which receives segment entry which is being merged and decides if it should be included or not. Even though I needed only ParseData for my purpose I have done it a bit more general purpose, so the filter receives all merged data. The attached patch is for version 0.9 which I use. Unfortunately I didn't have time to check how it fits to trunk version. Sorry :( -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-677) Segment merge filering based on segment content
[ https://issues.apache.org/jira/browse/NUTCH-677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcin Okraszewski updated NUTCH-677: - Attachment: SegmentMergeFilters.java Added Apache license header. Segment merge filering based on segment content --- Key: NUTCH-677 URL: https://issues.apache.org/jira/browse/NUTCH-677 Project: Nutch Issue Type: Improvement Affects Versions: 0.9.0 Reporter: Marcin Okraszewski Fix For: 1.1 Attachments: MergeFilter.patch, MergeFilter_for_1.0.patch, SegmentMergeFilter.java, SegmentMergeFilter.java, SegmentMergeFilters.java, SegmentMergeFilters.java I needed a segment filtering based on meta data detected during parse phase. Unfortunately current URL based filtering does not allow for this. So I have created a new SegmentMergeFilter extension which receives segment entry which is being merged and decides if it should be included or not. Even though I needed only ParseData for my purpose I have done it a bit more general purpose, so the filter receives all merged data. The attached patch is for version 0.9 which I use. Unfortunately I didn't have time to check how it fits to trunk version. Sorry :( -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-677) Segment merge filering based on segment content
[ https://issues.apache.org/jira/browse/NUTCH-677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12763681#action_12763681 ] Marcin Okraszewski commented on NUTCH-677: -- Sorry, I didn't notice the request for the license header. I've just uploaded files with the header. Segment merge filering based on segment content --- Key: NUTCH-677 URL: https://issues.apache.org/jira/browse/NUTCH-677 Project: Nutch Issue Type: Improvement Affects Versions: 0.9.0 Reporter: Marcin Okraszewski Fix For: 1.1 Attachments: MergeFilter.patch, MergeFilter_for_1.0.patch, SegmentMergeFilter.java, SegmentMergeFilter.java, SegmentMergeFilters.java, SegmentMergeFilters.java I needed a segment filtering based on meta data detected during parse phase. Unfortunately current URL based filtering does not allow for this. So I have created a new SegmentMergeFilter extension which receives segment entry which is being merged and decides if it should be included or not. Even though I needed only ParseData for my purpose I have done it a bit more general purpose, so the filter receives all merged data. The attached patch is for version 0.9 which I use. Unfortunately I didn't have time to check how it fits to trunk version. Sorry :( -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-740) Configuration option to override default language for fetched pages.
[ https://issues.apache.org/jira/browse/NUTCH-740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcin Okraszewski updated NUTCH-740: - Attachment: AcceptLanguage_trunk_2009-06-09.patch It does apply, but with Fuzz factor set to 2. Here is the ported patch. Configuration option to override default language for fetched pages. Key: NUTCH-740 URL: https://issues.apache.org/jira/browse/NUTCH-740 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 1.0.0 Reporter: Marcin Okraszewski Assignee: Otis Gospodnetic Priority: Minor Fix For: 1.1 Attachments: AcceptLanguage.patch, AcceptLanguage_trunk_2009-06-09.patch By default Accept-Language HTTP request header is set to English. Unfortunately this value is hard coded and seems there is no way to override it. As a result you may index English version of pages even though you would prefer it in different language. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-740) Configuration option to override default language for fetched pages.
Configuration option to override default language for fetched pages. Key: NUTCH-740 URL: https://issues.apache.org/jira/browse/NUTCH-740 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 1.0.0, 0.9.0 Reporter: Marcin Okraszewski Fix For: 1.0.0 By default Accept-Language HTTP request header is set to English. Unfortunately this value is hard coded and seems there is no way to override it. As a result you may index English version of pages even though you would prefer it in different language. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-740) Configuration option to override default language for fetched pages.
[ https://issues.apache.org/jira/browse/NUTCH-740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcin Okraszewski updated NUTCH-740: - Attachment: AcceptLanguage.patch The patch which allows overriding of Accept-Language header. The patch is done on 1.0 code. Configuration option to override default language for fetched pages. Key: NUTCH-740 URL: https://issues.apache.org/jira/browse/NUTCH-740 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 0.9.0, 1.0.0 Reporter: Marcin Okraszewski Fix For: 1.0.0 Attachments: AcceptLanguage.patch By default Accept-Language HTTP request header is set to English. Unfortunately this value is hard coded and seems there is no way to override it. As a result you may index English version of pages even though you would prefer it in different language. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-677) Segment merge filering based on segment content
[ https://issues.apache.org/jira/browse/NUTCH-677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcin Okraszewski updated NUTCH-677: - Attachment: MergeFilter_for_1.0.patch The patch ported to Nutch 1.0. The Java files remain unchanged, only patch has changed. Segment merge filering based on segment content --- Key: NUTCH-677 URL: https://issues.apache.org/jira/browse/NUTCH-677 Project: Nutch Issue Type: Improvement Affects Versions: 0.9.0 Reporter: Marcin Okraszewski Fix For: 1.1 Attachments: MergeFilter.patch, MergeFilter_for_1.0.patch, SegmentMergeFilter.java, SegmentMergeFilters.java I needed a segment filtering based on meta data detected during parse phase. Unfortunately current URL based filtering does not allow for this. So I have created a new SegmentMergeFilter extension which receives segment entry which is being merged and decides if it should be included or not. Even though I needed only ParseData for my purpose I have done it a bit more general purpose, so the filter receives all merged data. The attached patch is for version 0.9 which I use. Unfortunately I didn't have time to check how it fits to trunk version. Sorry :( -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-490) Extension point with filters for Neko HTML parser (with patch)
[ https://issues.apache.org/jira/browse/NUTCH-490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcin Okraszewski updated NUTCH-490: - Attachment: NekoFilters_for_1.0.patch Patch ported to Nutch 1.0. It includes the two previous patches. Extension point with filters for Neko HTML parser (with patch) -- Key: NUTCH-490 URL: https://issues.apache.org/jira/browse/NUTCH-490 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 0.9.0 Environment: Any Reporter: Marcin Okraszewski Priority: Minor Attachments: HtmlParser.java.diff, NekoFilters_for_1.0.patch, nutch-extensionpoins_plugin.xml.diff In my project I need to set filters for Neko HTML parser. So instead of adding it hard coded, I made an extension point to define filters for Neko. I was fallowing the code for HtmlParser filters. In fact the method to get filters I think could be generalized to handle both cases. But I didn't want to make too big mess. The attached patch is for Nutch 0.9. This part of code wasn't changed in trunk, so should be applicable easily. BTW. I wonder if it wouldn't be best to have HTML DOM Parsing defined by extension point itself. Now there are options for Neko and TagSoap. But if someone would like to use something else or set give different settings for the parser, he would need to modify HtmlParser class, instead of replacing a plugin. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-677) Segment merge filering based on segment content
Segment merge filering based on segment content --- Key: NUTCH-677 URL: https://issues.apache.org/jira/browse/NUTCH-677 Project: Nutch Issue Type: Improvement Affects Versions: 0.9.0 Reporter: Marcin Okraszewski Fix For: 0.9.0 I needed a segment filtering based on meta data detected during parse phase. Unfortunately current URL based filtering does not allow for this. So I have created a new SegmentMergeFilter extension which receives segment entry which is being merged and decides if it should be included or not. Even though I needed only ParseData for my purpose I have done it a bit more general purpose, so the filter receives all merged data. The attached patch is for version 0.9 which I use. Unfortunately I didn't have time to check how it fits to trunk version. Sorry :( -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-677) Segment merge filering based on segment content
[ https://issues.apache.org/jira/browse/NUTCH-677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcin Okraszewski updated NUTCH-677: - Attachment: MergeFilter.patch The patch for 0.9 Segment merge filering based on segment content --- Key: NUTCH-677 URL: https://issues.apache.org/jira/browse/NUTCH-677 Project: Nutch Issue Type: Improvement Affects Versions: 0.9.0 Reporter: Marcin Okraszewski Fix For: 0.9.0 Attachments: MergeFilter.patch, SegmentMergeFilter.java I needed a segment filtering based on meta data detected during parse phase. Unfortunately current URL based filtering does not allow for this. So I have created a new SegmentMergeFilter extension which receives segment entry which is being merged and decides if it should be included or not. Even though I needed only ParseData for my purpose I have done it a bit more general purpose, so the filter receives all merged data. The attached patch is for version 0.9 which I use. Unfortunately I didn't have time to check how it fits to trunk version. Sorry :( -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-677) Segment merge filering based on segment content
[ https://issues.apache.org/jira/browse/NUTCH-677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcin Okraszewski updated NUTCH-677: - Attachment: SegmentMergeFilter.java The filter interface (referred by the patch). Segment merge filering based on segment content --- Key: NUTCH-677 URL: https://issues.apache.org/jira/browse/NUTCH-677 Project: Nutch Issue Type: Improvement Affects Versions: 0.9.0 Reporter: Marcin Okraszewski Fix For: 0.9.0 Attachments: MergeFilter.patch, SegmentMergeFilter.java I needed a segment filtering based on meta data detected during parse phase. Unfortunately current URL based filtering does not allow for this. So I have created a new SegmentMergeFilter extension which receives segment entry which is being merged and decides if it should be included or not. Even though I needed only ParseData for my purpose I have done it a bit more general purpose, so the filter receives all merged data. The attached patch is for version 0.9 which I use. Unfortunately I didn't have time to check how it fits to trunk version. Sorry :( -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-677) Segment merge filering based on segment content
[ https://issues.apache.org/jira/browse/NUTCH-677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcin Okraszewski updated NUTCH-677: - Attachment: SegmentMergeFilters.java Merge filter aggregation which hides extension point, etc. It is referred by the patch. Segment merge filering based on segment content --- Key: NUTCH-677 URL: https://issues.apache.org/jira/browse/NUTCH-677 Project: Nutch Issue Type: Improvement Affects Versions: 0.9.0 Reporter: Marcin Okraszewski Fix For: 0.9.0 Attachments: MergeFilter.patch, SegmentMergeFilter.java, SegmentMergeFilters.java I needed a segment filtering based on meta data detected during parse phase. Unfortunately current URL based filtering does not allow for this. So I have created a new SegmentMergeFilter extension which receives segment entry which is being merged and decides if it should be included or not. Even though I needed only ParseData for my purpose I have done it a bit more general purpose, so the filter receives all merged data. The attached patch is for version 0.9 which I use. Unfortunately I didn't have time to check how it fits to trunk version. Sorry :( -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Commit Times for Issues
I can say something from a contributor point of view. I've contributed two rather trivial patches and ... I'm discouraged. Simply the process was far too long. Actually I had to ask that someone takes a look for it. Once someone invest his time to create patch, write a Jira entry, etc., you rather expect it to be reviewed and possibly committed. If there is at least one person who needs it that much that is willing to develop it, it may mean there might be others who would need it as well. Just to add. I've done some several contributions to some other projects, but this is first time I have a feeling like this. As looking for perfection, it must be balanced in my opinion. If there is something trivial which is not done perfect, which does not break architecture ... well, it might be acceptable. But if something would make a spaghetti code, I wouldn't be so much for it. So my rule of thumb would be - once it breaks well design, introduces too big complexity, it shouldn't be accepted. If it doesn't influence those, but does what it should, maybe in a bit clumsy way - why not. It still solves someone's problem or need. Regards, Marcin Dnia 16 listopada 2007 18:45 Dennis Kubes [EMAIL PROTECTED] napisał(a): So a few years ago I started a dating site called oneforever.com. Good technology, but it took us 9 months to develop the first version. Mostly because we wanted everything to be perfect. So we would work on something, if it was not perfect change it, and so on. We never did get it perfect, we just got it to the point where we had to launch it. A few months ago a started a different project focused around social networking and search. With this project I took the viewpoint of consistent progress every day. I would make some improvement to it everyday, no matter how small. No such thing as perfect, just better. This project developed much quicker and I think is actually a better code base. And what was more it was fun to work on. All of this is to say that I don't think there is any such thing as perfection. I do think there is better, continuously better. And since we all enjoy programming (I hope), the making something better (not perfect or best) is the fun part (or at least should be). I can only talk from my experience but I think the best part of programming is when I have found the solution to the problem and it just works. So as we are developing this *standard* for committers I agree with Chris that we should make this fun and casual and not be worried about breaking the trunk. After all, it's only code (I know, to some people that is heresy :)) I actually think we are all in agreement about this. I would love to hear from some of the other committers or members of the community before we put these thoughts down on a wiki. Oh, and I am ok with minor issues having a longer wait time or 1 or more +1. Dennis Kubes Chris Mattmann wrote: Hi Guys, I'd like to chime in here on this one. My +1 for shortening the time to commit for issues. I fear that development effort on Nutch has teetered on the dwindling side of things for the last year or so, and there (in my opinion, so feel free to disagree) is certainly a stigma to the trunk and its sacred nature that discourages people (including myself) from introducing new code there. I would like to propose even extending Dennis's idea below and developing a new philosophy towards the Nutch CM. To me, the big picture change is the following statement: the trunk is something that can be broke. Let's just accept that it's possible. If it's broke, someone will report it. Nutch has a big enough user base now that plays around with new builds and revisions that this will get caught. Guess what. If the trunk is broke, then it can be fixed. I'll tell you guys a story of one of my bosses here at JPL. He used to work for a civil defense contractor in the U.S., with very rigorous design and software development process. Unit tests for each line of code type of place. In any case, my boss used to break his company's equivalent of the trunk daily build process all the time. Well one day he gets called in to speak with the vice president of engineering at the company, who proceeds to tell him: You're really good at breaking the code, eh?. My boss immediately jumps up to defend himself, citing the fact that it wasn't a big problem and that he has fixed it already, but the vice president cuts him off and says, You probably think I'm mad. Well let me tell you: I'm not. You can break the code all you want because you know what it tells me? That you're actually *DOING WORK* unlike the rest of these people who work here and do very little. The above story has stuck with me and made me feel a lot better about situations such as those in that it gives me the belief that waiting until everything is perfect before acting
[jira] Updated: (NUTCH-488) Avoid parsing uneccessary links and get a more relevant outlink list
[ https://issues.apache.org/jira/browse/NUTCH-488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcin Okraszewski updated NUTCH-488: - Attachment: ignore_tags_v3.patch OK, yet another approach based on Doğacan comments. Sorry for delay, but I didn't notice the comment earlier. - I didn't notice the conf.getStrings() method. Thanks for hint :) - I did made the backward compatibility with the use_action param, but it works a bit different now, if there is no value set. Now, default is that it should use the forms. But it can be dropped with ignore_tags setting if not specified. If someone has the use_action set to true explicite, then it cannot be overridden by the ignore_tags. It is still a bit inconsitent, but it is understandable that specific setting (use_action) has precedence. If default is false then if you do not have use_action defined and not added to ignore_tags, then one could expect that form is taken. But it wouldn't be. Keeping the backward compatibility make the code a bit clumsy :( ... and I think I've made it over flexible, but that was the cleanest solution here. - For the repeating if; I agree, it is error prone, but on the other hand it is easy to understand. I didn't quite understand Dogacan's proposal :( but I think I did something acceptable - simply remove all specified tags from link params. Avoid parsing uneccessary links and get a more relevant outlink list Key: NUTCH-488 URL: https://issues.apache.org/jira/browse/NUTCH-488 Project: Nutch Issue Type: Improvement Affects Versions: 0.9.0 Environment: Windows, Java 1.5 Reporter: Emmanuel Joke Attachments: DOMContentUtils.patch, ignore_tags_v2.patch, ignore_tags_v3.patch, nutch-default.xml.patch NekoHTML parser use a method to extract all outlinks from the HTML page. It will extracts them from the HTML content based on the list of param defined in the method setConf(). Then this list of links will be truncated to be limit to the the maximum number of outlinks that we'll process for a page defined in nutch-default.xml (db.max.outlinks.per.page = 100 by default ) and finally it will be go through all urlfilter defined. Unfortunetly it can happen that the list of outlinks is more than 100, so it will truncated the list and could remove some relevant links. So I've added few options in the nutch-default.xml in order to enable/disable the extraction of specific HTML Tag links in this parser (SCRIPT, IMG, FORM, LINK). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Parsing extra fields from an html page in the web. ....
I brief. You need to write HtmlParserFilter, then IndexingFilter and QueryFilter. You register them through extension points. Search USER (not dev) group, there answers already. BTW. This questions is asked over and over. It seems to be a good subject to write on wiki. Marcin Hi, We are working on an Indian Language search engine and are using nutch-0.9as the basic framework. However when the html pages are parsed during the fetching phase, the htmlParser which runs on the page extracts the title text and metatags and the outlinks. what do i need to do if i need to add in more fields like author, language, script to the segments extracted from the web page. In case the data is unavailable in the page, we can load in some default values. Do i need to touch the actual parser code (parser used here is a neko-html parser if am not wrong) or the additions can be done right from within the nutch code. It would be of great help if you could get me through this. -- Pratyush Banerjee SPO, CLIA IIT Kharagpur
Limiting outlink tags.
Hi, I have noticed that Nutch considers img/@src as an outlink. I suppose in many cases people do not want to threat image as an outlink. At least I don't want. The same case is with script/@src. But, it seems there is no way to limit outlink tags. The DOMContentUtils.getOutlinks() takes links from all a,area,form,frame,iframe,script,link,img. Only form element can be turned off by parser.html.form.use_action parameter. I would suggest to introduce a new configuration parameter which could be used to turn on or off certain elements. It could be simply done by single parameter, which would contain coma separated list of tags to be turned off. What is your opinion? If you think it is a valid issue I can make a patch for this. Regards, Marcin
Is there any chance that my patches will be considered?
Hello Nutch Developers, On May 22 I have contributed two patches - NUTCH-487 and NUTCH-490. The next release is probably coming soon. I would be really pleased if they are merged by then I don't have to patch it. Is there any chance they will be merged into source code? I can port them to current head, but so far nobody asked for this. I would also like to point your attention to one point. It is already two and half month since I have added the patches. There is not even a single comment on this. This is really discouraging for me, as a contributor. I know that merging patches is not the thing that developers love to do, but you are the only one who can do it. Of course I don't mean you should thank to every contribution, but take it into account. Having someone's work being ignored, and it looks like this for me, really discourages from further work. Reviewing it and saying you won't merge it because something would be much better than leaving it without a single comment. This may reduce your active community. Think of this. Best regards, Marcin Okraszewski
Re: Re: Is there any chance that my patches will be considered?
Thanks for a quick answer. As for NUTCH-490, I haven't taken an in-depth look at it, but I don't see the point of it. Why not just use HtmlParseFilters since you have access to the DOM object? What advantage do neko filters have? Also, having an extension point for a library possibly used by a possibly used plugin looks really really wrong from a design point. In my case I want to achieve two things: 1. Ensure there is always TBODY element. 2. Drop all SELECT elements (I don't want it to be in You are right, I could manipulate DOM for this. But filters seems to be less costly operation, this is why I took this approach first. Though I haven't done any tests - maybe I'm too concerned and it doesn't matter that much. Or possibly my suggestion to make extension point for parser is the best one? Then if you want to modify parsing itself, you can do whatever you want. Then you also wouldn't need any switch for parsing implementation as it is now. Simply modify plugin inclusion. I could do it, if you find it a good idea. Anyway, as I said, even if you find it all as not having much sense, just close the issue with this comment. I really prefer it over hanging request, because I know I should rather think of different solution for my case. Thanks, Marcin Okraszewski
Re: How to create patch?
http://wiki.apache.org/nutch/HowToContribute On 6/1/07, Manoharam Reddy [EMAIL PROTECTED] wrote: I have seen some patches been exchanged in the list. I want to know how this patch is created and how is it applied? Any pointers to tutorials on net or wiki or a plain reply here would be helpful.
Re: running nutch without http proxy
Seems like this is default. You may rather expect some problems is you want to use proxy. The default configuration is without proxy. Cheers, Marcin On 5/29/07, prem kumar [EMAIL PROTECTED] wrote: Is it possible to run nutch without using a http proxy to search the internet? If so, what are the configurations needed ? I don't want to use a socks proxy either. All I have is a direct connection to the internet. Thanks Prem -- http://premsden.blogspot.com/
[jira] Updated: (NUTCH-490) Extension point with filters for Neko HTML parser (with patch)
[ https://issues.apache.org/jira/browse/NUTCH-490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcin Okraszewski updated NUTCH-490: - Attachment: HtmlParser.java.diff Patch for HtmlParser. Extension point with filters for Neko HTML parser (with patch) -- Key: NUTCH-490 URL: https://issues.apache.org/jira/browse/NUTCH-490 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 0.9.0 Environment: Any Reporter: Marcin Okraszewski Priority: Minor Attachments: HtmlParser.java.diff In my project I need to set filters for Neko HTML parser. So instead of adding it hard coded, I made an extension point to define filters for Neko. I was fallowing the code for HtmlParser filters. In fact the method to get filters I think could be generalized to handle both cases. But I didn't want to make too big mess. The attached patch is for Nutch 0.9. This part of code wasn't changed in trunk, so should be applicable easily. BTW. I wonder if it wouldn't be best to have HTML DOM Parsing defined by extension point itself. Now there are options for Neko and TagSoap. But if someone would like to use something else or set give different settings for the parser, he would need to modify HtmlParser class, instead of replacing a plugin. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-490) Extension point with filters for Neko HTML parser (with patch)
[ https://issues.apache.org/jira/browse/NUTCH-490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcin Okraszewski updated NUTCH-490: - Attachment: nutch-extensionpoins_plugin.xml.diff Patch for plugin.xml in nutch-extensionpoins. BTW. Why extension points are declared in this plugin? Normally I would define this extension point in plugin.xml of parse-html plugin. But I saw all extension points defined here, so I fallowed this policy. Extension point with filters for Neko HTML parser (with patch) -- Key: NUTCH-490 URL: https://issues.apache.org/jira/browse/NUTCH-490 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 0.9.0 Environment: Any Reporter: Marcin Okraszewski Priority: Minor Attachments: HtmlParser.java.diff, nutch-extensionpoins_plugin.xml.diff In my project I need to set filters for Neko HTML parser. So instead of adding it hard coded, I made an extension point to define filters for Neko. I was fallowing the code for HtmlParser filters. In fact the method to get filters I think could be generalized to handle both cases. But I didn't want to make too big mess. The attached patch is for Nutch 0.9. This part of code wasn't changed in trunk, so should be applicable easily. BTW. I wonder if it wouldn't be best to have HTML DOM Parsing defined by extension point itself. Now there are options for Neko and TagSoap. But if someone would like to use something else or set give different settings for the parser, he would need to modify HtmlParser class, instead of replacing a plugin. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Bug (with fix): Neko HTML parser goes on defaults.
Hi, The Neko HTML parser set up is done in silent try / catch statement (Nutch 0.9: HtmlParser.java:248-259). The problem is that the first feature being set thrown an exception. So, the whole setup block is skipped. The catch statement does nothing, so probably nobody noticed this. I attach a patch which fixes this. It was done on Nutch 0.9, but SVN trunk contains the same code. The patch does: 1. Fixes augmentations feature. 2. Removes include-comments feature, because I couldn't find anything similar at http://people.apache.org/~andyc/neko/doc/html/settings.html 3. Prints warn message when exception is caught. Please note that now there goes a lot for messages to console (not log4j log), because report-errors feature is being set. Shouldn't it be removed? Cheers, Marcin--- nutch-0.9/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java 2007-04-03 05:44:21.0 +0200 +++ ../nutch-0.9/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java 2007-05-21 12:33:46.0 +0200 @@ -246,9 +246,9 @@ DOMFragmentParser parser = new DOMFragmentParser(); // some plugins, e.g., creativecommons, need to examine html comments try { - parser.setFeature(http://apache.org/xml/features/include-comments;, - true); - parser.setFeature(http://apache.org/xml/features/augmentations;, +// parser.setFeature(http://apache.org/xml/features/include-comments;, +// true); + parser.setFeature(http://cyberneko.org/html/features/augmentations;, true); parser.setFeature(http://cyberneko.org/html/features/balance-tags/ignore-outside-content;, false); @@ -256,7 +256,9 @@ true); parser.setFeature(http://cyberneko.org/html/features/report-errors;, true); -} catch (SAXException e) {} +} catch (SAXException e) { + LOG.warn(Exception while setting Neko parser settings., e); +} // convert Document to DocumentFragment HTMLDocumentImpl doc = new HTMLDocumentImpl(); doc.setErrorChecking(false);
[jira] Created: (NUTCH-487) Neko HTML parser goes on default settings.
Neko HTML parser goes on default settings. -- Key: NUTCH-487 URL: https://issues.apache.org/jira/browse/NUTCH-487 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 0.9.0 Environment: Linux, Java 1.5.0. Reporter: Marcin Okraszewski Attachments: neko_setup.patch The Neko HTML parser set up is done in silent try / catch statement (Nutch 0.9: HtmlParser.java:248-259). The problem is that the first feature being set thrown an exception. So, the whole setup block is skipped. The catch statement does nothing, so probably nobody noticed this. I attach a patch which fixes this. It was done on Nutch 0.9, but SVN trunk contains the same code. The patch does: 1. Fixes augmentations feature. 2. Removes include-comments feature, because I couldn't find anything similar at http://people.apache.org/~andyc/neko/doc/html/settings.html 3. Prints warn message when exception is caught. Please note that now there goes a lot for messages to console (not log4j log), because report-errors feature is being set. Shouldn't it be removed? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Bug (with fix): Neko HTML parser goes on defaults.
I would suggest that you open a JIRA issue and attach the patch there. For this case, there is a similar issue(with patch) at NUTCH-369. Done - NUTCH-487 Marcin