[jira] Updated: (NUTCH-677) Segment merge filering based on segment content

2009-10-08 Thread Marcin Okraszewski (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcin Okraszewski updated NUTCH-677:
-

Attachment: SegmentMergeFilter.java

Added Apache License.

 Segment merge filering based on segment content
 ---

 Key: NUTCH-677
 URL: https://issues.apache.org/jira/browse/NUTCH-677
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.9.0
Reporter: Marcin Okraszewski
 Fix For: 1.1

 Attachments: MergeFilter.patch, MergeFilter_for_1.0.patch, 
 SegmentMergeFilter.java, SegmentMergeFilter.java, SegmentMergeFilters.java


 I needed a segment filtering based on meta data detected during parse phase. 
 Unfortunately current URL based filtering does not allow for this. So I have 
 created a new SegmentMergeFilter extension which receives segment entry which 
 is being merged and decides if it should be included or not. Even though I 
 needed only ParseData for my purpose I have done it a bit more general 
 purpose, so the filter receives all merged data.
 The attached patch is for version 0.9 which I use. Unfortunately I didn't 
 have time to check how it fits to trunk version. Sorry :(

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-677) Segment merge filering based on segment content

2009-10-08 Thread Marcin Okraszewski (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcin Okraszewski updated NUTCH-677:
-

Attachment: SegmentMergeFilters.java

Added Apache license header.

 Segment merge filering based on segment content
 ---

 Key: NUTCH-677
 URL: https://issues.apache.org/jira/browse/NUTCH-677
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.9.0
Reporter: Marcin Okraszewski
 Fix For: 1.1

 Attachments: MergeFilter.patch, MergeFilter_for_1.0.patch, 
 SegmentMergeFilter.java, SegmentMergeFilter.java, SegmentMergeFilters.java, 
 SegmentMergeFilters.java


 I needed a segment filtering based on meta data detected during parse phase. 
 Unfortunately current URL based filtering does not allow for this. So I have 
 created a new SegmentMergeFilter extension which receives segment entry which 
 is being merged and decides if it should be included or not. Even though I 
 needed only ParseData for my purpose I have done it a bit more general 
 purpose, so the filter receives all merged data.
 The attached patch is for version 0.9 which I use. Unfortunately I didn't 
 have time to check how it fits to trunk version. Sorry :(

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-677) Segment merge filering based on segment content

2009-10-08 Thread Marcin Okraszewski (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12763681#action_12763681
 ] 

Marcin Okraszewski commented on NUTCH-677:
--

Sorry, I didn't notice the request for the license header. I've just uploaded 
files with the header.

 Segment merge filering based on segment content
 ---

 Key: NUTCH-677
 URL: https://issues.apache.org/jira/browse/NUTCH-677
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.9.0
Reporter: Marcin Okraszewski
 Fix For: 1.1

 Attachments: MergeFilter.patch, MergeFilter_for_1.0.patch, 
 SegmentMergeFilter.java, SegmentMergeFilter.java, SegmentMergeFilters.java, 
 SegmentMergeFilters.java


 I needed a segment filtering based on meta data detected during parse phase. 
 Unfortunately current URL based filtering does not allow for this. So I have 
 created a new SegmentMergeFilter extension which receives segment entry which 
 is being merged and decides if it should be included or not. Even though I 
 needed only ParseData for my purpose I have done it a bit more general 
 purpose, so the filter receives all merged data.
 The attached patch is for version 0.9 which I use. Unfortunately I didn't 
 have time to check how it fits to trunk version. Sorry :(

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-740) Configuration option to override default language for fetched pages.

2009-06-09 Thread Marcin Okraszewski (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcin Okraszewski updated NUTCH-740:
-

Attachment: AcceptLanguage_trunk_2009-06-09.patch

It does apply, but with Fuzz factor set to 2. Here is the ported patch.

 Configuration option to override default language for fetched pages.
 

 Key: NUTCH-740
 URL: https://issues.apache.org/jira/browse/NUTCH-740
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 1.0.0
Reporter: Marcin Okraszewski
Assignee: Otis Gospodnetic
Priority: Minor
 Fix For: 1.1

 Attachments: AcceptLanguage.patch, 
 AcceptLanguage_trunk_2009-06-09.patch


 By default Accept-Language HTTP request header is set to English. 
 Unfortunately this value is hard coded and seems there is no way to override 
 it. As a result you may index English version of pages even though you would 
 prefer it in different language. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-740) Configuration option to override default language for fetched pages.

2009-05-28 Thread Marcin Okraszewski (JIRA)
Configuration option to override default language for fetched pages.


 Key: NUTCH-740
 URL: https://issues.apache.org/jira/browse/NUTCH-740
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 1.0.0, 0.9.0
Reporter: Marcin Okraszewski
 Fix For: 1.0.0


By default Accept-Language HTTP request header is set to English. 
Unfortunately this value is hard coded and seems there is no way to override 
it. As a result you may index English version of pages even though you would 
prefer it in different language. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-740) Configuration option to override default language for fetched pages.

2009-05-28 Thread Marcin Okraszewski (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcin Okraszewski updated NUTCH-740:
-

Attachment: AcceptLanguage.patch

The patch which allows overriding of Accept-Language header. The patch is 
done on 1.0 code. 

 Configuration option to override default language for fetched pages.
 

 Key: NUTCH-740
 URL: https://issues.apache.org/jira/browse/NUTCH-740
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 0.9.0, 1.0.0
Reporter: Marcin Okraszewski
 Fix For: 1.0.0

 Attachments: AcceptLanguage.patch


 By default Accept-Language HTTP request header is set to English. 
 Unfortunately this value is hard coded and seems there is no way to override 
 it. As a result you may index English version of pages even though you would 
 prefer it in different language. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-677) Segment merge filering based on segment content

2009-05-27 Thread Marcin Okraszewski (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcin Okraszewski updated NUTCH-677:
-

Attachment: MergeFilter_for_1.0.patch

The patch ported to Nutch 1.0. The Java files remain unchanged, only patch has 
changed.

 Segment merge filering based on segment content
 ---

 Key: NUTCH-677
 URL: https://issues.apache.org/jira/browse/NUTCH-677
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.9.0
Reporter: Marcin Okraszewski
 Fix For: 1.1

 Attachments: MergeFilter.patch, MergeFilter_for_1.0.patch, 
 SegmentMergeFilter.java, SegmentMergeFilters.java


 I needed a segment filtering based on meta data detected during parse phase. 
 Unfortunately current URL based filtering does not allow for this. So I have 
 created a new SegmentMergeFilter extension which receives segment entry which 
 is being merged and decides if it should be included or not. Even though I 
 needed only ParseData for my purpose I have done it a bit more general 
 purpose, so the filter receives all merged data.
 The attached patch is for version 0.9 which I use. Unfortunately I didn't 
 have time to check how it fits to trunk version. Sorry :(

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-490) Extension point with filters for Neko HTML parser (with patch)

2009-05-27 Thread Marcin Okraszewski (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcin Okraszewski updated NUTCH-490:
-

Attachment: NekoFilters_for_1.0.patch

Patch ported to Nutch 1.0. It includes the two previous patches.

 Extension point with filters for Neko HTML parser (with patch)
 --

 Key: NUTCH-490
 URL: https://issues.apache.org/jira/browse/NUTCH-490
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 0.9.0
 Environment: Any
Reporter: Marcin Okraszewski
Priority: Minor
 Attachments: HtmlParser.java.diff, NekoFilters_for_1.0.patch, 
 nutch-extensionpoins_plugin.xml.diff


 In my project I need to set filters for Neko HTML parser. So instead of 
 adding it hard coded, I made an extension point to define filters for Neko. I 
 was fallowing the code for HtmlParser filters. In fact the method to get 
 filters I think could be generalized to handle both cases. But I didn't want 
 to make too big mess.
 The attached patch is for Nutch 0.9. This part of code wasn't changed in 
 trunk, so should be applicable easily.
 BTW. I wonder if it wouldn't be best to have HTML DOM Parsing defined by 
 extension point itself. Now there are options for Neko and TagSoap. But if 
 someone would like to use something else or set give different settings for 
 the parser, he would need to modify HtmlParser class, instead of replacing a 
 plugin.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-677) Segment merge filering based on segment content

2009-01-08 Thread Marcin Okraszewski (JIRA)
Segment merge filering based on segment content
---

 Key: NUTCH-677
 URL: https://issues.apache.org/jira/browse/NUTCH-677
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.9.0
Reporter: Marcin Okraszewski
 Fix For: 0.9.0


I needed a segment filtering based on meta data detected during parse phase. 
Unfortunately current URL based filtering does not allow for this. So I have 
created a new SegmentMergeFilter extension which receives segment entry which 
is being merged and decides if it should be included or not. Even though I 
needed only ParseData for my purpose I have done it a bit more general purpose, 
so the filter receives all merged data.

The attached patch is for version 0.9 which I use. Unfortunately I didn't have 
time to check how it fits to trunk version. Sorry :(

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-677) Segment merge filering based on segment content

2009-01-08 Thread Marcin Okraszewski (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcin Okraszewski updated NUTCH-677:
-

Attachment: MergeFilter.patch

The patch for 0.9

 Segment merge filering based on segment content
 ---

 Key: NUTCH-677
 URL: https://issues.apache.org/jira/browse/NUTCH-677
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.9.0
Reporter: Marcin Okraszewski
 Fix For: 0.9.0

 Attachments: MergeFilter.patch, SegmentMergeFilter.java


 I needed a segment filtering based on meta data detected during parse phase. 
 Unfortunately current URL based filtering does not allow for this. So I have 
 created a new SegmentMergeFilter extension which receives segment entry which 
 is being merged and decides if it should be included or not. Even though I 
 needed only ParseData for my purpose I have done it a bit more general 
 purpose, so the filter receives all merged data.
 The attached patch is for version 0.9 which I use. Unfortunately I didn't 
 have time to check how it fits to trunk version. Sorry :(

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-677) Segment merge filering based on segment content

2009-01-08 Thread Marcin Okraszewski (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcin Okraszewski updated NUTCH-677:
-

Attachment: SegmentMergeFilter.java

The filter interface (referred by the patch).

 Segment merge filering based on segment content
 ---

 Key: NUTCH-677
 URL: https://issues.apache.org/jira/browse/NUTCH-677
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.9.0
Reporter: Marcin Okraszewski
 Fix For: 0.9.0

 Attachments: MergeFilter.patch, SegmentMergeFilter.java


 I needed a segment filtering based on meta data detected during parse phase. 
 Unfortunately current URL based filtering does not allow for this. So I have 
 created a new SegmentMergeFilter extension which receives segment entry which 
 is being merged and decides if it should be included or not. Even though I 
 needed only ParseData for my purpose I have done it a bit more general 
 purpose, so the filter receives all merged data.
 The attached patch is for version 0.9 which I use. Unfortunately I didn't 
 have time to check how it fits to trunk version. Sorry :(

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-677) Segment merge filering based on segment content

2009-01-08 Thread Marcin Okraszewski (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcin Okraszewski updated NUTCH-677:
-

Attachment: SegmentMergeFilters.java

Merge filter aggregation which hides extension point, etc. It is referred by 
the patch.

 Segment merge filering based on segment content
 ---

 Key: NUTCH-677
 URL: https://issues.apache.org/jira/browse/NUTCH-677
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.9.0
Reporter: Marcin Okraszewski
 Fix For: 0.9.0

 Attachments: MergeFilter.patch, SegmentMergeFilter.java, 
 SegmentMergeFilters.java


 I needed a segment filtering based on meta data detected during parse phase. 
 Unfortunately current URL based filtering does not allow for this. So I have 
 created a new SegmentMergeFilter extension which receives segment entry which 
 is being merged and decides if it should be included or not. Even though I 
 needed only ParseData for my purpose I have done it a bit more general 
 purpose, so the filter receives all merged data.
 The attached patch is for version 0.9 which I use. Unfortunately I didn't 
 have time to check how it fits to trunk version. Sorry :(

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Commit Times for Issues

2007-11-16 Thread Marcin Okraszewski
I can say something from a contributor point of view. I've contributed two 
rather trivial patches and ... I'm discouraged. Simply the process was far too 
long. Actually I had to ask that someone takes a look for it. Once someone 
invest his time to create patch, write a Jira entry, etc., you rather expect it 
to be reviewed and possibly committed. If there is at least one person who 
needs it that much that is willing to develop it, it may mean there might be 
others who would need it as well.

Just to add. I've done some several contributions to some other projects, but 
this is first time I have a feeling like this.

As looking for perfection, it must be balanced in my opinion. If there is 
something trivial which is not done perfect, which does not break architecture 
... well, it might be acceptable. But if something would make a spaghetti code, 
I wouldn't be so much for it. So my rule of thumb would be - once it breaks 
well design, introduces too big complexity, it shouldn't be accepted. If it 
doesn't influence those, but does what it should, maybe in a bit clumsy way - 
why not. It still solves someone's problem or need.

Regards,
Marcin


Dnia 16 listopada 2007 18:45 Dennis Kubes [EMAIL PROTECTED] napisał(a):

 So a few years ago I started a dating site called oneforever.com.  Good 
 technology, but it took us 9 months to develop the first version. 
 Mostly because we wanted everything to be perfect.  So we would work on 
 something, if it was not perfect change it, and so on.  We never did get 
 it perfect, we just got it to the point where we had to launch it.
 
 A few months ago a started a different project focused around social 
 networking and search.  With this project I took the viewpoint of 
 consistent progress every day.  I would make some improvement to it 
 everyday, no matter how small.  No such thing as perfect, just better. 
 This project developed much quicker and I think is actually a better 
 code base.  And what was more it was fun to work on.
 
 All of this is to say that I don't think there is any such thing as 
 perfection.  I do think there is better, continuously better.  And since 
 we all enjoy programming (I hope), the making something better (not 
 perfect or best) is the fun part (or at least should be).  I can only 
 talk from my experience but I think the best part of programming is when 
 I have found the solution to the problem and it just works.
 
 So as we are developing this *standard* for committers I agree with 
 Chris that we should make this fun and casual and not be worried about 
 breaking the trunk.  After all, it's only code (I know, to some people 
 that is heresy :)) I actually think we are all in agreement about this. 
   I would love to hear from some of the other committers or members of 
 the community before we put these thoughts down on a wiki.
 
 Oh, and I am ok with minor issues having a longer wait time or 1 or more +1.
 
 Dennis Kubes
 
 Chris Mattmann wrote:
  Hi Guys,
  
   I'd like to chime in here on this one. My +1 for shortening the time to
  commit for issues. I fear that development effort on Nutch has teetered on
  the dwindling side of things for the last year or so, and there (in my
  opinion, so feel free to disagree) is certainly a stigma to the trunk and
  its sacred nature that discourages people (including myself) from
  introducing new code there.
  
   I would like to propose even extending Dennis's idea below and developing a
  new philosophy towards the Nutch CM. To me, the big picture change is the
  following statement: the trunk is something that can be broke. Let's just
  accept that it's possible. If it's broke, someone will report it. Nutch has
  a big enough user base now that plays around with new builds and revisions
  that this will get caught. Guess what. If the trunk is broke, then it can be
  fixed. 
  
   I'll tell you guys a story of one of my bosses here at JPL. He used to work
  for a civil defense contractor in the U.S., with very rigorous design and
  software development process. Unit tests for each line of code type of
  place. In any case, my boss used to break his company's equivalent of the
  trunk daily build process all the time. Well one day he gets called in to
  speak with the vice president of engineering at the company, who proceeds to
  tell him: You're really good at breaking the code, eh?. My boss
  immediately jumps up to defend himself, citing the fact that it wasn't a big
  problem and that he has fixed it already, but the vice president cuts him
  off and says, You probably think I'm mad. Well let me tell you: I'm not.
  You can break the code all you want because you know what it tells me? That
  you're actually *DOING WORK* unlike the rest of these people who work here
  and do very little.
  
   The above story has stuck with me and made me feel a lot better about
  situations such as those in that it gives me the belief that waiting until
  everything is perfect before acting 

[jira] Updated: (NUTCH-488) Avoid parsing uneccessary links and get a more relevant outlink list

2007-10-15 Thread Marcin Okraszewski (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcin Okraszewski updated NUTCH-488:
-

Attachment: ignore_tags_v3.patch

OK, yet another approach based on Doğacan comments. Sorry for delay, but I 
didn't notice the comment earlier.

- I didn't notice the conf.getStrings() method. Thanks for hint :)
- I did made the backward compatibility with the use_action param, but it 
works a bit different now, if there is no value set. Now, default is that it 
should use the forms. But it can be dropped with ignore_tags setting if not 
specified. If someone has the use_action set to true explicite, then it cannot 
be overridden by the ignore_tags. It is still a bit inconsitent, but it is 
understandable that specific setting (use_action) has precedence. If default is 
false then if you do not have use_action defined and not added to 
ignore_tags, then one could expect that form is taken. But it wouldn't be. 
Keeping the backward compatibility make the code a bit clumsy :( ... and I 
think I've made it over flexible, but that was the cleanest solution here.
- For the repeating if; I agree, it is error prone, but on the other hand it is 
easy to understand. I didn't quite understand Dogacan's proposal :( but I think 
I did something acceptable - simply remove all specified tags from link params. 




 Avoid parsing uneccessary links and get a more relevant outlink list
 

 Key: NUTCH-488
 URL: https://issues.apache.org/jira/browse/NUTCH-488
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.9.0
 Environment: Windows, Java 1.5
Reporter: Emmanuel Joke
 Attachments: DOMContentUtils.patch, ignore_tags_v2.patch, 
 ignore_tags_v3.patch, nutch-default.xml.patch


 NekoHTML parser use a method to extract all outlinks from the HTML page. It 
 will extracts them from the HTML content based on the list of param defined 
 in the method setConf(). Then this list of links will be truncated to be 
 limit to the the maximum number of outlinks that we'll process for a page 
 defined in nutch-default.xml (db.max.outlinks.per.page = 100 by default ) and 
 finally it will be go through all urlfilter defined.
 Unfortunetly it can happen that the list of outlinks is more than 100, so it 
 will truncated the list and could remove some relevant links.
 So I've added few options in the nutch-default.xml in order to enable/disable 
 the extraction of specific HTML Tag links in this parser (SCRIPT, IMG, FORM, 
 LINK).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Parsing extra fields from an html page in the web. ....

2007-09-27 Thread Marcin Okraszewski
I brief. You need to write HtmlParserFilter, then IndexingFilter and 
QueryFilter. You register them through extension points. Search USER (not dev) 
group, there answers already.

BTW. This questions is asked over and over. It seems to be a good subject to 
write on wiki.

Marcin

 Hi,
 We are working on an Indian Language search engine and are using
 nutch-0.9as the basic framework.
 
 However when the html pages are parsed during the fetching phase, the
 htmlParser which runs on the page extracts the title text and metatags and
 the outlinks.
 what do i need to do if i need to add in more fields like author,
 language, script  to the segments extracted from the web page. In case
 the data is unavailable in the page, we can load in some default values.
 
 Do i need to touch the actual parser code (parser used here is a neko-html
 parser if am not wrong) or the additions can be done right from within the
 nutch code.
 
 It would be of great help if you could get me through this.
 
 -- 
 Pratyush Banerjee
 SPO, CLIA
 IIT Kharagpur



Limiting outlink tags.

2007-09-06 Thread Marcin Okraszewski
Hi,
I have noticed that Nutch considers img/@src as an outlink. I suppose in many 
cases people do not want to threat image as an outlink. At least I don't want. 
The same case is with script/@src. But, it seems there is no way to limit 
outlink tags. The DOMContentUtils.getOutlinks() takes links from all 
a,area,form,frame,iframe,script,link,img. Only form element can be turned off 
by parser.html.form.use_action parameter.

I would suggest to introduce a new configuration parameter which could be used 
to turn on or off certain elements. It could be simply done by single 
parameter, which would contain coma separated list of tags to be turned off.

What is your opinion? If you think it is a valid issue I can make a patch for 
this.

Regards,
Marcin



Is there any chance that my patches will be considered?

2007-08-08 Thread Marcin Okraszewski
Hello Nutch Developers,
On May 22 I have contributed two patches - NUTCH-487 and NUTCH-490. The next 
release is probably coming soon. I would be really pleased if they are merged 
by then I don't have to patch it. Is there any chance they will be merged into 
source code? I can port them to current head, but so far nobody asked for this. 

I would also like to point your attention to one point. It is already two and 
half month since I have added the patches. There is not even a single comment 
on this. This is really discouraging for me, as a contributor. I know that 
merging patches is not the thing that developers love to do, but you are the 
only one who can do it. Of course I don't mean you should thank to every 
contribution, but take it into account. Having someone's work being ignored, 
and it looks like this for me, really discourages from further work. Reviewing 
it and saying you won't merge it because something would be much better than 
leaving it without a single comment. This may reduce your active community.

Think of this.

Best regards,
Marcin Okraszewski


Re: Re: Is there any chance that my patches will be considered?

2007-08-08 Thread Marcin Okraszewski
Thanks for a quick answer.

 As for NUTCH-490, I haven't taken an in-depth look at it, but I don't
 see the point of it. Why not just use HtmlParseFilters since you have
 access to the DOM object? What advantage do neko filters have? Also,
 having an extension point for a library possibly used by a possibly
 used plugin looks really really wrong from a design point.

In my case I want to achieve two things:
1. Ensure there is always TBODY element.
2. Drop all SELECT elements (I don't want it to be in 

You are right, I could manipulate DOM for this. But filters seems to be less 
costly operation, this is why I took this approach first. Though I haven't done 
any tests - maybe I'm too concerned and it doesn't matter that much.

Or possibly my suggestion to make extension point for parser is the best one? 
Then if you want to modify parsing itself, you can do whatever you want. Then 
you also wouldn't need any switch for parsing implementation as it is now. 
Simply modify plugin inclusion. I could do it, if you find it a good idea.

Anyway, as I said, even if you find it all as not having much sense, just close 
the issue with this comment. I really prefer it over hanging request, because I 
know I should rather think of different solution for my case.

Thanks,
Marcin Okraszewski


Re: How to create patch?

2007-06-01 Thread Marcin Okraszewski

http://wiki.apache.org/nutch/HowToContribute



On 6/1/07, Manoharam Reddy [EMAIL PROTECTED] wrote:

I have seen some patches been exchanged in the list.

I want to know how this patch is created and how is it applied? Any
pointers to tutorials on net or wiki or a plain reply here would be
helpful.



Re: running nutch without http proxy

2007-05-30 Thread Marcin Okraszewski

Seems like this is default. You may rather expect some problems is you
want to use proxy. The default configuration is without proxy.

Cheers,
Marcin

On 5/29/07, prem kumar [EMAIL PROTECTED] wrote:

Is it possible to run nutch  without using a http proxy to search the
internet? If so, what are the configurations needed ?
I don't want to use a socks proxy either. All I have is a direct connection
to the internet.

Thanks
Prem


--
http://premsden.blogspot.com/



[jira] Updated: (NUTCH-490) Extension point with filters for Neko HTML parser (with patch)

2007-05-22 Thread Marcin Okraszewski (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcin Okraszewski updated NUTCH-490:
-

Attachment: HtmlParser.java.diff

Patch for HtmlParser.

 Extension point with filters for Neko HTML parser (with patch)
 --

 Key: NUTCH-490
 URL: https://issues.apache.org/jira/browse/NUTCH-490
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 0.9.0
 Environment: Any
Reporter: Marcin Okraszewski
Priority: Minor
 Attachments: HtmlParser.java.diff


 In my project I need to set filters for Neko HTML parser. So instead of 
 adding it hard coded, I made an extension point to define filters for Neko. I 
 was fallowing the code for HtmlParser filters. In fact the method to get 
 filters I think could be generalized to handle both cases. But I didn't want 
 to make too big mess.
 The attached patch is for Nutch 0.9. This part of code wasn't changed in 
 trunk, so should be applicable easily.
 BTW. I wonder if it wouldn't be best to have HTML DOM Parsing defined by 
 extension point itself. Now there are options for Neko and TagSoap. But if 
 someone would like to use something else or set give different settings for 
 the parser, he would need to modify HtmlParser class, instead of replacing a 
 plugin.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-490) Extension point with filters for Neko HTML parser (with patch)

2007-05-22 Thread Marcin Okraszewski (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcin Okraszewski updated NUTCH-490:
-

Attachment: nutch-extensionpoins_plugin.xml.diff

Patch for plugin.xml in nutch-extensionpoins.

BTW. Why extension points are declared in this plugin? Normally I would define 
this extension point in plugin.xml of parse-html plugin. But I saw all 
extension points defined here, so I fallowed this policy.

 Extension point with filters for Neko HTML parser (with patch)
 --

 Key: NUTCH-490
 URL: https://issues.apache.org/jira/browse/NUTCH-490
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 0.9.0
 Environment: Any
Reporter: Marcin Okraszewski
Priority: Minor
 Attachments: HtmlParser.java.diff, 
 nutch-extensionpoins_plugin.xml.diff


 In my project I need to set filters for Neko HTML parser. So instead of 
 adding it hard coded, I made an extension point to define filters for Neko. I 
 was fallowing the code for HtmlParser filters. In fact the method to get 
 filters I think could be generalized to handle both cases. But I didn't want 
 to make too big mess.
 The attached patch is for Nutch 0.9. This part of code wasn't changed in 
 trunk, so should be applicable easily.
 BTW. I wonder if it wouldn't be best to have HTML DOM Parsing defined by 
 extension point itself. Now there are options for Neko and TagSoap. But if 
 someone would like to use something else or set give different settings for 
 the parser, he would need to modify HtmlParser class, instead of replacing a 
 plugin.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Bug (with fix): Neko HTML parser goes on defaults.

2007-05-21 Thread Marcin Okraszewski
Hi,
The Neko HTML parser set up is done in silent try / catch statement (Nutch 0.9: 
HtmlParser.java:248-259). The problem is that the first feature being set 
thrown an exception. So, the whole setup block is skipped. The catch statement 
does nothing, so probably nobody noticed this.

I attach a patch which fixes this. It was done on Nutch 0.9, but SVN trunk 
contains the same code. 

The patch does:
1. Fixes augmentations feature.
2. Removes include-comments feature, because I couldn't find anything similar 
at http://people.apache.org/~andyc/neko/doc/html/settings.html
3. Prints warn message when exception is caught.

Please note that now there goes a lot for messages to console (not log4j log), 
because report-errors feature is being set. Shouldn't it be removed?

Cheers,
Marcin--- nutch-0.9/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java	2007-04-03 05:44:21.0 +0200
+++ ../nutch-0.9/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java	2007-05-21 12:33:46.0 +0200
@@ -246,9 +246,9 @@
 DOMFragmentParser parser = new DOMFragmentParser();
 // some plugins, e.g., creativecommons, need to examine html comments
 try {
-  parser.setFeature(http://apache.org/xml/features/include-comments;, 
-  true);
-  parser.setFeature(http://apache.org/xml/features/augmentations;, 
+//  parser.setFeature(http://apache.org/xml/features/include-comments;, 
+//  true);
+  parser.setFeature(http://cyberneko.org/html/features/augmentations;, 
   true);
   parser.setFeature(http://cyberneko.org/html/features/balance-tags/ignore-outside-content;,
   false);
@@ -256,7 +256,9 @@
   true);
   parser.setFeature(http://cyberneko.org/html/features/report-errors;,
   true);
-} catch (SAXException e) {}
+} catch (SAXException e) {
+	LOG.warn(Exception while setting Neko parser settings., e);
+}
 // convert Document to DocumentFragment
 HTMLDocumentImpl doc = new HTMLDocumentImpl();
 doc.setErrorChecking(false);


[jira] Created: (NUTCH-487) Neko HTML parser goes on default settings.

2007-05-21 Thread Marcin Okraszewski (JIRA)
Neko HTML parser goes on default settings.
--

 Key: NUTCH-487
 URL: https://issues.apache.org/jira/browse/NUTCH-487
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 0.9.0
 Environment: Linux, Java 1.5.0.
Reporter: Marcin Okraszewski
 Attachments: neko_setup.patch

The Neko HTML parser set up is done in silent try / catch statement (Nutch 0.9: 
HtmlParser.java:248-259). The problem is that the first feature being set 
thrown an exception. So, the whole setup block is skipped. The catch statement 
does nothing, so probably nobody noticed this.

I attach a patch which fixes this. It was done on Nutch 0.9, but SVN trunk 
contains the same code.

The patch does:
1. Fixes augmentations feature.
2. Removes include-comments feature, because I couldn't find anything similar 
at http://people.apache.org/~andyc/neko/doc/html/settings.html
3. Prints warn message when exception is caught.

Please note that now there goes a lot for messages to console (not log4j log), 
because report-errors feature is being set. Shouldn't it be removed?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Bug (with fix): Neko HTML parser goes on defaults.

2007-05-21 Thread Marcin Okraszewski
 I would suggest that you open a JIRA issue and attach the patch there.
 For this case, there is a similar issue(with patch) at NUTCH-369.

Done - NUTCH-487

Marcin