[jira] [Commented] (NUTCH-585) [PARSE-HTML plugin] Block certain parts of HTML code from being indexed

2013-03-17 Thread Roberto Gardenier (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13604649#comment-13604649
 ] 

Roberto Gardenier commented on NUTCH-585:
-

Will this patch be implemented in Nutch at all? I've seen this patch / feature 
request being marked from 1.4 up till 1.7 now. 
Even though the patch works with Nutch 1.5 up till 1.5.1 I wonder if this will 
become part of Nutch at any time, [~markus17]?

 [PARSE-HTML plugin] Block certain parts of HTML code from being indexed
 ---

 Key: NUTCH-585
 URL: https://issues.apache.org/jira/browse/NUTCH-585
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.9.0
 Environment: All operating systems
Reporter: Andrea Spinelli
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.7

 Attachments: blacklist_whitelist_plugin.patch, 
 nutch-585-excludeNodes.patch, nutch-585-jostens-excludeDIVs.patch


 We are using nutch to index our own web sites; we would like not to index 
 certain parts of our pages, because we know they are not relevant (for 
 instance, there are several links to change the background color) and 
 generate spurious matches.
 We have modified the plugin so that it ignores HTML code between certain HTML 
 comments, like
 !-- START-IGNORE --
 ... ignored part ...
 !-- STOP-IGNORE --
 We feel this might be useful to someone else, maybe factorizing the comment 
 strings as constants in the configuration files (say parser.html.ignore.start 
 and parser.html.ignore.stop in nutch-site.xml).
 We are almost ready to contribute our code snippet.  Looking forward for any 
 expression of  interest - or for an explanation why waht we are doing is 
 plain wrong!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-585) [PARSE-HTML plugin] Block certain parts of HTML code from being indexed

2012-10-29 Thread Roberto Gardenier (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Roberto Gardenier updated NUTCH-585:


Comment: was deleted

(was: I have compiled nutch 1.5.1 with the provided plugin and used the 
configuration as described above. This all without success. 
Could anyone assist me on troubleshooting ?

Nutch crawls and SOLR indexes with success but the content field still includes 
content of which are supposed to be blacklisted.

Steps:
1. Patched Nutch 1.5.1. with above blacklist_whitelist_plugin.patch
2. Enabled the plugin in nutch-default.xml plugin.includes: 
index-blacklist-whitelist.
3. Added the new field strippedContent to schema.xml (both nutch and solr) !-- 
fields for the blacklist/whitelist plugin -- field name=strippedContent 
type=text stored=true indexed=true/.
4. Configured parser.html.blacklist to blacklist div.kruimelspoor in 
nutch-default.xml.

I pointed nutch at my site and fired it. I dont get warnings/errors or any kind 
of showstoppers, the crawling goes well and the index is filled. But still with 
everything inside div.kruimelspoor.
)

 [PARSE-HTML plugin] Block certain parts of HTML code from being indexed
 ---

 Key: NUTCH-585
 URL: https://issues.apache.org/jira/browse/NUTCH-585
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.9.0
 Environment: All operating systems
Reporter: Andrea Spinelli
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.6

 Attachments: blacklist_whitelist_plugin.patch, 
 nutch-585-excludeNodes.patch, nutch-585-jostens-excludeDIVs.patch


 We are using nutch to index our own web sites; we would like not to index 
 certain parts of our pages, because we know they are not relevant (for 
 instance, there are several links to change the background color) and 
 generate spurious matches.
 We have modified the plugin so that it ignores HTML code between certain HTML 
 comments, like
 !-- START-IGNORE --
 ... ignored part ...
 !-- STOP-IGNORE --
 We feel this might be useful to someone else, maybe factorizing the comment 
 strings as constants in the configuration files (say parser.html.ignore.start 
 and parser.html.ignore.stop in nutch-site.xml).
 We are almost ready to contribute our code snippet.  Looking forward for any 
 expression of  interest - or for an explanation why waht we are doing is 
 plain wrong!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1343) Crawl sites with hashtags in url

2012-05-01 Thread Roberto Gardenier (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13265751#comment-13265751
 ] 

Roberto Gardenier commented on NUTCH-1343:
--

Markus Jelsma,

I got notified that you have closed my jira ticket, chaning its resolution 
status to Invalid.
I wonder why you have closed my ticket and marked it invalid as i did not 
commit any changes or solutions?

With kind regards,
Roberto Gardenier 


-Oorspronkelijk bericht-
Van: Markus Jelsma (JIRA) [mailto:j...@apache.org] 
Verzonden: dinsdag 1 mei 2012 13:40
Aan: r.garden...@simgroep.nl
Onderwerp: [jira] [Closed] (NUTCH-1343) Crawl sites with hashtags in url


 [ 
https://issues.apache.org/jira/browse/NUTCH-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-1343.


Resolution: Invalid


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira






 Crawl sites with hashtags in url
 

 Key: NUTCH-1343
 URL: https://issues.apache.org/jira/browse/NUTCH-1343
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
Reporter: Roberto Gardenier
Priority: Blocker

 Hello,
 Im currently trying to crawl a site which uses hashtags in the urls. I dont 
 seem to get any results and Im hoping im just overlooking something.
 Site structure is as follows:
 http://domain.com (landingpage)
 http://domain.com/#/page1
 http://domain.com/#/page1/subpage1
 http://domain.com/#/page2
 http://domain.com/#/page2/subpage1
 and so on.
 I've pointed nutch to http://domain.com as start url and in my filter i've 
 placed all kind of rules.
 First i thought this would be sufficient:
 +http\://domain\.com\/#
 But then i realised that # is used for comments so i escaped it:
 +http\://domain\.com\/\#
 Still no results. So i thought i could use the asterix for it:
 +http\://domain\.com\/*
 Still no luck.. So i started using various regex stuff but without success.
 I noticed the following messages in hadoop.log:
 INFO  anchor.AnchorIndexingFilter - Anchor deduplication is: off
 Ive researched on this setting but i dont know for sure if this affects my 
 problem in a way. This property is set to false in my configs.
 I dont know if this is even related to the situation above but maybe it helps.
 Any help is very much appreciated! I've tried googling the problem but i 
 couldnt find documentation or anyone else with this problem.
 Many thanks in advance. 
 With kind regard,
 Roberto Gardenier

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1343) Crawl sites with hashtags in url

2012-05-01 Thread Roberto Gardenier (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13265759#comment-13265759
 ] 

Roberto Gardenier commented on NUTCH-1343:
--

Thank you for your response. I will check the mailinglist for any possible 
reactions. Thank you very much.

 Crawl sites with hashtags in url
 

 Key: NUTCH-1343
 URL: https://issues.apache.org/jira/browse/NUTCH-1343
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
Reporter: Roberto Gardenier
Priority: Blocker

 Hello,
 Im currently trying to crawl a site which uses hashtags in the urls. I dont 
 seem to get any results and Im hoping im just overlooking something.
 Site structure is as follows:
 http://domain.com (landingpage)
 http://domain.com/#/page1
 http://domain.com/#/page1/subpage1
 http://domain.com/#/page2
 http://domain.com/#/page2/subpage1
 and so on.
 I've pointed nutch to http://domain.com as start url and in my filter i've 
 placed all kind of rules.
 First i thought this would be sufficient:
 +http\://domain\.com\/#
 But then i realised that # is used for comments so i escaped it:
 +http\://domain\.com\/\#
 Still no results. So i thought i could use the asterix for it:
 +http\://domain\.com\/*
 Still no luck.. So i started using various regex stuff but without success.
 I noticed the following messages in hadoop.log:
 INFO  anchor.AnchorIndexingFilter - Anchor deduplication is: off
 Ive researched on this setting but i dont know for sure if this affects my 
 problem in a way. This property is set to false in my configs.
 I dont know if this is even related to the situation above but maybe it helps.
 Any help is very much appreciated! I've tried googling the problem but i 
 couldnt find documentation or anyone else with this problem.
 Many thanks in advance. 
 With kind regard,
 Roberto Gardenier

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira