[jira] Commented: (NUTCH-586) Add option to run compiled classes w/o job file
[ https://issues.apache.org/jira/browse/NUTCH-586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12548198 ] Enis Soztutar commented on NUTCH-586: - Can someone review this ? Add option to run compiled classes w/o job file --- Key: NUTCH-586 URL: https://issues.apache.org/jira/browse/NUTCH-586 Project: Nutch Issue Type: New Feature Affects Versions: 1.0.0 Reporter: Enis Soztutar Assignee: Enis Soztutar Fix For: 1.0.0 Attachments: run-core_v1.patch bin/nutch adds nutch-*.job files under build and base directory to the classpath. However building the job file takes a long time. We have a target compile-core which builds only the core classes w/o plugins, but we need a way to run the compiled core class files. An option to bin/nutch to run the classes compiled with ant compile-core seems enough. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-585) [PARSE-HTML plugin] Block certain parts of HTML code from being indexed
[ https://issues.apache.org/jira/browse/NUTCH-585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12548217 ] Andrea Spinelli commented on NUTCH-585: --- I absolutely agree that a more general solution is needed; however, I think that some of the Nutch current users might benefit from a quick fix. If there is no opposition, I could submit a patch (less than 20 lines) On the other hand,anybody thinks that blocking selected portions of text could pose serious architectural or stability risks? About the more general solution, do you think there is a viable path from here to there? -- andrea [PARSE-HTML plugin] Block certain parts of HTML code from being indexed --- Key: NUTCH-585 URL: https://issues.apache.org/jira/browse/NUTCH-585 Project: Nutch Issue Type: Improvement Affects Versions: 0.9.0 Environment: All operating systems Reporter: Andrea Spinelli Priority: Minor We are using nutch to index our own web sites; we would like not to index certain parts of our pages, because we know they are not relevant (for instance, there are several links to change the background color) and generate spurious matches. We have modified the plugin so that it ignores HTML code between certain HTML comments, like !-- START-IGNORE -- ... ignored part ... !-- STOP-IGNORE -- We feel this might be useful to someone else, maybe factorizing the comment strings as constants in the configuration files (say parser.html.ignore.start and parser.html.ignore.stop in nutch-site.xml). We are almost ready to contribute our code snippet. Looking forward for any expression of interest - or for an explanation why waht we are doing is plain wrong! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-586) Add option to run compiled classes w/o job file
[ https://issues.apache.org/jira/browse/NUTCH-586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12548233 ] Andrzej Bialecki commented on NUTCH-586: - +1. I think you also need to put a comment, which clarifies that this works only in the local Hadoop mode. Add option to run compiled classes w/o job file --- Key: NUTCH-586 URL: https://issues.apache.org/jira/browse/NUTCH-586 Project: Nutch Issue Type: New Feature Affects Versions: 1.0.0 Reporter: Enis Soztutar Assignee: Enis Soztutar Fix For: 1.0.0 Attachments: run-core_v1.patch bin/nutch adds nutch-*.job files under build and base directory to the classpath. However building the job file takes a long time. We have a target compile-core which builds only the core classes w/o plugins, but we need a way to run the compiled core class files. An option to bin/nutch to run the classes compiled with ant compile-core seems enough. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-581) DistributedSearch does not update search servers added to search-servers.txt on the fly
[ https://issues.apache.org/jira/browse/NUTCH-581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12548252 ] Doğacan Güney commented on NUTCH-581: - This patch conflicts with my patch in NUTCH-442 (which I really want to commit sometime) but that's my problem :). So +1 from me. DistributedSearch does not update search servers added to search-servers.txt on the fly --- Key: NUTCH-581 URL: https://issues.apache.org/jira/browse/NUTCH-581 Project: Nutch Issue Type: Improvement Components: searcher Affects Versions: 0.9.0 Reporter: Rohan Mehta Priority: Minor Fix For: 0.9.0 Attachments: NUTCH-581-2.patch, UpdateSearch.patch DistributedSearch client updates the search servers added to the search-servers.txt file on the fly. This patch will updates the search servers on the fly and the client does not need a restart. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-586) Add option to run compiled classes w/o job file
[ https://issues.apache.org/jira/browse/NUTCH-586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enis Soztutar updated NUTCH-586: Attachment: run-core_v2.patch bq. I think you also need to put a comment, which clarifies that this works only in the local Hadoop mode. agreed. This patch addresses that. Add option to run compiled classes w/o job file --- Key: NUTCH-586 URL: https://issues.apache.org/jira/browse/NUTCH-586 Project: Nutch Issue Type: New Feature Affects Versions: 1.0.0 Reporter: Enis Soztutar Assignee: Enis Soztutar Fix For: 1.0.0 Attachments: run-core_v1.patch, run-core_v2.patch bin/nutch adds nutch-*.job files under build and base directory to the classpath. However building the job file takes a long time. We have a target compile-core which builds only the core classes w/o plugins, but we need a way to run the compiled core class files. An option to bin/nutch to run the classes compiled with ant compile-core seems enough. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-588) Help Need
Help Need - Key: NUTCH-588 URL: https://issues.apache.org/jira/browse/NUTCH-588 Project: Nutch Issue Type: Task Components: indexer Affects Versions: 0.7.2 Environment: Linux Reporter: Teccon Ingenieros Hello, We are trying to index a word file, if we put the static url like (/servlet/jsp/documento.doc) it works ok, put if we try to do the same with an dinamic url that generates that file (/servlet/jsp/leerFichero.jspid=112) it does´t work, it does´t index our url. What can we do? Regards, -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-581) DistributedSearch does not update search servers added to search-servers.txt on the fly
[ https://issues.apache.org/jira/browse/NUTCH-581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes resolved NUTCH-581. Resolution: Fixed Patch committed. This patch check modified time on search-servers.txt file and automatically reloads if changed. This allows added and removing search servers on the fly.Thanks Rohan. DistributedSearch does not update search servers added to search-servers.txt on the fly --- Key: NUTCH-581 URL: https://issues.apache.org/jira/browse/NUTCH-581 Project: Nutch Issue Type: Improvement Components: searcher Affects Versions: 0.9.0 Reporter: Rohan Mehta Priority: Minor Fix For: 0.9.0 Attachments: NUTCH-581-2.patch, UpdateSearch.patch DistributedSearch client updates the search servers added to the search-servers.txt file on the fly. This patch will updates the search servers on the fly and the client does not need a restart. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-585) [PARSE-HTML plugin] Block certain parts of HTML code from being indexed
[ https://issues.apache.org/jira/browse/NUTCH-585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12548420 ] Matt Kangas commented on NUTCH-585: --- Simplest path forward... that I can think of: 1) Add a new indexing plugin extension-point for filtering page content. 2) Put your apriori marked-up content exclusion logic into a plugin. 3) Someone else figures out a more general-purpose solution later, and swaps out your plugin at that time. Ergo, you generalize the interface, and lazy-load the more general implementation. :-) [PARSE-HTML plugin] Block certain parts of HTML code from being indexed --- Key: NUTCH-585 URL: https://issues.apache.org/jira/browse/NUTCH-585 Project: Nutch Issue Type: Improvement Affects Versions: 0.9.0 Environment: All operating systems Reporter: Andrea Spinelli Priority: Minor We are using nutch to index our own web sites; we would like not to index certain parts of our pages, because we know they are not relevant (for instance, there are several links to change the background color) and generate spurious matches. We have modified the plugin so that it ignores HTML code between certain HTML comments, like !-- START-IGNORE -- ... ignored part ... !-- STOP-IGNORE -- We feel this might be useful to someone else, maybe factorizing the comment strings as constants in the configuration files (say parser.html.ignore.start and parser.html.ignore.stop in nutch-site.xml). We are almost ready to contribute our code snippet. Looking forward for any expression of interest - or for an explanation why waht we are doing is plain wrong! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-589) Hierarchical Classloaders
Hierarchical Classloaders - Key: NUTCH-589 URL: https://issues.apache.org/jira/browse/NUTCH-589 Project: Nutch Issue Type: Improvement Affects Versions: 0.9.0 Reporter: Ryan Levering Priority: Minor Currently the Nutch plugin classloader flattens all the jars from a plugins' dependencies and instantiates a new classloader for each plugin. I think it would be better to create a hierarchical classloader chain. Currently plugins can't pass objects from a common plugin to one another because the objects are created using different classloaders. Nutch currently avoids this by only using interfaces from a common classloader to pass objects between plugins, but I can't see the harm in improving the plugin classloader. It would require a change to PluginDescription and PluginClassLoader in order to override ClassLoader to maintain the export filter functionality that currently exists. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Nutch\nutch-0.9\build.xml:61: Specify at least one source--a file or resource collection.
I want to develop a simple plugin. But I get the error build.xml:61: Specify at least one source--a file or resource collection.when I use ant. Could anyone tell me how to fix it? Thank you very much. quxy 2007-12-05
[jira] Commented: (NUTCH-581) DistributedSearch does not update search servers added to search-servers.txt on the fly
[ https://issues.apache.org/jira/browse/NUTCH-581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12548541 ] Hudson commented on NUTCH-581: -- Integrated in Nutch-Nightly #285 (See [http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/285/]) DistributedSearch does not update search servers added to search-servers.txt on the fly --- Key: NUTCH-581 URL: https://issues.apache.org/jira/browse/NUTCH-581 Project: Nutch Issue Type: Improvement Components: searcher Affects Versions: 0.9.0 Reporter: Rohan Mehta Priority: Minor Fix For: 0.9.0 Attachments: NUTCH-581-2.patch, UpdateSearch.patch DistributedSearch client updates the search servers added to the search-servers.txt file on the fly. This patch will updates the search servers on the fly and the client does not need a restart. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.