[jira] Commented: (NUTCH-563) Include custom fields in BasicQueryFilter
[ https://issues.apache.org/jira/browse/NUTCH-563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12651208#action_12651208 ] Jasper Kamperman commented on NUTCH-563: Hi Davide, My laptop which has nutch-0.9 on it is in the shop so I can't verify where that file is, but I think it is altogether possible that nutch-0.8 doesn't yet have a file BasicQueryFilter.java . Sorry I can't be of more help. I'm CC'ing the original author of the patch, but he just became Father, so it might be a while until you hear from him :-). Jasper > Include custom fields in BasicQueryFilter > - > > Key: NUTCH-563 > URL: https://issues.apache.org/jira/browse/NUTCH-563 > Project: Nutch > Issue Type: New Feature > Components: searcher >Reporter: julien nioche >Priority: Minor > Fix For: 0.9.0 > > Attachments: diff.BasicQueryFilter.dynamicFields.txt > > > This patch allows to include additional fields in the BasicQueryFilter by > specifying runtime parameters. Any parameter matching the regular expression > (query\\.basic\\.(.+)\\.boost") will be added to the list of fields to be > used by the BQF and the specified float value will be used as boost. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Pending Commits for Nutch Issues
If nobody has a problem with them I would like to commit the following issues in the next day or two: NUTCH-663: Upgrade Nutch to the most recent Hadoop version (0.19) NUTCH-662: Upgrade Nutch to the most recent Lucene version (2.4) NUTCH-647: Resolve URLs tool NUTCH-665: Search Load Testing Tool NUTCH-667: Input Format for working with Content in Hadoop Streaming And I would like to commit these in < a week: NUTCH-635: LinkAnalysis Tool for Nutch NUTCH-646: New Indexing framework for Nutch NUTCH-594: Serve Nutch search results in XML and JSON NUTCH-666: Analysis plugins and new language identifier. There are others too but these are the ones I am trying to get moved into trunk right now. Dennis
[jira] Updated: (NUTCH-646) New Indexing Framework for Nutch
[ https://issues.apache.org/jira/browse/NUTCH-646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-646: --- Attachment: NUTCH-646-2-20081126.patch Updated indexing patch. > New Indexing Framework for Nutch > > > Key: NUTCH-646 > URL: https://issues.apache.org/jira/browse/NUTCH-646 > Project: Nutch > Issue Type: New Feature > Components: indexer >Affects Versions: 0.9.0 > Environment: All >Reporter: Dennis Kubes >Assignee: Dennis Kubes > Fix For: 0.9.0, 1.0.0 > > Attachments: arity-1.3.2.jar, NUTCH-646-1-20080818.patch, > NUTCH-646-2-20081126.patch > > > New indexing framework for Nutch that provides a more generic field > abstraction consistent with Lucene index semantics. Allows multiple MR jobs > to be created for different fields and those fields to be aggregated and > indexed in the end. Overcomes limitations of the current indexer that limits > what databases are passed into the indexer. Creates a new extension point as > well for field-filters for manipulation of fields during the indexing process. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-563) Include custom fields in BasicQueryFilter
[ https://issues.apache.org/jira/browse/NUTCH-563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12651152#action_12651152 ] Davide commented on NUTCH-563: -- Hi Jasper, could you explain me how to apply it? I can't find the right file to apply the diff.. Thank you a lot! > Include custom fields in BasicQueryFilter > - > > Key: NUTCH-563 > URL: https://issues.apache.org/jira/browse/NUTCH-563 > Project: Nutch > Issue Type: New Feature > Components: searcher >Reporter: julien nioche >Priority: Minor > Fix For: 0.9.0 > > Attachments: diff.BasicQueryFilter.dynamicFields.txt > > > This patch allows to include additional fields in the BasicQueryFilter by > specifying runtime parameters. Any parameter matching the regular expression > (query\\.basic\\.(.+)\\.boost") will be added to the list of fields to be > used by the BQF and the specified float value will be used as boost. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[Nutch Wiki] Update of "PluginCentral" by johnroman
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification. The following page has been changed by johnroman: http://wiki.apache.org/nutch/PluginCentral -- * WritingPluginExample - A step-by-step example of how to write a plugin for the 0.7 branch. - updated by LucasBoullosa * [http://wiki.media-style.com/display/nutchDocu/Write+a+plugin Writing Plugins] - by Stefan - == Plugins that Come with Nutch (0.7) == + == Plugins that Come with Nutch (0.9) == In order to get Nutch to use any of these plugins, you just need to edit your conf/nutch-site.xml file and add the name of the plugin to the list of plugin.includes. @@ -24, +24 @@ * '''parse-html''' - Parses HTML documents * '''parse-js''' - Parses Java``Script * '''parse-mp3''' - Parses MP3s + * '''parse-zip''' - Parses ZIP archives + * '''parse-mspowerpoint''' - Parses Microsoft Powerpoint files * '''parse-msword''' - Parses MS Word documents + * '''parse-msexcel''' - Parses MS Excel documents * '''parse-pdf''' - Parses PDFs * '''parse-rss''' - Parses RSS feeds + * '''parse-oo''' - Parses OpenOffice files + * '''parse-swf''' - Parses Shockwave Flash * '''parse-rtf''' - Parses RTF files * '''parse-text''' - Parses text documents * '''protocol-file''' - Retreives documents from the filesystem @@ -47, +52 @@ * '''lib-commons-httpclient''' * '''lib-http''' * '''lib-jakarta-poi''' - * '''lib-log4j''' + * '''lib-log4j''' - * '''lib-lucene-analyzers''' + * '''lib-lucene-analyzers''' - Lucene analyzers - * '''lib-nekohtml''' - * '''lib-parsems''' + * '''lib-nekohtml''' - automatic tag balancer + * '''lib-parsems''' - parse ms documents framework * '''parse-msexcel''' - Parses MS Excel documents * '''parse-mspowerpoint''' - Parses MS Powerpoint documents * '''parse-oo''' - Parses Open Office and Star Office documents (Extentsions: ODT, OTT, ODH, ODM, ODS, OTS, ODP, OTP, SXW, STW, SXC, STC, SXI, STI)
[jira] Commented: (NUTCH-563) Include custom fields in BasicQueryFilter
[ https://issues.apache.org/jira/browse/NUTCH-563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12651093#action_12651093 ] Jasper Kamperman commented on NUTCH-563: Hi Davide, I never tried to apply it to 0.8, sorry. Jasper > Include custom fields in BasicQueryFilter > - > > Key: NUTCH-563 > URL: https://issues.apache.org/jira/browse/NUTCH-563 > Project: Nutch > Issue Type: New Feature > Components: searcher >Reporter: julien nioche >Priority: Minor > Fix For: 0.9.0 > > Attachments: diff.BasicQueryFilter.dynamicFields.txt > > > This patch allows to include additional fields in the BasicQueryFilter by > specifying runtime parameters. Any parameter matching the regular expression > (query\\.basic\\.(.+)\\.boost") will be added to the list of fields to be > used by the BQF and the specified float value will be used as boost. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Troubles while creating a plugin
Hello, I am creating a plugin for Nutch that extends the QueryFilter. I get a successful compilation with "ant" and "ant war", but when I do a search, I get the following exception: 26/11/2008 18:50:07 org.apache.catalina.core.StandardWrapperValve invoke SEVERE: Servlet.service() for servlet jsp threw exception java.lang.NoClassDefFoundError: org/apache/commons/codec/DecoderException at org.apache.tika.mime.MimeTypesReader.readMatch(MimeTypesReader.java:272) at org.apache.tika.mime.MimeTypesReader.readMatches(MimeTypesReader.java:221) at org.apache.tika.mime.MimeTypesReader.readMagic(MimeTypesReader.java:201) at org.apache.tika.mime.MimeTypesReader.readMimeType(MimeTypesReader.java:164) at org.apache.tika.mime.MimeTypesReader.read(MimeTypesReader.java:138) at org.apache.tika.mime.MimeTypesReader.read(MimeTypesReader.java:121) at org.apache.tika.mime.MimeTypesFactory.create(MimeTypesFactory.java:56) at org.apache.nutch.util.MimeUtil.(MimeUtil.java:62) at org.apache.nutch.protocol.Content.(Content.java:85) at org.apache.nutch.personalizedsearch.searcher.context.ContextQueryFilter.filter(ContextQueryFilter.java:55) at org.apache.nutch.searcher.QueryFilters.filter(QueryFilters.java:111) at org.apache.nutch.searcher.IndexSearcher.search(IndexSearcher.java:96) at org.apache.nutch.searcher.NutchBean.search(NutchBean.java:251) at org.apache.jsp.search_jsp._jspService(search_jsp.java:284) at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:70) at javax.servlet.http.HttpServlet.service(HttpServlet.java:803) at org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:393) at org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:320) at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:266) at javax.servlet.http.HttpServlet.service(HttpServlet.java:803) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:263) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:584) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447) at java.lang.Thread.run(Thread.java:619) The DecoderException class is in commons-codec-1.3.jar, so I added the jar file to my plugin.xml: But the same error appears. Any idea on what I may be doing wrong? Thanks.
[jira] Updated: (NUTCH-667) Input Format for working with Content in Hadoop Streaming
[ https://issues.apache.org/jira/browse/NUTCH-667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-667: --- Summary: Input Format for working with Content in Hadoop Streaming (was: Input Forma for working with Content in Hadoop Streaming) > Input Format for working with Content in Hadoop Streaming > - > > Key: NUTCH-667 > URL: https://issues.apache.org/jira/browse/NUTCH-667 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.0.0 > Environment: All >Reporter: Dennis Kubes >Assignee: Dennis Kubes >Priority: Minor > Fix For: 1.0.0 > > Attachments: NUTCH-667-1-20081126.patch > > > This is a ContextAsText input format that removes line endings with spaces > that allow Nutch content to be used more effectively inside of Hadoop > streaming jobs that allow MapReduce jobs to be written in any language that > can communicate with stdin and stdout. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-667) Input Forma for working with Content in Hadoop Streaming
[ https://issues.apache.org/jira/browse/NUTCH-667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-667: --- Attachment: NUTCH-667-1-20081126.patch Input format for working with hadoop streaming. > Input Forma for working with Content in Hadoop Streaming > > > Key: NUTCH-667 > URL: https://issues.apache.org/jira/browse/NUTCH-667 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.0.0 > Environment: All >Reporter: Dennis Kubes >Assignee: Dennis Kubes >Priority: Minor > Fix For: 1.0.0 > > Attachments: NUTCH-667-1-20081126.patch > > > This is a ContextAsText input format that removes line endings with spaces > that allow Nutch content to be used more effectively inside of Hadoop > streaming jobs that allow MapReduce jobs to be written in any language that > can communicate with stdin and stdout. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-667) Input Forma for working with Content in Hadoop Streaming
Input Forma for working with Content in Hadoop Streaming Key: NUTCH-667 URL: https://issues.apache.org/jira/browse/NUTCH-667 Project: Nutch Issue Type: Improvement Affects Versions: 1.0.0 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Priority: Minor Fix For: 1.0.0 This is a ContextAsText input format that removes line endings with spaces that allow Nutch content to be used more effectively inside of Hadoop streaming jobs that allow MapReduce jobs to be written in any language that can communicate with stdin and stdout. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-635) LinkAnalysis Tool for Nutch
[ https://issues.apache.org/jira/browse/NUTCH-635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-635: --- Attachment: (was: NUTCH-635-8-20080818.patch) > LinkAnalysis Tool for Nutch > --- > > Key: NUTCH-635 > URL: https://issues.apache.org/jira/browse/NUTCH-635 > Project: Nutch > Issue Type: New Feature >Affects Versions: 1.0.0 > Environment: All >Reporter: Dennis Kubes >Assignee: Dennis Kubes > Fix For: 1.0.0 > > Attachments: NUTCH-635-1-20080612.patch, NUTCH-635-2-20080613.patch, > NUTCH-635-3-20080614.patch, NUTCH-635-4-20080615.patch, > NUTCH-635-5-20080620.patch, NUTCH-635-6-20080725.patch, > NUTCH-635-7-20080808.patch, NUTCH-635-9-20081126.patch > > > This is a basic pagerank type link analysis tool for nutch which simulates a > sparse matrix using inlinks and outlinks and converges after a given number > of iterations. This tool is mean to replace the current scoring system in > nutch with a system that converges instead of exponentially increasing > scores. Also includes a tool to create an outlinkdb. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-635) LinkAnalysis Tool for Nutch
[ https://issues.apache.org/jira/browse/NUTCH-635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-635: --- Attachment: NUTCH-635-9-20081126.patch Updated final patch for new link analysis framework. I am also going to write up some documentation on the wiki for how this new process works. > LinkAnalysis Tool for Nutch > --- > > Key: NUTCH-635 > URL: https://issues.apache.org/jira/browse/NUTCH-635 > Project: Nutch > Issue Type: New Feature >Affects Versions: 1.0.0 > Environment: All >Reporter: Dennis Kubes >Assignee: Dennis Kubes > Fix For: 1.0.0 > > Attachments: NUTCH-635-1-20080612.patch, NUTCH-635-2-20080613.patch, > NUTCH-635-3-20080614.patch, NUTCH-635-4-20080615.patch, > NUTCH-635-5-20080620.patch, NUTCH-635-6-20080725.patch, > NUTCH-635-7-20080808.patch, NUTCH-635-9-20081126.patch > > > This is a basic pagerank type link analysis tool for nutch which simulates a > sparse matrix using inlinks and outlinks and converges after a given number > of iterations. This tool is mean to replace the current scoring system in > nutch with a system that converges instead of exponentially increasing > scores. Also includes a tool to create an outlinkdb. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-663) Upgrade Nutch to use Hadoop 0.19
[ https://issues.apache.org/jira/browse/NUTCH-663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-663: --- Attachment: (was: NUTCH-663-1-20081126.patch) > Upgrade Nutch to use Hadoop 0.19 > > > Key: NUTCH-663 > URL: https://issues.apache.org/jira/browse/NUTCH-663 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.0.0 > Environment: All >Reporter: Dennis Kubes >Assignee: Dennis Kubes > Fix For: 1.0.0 > > Attachments: hadoop-0.19-native.tar.gz, hadoop-0.19.0-core.jar, > NUTCH-663-1-20081126.patch > > > Upgrade Nutch to use a newer hadoop, version 0.18.2. This includes > performance improvements, bug fixes, and new functionality. Changes some > current APIs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-663) Upgrade Nutch to use Hadoop 0.19
[ https://issues.apache.org/jira/browse/NUTCH-663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-663: --- Attachment: NUTCH-663-1-20081126.patch Updated patch to include API changes in Nutch classes. > Upgrade Nutch to use Hadoop 0.19 > > > Key: NUTCH-663 > URL: https://issues.apache.org/jira/browse/NUTCH-663 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.0.0 > Environment: All >Reporter: Dennis Kubes >Assignee: Dennis Kubes > Fix For: 1.0.0 > > Attachments: hadoop-0.19-native.tar.gz, hadoop-0.19.0-core.jar, > NUTCH-663-1-20081126.patch > > > Upgrade Nutch to use a newer hadoop, version 0.18.2. This includes > performance improvements, bug fixes, and new functionality. Changes some > current APIs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-666) Analysis plugins for multiple language and new Language Identifier Tool
[ https://issues.apache.org/jira/browse/NUTCH-666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-666: --- Attachment: NUTCH-666-1-20081126.patch Fixed patch. Now includes the changes to AnalyzerFactory to allow multiple languages per plugin. > Analysis plugins for multiple language and new Language Identifier Tool > --- > > Key: NUTCH-666 > URL: https://issues.apache.org/jira/browse/NUTCH-666 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.0.0 > Environment: All >Reporter: Dennis Kubes >Assignee: Dennis Kubes > Fix For: 1.0.0 > > Attachments: NUTCH-666-1-20081126.patch > > > Add analysis plugins for czech, greek, japanese, chinese, korean, dutch, > russian, and thai. Also includes a new Language Identifier tool that used > the new indexing framework in NUTCH-646. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-666) Analysis plugins for multiple language and new Language Identifier Tool
[ https://issues.apache.org/jira/browse/NUTCH-666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-666: --- Attachment: (was: NUTCH-666-1-20081126.patch) > Analysis plugins for multiple language and new Language Identifier Tool > --- > > Key: NUTCH-666 > URL: https://issues.apache.org/jira/browse/NUTCH-666 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.0.0 > Environment: All >Reporter: Dennis Kubes >Assignee: Dennis Kubes > Fix For: 1.0.0 > > Attachments: NUTCH-666-1-20081126.patch > > > Add analysis plugins for czech, greek, japanese, chinese, korean, dutch, > russian, and thai. Also includes a new Language Identifier tool that used > the new indexing framework in NUTCH-646. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-663) Upgrade Nutch to use Hadoop 0.19
[ https://issues.apache.org/jira/browse/NUTCH-663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-663: --- Summary: Upgrade Nutch to use Hadoop 0.19 (was: Upgrade Nutch to use Hadoop 0.18.2) change to 0.19 instead of 0.18.2 > Upgrade Nutch to use Hadoop 0.19 > > > Key: NUTCH-663 > URL: https://issues.apache.org/jira/browse/NUTCH-663 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.0.0 > Environment: All >Reporter: Dennis Kubes >Assignee: Dennis Kubes > Fix For: 1.0.0 > > Attachments: hadoop-0.19-native.tar.gz, hadoop-0.19.0-core.jar, > NUTCH-663-1-20081126.patch > > > Upgrade Nutch to use a newer hadoop, version 0.18.2. This includes > performance improvements, bug fixes, and new functionality. Changes some > current APIs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-663) Upgrade Nutch to use Hadoop 0.18.2
[ https://issues.apache.org/jira/browse/NUTCH-663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-663: --- Attachment: hadoop-0.19.0-core.jar Hadoop core jar > Upgrade Nutch to use Hadoop 0.18.2 > -- > > Key: NUTCH-663 > URL: https://issues.apache.org/jira/browse/NUTCH-663 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.0.0 > Environment: All >Reporter: Dennis Kubes >Assignee: Dennis Kubes > Fix For: 1.0.0 > > Attachments: hadoop-0.19-native.tar.gz, hadoop-0.19.0-core.jar, > NUTCH-663-1-20081126.patch > > > Upgrade Nutch to use a newer hadoop, version 0.18.2. This includes > performance improvements, bug fixes, and new functionality. Changes some > current APIs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-663) Upgrade Nutch to use Hadoop 0.18.2
[ https://issues.apache.org/jira/browse/NUTCH-663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-663: --- Attachment: NUTCH-663-1-20081126.patch Updates jar and native files > Upgrade Nutch to use Hadoop 0.18.2 > -- > > Key: NUTCH-663 > URL: https://issues.apache.org/jira/browse/NUTCH-663 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.0.0 > Environment: All >Reporter: Dennis Kubes >Assignee: Dennis Kubes > Fix For: 1.0.0 > > Attachments: hadoop-0.19-native.tar.gz, NUTCH-663-1-20081126.patch > > > Upgrade Nutch to use a newer hadoop, version 0.18.2. This includes > performance improvements, bug fixes, and new functionality. Changes some > current APIs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-663) Upgrade Nutch to use Hadoop 0.18.2
[ https://issues.apache.org/jira/browse/NUTCH-663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-663: --- Attachment: hadoop-0.19-native.tar.gz Native files > Upgrade Nutch to use Hadoop 0.18.2 > -- > > Key: NUTCH-663 > URL: https://issues.apache.org/jira/browse/NUTCH-663 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.0.0 > Environment: All >Reporter: Dennis Kubes >Assignee: Dennis Kubes > Fix For: 1.0.0 > > Attachments: hadoop-0.19-native.tar.gz, NUTCH-663-1-20081126.patch > > > Upgrade Nutch to use a newer hadoop, version 0.18.2. This includes > performance improvements, bug fixes, and new functionality. Changes some > current APIs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-666) Analysis plugins for multiple language and new Language Identifier Tool
[ https://issues.apache.org/jira/browse/NUTCH-666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-666: --- Attachment: NUTCH-666-1-20081126.patch Part one of patch. This includes the new analyzers for different languages. Part two will include the new language identifier tool. > Analysis plugins for multiple language and new Language Identifier Tool > --- > > Key: NUTCH-666 > URL: https://issues.apache.org/jira/browse/NUTCH-666 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.0.0 > Environment: All >Reporter: Dennis Kubes >Assignee: Dennis Kubes > Fix For: 1.0.0 > > Attachments: NUTCH-666-1-20081126.patch > > > Add analysis plugins for czech, greek, japanese, chinese, korean, dutch, > russian, and thai. Also includes a new Language Identifier tool that used > the new indexing framework in NUTCH-646. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-666) Analysis plugins for multiple language and new Language Identifier Tool
Analysis plugins for multiple language and new Language Identifier Tool --- Key: NUTCH-666 URL: https://issues.apache.org/jira/browse/NUTCH-666 Project: Nutch Issue Type: Improvement Affects Versions: 1.0.0 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Fix For: 1.0.0 Add analysis plugins for czech, greek, japanese, chinese, korean, dutch, russian, and thai. Also includes a new Language Identifier tool that used the new indexing framework in NUTCH-646. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-647) Resolve URLs tool
[ https://issues.apache.org/jira/browse/NUTCH-647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-647: --- Attachment: NUTCH-647-2-20081126.patch Updated patch. > Resolve URLs tool > - > > Key: NUTCH-647 > URL: https://issues.apache.org/jira/browse/NUTCH-647 > Project: Nutch > Issue Type: New Feature > Environment: All >Reporter: Dennis Kubes >Assignee: Dennis Kubes > Attachments: NUTCH-647-1-20080818.patch, NUTCH-647-2-20081126.patch > > > A tool that takes a listing of urls and attempts to resolve their IP > addresses. Useful for running after the fetcher has run to determine if DNS > problems exist. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-665) Search Load Testing Tool
[ https://issues.apache.org/jira/browse/NUTCH-665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-665: --- Attachment: NUTCH-665-20081126-1.patch Search load testing tool. > Search Load Testing Tool > > > Key: NUTCH-665 > URL: https://issues.apache.org/jira/browse/NUTCH-665 > Project: Nutch > Issue Type: New Feature > Components: searcher >Affects Versions: 1.0.0 > Environment: All >Reporter: Dennis Kubes >Assignee: Dennis Kubes >Priority: Minor > Fix For: 1.0.0 > > Attachments: NUTCH-665-20081126-1.patch > > > A tool which spawn a number of threads and executes searches against > configured search servers. This is used for light load testing of search > servers. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-665) Search Load Testing Tool
Search Load Testing Tool Key: NUTCH-665 URL: https://issues.apache.org/jira/browse/NUTCH-665 Project: Nutch Issue Type: New Feature Components: searcher Affects Versions: 1.0.0 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Priority: Minor Fix For: 1.0.0 A tool which spawn a number of threads and executes searches against configured search servers. This is used for light load testing of search servers. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-563) Include custom fields in BasicQueryFilter
[ https://issues.apache.org/jira/browse/NUTCH-563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12651009#action_12651009 ] Davide commented on NUTCH-563: -- Hi, is it possible to apply this code also on Nutch 0.8.1? Can you explain me how? Thanks > Include custom fields in BasicQueryFilter > - > > Key: NUTCH-563 > URL: https://issues.apache.org/jira/browse/NUTCH-563 > Project: Nutch > Issue Type: New Feature > Components: searcher >Reporter: julien nioche >Priority: Minor > Fix For: 0.9.0 > > Attachments: diff.BasicQueryFilter.dynamicFields.txt > > > This patch allows to include additional fields in the BasicQueryFilter by > specifying runtime parameters. Any parameter matching the regular expression > (query\\.basic\\.(.+)\\.boost") will be added to the list of fields to be > used by the BQF and the specified float value will be used as boost. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-663) Upgrade Nutch to use Hadoop 0.18.2
[ https://issues.apache.org/jira/browse/NUTCH-663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12650982#action_12650982 ] Dennis Kubes commented on NUTCH-663: hadoop 0.19 was release. I am integrating it in and should have a patch shortly. > Upgrade Nutch to use Hadoop 0.18.2 > -- > > Key: NUTCH-663 > URL: https://issues.apache.org/jira/browse/NUTCH-663 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.0.0 > Environment: All >Reporter: Dennis Kubes >Assignee: Dennis Kubes > Fix For: 1.0.0 > > > Upgrade Nutch to use a newer hadoop, version 0.18.2. This includes > performance improvements, bug fixes, and new functionality. Changes some > current APIs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-664) Possibility to update already stored documents.
[ https://issues.apache.org/jira/browse/NUTCH-664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12650912#action_12650912 ] Sergey Khilkov commented on NUTCH-664: -- Yes, It will be great to have changeDocument() method of IndexWriter class. Hope it's possible ) > Possibility to update already stored documents. > --- > > Key: NUTCH-664 > URL: https://issues.apache.org/jira/browse/NUTCH-664 > Project: Nutch > Issue Type: Wish >Reporter: Sergey Khilkov >Priority: Minor > > We have huge index of stored documents. It is high cost procedure to fetch > page, merge indexes any time we update some information about page. The > information can be changed 1-3 times per day. At this moment we have to store > changed info in database, but in this case we have lots of problems with > sorting, search restricions and so on. Lucene itself allows delete single > document and add new one into existing index. But there is a problem with > hadoop... As I understand hadoop filesystem has no possibility to write in > random positions. But it will be great feature if nutch will be able to > update created index. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-664) Possibility to update already stored documents.
[ https://issues.apache.org/jira/browse/NUTCH-664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-664: Priority: Minor (was: Major) Issue Type: Wish (was: New Feature) There is no proposed design, so this is a Wish. > Possibility to update already stored documents. > --- > > Key: NUTCH-664 > URL: https://issues.apache.org/jira/browse/NUTCH-664 > Project: Nutch > Issue Type: Wish >Reporter: Sergey Khilkov >Priority: Minor > > We have huge index of stored documents. It is high cost procedure to fetch > page, merge indexes any time we update some information about page. The > information can be changed 1-3 times per day. At this moment we have to store > changed info in database, but in this case we have lots of problems with > sorting, search restricions and so on. Lucene itself allows delete single > document and add new one into existing index. But there is a problem with > hadoop... As I understand hadoop filesystem has no possibility to write in > random positions. But it will be great feature if nutch will be able to > update created index. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.