unable to build 2.x
Hi nutch-dev, I took a *fresh* checkout of 2.x and tried to build it (ant clean runtime). I get lot of compilation errors. At first when I saw that on the terminal, I said to my laptop : Are you kidding me ?. I re-tried it 2 times again and still the same thing happens. I am checking the reason as to why build fails on my machine. Curious to know: Is it just me or everybody else is seeing this ?
[jira] [Updated] (NUTCH-356) Plugin repository cache can lead to memory leak
[ https://issues.apache.org/jira/browse/NUTCH-356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-356: -- Fix Version/s: 1.8 Plugin repository cache can lead to memory leak --- Key: NUTCH-356 URL: https://issues.apache.org/jira/browse/NUTCH-356 Project: Nutch Issue Type: Bug Affects Versions: 0.8 Reporter: Enrico Triolo Fix For: 2.3, 1.8 Attachments: ASF.LICENSE.NOT.GRANTED--NutchTest.java, ASF.LICENSE.NOT.GRANTED--patch.txt, cache_classes.patch While I was trying to solve a problem I reported a while ago (see Nutch-314), I found out that actually the problem was related to the plugin cache used in class PluginRepository.java. As I said in Nutch-314, I think I somehow 'force' the way nutch is meant to work, since I need to frequently submit new urls and append their contents to the index; I don't (and I can't) have an urls.txt file with all urls I'm going to fetch, but I recreate it each time a new url is submitted. Thus, I think in the majority of times you won't have problems using nutch as-is, since the problem I found occours only if nutch is used in a way similar to the one I use. To simplify your test I'm attaching a class that performs something similar to what I need. It fetches and index some sample urls; to avoid webmasters complaints I left the sample urls list empty, so you should modify the source code and add some urls. Then you only have to run it and watch your memory consumption with top. In my experience I get an OutOfMemoryException after a couple of minutes, but it clearly depends on your heap settings and on the plugins you are using (I'm using 'protocol-file|protocol-http|parse-(rss|html|msword|pdf|text)|language-identifier|index-(basic|more)|query-(basic|more|site|url)|urlfilter-regex|summary-basic|scoring-opic'). The problem is bound to the PluginRepository 'singleton' instance, since it never get released. It seems that some class maintains a reference to it and this class is never released since it is cached somewhere in the configuration. So I modified the PluginRepository's 'get' method so that it never uses the cache and always returns a new instance (you can find the patch in attachment). This way the memory consumption is always stable and I get no OOM anymore. Clearly this is not the solution, since I guess there are many performance issues involved, but for the moment it works. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-840) Port tests from parse-html to parse-tika
[ https://issues.apache.org/jira/browse/NUTCH-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-840: -- Fix Version/s: 1.8 Port tests from parse-html to parse-tika Key: NUTCH-840 URL: https://issues.apache.org/jira/browse/NUTCH-840 Project: Nutch Issue Type: Task Components: parser Affects Versions: 1.1, 1.6 Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 2.3, 1.8 Attachments: NUTCH-840.patch, NUTCH-840.patch, NUTCH-840-trunk.patch, NUTCH-840v2.patch We don't have test for HTML in parse-tika so I'll copy them from the old parse-html plugin -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-410) Faster RegexNormalize with more features
[ https://issues.apache.org/jira/browse/NUTCH-410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-410: -- Fix Version/s: 1.8 Faster RegexNormalize with more features Key: NUTCH-410 URL: https://issues.apache.org/jira/browse/NUTCH-410 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 0.8 Environment: Tested on MacOS X 10.4.7/10.4.8 Reporter: Doug Cook Priority: Minor Fix For: 2.3, 1.8 Attachments: betterRegexNorm.patch The patch associated with this is backwards-compatible and has several improvements over the stock 0.8 RegexURLNormalizer: 1) About a 34% performance improvement, from only executing the superclass (BasicURLNormalizer) once in most cases, instead of twice as the stock version did. 2) Support for expensive host-specific normalizations with good performance. Each regex block optionally takes a list of hosts to which to apply the associated regex. If supplied, the regex will only be applied to these hosts. This should have scalable performance; the comparison is O(1) regardless of the number of hosts. The format is: regex hostwww.host1.com/host hosthost2.site2.com/host pattern my pattern here /pattern substitution my substitution here /substitution /regex 3) Support for decoding URLs with escaped character encodings (e.g. %20, etc.). This is useful, for example, to decode jump redirects which have the target URL encoded within the source, as on Yahoo. I tried to create an extensible notion of options, the first of which is unescape. The unescape function is applied *after* the substitution and *only* if the substitution pattern matches. A simple pattern to unescape Yahoo directory redirects would be something like: regex pattern^http://[a-z\.]*\.yahoo\.com/.*/\*+(http[^amp;]+)/pattern substitution$1/substitution optionsunescape/options /regex 4) Added the notion of iterating the pattern chain. This is useful when the result of a normalization can itself be normalized. While some of this can be handled in the stock version by repeating patterns, or by careful ordering of patterns, the notion of iterating is cleaner and more powerful. The chain is defined to iterate only when the previous iteration changes the input, up to a configurable maxium number of iterations. The config parameter to change is: urlnormalizer.regex.maxiterations, which defaults to 1 (previous behavior). The change is performance-neutral when disabled, and has a relatively small performance cost when enabled. Pardon any potentially unconventional Java on my part. I've got lots of C/C++ search engine experience, but Nutch is my first large Java app. I welcome any feedback, and hope this is useful. Doug -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1253) Incompatible neko and xerces versions
[ https://issues.apache.org/jira/browse/NUTCH-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1253: --- Fix Version/s: 1.8 Incompatible neko and xerces versions - Key: NUTCH-1253 URL: https://issues.apache.org/jira/browse/NUTCH-1253 Project: Nutch Issue Type: Bug Affects Versions: 1.4 Environment: Ubuntu 10.04 Reporter: Dennis Spathis Assignee: Lewis John McGibbney Fix For: 2.3, 1.8 Attachments: NUTCH-1253-2.x-v2.patch, NUTCH-1253-nutchgora.patch, NUTCH-1253.patch, TEST-org.apache.nutch.parse.html.TestDOMContentUtils.txt The Nutch 1.4 distribution includes - nekohtml-0.9.5.jar (under .../runtime/local/plugins/lib- nekohtml) - xercesImpl-2.9.1.jar (under .../runtime/local/lib) These two JARs appear to be incompatible versions. When the HtmlParser (configured to use neko) is invoked during a local-mode crawl, the parse fails due to an AbstractMethodError. (Note: To see the AbstractMethodError, rebuild the HtmlParser plugin and add a catch(Throwable) clause in the getParse method to log the stacktrace.) I found that substituting a later, compatible version of nekohtml (1.9.11) fixes the problem. Curiously, and in support of the above, the nekohtml plugin.xml file in Nutch 1.4 contains the following: plugin id=lib-nekohtml name=CyberNeko HTML Parser version=1.9.11 provider-name=org.cyberneko runtime library name=nekohtml-0.9.5.jar export name=*/ /library /runtime /plugin Note the conflicting version numbers (version tag is 1.9.11 but the specified library is nekohtml-0.9.5.jar). Was the 0.9.5 version included by mistake? Was the intention rather to include 1.9.11? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1190) MoreIndexingFilter refactor: move data formats used to parse lastModified to a config file.
[ https://issues.apache.org/jira/browse/NUTCH-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1190: --- Fix Version/s: 1.8 MoreIndexingFilter refactor: move data formats used to parse lastModified to a config file. - Key: NUTCH-1190 URL: https://issues.apache.org/jira/browse/NUTCH-1190 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 1.4 Environment: jdk6 Reporter: Zhang JinYan Fix For: 2.3, 1.8 Attachments: date-styles.txt, MoreIndexingFilter.patch, NUTCH-1190-trunk.patch There many issues about missing date format: [NUTCH-871|https://issues.apache.org/jira/browse/NUTCH-871] [NUTCH-912|https://issues.apache.org/jira/browse/NUTCH-912] [NUTCH-1015|https://issues.apache.org/jira/browse/NUTCH-1015] The data formats can be diverse, so why not move those data formats to a extra config file? I move all the data formats from MoreIndexingFilter.java to a file named date-styles.txt(place in conf), which will be load on startup. {code} public void setConf(Configuration conf) { this.conf = conf; MIME = new MimeUtil(conf); URL res = conf.getResource(date-styles.txt); if(res==null){ LOG.error(Can't find resource: date-styles.txt); }else{ try { List lines = FileUtils.readLines(new File(res.getFile())); for (int i = 0; i lines.size(); i++) { String dateStyle = (String) lines.get(i); if(StringUtils.isBlank(dateStyle)){ lines.remove(i); i--; continue; } dateStyle=StringUtils.trim(dateStyle); if(dateStyle.startsWith(#)){ lines.remove(i); i--; continue; } lines.set(i, dateStyle); } dateStyles = new String[lines.size()]; lines.toArray(dateStyles); } catch (IOException e) { LOG.error(Failed to load resource: date-styles.txt); } } } {code} Then parse lastModified like this(sample): {code} private long getTime(String date, String url) { .. Date parsedDate = DateUtils.parseDate(date, dateStyles); time = parsedDate.getTime(); .. return time; } {code} This path also contains the path of [NUTCH-1140|https://issues.apache.org/jira/browse/NUTCH-1140]. Find more details in the patch file. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a ?
[ https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-797: -- Fix Version/s: 1.8 parse-tika is not properly constructing URLs when the target begins with a ? -- Key: NUTCH-797 URL: https://issues.apache.org/jira/browse/NUTCH-797 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.1, nutchgora Environment: Win 7, Java(TM) SE Runtime Environment (build 1.6.0_16-b01) Also repro's on RHEL and java 1.4.2 Reporter: Robert Hohman Assignee: Andrzej Bialecki Priority: Minor Fix For: 2.3, 1.8 Attachments: NUTCH-797.patch, pureQueryUrl-2.patch, pureQueryUrl.patch This is my first bug and patch on nutch, so apologies if I have not provided enough detail. In crawling the page at http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0 there are links in the page that look like this: a href=?co=0sk=0p=2pi=12/a/tdtda href=?co=0sk=0p=3pi=13/a in org.apache.nutch.parse.tika.DOMContentUtils rev 916362 (trunk), as getOutlinks looks for links, it comes across this link, and constucts a new url with a base URL class built from http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0;, and a target of ?co=0sk=0p=2pi=1 The URL class, per RFC 3986 at http://labs.apache.org/webarch/uri/rfc/rfc3986.html#relative-merge, defines how to merge these two, and per the RFC, the URL class merges these to: http://careers3.accenture.com/Careers/ASPX/?co=0sk=0p=2pi=1 because the RFC explicitly states that the rightmost url segment (the Search.aspx in this case) should be ripped off before combining. While this is compliant with the RFC, it means the URLs which are created for the next round of fetching are incorrect. Modern browsers seem to handle this case (I checked IE8 and Firefox 3.5), so I'm guessing this is an obscure exception or handling of what is a poorly formed url on accenture's part. I have fixed this by modifying DOMContentUtils to look for the case where a ? begins the target, and then pulling the rightmost component out of the base and inserting it into the target before the ?, so the target in this example becomes: Search.aspx?co=0sk=0p=2pi=1 The URL class then properly constructs the new url as: http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0p=2pi=1 If it is agreed that this solution works, I believe the other html parsers in nutch would need to be modified in a similar way. Can I get feedback on this proposed solution? Specifically I'm worried about unforeseen side effects. Much thanks Here is the patch info: Index: src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java === --- src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java (revision 916362) +++ src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java (working copy) @@ -299,6 +299,50 @@ return false; } + private URL fixURL(URL base, String target) throws MalformedURLException + { + // handle params that are embedded into the base url - move them to target + // so URL class constructs the new url class properly + if (base.toString().indexOf(';') 0) + return fixEmbeddedParams(base, target); + + // handle the case that there is a target that is a pure query. + // Strictly speaking this is a violation of RFC 2396 section 5.2.2 on how to assemble + // URLs but I've seen this in numerous places, for example at + // http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0 + // It has urls in the page of the form href=?co=0sk=0pg=1, and by default + // URL constructs the base+target combo as + // http://careers3.accenture.com/Careers/ASPX/?co=0sk=0pg=1, incorrectly + // dropping the Search.aspx target + // + // Browsers handle these just fine, they must have an exception similar to this + if (target.startsWith(?)) + { + return fixPureQueryTargets(base, target); + } + + return new URL(base, target); + } + + private URL fixPureQueryTargets(URL base, String target) throws MalformedURLException + { + if (!target.startsWith(?)) + return new URL(base, target); + + String basePath = base.getPath(); + String baseRightMost=; + int baseRightMostIdx = basePath.lastIndexOf(/); + if (baseRightMostIdx != -1) + { + baseRightMost = basePath.substring(baseRightMostIdx+1); + } + + if
[jira] [Updated] (NUTCH-566) Sun's URL class has bug in creation of relative query URLs
[ https://issues.apache.org/jira/browse/NUTCH-566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-566: -- Fix Version/s: 1.8 Sun's URL class has bug in creation of relative query URLs -- Key: NUTCH-566 URL: https://issues.apache.org/jira/browse/NUTCH-566 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 0.8, 0.8.1, 0.9.0 Environment: MacOS X and Linux (CentOS 4.5) both Reporter: Doug Cook Priority: Minor Fix For: 2.3, 1.8 Attachments: RelativeURL.java I'm using 0.81, but this will affect all other versions as well. Relative links of the form ?blah are resolved incorrectly. For example, with a base URL of http://www.fleurie.org/entreprise.asp, and a relative link of ?id_entrep=111, Nutch will resolve this pair to the link http://www.fleurie.org/?id_entrep=111;. No such URL exists, and all browsers I tried will resolve the pair to http://www.fleurie.org/entreprise.asp?id_entrep=111;. I tracked this down to what could be called a bug in Sun's URL class. According to Sun's spec, they parse the relative URL according to RFC 2396. But the original RFC for relative links was RFC 1808, and the two RFCs differ in how they handle relative links beginning with ?. Most browsers (Netscape/Mozilla, IE, Safari) implemented RFC 1808, and stuck with it (for compatibility and also because the behavior makes more sense). Apparently even the people that wrote RFC 2396 recognized that this was a mistake, and the specified behavior was changed in RFC 3986 to match what browsers do. For a discussion of this, see http://gbiv.com/protocols/uri/rev-2002/issues.html#003-relative-query Sun's URL implementation, however, still implements RFC2396, as far as I can tell, and is out of step with the rest of the world. This breaks link extraction on a number of sites. I implemented a simple workaround, which I'm attaching. It is a static method to create URLs which behaves exactly as new URL(URL base, String relativePath), and I use it as a drop-in replacement for that in DOMContentUtils, Javascript link extraction, etc. Obviously, it really only matters wherever links are extracted. I haven't included the calling code from DOMContentUtils, etc. because my local versions are largely rewritten, but it should be pretty obvious. I put it in the org.apache.nutch.net directory, but obviously feel free to move it to another place if you feel it belongs there! -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1250) parse-html does not parse links with empty anchor
[ https://issues.apache.org/jira/browse/NUTCH-1250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1250: --- Fix Version/s: 1.8 parse-html does not parse links with empty anchor - Key: NUTCH-1250 URL: https://issues.apache.org/jira/browse/NUTCH-1250 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.4 Reporter: Andreas Janning Fix For: 2.3, 1.8 Attachments: DOMContentUtils_v1.patch, DOMContentUtils_v2.patch, TestDomContentUitls_v1.patch The parse-html plugin does not generate an outlink if the link has no anchor For example the following HTML-Code does not create an Outlink: {code:html} a href=example.com/a {code} The JUnit-Test TestDOMContentUtils tries to test this but fails since there is a comment inside the a-Tag. {code:title=TestDOMContentUtils.java|borderStyle=solid} new String(htmlheadtitle title /title + /headbody + a href=\g\!--no anchor--/a + a href=\g1\ !--whitespace-- /a + a href=\g2\ img src=test.gif alt='bla bla' /a + /body/html), {code} When you remove the comment the test fails. {code:title=TestDOMContentUtils.java Test fails|borderStyle=solid} new String(htmlheadtitle title /title + /headbody + a href=\g\/a // no anchor + a href=\g1\ !--whitespace-- /a + a href=\g2\ img src=test.gif alt='bla bla' /a + /body/html), {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1562) Order of execution for scoring filters
[ https://issues.apache.org/jira/browse/NUTCH-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1562: --- Fix Version/s: 1.8 Order of execution for scoring filters -- Key: NUTCH-1562 URL: https://issues.apache.org/jira/browse/NUTCH-1562 Project: Nutch Issue Type: Bug Components: documentation Affects Versions: 1.6, 2.1 Reporter: Julien Nioche Fix For: 2.3, 1.8 Attachments: NUTCH-1562-trunk.patch The documentation in nutch-default.xml states that : {quote} property namescoring.filter.order/name value/value descriptionThe order in which scoring filters are applied. This may be left empty (in which case all available scoring filters will be applied in the order defined in plugin-includes and plugin-excludes), or a space separated list of implementation classes. /description /property {quote} however if no order is specified the filters are ordered randomly and not in the order defined in plugin-includes. The other *order parameters (e.g. urlfilter.order) have a different documentation and are loaded and applied in system defined order which corresponds to what the code does. The patch attached is for 1.x and puts the code in accordance with the documentation by ordering the filters according to the order of the plugins, which gives users more control without having to specify the classes explicitly in scoring.filter.order. We could extend the same idea to the other *order params. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-409) Add short circuit notion to filters to speedup mixed site/subsite crawling
[ https://issues.apache.org/jira/browse/NUTCH-409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-409: -- Fix Version/s: 1.8 Add short circuit notion to filters to speedup mixed site/subsite crawling Key: NUTCH-409 URL: https://issues.apache.org/jira/browse/NUTCH-409 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 0.8 Reporter: Doug Cook Priority: Minor Fix For: 2.3, 1.8 Attachments: shortcircuit.patch In the case where one is crawling a mixture of sites and sub-sites, the prefix matcher can match the sites quite quickly, but either the regex or automaton filters are considerably slower matching the sub-sites. In the current model of AND-ing all the filters together, the pattern-matching filter will be run on every site that matches the prefix matcher -- even if that entire site is to be crawled and there are no sub-site patterns. If only a small portion of the sites actually need sub-site pattern matching, this is much slower than it should be. I propose (and attach) a simple modification allowing considerable speedup for this usage pattern. I define the notion of a short circuit match that means accept this URL and don't run any of the remaining filters in the filter chain. Though with this change, any filter plugin can in theory return a short-circuit match, I have only implemented the short-circuit match for the PrefixURLFilter. The configuration file format is backwards-compatible; shortcircuit matches just have SHORTCIRCUIT: in front of them. One minor gotcha: * Because the shortcircuit matches will avoid running any later filters, all of the site-independent filters need to be BEFORE the PrefixURLFilter in the chain. I get my best performance using the following filter chain: 1) The SuffixURLFilter to throw away anything with unwanted extensions 2) The RegexURLFilter to do site-independent cleanup (ad removal, skipping mailto:, bulletin-board pages, etc.) 3) The PrefixURLFilter, with SHORTCIRCUIT: in front of every site name EXCEPT the sites needing subsite matching 4) The AutomatonURLFilter to match those sites needing subsite pattern matching. I have tens of thousands of sites and an order of magnitude fewer subsites, so skipping step #4 90% of the time speeds things up considerably (my reduce time on a round of crawling is down from some 26 hours to less than 10). There are only two drawbacks to the implementation, and I think they're pretty minor: 1) Because I pass a special token (_PASS_) in the place of the URL to implement the short circuit, if for some reason someone wanted to crawl a URL named _PASS_, there would be problems. I find this highly unlikely, since that's an invalid URL. 2) The correct behavior of steps #3 and #4 above depends upon coordination of the config files between the prefix and automaton filters, making an opportunity for user screwup. I thought about creating a new kind of filter which essentially combined prefix automaton's behaviors, took one config file, and internally handled the short-circuiting. But I think the approach I took is simpler, cleaner, more flexible, and avoids creating yet another kind of filter. Coordinating the config files is pretty easy (I generate them programmatically). As this is my first contribution to Nutch I'm sure that there are things I've missed, whether in coding style or desired patch format. I welcome any feedback, suggestions, etc. Doug -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-945) Indexing to multiple SOLR Servers
[ https://issues.apache.org/jira/browse/NUTCH-945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-945: -- Fix Version/s: 1.8 Indexing to multiple SOLR Servers - Key: NUTCH-945 URL: https://issues.apache.org/jira/browse/NUTCH-945 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 1.2 Reporter: Charan Malemarpuram Fix For: 2.3, 1.8 Attachments: MurmurHashPartitioner.java, NonPartitioningPartitioner.java, patch-NUTCH-945.txt It would be nice to have a default Indexer in Nutch, which can submit docs to multiple SOLR Servers. Partitioning is always the question, when writing to multiple SOLR Servers. Default partitioning can be a simple hashcode based distribution with addition hooks to customization. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1531) URL filtering takes long time for very long URLs
[ https://issues.apache.org/jira/browse/NUTCH-1531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1531: --- Fix Version/s: 1.8 URL filtering takes long time for very long URLs Key: NUTCH-1531 URL: https://issues.apache.org/jira/browse/NUTCH-1531 Project: Nutch Issue Type: Improvement Affects Versions: 1.6, 2.1, 1.7, 2.2 Reporter: Fırat KÜÇÜK Priority: Minor Fix For: 2.3, 1.8 Attachments: max_url_length.diff, test_case.txt Filtering very long urls (such as base64 image generators) take long time (hours). On reducing phase it locks down all the system for hours. Therefore some URL length limitation needed. We attached a little patch for this improvement. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-351) Protocol forward proxy
[ https://issues.apache.org/jira/browse/NUTCH-351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-351: -- Fix Version/s: 1.8 Protocol forward proxy -- Key: NUTCH-351 URL: https://issues.apache.org/jira/browse/NUTCH-351 Project: Nutch Issue Type: New Feature Components: fetcher Affects Versions: 0.8, 0.8.1, 0.9.0 Reporter: Sami Siren Assignee: Sami Siren Priority: Minor Fix For: 2.3, 1.8 Attachments: protocol-http-proxy-adapter.txt Protocol proxy adapter takes advantage of protocols known to http forward proxy. Usually there's atleast http, https and ftp. You must configure nutch to use this plugin and to use http proxy before use. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-490) Extension point with filters for Neko HTML parser (with patch)
[ https://issues.apache.org/jira/browse/NUTCH-490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-490: -- Fix Version/s: 1.8 Extension point with filters for Neko HTML parser (with patch) -- Key: NUTCH-490 URL: https://issues.apache.org/jira/browse/NUTCH-490 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 0.9.0 Environment: Any Reporter: Marcin Okraszewski Priority: Minor Fix For: 2.3, 1.8 Attachments: HtmlParser.java.diff, NekoFilters_for_1.0.patch, nutch-extensionpoins_plugin.xml.diff In my project I need to set filters for Neko HTML parser. So instead of adding it hard coded, I made an extension point to define filters for Neko. I was fallowing the code for HtmlParser filters. In fact the method to get filters I think could be generalized to handle both cases. But I didn't want to make too big mess. The attached patch is for Nutch 0.9. This part of code wasn't changed in trunk, so should be applicable easily. BTW. I wonder if it wouldn't be best to have HTML DOM Parsing defined by extension point itself. Now there are options for Neko and TagSoap. But if someone would like to use something else or set give different settings for the parser, he would need to modify HtmlParser class, instead of replacing a plugin. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
fix version 1.7 removed in Jira
Hi, please take care not to remove the fix version when applying bulk changes, e.g., 2.2 = 2.3 Alternative fix versions (1.7) are not kept. Luckily Jira is quite powerful, I restored the 1.x fix version using this awful filter: project = NUTCH AND fixVersion in (2.3) AND status = Open AND updated = 2013-05-21 AND affectedVersion in (0.8, 0.9, 0.9.0, 1.0, 1.1, 1.2, 1.3, 1.4, 1.6, 1.7) ORDER BY issuetype ASC, priority DESC Cheers, Sebastian
[jira] [Updated] (NUTCH-1483) Can't crawl filesystem with protocol-file plugin
[ https://issues.apache.org/jira/browse/NUTCH-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1483: --- Priority: Critical (was: Major) Can't crawl filesystem with protocol-file plugin Key: NUTCH-1483 URL: https://issues.apache.org/jira/browse/NUTCH-1483 Project: Nutch Issue Type: Bug Components: protocol Affects Versions: 1.6, 2.1 Environment: OpenSUSE 12.1, OpenJDK 1.6.0, HBase 0.90.4 Reporter: Rogério Pereira Araújo Priority: Critical Fix For: 2.3 Attachments: NUTCH-1483.patch I tried to follow the same steps described in this wiki page: http://wiki.apache.org/nutch/IntranetDocumentSearch I made all required changes on regex-urlfilter.txt and added the following entry in my seed file: file:///home/rogerio/Documents/ The permissions are ok, I'm running nutch with the same user as folder owner, so nutch has all the required permissions, unfortunately I'm getting the following error: org.apache.nutch.protocol.file.FileError: File Error: 404 at org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:105) at org.apache.nutch.fetcher.FetcherReducer$FetcherThread.run(FetcherReducer.java:514) fetch of file://home/rogerio/Documents/ failed with: org.apache.nutch.protocol.file.FileError: File Error: 404 Why the logs are showing file://home/rogerio/Documents/ instead of file:///home/rogerio/Documents/ ??? Note: The regex-urlfilter entry only works as expected if I add the entry +^file://home/rogerio/Documents/ instead of +^file:///home/rogerio/Documents/ as wiki says. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1483) Can't crawl filesystem with protocol-file plugin
[ https://issues.apache.org/jira/browse/NUTCH-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1483: --- Fix Version/s: 1.7 Can't crawl filesystem with protocol-file plugin Key: NUTCH-1483 URL: https://issues.apache.org/jira/browse/NUTCH-1483 Project: Nutch Issue Type: Bug Components: protocol Affects Versions: 1.6, 2.1 Environment: OpenSUSE 12.1, OpenJDK 1.6.0, HBase 0.90.4 Reporter: Rogério Pereira Araújo Priority: Critical Fix For: 1.7, 2.3 Attachments: NUTCH-1483.patch I tried to follow the same steps described in this wiki page: http://wiki.apache.org/nutch/IntranetDocumentSearch I made all required changes on regex-urlfilter.txt and added the following entry in my seed file: file:///home/rogerio/Documents/ The permissions are ok, I'm running nutch with the same user as folder owner, so nutch has all the required permissions, unfortunately I'm getting the following error: org.apache.nutch.protocol.file.FileError: File Error: 404 at org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:105) at org.apache.nutch.fetcher.FetcherReducer$FetcherThread.run(FetcherReducer.java:514) fetch of file://home/rogerio/Documents/ failed with: org.apache.nutch.protocol.file.FileError: File Error: 404 Why the logs are showing file://home/rogerio/Documents/ instead of file:///home/rogerio/Documents/ ??? Note: The regex-urlfilter entry only works as expected if I add the entry +^file://home/rogerio/Documents/ instead of +^file:///home/rogerio/Documents/ as wiki says. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-1249) Resolve all issues flagged up by adding javac -Xlint arguement
[ https://issues.apache.org/jira/browse/NUTCH-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil resolved NUTCH-1249. Resolution: Fixed Fix Version/s: 2.2 Assignee: Tejas Patil (was: Lewis John McGibbney) Ported the patch for trunk to 2.x. All the tests are passing (verified on Java 1.7.0_10 and 1.6.0_38). Committed to svn at rev 1485125. Resolve all issues flagged up by adding javac -Xlint arguement -- Key: NUTCH-1249 URL: https://issues.apache.org/jira/browse/NUTCH-1249 Project: Nutch Issue Type: Improvement Components: build Affects Versions: nutchgora Reporter: Lewis John McGibbney Assignee: Tejas Patil Priority: Minor Fix For: 1.7, 2.2 Attachments: NUTCH-1249.trunk.patch There are a heap of issues flagged up by NUTCH-1237, I think over time it would be great to get these addressed and resolved. What is interesting is that adding the same arguements to /src/plugin/plugin-build.xml actually breaks my build as tests begin to fail. Some of this stuff is documented in the link below http://docs.oracle.com/javase/1.5.0/docs/tooldocs/windows/javac.html#options -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-1275) Fix [unchecked] javac warnings
[ https://issues.apache.org/jira/browse/NUTCH-1275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil resolved NUTCH-1275. Resolution: Fixed Fix Version/s: 2.2 Got resolved with NUTCH-1249 Fix [unchecked] javac warnings -- Key: NUTCH-1275 URL: https://issues.apache.org/jira/browse/NUTCH-1275 Project: Nutch Issue Type: Sub-task Components: build Affects Versions: nutchgora, 1.5 Reporter: Lewis John McGibbney Assignee: Tejas Patil Priority: Minor Fix For: 1.7, 2.2 We can simply suppress these warnings using {code} SuppressWarnings [unchecked] {code} However if there is a another method for resolving these warnings then they should be implemented if deemed beneficial to code quality. Some resources http://java.sun.com/docs/books/jls/third_edition/html/conversions.html#190772 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1249) Resolve all issues flagged up by adding javac -Xlint arguement
[ https://issues.apache.org/jira/browse/NUTCH-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13663973#comment-13663973 ] Hudson commented on NUTCH-1249: --- Integrated in Nutch-nutchgora #614 (See [https://builds.apache.org/job/Nutch-nutchgora/614/]) NUTCH-1249 and NUTCH-1275 : Resolve all issues flagged up by adding javac -Xlint argument (Revision 1485125) Result = FAILURE tejasp : http://svn.apache.org/viewvc/nutch/branches/2.x/?view=revrev=1485125 Files : * /nutch/branches/2.x/CHANGES.txt * /nutch/branches/2.x/build.xml * /nutch/branches/2.x/src/java/org/apache/nutch/api/ConfResource.java * /nutch/branches/2.x/src/java/org/apache/nutch/api/DbReader.java * /nutch/branches/2.x/src/java/org/apache/nutch/api/JobResource.java * /nutch/branches/2.x/src/java/org/apache/nutch/api/impl/RAMJobManager.java * /nutch/branches/2.x/src/java/org/apache/nutch/crawl/NutchWritable.java * /nutch/branches/2.x/src/java/org/apache/nutch/metadata/Metadata.java * /nutch/branches/2.x/src/java/org/apache/nutch/metadata/SpellCheckedMetadata.java * /nutch/branches/2.x/src/java/org/apache/nutch/net/URLNormalizers.java * /nutch/branches/2.x/src/java/org/apache/nutch/plugin/Extension.java * /nutch/branches/2.x/src/java/org/apache/nutch/plugin/PluginDescriptor.java * /nutch/branches/2.x/src/java/org/apache/nutch/plugin/PluginRepository.java * /nutch/branches/2.x/src/java/org/apache/nutch/storage/Host.java * /nutch/branches/2.x/src/java/org/apache/nutch/storage/ParseStatus.java * /nutch/branches/2.x/src/java/org/apache/nutch/storage/ProtocolStatus.java * /nutch/branches/2.x/src/java/org/apache/nutch/storage/WebPage.java * /nutch/branches/2.x/src/java/org/apache/nutch/tools/ResolveUrls.java * /nutch/branches/2.x/src/java/org/apache/nutch/util/GenericWritableConfigurable.java * /nutch/branches/2.x/src/java/org/apache/nutch/util/PrefixStringMatcher.java * /nutch/branches/2.x/src/java/org/apache/nutch/util/SuffixStringMatcher.java * /nutch/branches/2.x/src/java/org/apache/nutch/util/ToolUtil.java * /nutch/branches/2.x/src/plugin/lib-regex-filter/src/java/org/apache/nutch/urlfilter/api/RegexURLFilterBase.java * /nutch/branches/2.x/src/plugin/lib-regex-filter/src/test/org/apache/nutch/urlfilter/api/RegexURLFilterBaseTest.java * /nutch/branches/2.x/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMBuilder.java * /nutch/branches/2.x/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java * /nutch/branches/2.x/src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/Client.java * /nutch/branches/2.x/src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/FtpResponse.java * /nutch/branches/2.x/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/DummyX509TrustManager.java * /nutch/branches/2.x/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/HttpAuthenticationException.java * /nutch/branches/2.x/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/HttpAuthenticationFactory.java * /nutch/branches/2.x/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/HttpBasicAuthentication.java * /nutch/branches/2.x/src/plugin/protocol-sftp/src/java/org/apache/nutch/protocol/sftp/Sftp.java * /nutch/branches/2.x/src/plugin/subcollection/src/java/org/apache/nutch/collection/CollectionManager.java * /nutch/branches/2.x/src/plugin/subcollection/src/java/org/apache/nutch/collection/Subcollection.java * /nutch/branches/2.x/src/plugin/urlfilter-prefix/src/java/org/apache/nutch/urlfilter/prefix/PrefixURLFilter.java * /nutch/branches/2.x/src/plugin/urlfilter-suffix/src/java/org/apache/nutch/urlfilter/suffix/SuffixURLFilter.java * /nutch/branches/2.x/src/plugin/urlnormalizer-regex/src/test/org/apache/nutch/net/urlnormalizer/regex/TestRegexURLNormalizer.java Resolve all issues flagged up by adding javac -Xlint arguement -- Key: NUTCH-1249 URL: https://issues.apache.org/jira/browse/NUTCH-1249 Project: Nutch Issue Type: Improvement Components: build Affects Versions: nutchgora Reporter: Lewis John McGibbney Assignee: Tejas Patil Priority: Minor Fix For: 1.7, 2.2 Attachments: NUTCH-1249.trunk.patch There are a heap of issues flagged up by NUTCH-1237, I think over time it would be great to get these addressed and resolved. What is interesting is that adding the same arguements to /src/plugin/plugin-build.xml actually breaks my build as tests begin to fail. Some of this stuff is documented in the link below http://docs.oracle.com/javase/1.5.0/docs/tooldocs/windows/javac.html#options -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA
[jira] [Commented] (NUTCH-1569) Upgrade 2.x to Gora 0.3
[ https://issues.apache.org/jira/browse/NUTCH-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13663974#comment-13663974 ] Hudson commented on NUTCH-1569: --- Integrated in Nutch-nutchgora #614 (See [https://builds.apache.org/job/Nutch-nutchgora/614/]) NUTCH-1569 Upgrade 2.x to Gora 0.3 (Revision 1485044) Result = FAILURE lewismc : http://svn.apache.org/viewvc/nutch/branches/2.x/?view=revrev=1485044 Files : * /nutch/branches/2.x/CHANGES.txt * /nutch/branches/2.x/build.xml * /nutch/branches/2.x/ivy/ivy.xml * /nutch/branches/2.x/src/java/org/apache/nutch/api/DbReader.java * /nutch/branches/2.x/src/java/org/apache/nutch/crawl/WebTableReader.java * /nutch/branches/2.x/src/java/org/apache/nutch/host/HostDb.java * /nutch/branches/2.x/src/java/org/apache/nutch/host/HostDbReader.java * /nutch/branches/2.x/src/test/org/apache/nutch/crawl/TestGenerator.java * /nutch/branches/2.x/src/test/org/apache/nutch/crawl/TestInjector.java * /nutch/branches/2.x/src/test/org/apache/nutch/fetcher/TestFetcher.java * /nutch/branches/2.x/src/test/org/apache/nutch/storage/TestGoraStorage.java * /nutch/branches/2.x/src/test/org/apache/nutch/util/AbstractNutchTest.java * /nutch/branches/2.x/src/test/org/apache/nutch/util/CrawlTestUtil.java Upgrade 2.x to Gora 0.3 --- Key: NUTCH-1569 URL: https://issues.apache.org/jira/browse/NUTCH-1569 Project: Nutch Issue Type: Improvement Components: build, storage Affects Versions: 2.2 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Fix For: 2.2 Attachments: NUTCH-1569.patch, NUTCH-1569.v2.patch We just released the Maven artifacts and I would like to upgrade before we push the RC for 2.2 :) Patch coming up -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Build failed in Jenkins: Nutch-nutchgora #614
See https://builds.apache.org/job/Nutch-nutchgora/614/changes Changes: [tejasp] NUTCH-1249 and NUTCH-1275 : Resolve all issues flagged up by adding javac -Xlint argument [lewismc] NUTCH-1569 Upgrade 2.x to Gora 0.3 -- [...truncated 1674 lines...] [javac] import org.apache.gora.mapreduce.GoraMapper; [javac] ^ [javac] /x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/2.x/src/java/org/apache/nutch/crawl/GeneratorMapper.java:36: error: cannot find symbol [javac] extends GoraMapperString, WebPage, SelectorEntry, WebPage { [javac] ^ [javac] symbol: class GoraMapper [javac] /x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/2.x/src/java/org/apache/nutch/crawl/GeneratorMapper.java:50: error: cannot find symbol [javac] Context context) throws IOException, InterruptedException { [javac] ^ [javac] symbol: class Context [javac] location: class GeneratorMapper [javac] /x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/2.x/src/java/org/apache/nutch/crawl/GeneratorMapper.java:109: error: cannot find symbol [javac] public void setup(Context context) { [javac] ^ [javac] symbol: class Context [javac] location: class GeneratorMapper [javac] /x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/2.x/src/java/org/apache/nutch/fetcher/FetcherJob.java:48: error: package org.apache.gora.mapreduce does not exist [javac] import org.apache.gora.mapreduce.GoraMapper; [javac] ^ [javac] /x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/2.x/src/java/org/apache/nutch/crawl/GeneratorReducer.java:32: error: package org.apache.gora.mapreduce does not exist [javac] import org.apache.gora.mapreduce.GoraReducer; [javac] ^ [javac] /x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/2.x/src/java/org/apache/nutch/crawl/GeneratorReducer.java:41: error: cannot find symbol [javac] extends GoraReducerSelectorEntry, WebPage, String, WebPage { [javac] ^ [javac] symbol: class GoraReducer [javac] /x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/2.x/src/java/org/apache/nutch/crawl/GeneratorReducer.java:52: error: cannot find symbol [javac] Context context) throws IOException, InterruptedException { [javac] ^ [javac] symbol: class Context [javac] location: class GeneratorReducer [javac] /x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/2.x/src/java/org/apache/nutch/crawl/GeneratorReducer.java:90: error: cannot find symbol [javac] protected void setup(Context context) [javac]^ [javac] symbol: class Context [javac] location: class GeneratorReducer [javac] /x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/2.x/src/java/org/apache/nutch/crawl/InjectorJob.java:29: error: package org.apache.gora.mapreduce does not exist [javac] import org.apache.gora.mapreduce.GoraOutputFormat; [javac] ^ [javac] /x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/2.x/src/java/org/apache/nutch/crawl/InjectorJob.java:30: error: package org.apache.gora.persistency does not exist [javac] import org.apache.gora.persistency.Persistent; [javac] ^ [javac] /x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/2.x/src/java/org/apache/nutch/crawl/InjectorJob.java:31: error: package org.apache.gora.store does not exist [javac] import org.apache.gora.store.DataStore; [javac] ^ [javac] /x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/2.x/src/java/org/apache/nutch/fetcher/FetchEntry.java:28: error: package org.apache.gora.util does not exist [javac] import org.apache.gora.util.IOUtils; [javac]^ [javac] /x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/2.x/src/java/org/apache/nutch/crawl/WebTableReader.java:57: error: package org.apache.gora.mapreduce does not exist [javac] import org.apache.gora.mapreduce.GoraMapper; [javac] ^ [javac] /x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/2.x/src/java/org/apache/nutch/crawl/WebTableReader.java:58: error: package org.apache.gora.query does not exist [javac] import org.apache.gora.query.Query; [javac] ^ [javac] /x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/2.x/src/java/org/apache/nutch/crawl/WebTableReader.java:59: error: package org.apache.gora.query does not exist [javac] import org.apache.gora.query.Result; [javac] ^ [javac] /x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/2.x/src/java/org/apache/nutch/crawl/WebTableReader.java:60: error: package org.apache.gora.store does not exist [javac] import
[jira] [Created] (NUTCH-1575) support solr authentication in nutch 2.x
lufeng created NUTCH-1575: - Summary: support solr authentication in nutch 2.x Key: NUTCH-1575 URL: https://issues.apache.org/jira/browse/NUTCH-1575 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 2.1 Reporter: lufeng Assignee: lufeng Priority: Minor Fix For: 2.2 can solr authentication in nutch 2.x like 1.x -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Work started] (NUTCH-1575) support solr authentication in nutch 2.x
[ https://issues.apache.org/jira/browse/NUTCH-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-1575 started by lufeng. support solr authentication in nutch 2.x Key: NUTCH-1575 URL: https://issues.apache.org/jira/browse/NUTCH-1575 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 2.1 Reporter: lufeng Assignee: lufeng Priority: Minor Fix For: 2.2 can solr authentication in nutch 2.x like 1.x -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1575) support solr authentication in nutch 2.x
[ https://issues.apache.org/jira/browse/NUTCH-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lufeng updated NUTCH-1575: -- Attachment: NUTCH-1575.patch add solr authentication support solr authentication in nutch 2.x Key: NUTCH-1575 URL: https://issues.apache.org/jira/browse/NUTCH-1575 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 2.1 Reporter: lufeng Assignee: lufeng Priority: Minor Fix For: 2.2 Attachments: NUTCH-1575.patch can solr authentication in nutch 2.x like 1.x -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1563) FetchSchedule#getFields is never used by GeneraterJob
[ https://issues.apache.org/jira/browse/NUTCH-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13664408#comment-13664408 ] Tejas Patil commented on NUTCH-1563: I think this is relevant to only 2.x and [~amuseme.lu] has pushed the patch to svn. Any work left here ? FetchSchedule#getFields is never used by GeneraterJob - Key: NUTCH-1563 URL: https://issues.apache.org/jira/browse/NUTCH-1563 Project: Nutch Issue Type: Bug Components: generator Affects Versions: 2.1 Reporter: lufeng Assignee: lufeng Priority: Minor Fix For: 2.3 Attachments: NUTCH-1563.patch The method of getFields in FetchSchedule if never used, so if user extends the FetchSchedule and want to get some fields of WebPage, it always return null. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1566) bin/nutch to allow whitespace in paths
[ https://issues.apache.org/jira/browse/NUTCH-1566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1566: --- Attachment: NUTCH-1566-v2-trunk.patch New patch including [~tejas.patil]'s suggestions. Also removed forgotten debug output :) bin/nutch to allow whitespace in paths -- Key: NUTCH-1566 URL: https://issues.apache.org/jira/browse/NUTCH-1566 Project: Nutch Issue Type: Bug Affects Versions: 1.6, 2.1 Reporter: Sebastian Nagel Priority: Minor Fix For: 2.4 Attachments: NUTCH-1566-trunk.patch, NUTCH-1566-v2-trunk.patch bin/nutch and bin/crawl choke if a path contains white space, eg, if JAVA_HOME is {{C:\Program Files\jdk}}. If you don't have the permission to change the path it is impossible to run Nutch. This has been reported frequently ([1|http://stackoverflow.com/questions/9345629/nutch-cygwin-how-to-set-java-home], [2|http://lucene.472066.n3.nabble.com/Problem-running-Nutch-on-Win-7-Cygwin-td3487163.html], and [3|http://nutchinstall.blogspot.de/2007/07/setting-up-cygwin-and-nutch.html]), see also NUTCH-19. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Reopened] (NUTCH-1569) Upgrade 2.x to Gora 0.3
[ https://issues.apache.org/jira/browse/NUTCH-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney reopened NUTCH-1569: - Upgrade 2.x to Gora 0.3 --- Key: NUTCH-1569 URL: https://issues.apache.org/jira/browse/NUTCH-1569 Project: Nutch Issue Type: Improvement Components: build, storage Affects Versions: 2.2 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Fix For: 2.2 Attachments: NUTCH-1569.patch, NUTCH-1569.v2.patch We just released the Maven artifacts and I would like to upgrade before we push the RC for 2.2 :) Patch coming up -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira