date:20130522

unable to build 2.x

2013-05-22 Thread Tejas Patil

Hi nutch-dev,

I took a *fresh* checkout of 2.x and tried to build it (ant clean runtime).
I get lot of compilation errors. At first when I saw that on the terminal,
I said to my laptop : Are you kidding me ?. I re-tried it 2 times again
and still the same thing happens.

I am checking the reason as to why build fails on my machine.
Curious to know: Is it just me or everybody else is seeing this ?

[jira] [Updated] (NUTCH-356) Plugin repository cache can lead to memory leak

2013-05-22 Thread Sebastian Nagel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-356:
--

Fix Version/s: 1.8

 Plugin repository cache can lead to memory leak
 ---

 Key: NUTCH-356
 URL: https://issues.apache.org/jira/browse/NUTCH-356
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8
Reporter: Enrico Triolo
 Fix For: 2.3, 1.8

 Attachments: ASF.LICENSE.NOT.GRANTED--NutchTest.java, 
 ASF.LICENSE.NOT.GRANTED--patch.txt, cache_classes.patch


 While I was trying to solve a problem I reported a while ago (see Nutch-314), 
 I found out that actually the problem was related to the plugin cache used in 
 class PluginRepository.java.
 As  I said in Nutch-314, I think I somehow 'force' the way nutch is meant to 
 work, since I need to frequently submit new urls and append their contents to 
 the index; I don't (and I can't) have an urls.txt file with all urls I'm 
 going to fetch, but I recreate it each time a new url is submitted.
 Thus,  I think in the majority of times you won't have problems using nutch 
 as-is, since the problem I found occours only if nutch is used in a way 
 similar to the one I use.
 To simplify your test I'm attaching a class that performs something similar 
 to what I need. It fetches and index some sample urls; to avoid webmasters 
 complaints I left the sample urls list empty, so you should modify the source 
 code and add some urls.
 Then you only have to run it and watch your memory consumption with top. In 
 my experience I get an OutOfMemoryException after a couple of minutes, but it 
 clearly depends on your heap settings and on the plugins you are using (I'm 
 using 
 'protocol-file|protocol-http|parse-(rss|html|msword|pdf|text)|language-identifier|index-(basic|more)|query-(basic|more|site|url)|urlfilter-regex|summary-basic|scoring-opic').
 The problem is bound to the PluginRepository 'singleton' instance, since it 
 never get released. It seems that some class maintains a reference to it and 
 this class is never released since it is cached somewhere in the 
 configuration.
 So I modified the PluginRepository's 'get' method so that it never uses the 
 cache and always returns a new instance (you can find the patch in 
 attachment). This way the memory consumption is always stable and I get no 
 OOM anymore.
 Clearly this is not the solution, since I guess there are many performance 
 issues involved, but for the moment it works.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-840) Port tests from parse-html to parse-tika

2013-05-22 Thread Sebastian Nagel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-840:
--

Fix Version/s: 1.8

 Port tests from parse-html to parse-tika
 

 Key: NUTCH-840
 URL: https://issues.apache.org/jira/browse/NUTCH-840
 Project: Nutch
  Issue Type: Task
  Components: parser
Affects Versions: 1.1, 1.6
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 2.3, 1.8

 Attachments: NUTCH-840.patch, NUTCH-840.patch, NUTCH-840-trunk.patch, 
 NUTCH-840v2.patch


 We don't have test for HTML in parse-tika so I'll copy them from the old 
 parse-html plugin

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-410) Faster RegexNormalize with more features

2013-05-22 Thread Sebastian Nagel (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sebastian Nagel updated NUTCH-410:
--

Fix Version/s: 1.8

Faster RegexNormalize with more features

Key: NUTCH-410
URL: https://issues.apache.org/jira/browse/NUTCH-410
Project: Nutch
Issue Type: Improvement
Components: fetcher
Affects Versions: 0.8
Environment: Tested on MacOS X 10.4.7/10.4.8
Reporter: Doug Cook
Priority: Minor
Fix For: 2.3, 1.8

Attachments: betterRegexNorm.patch

The patch associated with this is backwards-compatible and has several
improvements over the stock 0.8 RegexURLNormalizer:
1) About a 34% performance improvement, from only executing the superclass
(BasicURLNormalizer) once in most cases, instead of twice as the stock
version did.
2) Support for expensive host-specific normalizations with good performance.
Each regex block optionally takes a list of hosts to which to apply the
associated regex. If supplied, the regex will only be applied to these hosts.
This should have scalable performance; the comparison is O(1) regardless of
the number of hosts. The format is:
regex
hostwww.host1.com/host
hosthost2.site2.com/host
pattern my pattern here /pattern
substitution my substitution here /substitution
/regex
3) Support for decoding URLs with escaped character encodings (e.g. %20,
etc.). This is useful, for example, to decode jump redirects which have the
target URL encoded within the source, as on Yahoo. I tried to create an
extensible notion of options, the first of which is unescape. The
unescape function is applied *after* the substitution and *only* if the
substitution pattern matches. A simple pattern to unescape Yahoo directory
redirects would be something like:
regex
pattern^http://[a-z\.]*\.yahoo\.com/.*/\*+(http[^amp;]+)/pattern
substitution$1/substitution
optionsunescape/options
/regex
4) Added the notion of iterating the pattern chain. This is useful when the
result of a normalization can itself be normalized. While some of this can be
handled in the stock version by repeating patterns, or by careful ordering of
patterns, the notion of iterating is cleaner and more powerful. The chain is
defined to iterate only when the previous iteration changes the input, up to
a configurable maxium number of iterations. The config parameter to change
is: urlnormalizer.regex.maxiterations, which defaults to 1 (previous
behavior). The change is performance-neutral when disabled, and has a
relatively small performance cost when enabled.
Pardon any potentially unconventional Java on my part. I've got lots of C/C++
search engine experience, but Nutch is my first large Java app. I welcome any
feedback, and hope this is useful.
Doug

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1253) Incompatible neko and xerces versions

2013-05-22 Thread Sebastian Nagel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1253:
---

Fix Version/s: 1.8

 Incompatible neko and xerces versions
 -

 Key: NUTCH-1253
 URL: https://issues.apache.org/jira/browse/NUTCH-1253
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.4
 Environment: Ubuntu 10.04
Reporter: Dennis Spathis
Assignee: Lewis John McGibbney
 Fix For: 2.3, 1.8

 Attachments: NUTCH-1253-2.x-v2.patch, NUTCH-1253-nutchgora.patch, 
 NUTCH-1253.patch, TEST-org.apache.nutch.parse.html.TestDOMContentUtils.txt


 The Nutch 1.4 distribution includes
  - nekohtml-0.9.5.jar (under .../runtime/local/plugins/lib-
 nekohtml)
  - xercesImpl-2.9.1.jar (under .../runtime/local/lib)
 These two JARs appear to be incompatible versions. When the HtmlParser 
 (configured to use neko) is invoked during a local-mode crawl, the parse 
 fails due to an AbstractMethodError. (Note: To see the AbstractMethodError, 
 rebuild the HtmlParser plugin and add a
 catch(Throwable) clause in the getParse method to log the stacktrace.)
 I found that substituting a later, compatible version of nekohtml (1.9.11)
 fixes the problem.
 Curiously, and in support of the above, the nekohtml plugin.xml file in
 Nutch 1.4 contains the following:
 plugin
id=lib-nekohtml
name=CyberNeko HTML Parser
version=1.9.11
provider-name=org.cyberneko
runtime
library name=nekohtml-0.9.5.jar
export name=*/
/library
/runtime
 /plugin
 Note the conflicting version numbers (version tag is 1.9.11 but the
 specified library is nekohtml-0.9.5.jar).
 Was the 0.9.5 version included by mistake? Was the intention rather to
 include 1.9.11?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1190) MoreIndexingFilter refactor: move data formats used to parse lastModified to a config file.

2013-05-22 Thread Sebastian Nagel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1190:
---

Fix Version/s: 1.8

 MoreIndexingFilter refactor: move data formats used to parse lastModified 
 to a config file.
 -

 Key: NUTCH-1190
 URL: https://issues.apache.org/jira/browse/NUTCH-1190
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 1.4
 Environment: jdk6
Reporter: Zhang JinYan
 Fix For: 2.3, 1.8

 Attachments: date-styles.txt, MoreIndexingFilter.patch, 
 NUTCH-1190-trunk.patch


 There many issues about missing date format:
 [NUTCH-871|https://issues.apache.org/jira/browse/NUTCH-871]
 [NUTCH-912|https://issues.apache.org/jira/browse/NUTCH-912]
 [NUTCH-1015|https://issues.apache.org/jira/browse/NUTCH-1015]
 The data formats can be diverse, so why not move those data formats to a 
 extra config file?
 I move all the data formats from MoreIndexingFilter.java to a file named 
 date-styles.txt(place in conf), which will be load on startup.
 {code}
   public void setConf(Configuration conf) {
 this.conf = conf;
 MIME = new MimeUtil(conf);
 
 URL res = conf.getResource(date-styles.txt);
 if(res==null){
   LOG.error(Can't find resource: date-styles.txt);
 }else{
   try {
 List lines = FileUtils.readLines(new File(res.getFile()));
 for (int i = 0; i  lines.size(); i++) {
   String dateStyle = (String) lines.get(i);
   if(StringUtils.isBlank(dateStyle)){
 lines.remove(i);
 i--;
 continue;
   }
   dateStyle=StringUtils.trim(dateStyle);
   if(dateStyle.startsWith(#)){
 lines.remove(i);
 i--;
 continue;
   }
   lines.set(i, dateStyle);
 }
 dateStyles = new String[lines.size()];
 lines.toArray(dateStyles);
   } catch (IOException e) {
 LOG.error(Failed to load resource: date-styles.txt);
   }
 }
   }
 {code}
 Then parse lastModified like this(sample):
 {code}
   private long getTime(String date, String url) {
 ..
 Date parsedDate = DateUtils.parseDate(date, dateStyles);
 time = parsedDate.getTime();
 ..
 return time;
   }
 {code}
 This path also contains the path of 
 [NUTCH-1140|https://issues.apache.org/jira/browse/NUTCH-1140].
 Find more details in the patch file.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a ?

2013-05-22 Thread Sebastian Nagel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-797:
--

Fix Version/s: 1.8

 parse-tika is not properly constructing URLs when the target begins with a ?
 --

 Key: NUTCH-797
 URL: https://issues.apache.org/jira/browse/NUTCH-797
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.1, nutchgora
 Environment: Win 7, Java(TM) SE Runtime Environment (build 
 1.6.0_16-b01)
 Also repro's on RHEL and java 1.4.2
Reporter: Robert Hohman
Assignee: Andrzej Bialecki 
Priority: Minor
 Fix For: 2.3, 1.8

 Attachments: NUTCH-797.patch, pureQueryUrl-2.patch, pureQueryUrl.patch


 This is my first bug and patch on nutch, so apologies if I have not provided 
 enough detail.
 In crawling the page at 
 http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0 there are 
 links in the page that look like this:
 a href=?co=0sk=0p=2pi=12/a/tdtda 
 href=?co=0sk=0p=3pi=13/a
 in org.apache.nutch.parse.tika.DOMContentUtils rev 916362 (trunk), as 
 getOutlinks looks for links, it comes across this link, and constucts a new 
 url with a base URL class built from 
 http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0;, and a 
 target of ?co=0sk=0p=2pi=1
 The URL class, per RFC 3986 at 
 http://labs.apache.org/webarch/uri/rfc/rfc3986.html#relative-merge, defines 
 how to merge these two, and per the RFC, the URL class merges these to: 
 http://careers3.accenture.com/Careers/ASPX/?co=0sk=0p=2pi=1
 because the RFC explicitly states that the rightmost url segment (the 
 Search.aspx in this case) should be ripped off before combining.
 While this is compliant with the RFC, it means the URLs which are created for 
 the next round of fetching are incorrect.  Modern browsers seem to handle 
 this case (I checked IE8 and Firefox 3.5), so I'm guessing this is an obscure 
 exception or handling of what is a poorly formed url on accenture's part.
 I have fixed this by modifying DOMContentUtils to look for the case where a ? 
 begins the target, and then pulling the rightmost component out of the base 
 and inserting it into the target before the ?, so the target in this example 
 becomes:
 Search.aspx?co=0sk=0p=2pi=1
 The URL class then properly constructs the new url as:
 http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0p=2pi=1
 If it is agreed that this solution works, I believe the other html parsers in 
 nutch would need to be modified in a similar way.
 Can I get feedback on this proposed solution?  Specifically I'm worried about 
 unforeseen side effects.
 Much thanks
 Here is the patch info:
 Index: 
 src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
 ===
 --- 
 src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
(revision 916362)
 +++ 
 src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
(working copy)
 @@ -299,6 +299,50 @@
  return false;
}

 +  private URL fixURL(URL base, String target) throws MalformedURLException
 +  {
 +   // handle params that are embedded into the base url - move them to 
 target
 +   // so URL class constructs the new url class properly
 +   if  (base.toString().indexOf(';')  0)  
 +  return fixEmbeddedParams(base, target);
 +   
 +   // handle the case that there is a target that is a pure query.
 +   // Strictly speaking this is a violation of RFC 2396 section 5.2.2 on 
 how to assemble
 +   // URLs but I've seen this in numerous places, for example at
 +   // http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0
 +   // It has urls in the page of the form href=?co=0sk=0pg=1, and by 
 default
 +   // URL constructs the base+target combo as 
 +   // http://careers3.accenture.com/Careers/ASPX/?co=0sk=0pg=1, 
 incorrectly
 +   // dropping the Search.aspx target
 +   //
 +   // Browsers handle these just fine, they must have an exception 
 similar to this
 +   if (target.startsWith(?))
 +   {
 +   return fixPureQueryTargets(base, target);
 +   }
 +   
 +   return new URL(base, target);
 +  }
 +  
 +  private URL fixPureQueryTargets(URL base, String target) throws 
 MalformedURLException
 +  {
 + if (!target.startsWith(?))
 + return new URL(base, target);
 +
 + String basePath = base.getPath();
 + String baseRightMost=;
 + int baseRightMostIdx = basePath.lastIndexOf(/);
 + if (baseRightMostIdx != -1)
 + {
 + baseRightMost = basePath.substring(baseRightMostIdx+1);
 + }
 + 
 + if

[jira] [Updated] (NUTCH-566) Sun's URL class has bug in creation of relative query URLs

2013-05-22 Thread Sebastian Nagel (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sebastian Nagel updated NUTCH-566:
--

Fix Version/s: 1.8

Sun's URL class has bug in creation of relative query URLs
--

Key: NUTCH-566
URL: https://issues.apache.org/jira/browse/NUTCH-566
Project: Nutch
Issue Type: Bug
Components: fetcher
Affects Versions: 0.8, 0.8.1, 0.9.0
Environment: MacOS X and Linux (CentOS 4.5) both
Reporter: Doug Cook
Priority: Minor
Fix For: 2.3, 1.8

Attachments: RelativeURL.java

I'm using 0.81, but this will affect all other versions as well.
Relative links of the form ?blah are resolved incorrectly. For example,
with a base URL of http://www.fleurie.org/entreprise.asp, and a relative link
of ?id_entrep=111, Nutch will resolve this pair to the link
http://www.fleurie.org/?id_entrep=111;. No such URL exists, and all browsers
I tried will resolve the pair to
http://www.fleurie.org/entreprise.asp?id_entrep=111;.
I tracked this down to what could be called a bug in Sun's URL class.
According to Sun's spec, they parse the relative URL according to RFC 2396.
But the original RFC for relative links was RFC 1808, and the two RFCs differ
in how they handle relative links beginning with ?. Most browsers
(Netscape/Mozilla, IE, Safari) implemented RFC 1808, and stuck with it (for
compatibility and also because the behavior makes more sense). Apparently
even the people that wrote RFC 2396 recognized that this was a mistake, and
the specified behavior was changed in RFC 3986 to match what browsers do.
For a discussion of this, see
http://gbiv.com/protocols/uri/rev-2002/issues.html#003-relative-query
Sun's URL implementation, however, still implements RFC2396, as far as I can
tell, and is out of step with the rest of the world.
This breaks link extraction on a number of sites.
I implemented a simple workaround, which I'm attaching. It is a static method
to create URLs which behaves exactly as new URL(URL base, String
relativePath), and I use it as a drop-in replacement for that in
DOMContentUtils, Javascript link extraction, etc. Obviously, it really only
matters wherever links are extracted. I haven't included the calling code
from DOMContentUtils, etc. because my local versions are largely rewritten,
but it should be pretty obvious.
I put it in the org.apache.nutch.net directory, but obviously feel free to
move it to another place if you feel it belongs there!

[jira] [Updated] (NUTCH-1250) parse-html does not parse links with empty anchor

2013-05-22 Thread Sebastian Nagel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1250:
---

Fix Version/s: 1.8

 parse-html does not parse links with empty anchor
 -

 Key: NUTCH-1250
 URL: https://issues.apache.org/jira/browse/NUTCH-1250
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
Reporter: Andreas Janning
 Fix For: 2.3, 1.8

 Attachments: DOMContentUtils_v1.patch, DOMContentUtils_v2.patch, 
 TestDomContentUitls_v1.patch


 The parse-html plugin does not generate an outlink if the link has no anchor
 For example the following HTML-Code does not create an Outlink:
 {code:html} 
   a href=example.com/a
 {code}
 The JUnit-Test TestDOMContentUtils tries to test this but fails since there 
 is a comment inside the a-Tag.
 {code:title=TestDOMContentUtils.java|borderStyle=solid}
 new String(htmlheadtitle title /title
 + /headbody
 + a href=\g\!--no anchor--/a
 + a href=\g1\ !--whitespace--  /a
 + a href=\g2\  img src=test.gif alt='bla bla' /a
 + /body/html), 
 {code}
 When you remove the comment the test fails.
 {code:title=TestDOMContentUtils.java Test fails|borderStyle=solid}
 new String(htmlheadtitle title /title
 + /headbody
 + a href=\g\/a // no anchor
 + a href=\g1\ !--whitespace--  /a
 + a href=\g2\  img src=test.gif alt='bla bla' /a
 + /body/html), 
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1562) Order of execution for scoring filters

2013-05-22 Thread Sebastian Nagel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1562:
---

Fix Version/s: 1.8

 Order of execution for scoring filters
 --

 Key: NUTCH-1562
 URL: https://issues.apache.org/jira/browse/NUTCH-1562
 Project: Nutch
  Issue Type: Bug
  Components: documentation
Affects Versions: 1.6, 2.1
Reporter: Julien Nioche
 Fix For: 2.3, 1.8

 Attachments: NUTCH-1562-trunk.patch


 The documentation in nutch-default.xml states that :
 {quote}
 property
   namescoring.filter.order/name
   value/value
   descriptionThe order in which scoring filters are applied.
   This may be left empty (in which case all available scoring
   filters will be applied in the order defined in plugin-includes
   and plugin-excludes), or a space separated list of implementation
   classes.
   /description
 /property
 {quote}
 however if no order is specified the filters are ordered randomly and not in 
 the order defined in plugin-includes.
 The other *order parameters (e.g. urlfilter.order) have a different 
 documentation and are loaded and applied in system defined order which 
 corresponds to what the code does.
 The patch attached is for 1.x and puts the code in accordance with the 
 documentation by ordering the filters according to the order of the plugins, 
 which gives users more control without having to specify the classes 
 explicitly in scoring.filter.order.
 We could extend the same idea to the other *order params.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-409) Add short circuit notion to filters to speedup mixed site/subsite crawling

2013-05-22 Thread Sebastian Nagel (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sebastian Nagel updated NUTCH-409:
--

Fix Version/s: 1.8

Add short circuit notion to filters to speedup mixed site/subsite crawling

Key: NUTCH-409
URL: https://issues.apache.org/jira/browse/NUTCH-409
Project: Nutch
Issue Type: Improvement
Components: fetcher
Affects Versions: 0.8
Reporter: Doug Cook
Priority: Minor
Fix For: 2.3, 1.8

Attachments: shortcircuit.patch

In the case where one is crawling a mixture of sites and sub-sites, the
prefix matcher can match the sites quite quickly, but either the regex or
automaton filters are considerably slower matching the sub-sites. In the
current model of AND-ing all the filters together, the pattern-matching
filter will be run on every site that matches the prefix matcher -- even if
that entire site is to be crawled and there are no sub-site patterns. If only
a small portion of the sites actually need sub-site pattern matching, this is
much slower than it should be.
I propose (and attach) a simple modification allowing considerable speedup
for this usage pattern. I define the notion of a short circuit match that
means accept this URL and don't run any of the remaining filters in the
filter chain.
Though with this change, any filter plugin can in theory return a
short-circuit match, I have only implemented the short-circuit match for the
PrefixURLFilter. The configuration file format is backwards-compatible;
shortcircuit matches just have SHORTCIRCUIT: in front of them.
One minor gotcha:
* Because the shortcircuit matches will avoid running any later filters, all
of the site-independent filters need to be BEFORE the PrefixURLFilter in the
chain.
I get my best performance using the following filter chain:
1) The SuffixURLFilter to throw away anything with unwanted extensions
2) The RegexURLFilter to do site-independent cleanup (ad removal, skipping
mailto:, bulletin-board pages, etc.)
3) The PrefixURLFilter, with SHORTCIRCUIT: in front of every site name EXCEPT
the sites needing subsite matching
4) The AutomatonURLFilter to match those sites needing subsite pattern
matching.
I have tens of thousands of sites and an order of magnitude fewer subsites,
so skipping step #4 90% of the time speeds things up considerably (my reduce
time on a round of crawling is down from some 26 hours to less than 10).
There are only two drawbacks to the implementation, and I think they're
pretty minor:
1) Because I pass a special token (_PASS_) in the place of the URL to
implement the short circuit, if for some reason someone wanted to crawl a URL
named _PASS_, there would be problems. I find this highly unlikely, since
that's an invalid URL.
2) The correct behavior of steps #3 and #4 above depends upon coordination of
the config files between the prefix and automaton filters, making an
opportunity for user screwup. I thought about creating a new kind of filter
which essentially combined prefix automaton's behaviors, took one config
file, and internally handled the short-circuiting. But I think the approach I
took is simpler, cleaner, more flexible, and avoids creating yet another kind
of filter. Coordinating the config files is pretty easy (I generate them
programmatically).
As this is my first contribution to Nutch I'm sure that there are things I've
missed, whether in coding style or desired patch format. I welcome any
feedback, suggestions, etc.
Doug

[jira] [Updated] (NUTCH-945) Indexing to multiple SOLR Servers

2013-05-22 Thread Sebastian Nagel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-945:
--

Fix Version/s: 1.8

 Indexing to multiple SOLR Servers
 -

 Key: NUTCH-945
 URL: https://issues.apache.org/jira/browse/NUTCH-945
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 1.2
Reporter: Charan Malemarpuram
 Fix For: 2.3, 1.8

 Attachments: MurmurHashPartitioner.java, 
 NonPartitioningPartitioner.java, patch-NUTCH-945.txt


 It would be nice to have a default Indexer in Nutch, which can submit docs to 
 multiple SOLR Servers.
  Partitioning is always the question, when writing to multiple SOLR Servers.
  Default partitioning can be a simple hashcode based distribution with 
  addition hooks to customization.
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1531) URL filtering takes long time for very long URLs

2013-05-22 Thread Sebastian Nagel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1531:
---

Fix Version/s: 1.8

 URL filtering takes long time for very long URLs
 

 Key: NUTCH-1531
 URL: https://issues.apache.org/jira/browse/NUTCH-1531
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.6, 2.1, 1.7, 2.2
Reporter: Fırat KÜÇÜK
Priority: Minor
 Fix For: 2.3, 1.8

 Attachments: max_url_length.diff, test_case.txt


 Filtering very long urls (such as base64 image generators) take long time 
 (hours). On reducing phase it locks down all the system for hours. Therefore 
 some URL length limitation needed. We attached a little patch for this 
 improvement.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-351) Protocol forward proxy

2013-05-22 Thread Sebastian Nagel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-351:
--

Fix Version/s: 1.8

 Protocol forward proxy
 --

 Key: NUTCH-351
 URL: https://issues.apache.org/jira/browse/NUTCH-351
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: 0.8, 0.8.1, 0.9.0
Reporter: Sami Siren
Assignee: Sami Siren
Priority: Minor
 Fix For: 2.3, 1.8

 Attachments: protocol-http-proxy-adapter.txt


 Protocol proxy adapter takes advantage of protocols known to http forward 
 proxy. Usually there's atleast http, https and ftp.
 You must configure nutch to use this plugin and to use http proxy before use.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-490) Extension point with filters for Neko HTML parser (with patch)

2013-05-22 Thread Sebastian Nagel (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sebastian Nagel updated NUTCH-490:
--

Fix Version/s: 1.8

Extension point with filters for Neko HTML parser (with patch)
--

Key: NUTCH-490
URL: https://issues.apache.org/jira/browse/NUTCH-490
Project: Nutch
Issue Type: Improvement
Components: fetcher
Affects Versions: 0.9.0
Environment: Any
Reporter: Marcin Okraszewski
Priority: Minor
Fix For: 2.3, 1.8

Attachments: HtmlParser.java.diff, NekoFilters_for_1.0.patch,
nutch-extensionpoins_plugin.xml.diff

In my project I need to set filters for Neko HTML parser. So instead of
adding it hard coded, I made an extension point to define filters for Neko. I
was fallowing the code for HtmlParser filters. In fact the method to get
filters I think could be generalized to handle both cases. But I didn't want
to make too big mess.
The attached patch is for Nutch 0.9. This part of code wasn't changed in
trunk, so should be applicable easily.
BTW. I wonder if it wouldn't be best to have HTML DOM Parsing defined by
extension point itself. Now there are options for Neko and TagSoap. But if
someone would like to use something else or set give different settings for
the parser, he would need to modify HtmlParser class, instead of replacing a
plugin.

fix version 1.7 removed in Jira

2013-05-22 Thread Sebastian Nagel

Hi,

please take care not to remove the fix version
when applying bulk changes, e.g., 2.2 = 2.3
Alternative fix versions (1.7) are not kept.

Luckily Jira is quite powerful, I restored the 1.x
fix version using this awful filter:
  project = NUTCH AND fixVersion in (2.3)
  AND status = Open AND updated = 2013-05-21
  AND affectedVersion in (0.8, 0.9, 0.9.0, 1.0, 1.1, 1.2,
1.3, 1.4, 1.6, 1.7)
  ORDER BY issuetype ASC, priority DESC

Cheers,
Sebastian

[jira] [Updated] (NUTCH-1483) Can't crawl filesystem with protocol-file plugin

2013-05-22 Thread Sebastian Nagel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1483:
---

Priority: Critical  (was: Major)

 Can't crawl filesystem with protocol-file plugin
 

 Key: NUTCH-1483
 URL: https://issues.apache.org/jira/browse/NUTCH-1483
 Project: Nutch
  Issue Type: Bug
  Components: protocol
Affects Versions: 1.6, 2.1
 Environment: OpenSUSE 12.1, OpenJDK 1.6.0, HBase 0.90.4
Reporter: Rogério Pereira Araújo
Priority: Critical
 Fix For: 2.3

 Attachments: NUTCH-1483.patch


 I tried to follow the same steps described in this wiki page:
 http://wiki.apache.org/nutch/IntranetDocumentSearch
 I made all required changes on regex-urlfilter.txt and added the following 
 entry in my seed file:
 file:///home/rogerio/Documents/
 The permissions are ok, I'm running nutch with the same user as folder owner, 
 so nutch has all the required permissions, unfortunately I'm getting the 
 following error:
 org.apache.nutch.protocol.file.FileError: File Error: 404
 at 
 org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:105)
 at 
 org.apache.nutch.fetcher.FetcherReducer$FetcherThread.run(FetcherReducer.java:514)
 fetch of file://home/rogerio/Documents/ failed with: 
 org.apache.nutch.protocol.file.FileError: File Error: 404
 Why the logs are showing file://home/rogerio/Documents/ instead of 
 file:///home/rogerio/Documents/ ???
 Note: The regex-urlfilter entry only works as expected if I add the entry 
 +^file://home/rogerio/Documents/ instead of +^file:///home/rogerio/Documents/ 
 as wiki says.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1483) Can't crawl filesystem with protocol-file plugin

2013-05-22 Thread Sebastian Nagel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1483:
---

Fix Version/s: 1.7

 Can't crawl filesystem with protocol-file plugin
 

 Key: NUTCH-1483
 URL: https://issues.apache.org/jira/browse/NUTCH-1483
 Project: Nutch
  Issue Type: Bug
  Components: protocol
Affects Versions: 1.6, 2.1
 Environment: OpenSUSE 12.1, OpenJDK 1.6.0, HBase 0.90.4
Reporter: Rogério Pereira Araújo
Priority: Critical
 Fix For: 1.7, 2.3

 Attachments: NUTCH-1483.patch


 I tried to follow the same steps described in this wiki page:
 http://wiki.apache.org/nutch/IntranetDocumentSearch
 I made all required changes on regex-urlfilter.txt and added the following 
 entry in my seed file:
 file:///home/rogerio/Documents/
 The permissions are ok, I'm running nutch with the same user as folder owner, 
 so nutch has all the required permissions, unfortunately I'm getting the 
 following error:
 org.apache.nutch.protocol.file.FileError: File Error: 404
 at 
 org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:105)
 at 
 org.apache.nutch.fetcher.FetcherReducer$FetcherThread.run(FetcherReducer.java:514)
 fetch of file://home/rogerio/Documents/ failed with: 
 org.apache.nutch.protocol.file.FileError: File Error: 404
 Why the logs are showing file://home/rogerio/Documents/ instead of 
 file:///home/rogerio/Documents/ ???
 Note: The regex-urlfilter entry only works as expected if I add the entry 
 +^file://home/rogerio/Documents/ instead of +^file:///home/rogerio/Documents/ 
 as wiki says.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (NUTCH-1249) Resolve all issues flagged up by adding javac -Xlint arguement

2013-05-22 Thread Tejas Patil (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil resolved NUTCH-1249.


   Resolution: Fixed
Fix Version/s: 2.2
 Assignee: Tejas Patil  (was: Lewis John McGibbney)

Ported the patch for trunk to 2.x. All the tests are passing (verified on Java 
1.7.0_10 and 1.6.0_38). Committed to svn at rev 1485125.

 Resolve all issues flagged up by adding javac -Xlint arguement
 --

 Key: NUTCH-1249
 URL: https://issues.apache.org/jira/browse/NUTCH-1249
 Project: Nutch
  Issue Type: Improvement
  Components: build
Affects Versions: nutchgora
Reporter: Lewis John McGibbney
Assignee: Tejas Patil
Priority: Minor
 Fix For: 1.7, 2.2

 Attachments: NUTCH-1249.trunk.patch


 There are a heap of issues flagged up by NUTCH-1237, I think over time it 
 would be great to get these addressed and resolved.
 What is interesting is that adding the same arguements to 
 /src/plugin/plugin-build.xml actually breaks my build as tests begin to fail.
 Some of this stuff is documented in the link below
 http://docs.oracle.com/javase/1.5.0/docs/tooldocs/windows/javac.html#options

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (NUTCH-1275) Fix [unchecked] javac warnings

2013-05-22 Thread Tejas Patil (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil resolved NUTCH-1275.


   Resolution: Fixed
Fix Version/s: 2.2

Got resolved with NUTCH-1249

 Fix [unchecked] javac warnings
 --

 Key: NUTCH-1275
 URL: https://issues.apache.org/jira/browse/NUTCH-1275
 Project: Nutch
  Issue Type: Sub-task
  Components: build
Affects Versions: nutchgora, 1.5
Reporter: Lewis John McGibbney
Assignee: Tejas Patil
Priority: Minor
 Fix For: 1.7, 2.2


 We can simply suppress these warnings using  
 {code}
 SuppressWarnings [unchecked]
 {code}
 However if there is a another method for resolving these warnings then they 
 should be implemented if deemed beneficial to code quality.
 Some resources 
 http://java.sun.com/docs/books/jls/third_edition/html/conversions.html#190772

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1249) Resolve all issues flagged up by adding javac -Xlint arguement

2013-05-22 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13663973#comment-13663973
 ] 

Hudson commented on NUTCH-1249:
---

Integrated in Nutch-nutchgora #614 (See 
[https://builds.apache.org/job/Nutch-nutchgora/614/])
NUTCH-1249 and NUTCH-1275 : Resolve all issues flagged up by adding javac 
-Xlint argument (Revision 1485125)

 Result = FAILURE
tejasp : http://svn.apache.org/viewvc/nutch/branches/2.x/?view=revrev=1485125
Files : 
* /nutch/branches/2.x/CHANGES.txt
* /nutch/branches/2.x/build.xml
* /nutch/branches/2.x/src/java/org/apache/nutch/api/ConfResource.java
* /nutch/branches/2.x/src/java/org/apache/nutch/api/DbReader.java
* /nutch/branches/2.x/src/java/org/apache/nutch/api/JobResource.java
* /nutch/branches/2.x/src/java/org/apache/nutch/api/impl/RAMJobManager.java
* /nutch/branches/2.x/src/java/org/apache/nutch/crawl/NutchWritable.java
* /nutch/branches/2.x/src/java/org/apache/nutch/metadata/Metadata.java
* 
/nutch/branches/2.x/src/java/org/apache/nutch/metadata/SpellCheckedMetadata.java
* /nutch/branches/2.x/src/java/org/apache/nutch/net/URLNormalizers.java
* /nutch/branches/2.x/src/java/org/apache/nutch/plugin/Extension.java
* /nutch/branches/2.x/src/java/org/apache/nutch/plugin/PluginDescriptor.java
* /nutch/branches/2.x/src/java/org/apache/nutch/plugin/PluginRepository.java
* /nutch/branches/2.x/src/java/org/apache/nutch/storage/Host.java
* /nutch/branches/2.x/src/java/org/apache/nutch/storage/ParseStatus.java
* /nutch/branches/2.x/src/java/org/apache/nutch/storage/ProtocolStatus.java
* /nutch/branches/2.x/src/java/org/apache/nutch/storage/WebPage.java
* /nutch/branches/2.x/src/java/org/apache/nutch/tools/ResolveUrls.java
* 
/nutch/branches/2.x/src/java/org/apache/nutch/util/GenericWritableConfigurable.java
* /nutch/branches/2.x/src/java/org/apache/nutch/util/PrefixStringMatcher.java
* /nutch/branches/2.x/src/java/org/apache/nutch/util/SuffixStringMatcher.java
* /nutch/branches/2.x/src/java/org/apache/nutch/util/ToolUtil.java
* 
/nutch/branches/2.x/src/plugin/lib-regex-filter/src/java/org/apache/nutch/urlfilter/api/RegexURLFilterBase.java
* 
/nutch/branches/2.x/src/plugin/lib-regex-filter/src/test/org/apache/nutch/urlfilter/api/RegexURLFilterBaseTest.java
* 
/nutch/branches/2.x/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMBuilder.java
* 
/nutch/branches/2.x/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
* 
/nutch/branches/2.x/src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/Client.java
* 
/nutch/branches/2.x/src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/FtpResponse.java
* 
/nutch/branches/2.x/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/DummyX509TrustManager.java
* 
/nutch/branches/2.x/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/HttpAuthenticationException.java
* 
/nutch/branches/2.x/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/HttpAuthenticationFactory.java
* 
/nutch/branches/2.x/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/HttpBasicAuthentication.java
* 
/nutch/branches/2.x/src/plugin/protocol-sftp/src/java/org/apache/nutch/protocol/sftp/Sftp.java
* 
/nutch/branches/2.x/src/plugin/subcollection/src/java/org/apache/nutch/collection/CollectionManager.java
* 
/nutch/branches/2.x/src/plugin/subcollection/src/java/org/apache/nutch/collection/Subcollection.java
* 
/nutch/branches/2.x/src/plugin/urlfilter-prefix/src/java/org/apache/nutch/urlfilter/prefix/PrefixURLFilter.java
* 
/nutch/branches/2.x/src/plugin/urlfilter-suffix/src/java/org/apache/nutch/urlfilter/suffix/SuffixURLFilter.java
* 
/nutch/branches/2.x/src/plugin/urlnormalizer-regex/src/test/org/apache/nutch/net/urlnormalizer/regex/TestRegexURLNormalizer.java


 Resolve all issues flagged up by adding javac -Xlint arguement
 --

 Key: NUTCH-1249
 URL: https://issues.apache.org/jira/browse/NUTCH-1249
 Project: Nutch
  Issue Type: Improvement
  Components: build
Affects Versions: nutchgora
Reporter: Lewis John McGibbney
Assignee: Tejas Patil
Priority: Minor
 Fix For: 1.7, 2.2

 Attachments: NUTCH-1249.trunk.patch


 There are a heap of issues flagged up by NUTCH-1237, I think over time it 
 would be great to get these addressed and resolved.
 What is interesting is that adding the same arguements to 
 /src/plugin/plugin-build.xml actually breaks my build as tests begin to fail.
 Some of this stuff is documented in the link below
 http://docs.oracle.com/javase/1.5.0/docs/tooldocs/windows/javac.html#options

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA

[jira] [Commented] (NUTCH-1569) Upgrade 2.x to Gora 0.3

2013-05-22 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13663974#comment-13663974
 ] 

Hudson commented on NUTCH-1569:
---

Integrated in Nutch-nutchgora #614 (See 
[https://builds.apache.org/job/Nutch-nutchgora/614/])
NUTCH-1569 Upgrade 2.x to Gora 0.3 (Revision 1485044)

 Result = FAILURE
lewismc : http://svn.apache.org/viewvc/nutch/branches/2.x/?view=revrev=1485044
Files : 
* /nutch/branches/2.x/CHANGES.txt
* /nutch/branches/2.x/build.xml
* /nutch/branches/2.x/ivy/ivy.xml
* /nutch/branches/2.x/src/java/org/apache/nutch/api/DbReader.java
* /nutch/branches/2.x/src/java/org/apache/nutch/crawl/WebTableReader.java
* /nutch/branches/2.x/src/java/org/apache/nutch/host/HostDb.java
* /nutch/branches/2.x/src/java/org/apache/nutch/host/HostDbReader.java
* /nutch/branches/2.x/src/test/org/apache/nutch/crawl/TestGenerator.java
* /nutch/branches/2.x/src/test/org/apache/nutch/crawl/TestInjector.java
* /nutch/branches/2.x/src/test/org/apache/nutch/fetcher/TestFetcher.java
* /nutch/branches/2.x/src/test/org/apache/nutch/storage/TestGoraStorage.java
* /nutch/branches/2.x/src/test/org/apache/nutch/util/AbstractNutchTest.java
* /nutch/branches/2.x/src/test/org/apache/nutch/util/CrawlTestUtil.java


 Upgrade 2.x to Gora 0.3
 ---

 Key: NUTCH-1569
 URL: https://issues.apache.org/jira/browse/NUTCH-1569
 Project: Nutch
  Issue Type: Improvement
  Components: build, storage
Affects Versions: 2.2
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 2.2

 Attachments: NUTCH-1569.patch, NUTCH-1569.v2.patch


 We just released the Maven artifacts and I would like to upgrade before we 
 push the RC for 2.2 :)
 Patch coming up

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Build failed in Jenkins: Nutch-nutchgora #614

2013-05-22 Thread Apache Jenkins Server

See https://builds.apache.org/job/Nutch-nutchgora/614/changes

Changes:

[tejasp] NUTCH-1249 and NUTCH-1275 : Resolve all issues flagged up by adding 
javac -Xlint argument

[lewismc] NUTCH-1569 Upgrade 2.x to Gora 0.3

--
[...truncated 1674 lines...]
[javac] import org.apache.gora.mapreduce.GoraMapper;
[javac] ^
[javac] 
/x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/2.x/src/java/org/apache/nutch/crawl/GeneratorMapper.java:36:
 error: cannot find symbol
[javac] extends GoraMapperString, WebPage, SelectorEntry, WebPage {
[javac] ^
[javac]   symbol: class GoraMapper
[javac] 
/x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/2.x/src/java/org/apache/nutch/crawl/GeneratorMapper.java:50:
 error: cannot find symbol
[javac]   Context context) throws IOException, InterruptedException {
[javac]   ^
[javac]   symbol:   class Context
[javac]   location: class GeneratorMapper
[javac] 
/x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/2.x/src/java/org/apache/nutch/crawl/GeneratorMapper.java:109:
 error: cannot find symbol
[javac]   public void setup(Context context) {
[javac] ^
[javac]   symbol:   class Context
[javac]   location: class GeneratorMapper
[javac] 
/x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/2.x/src/java/org/apache/nutch/fetcher/FetcherJob.java:48:
 error: package org.apache.gora.mapreduce does not exist
[javac] import org.apache.gora.mapreduce.GoraMapper;
[javac] ^
[javac] 
/x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/2.x/src/java/org/apache/nutch/crawl/GeneratorReducer.java:32:
 error: package org.apache.gora.mapreduce does not exist
[javac] import org.apache.gora.mapreduce.GoraReducer;
[javac] ^
[javac] 
/x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/2.x/src/java/org/apache/nutch/crawl/GeneratorReducer.java:41:
 error: cannot find symbol
[javac] extends GoraReducerSelectorEntry, WebPage, String, WebPage {
[javac] ^
[javac]   symbol: class GoraReducer
[javac] 
/x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/2.x/src/java/org/apache/nutch/crawl/GeneratorReducer.java:52:
 error: cannot find symbol
[javac]   Context context) throws IOException, InterruptedException {
[javac]   ^
[javac]   symbol:   class Context
[javac]   location: class GeneratorReducer
[javac] 
/x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/2.x/src/java/org/apache/nutch/crawl/GeneratorReducer.java:90:
 error: cannot find symbol
[javac]   protected void setup(Context context)
[javac]^
[javac]   symbol:   class Context
[javac]   location: class GeneratorReducer
[javac] 
/x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/2.x/src/java/org/apache/nutch/crawl/InjectorJob.java:29:
 error: package org.apache.gora.mapreduce does not exist
[javac] import org.apache.gora.mapreduce.GoraOutputFormat;
[javac] ^
[javac] 
/x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/2.x/src/java/org/apache/nutch/crawl/InjectorJob.java:30:
 error: package org.apache.gora.persistency does not exist
[javac] import org.apache.gora.persistency.Persistent;
[javac]   ^
[javac] 
/x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/2.x/src/java/org/apache/nutch/crawl/InjectorJob.java:31:
 error: package org.apache.gora.store does not exist
[javac] import org.apache.gora.store.DataStore;
[javac] ^
[javac] 
/x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/2.x/src/java/org/apache/nutch/fetcher/FetchEntry.java:28:
 error: package org.apache.gora.util does not exist
[javac] import org.apache.gora.util.IOUtils;
[javac]^
[javac] 
/x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/2.x/src/java/org/apache/nutch/crawl/WebTableReader.java:57:
 error: package org.apache.gora.mapreduce does not exist
[javac] import org.apache.gora.mapreduce.GoraMapper;
[javac] ^
[javac] 
/x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/2.x/src/java/org/apache/nutch/crawl/WebTableReader.java:58:
 error: package org.apache.gora.query does not exist
[javac] import org.apache.gora.query.Query;
[javac] ^
[javac] 
/x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/2.x/src/java/org/apache/nutch/crawl/WebTableReader.java:59:
 error: package org.apache.gora.query does not exist
[javac] import org.apache.gora.query.Result;
[javac] ^
[javac] 
/x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/2.x/src/java/org/apache/nutch/crawl/WebTableReader.java:60:
 error: package org.apache.gora.store does not exist
[javac] import

[jira] [Created] (NUTCH-1575) support solr authentication in nutch 2.x

2013-05-22 Thread lufeng (JIRA)

lufeng created NUTCH-1575:
-

 Summary: support solr authentication in nutch 2.x
 Key: NUTCH-1575
 URL: https://issues.apache.org/jira/browse/NUTCH-1575
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 2.1
Reporter: lufeng
Assignee: lufeng
Priority: Minor
 Fix For: 2.2


can solr authentication in nutch 2.x like 1.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Work started] (NUTCH-1575) support solr authentication in nutch 2.x

2013-05-22 Thread lufeng (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-1575 started by lufeng.

 support solr authentication in nutch 2.x
 

 Key: NUTCH-1575
 URL: https://issues.apache.org/jira/browse/NUTCH-1575
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 2.1
Reporter: lufeng
Assignee: lufeng
Priority: Minor
 Fix For: 2.2


 can solr authentication in nutch 2.x like 1.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1575) support solr authentication in nutch 2.x

2013-05-22 Thread lufeng (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lufeng updated NUTCH-1575:
--

Attachment: NUTCH-1575.patch

add solr authentication

 support solr authentication in nutch 2.x
 

 Key: NUTCH-1575
 URL: https://issues.apache.org/jira/browse/NUTCH-1575
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 2.1
Reporter: lufeng
Assignee: lufeng
Priority: Minor
 Fix For: 2.2

 Attachments: NUTCH-1575.patch


 can solr authentication in nutch 2.x like 1.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1563) FetchSchedule#getFields is never used by GeneraterJob

2013-05-22 Thread Tejas Patil (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13664408#comment-13664408
 ] 

Tejas Patil commented on NUTCH-1563:


I think this is relevant to only 2.x and [~amuseme.lu] has pushed the patch to 
svn. Any work left here ?

 FetchSchedule#getFields is never used by GeneraterJob
 -

 Key: NUTCH-1563
 URL: https://issues.apache.org/jira/browse/NUTCH-1563
 Project: Nutch
  Issue Type: Bug
  Components: generator
Affects Versions: 2.1
Reporter: lufeng
Assignee: lufeng
Priority: Minor
 Fix For: 2.3

 Attachments: NUTCH-1563.patch


 The method of getFields in FetchSchedule if never used, so if user extends 
 the FetchSchedule and want to get some fields of WebPage, it always return 
 null.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1566) bin/nutch to allow whitespace in paths

2013-05-22 Thread Sebastian Nagel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1566:
---

Attachment: NUTCH-1566-v2-trunk.patch

New patch including [~tejas.patil]'s suggestions. Also removed forgotten debug 
output :)

 bin/nutch to allow whitespace in paths
 --

 Key: NUTCH-1566
 URL: https://issues.apache.org/jira/browse/NUTCH-1566
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.6, 2.1
Reporter: Sebastian Nagel
Priority: Minor
 Fix For: 2.4

 Attachments: NUTCH-1566-trunk.patch, NUTCH-1566-v2-trunk.patch


 bin/nutch and bin/crawl choke if a path contains white space, eg, if 
 JAVA_HOME is {{C:\Program Files\jdk}}. If you don't have the permission to 
 change the path it is impossible to run Nutch. This has been reported 
 frequently 
 ([1|http://stackoverflow.com/questions/9345629/nutch-cygwin-how-to-set-java-home],
  
 [2|http://lucene.472066.n3.nabble.com/Problem-running-Nutch-on-Win-7-Cygwin-td3487163.html],
  and 
 [3|http://nutchinstall.blogspot.de/2007/07/setting-up-cygwin-and-nutch.html]),
  see also NUTCH-19.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Reopened] (NUTCH-1569) Upgrade 2.x to Gora 0.3

2013-05-22 Thread Lewis John McGibbney (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney reopened NUTCH-1569:
-


 Upgrade 2.x to Gora 0.3
 ---

 Key: NUTCH-1569
 URL: https://issues.apache.org/jira/browse/NUTCH-1569
 Project: Nutch
  Issue Type: Improvement
  Components: build, storage
Affects Versions: 2.2
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 2.2

 Attachments: NUTCH-1569.patch, NUTCH-1569.v2.patch


 We just released the Maven artifacts and I would like to upgrade before we 
 push the RC for 2.2 :)
 Patch coming up

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

unable to build 2.x

[jira] [Updated] (NUTCH-356) Plugin repository cache can lead to memory leak

[jira] [Updated] (NUTCH-840) Port tests from parse-html to parse-tika

[jira] [Updated] (NUTCH-410) Faster RegexNormalize with more features

[jira] [Updated] (NUTCH-1253) Incompatible neko and xerces versions

[jira] [Updated] (NUTCH-1190) MoreIndexingFilter refactor: move data formats used to parse lastModified to a config file.

[jira] [Updated] (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a ?

[jira] [Updated] (NUTCH-566) Sun's URL class has bug in creation of relative query URLs

[jira] [Updated] (NUTCH-1250) parse-html does not parse links with empty anchor

[jira] [Updated] (NUTCH-1562) Order of execution for scoring filters

[jira] [Updated] (NUTCH-409) Add short circuit notion to filters to speedup mixed site/subsite crawling

[jira] [Updated] (NUTCH-945) Indexing to multiple SOLR Servers

[jira] [Updated] (NUTCH-1531) URL filtering takes long time for very long URLs

[jira] [Updated] (NUTCH-351) Protocol forward proxy

[jira] [Updated] (NUTCH-490) Extension point with filters for Neko HTML parser (with patch)

fix version 1.7 removed in Jira

[jira] [Updated] (NUTCH-1483) Can't crawl filesystem with protocol-file plugin

[jira] [Updated] (NUTCH-1483) Can't crawl filesystem with protocol-file plugin

[jira] [Resolved] (NUTCH-1249) Resolve all issues flagged up by adding javac -Xlint arguement

[jira] [Resolved] (NUTCH-1275) Fix [unchecked] javac warnings

[jira] [Commented] (NUTCH-1249) Resolve all issues flagged up by adding javac -Xlint arguement

[jira] [Commented] (NUTCH-1569) Upgrade 2.x to Gora 0.3

Build failed in Jenkins: Nutch-nutchgora #614

[jira] [Created] (NUTCH-1575) support solr authentication in nutch 2.x

[jira] [Work started] (NUTCH-1575) support solr authentication in nutch 2.x

[jira] [Updated] (NUTCH-1575) support solr authentication in nutch 2.x

[jira] [Commented] (NUTCH-1563) FetchSchedule#getFields is never used by GeneraterJob

[jira] [Updated] (NUTCH-1566) bin/nutch to allow whitespace in paths

[jira] [Reopened] (NUTCH-1569) Upgrade 2.x to Gora 0.3

29 matches

Site Navigation

Mail list logo

Footer information