Re: Suitable naming for > Nutchgora branch?

2012-04-25 Thread Mattmann, Chris A (388J)
Great work Lewis, thanks!

Cheers,
Chris

On Apr 25, 2012, at 4:01 PM, Lewis John Mcgibbney wrote:

> Hi Everyone,
> 
> As you guys will have seen I've quickly polluted our dev list again 
> (sorry!!!) with set and classify for 2.1.
> 
> The open issues for 2.0 are ones which I think we could address within the 
> 2.0 release. This is merely my opinion, based upon the assertion that they 
> all contain patches which could be up for review. With the exception of 
> NUTCH-879 which is pretty alarming. I'll test shortly.
> 
> I'm now away to bed.
> 
> Best
> 
> Lewis
> 
> On Wed, Apr 25, 2012 at 3:06 PM, Mattmann, Chris A (388J) 
>  wrote:
> Hi Guys,
> 
> 


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Re: Suitable naming for > Nutchgora branch?

2012-04-25 Thread Lewis John Mcgibbney
Hi Everyone,

As you guys will have seen I've quickly polluted our dev list again
(sorry!!!) with set and classify for 2.1.

The open issues for 2.0 are ones which I think we could address within the
2.0 release. This is merely my opinion, based upon the assertion that they
all contain patches which could be up for review. With the exception of
NUTCH-879 which is pretty alarming. I'll test shortly.

I'm now away to bed.

Best

Lewis

On Wed, Apr 25, 2012 at 3:06 PM, Mattmann, Chris A (388J) <
chris.a.mattm...@jpl.nasa.gov> wrote:

> Hi Guys,
>
>


[jira] [Updated] (NUTCH-849) different versions of the same library in nutch-2.0-dev.job and local\lib directory

2012-04-25 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-849:
---

Affects Version/s: nutchgora
   1.4
Fix Version/s: (was: nutchgora)
   2.1
   1.6

Confirmed to affect 1.X also, my gut instinct is that what pham describes with 
dependencies getting dragged up. Therefore this sounds like more of an ivy 
review and configuration.

Set and classify

> different versions of the same library in nutch-2.0-dev.job and local\lib 
> directory 
> 
>
> Key: NUTCH-849
> URL: https://issues.apache.org/jira/browse/NUTCH-849
> Project: Nutch
>  Issue Type: Task
>Affects Versions: 1.4, nutchgora
> Environment: Window XP SP3, Cygwin
>Reporter: Pham Tuan Minh
>Priority: Minor
> Fix For: 1.6, 2.1
>
>
> Hi,
> I found that after building runtime, In nutch-2.0-dev.job and local\lib 
> directory contains different versions of the same library
> ant-1.7.1.jar
> ant-1.6.5.jar
> servlet-api-2.5-20081211.jar
> servlet-api-2.5-6.1.14.jar
> I predict these libraries come from different dependencies branch. Anyone 
> help me to fix it?
> Thanks,

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-979) Add support for deleting Solr documents with ProtocolStatusCodes.NOTFOUND

2012-04-25 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-979:
---

Patch Info: Patch Available

> Add support for deleting Solr documents with ProtocolStatusCodes.NOTFOUND
> -
>
> Key: NUTCH-979
> URL: https://issues.apache.org/jira/browse/NUTCH-979
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>Affects Versions: nutchgora
>Reporter: Markus Jelsma
>Priority: Minor
> Fix For: 2.1
>
> Attachments: SolrClean.java
>
>
> When issuing recrawls it can happen that certain urls have expired (i.e. URLs 
> that don't exist anymore and return 404).
> This issue creates a new command in the indexer that scans for WebPages with 
> ProtocolStatusCodes.NOTFOUND and issues delete commands to Solr.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-979) Add support for deleting Solr documents with ProtocolStatusCodes.NOTFOUND

2012-04-25 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-979:
---

Fix Version/s: (was: nutchgora)
   2.1

Some work to be done

Set and Classify

> Add support for deleting Solr documents with ProtocolStatusCodes.NOTFOUND
> -
>
> Key: NUTCH-979
> URL: https://issues.apache.org/jira/browse/NUTCH-979
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>Affects Versions: nutchgora
>Reporter: Markus Jelsma
>Priority: Minor
> Fix For: 2.1
>
> Attachments: SolrClean.java
>
>
> When issuing recrawls it can happen that certain urls have expired (i.e. URLs 
> that don't exist anymore and return 404).
> This issue creates a new command in the indexer that scans for WebPages with 
> ProtocolStatusCodes.NOTFOUND and issues delete commands to Solr.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a "?"

2012-04-25 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-797:
---

Affects Version/s: nutchgora
Fix Version/s: (was: nutchgora)
   2.1

Set and Classify

> parse-tika is not properly constructing URLs when the target begins with a "?"
> --
>
> Key: NUTCH-797
> URL: https://issues.apache.org/jira/browse/NUTCH-797
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.1, nutchgora
> Environment: Win 7, Java(TM) SE Runtime Environment (build 
> 1.6.0_16-b01)
> Also repro's on RHEL and java 1.4.2
>Reporter: Robert Hohman
>Assignee: Andrzej Bialecki 
>Priority: Minor
> Fix For: 2.1
>
> Attachments: NUTCH-797.patch, pureQueryUrl-2.patch, pureQueryUrl.patch
>
>
> This is my first bug and patch on nutch, so apologies if I have not provided 
> enough detail.
> In crawling the page at 
> http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0 there are 
> links in the page that look like this:
> 2 href="?co=0&sk=0&p=3&pi=1">3
> in org.apache.nutch.parse.tika.DOMContentUtils rev 916362 (trunk), as 
> getOutlinks looks for links, it comes across this link, and constucts a new 
> url with a base URL class built from 
> "http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0";, and a 
> target of "?co=0&sk=0&p=2&pi=1"
> The URL class, per RFC 3986 at 
> http://labs.apache.org/webarch/uri/rfc/rfc3986.html#relative-merge, defines 
> how to merge these two, and per the RFC, the URL class merges these to: 
> http://careers3.accenture.com/Careers/ASPX/?co=0&sk=0&p=2&pi=1
> because the RFC explicitly states that the rightmost url segment (the 
> Search.aspx in this case) should be ripped off before combining.
> While this is compliant with the RFC, it means the URLs which are created for 
> the next round of fetching are incorrect.  Modern browsers seem to handle 
> this case (I checked IE8 and Firefox 3.5), so I'm guessing this is an obscure 
> exception or handling of what is a poorly formed url on accenture's part.
> I have fixed this by modifying DOMContentUtils to look for the case where a ? 
> begins the target, and then pulling the rightmost component out of the base 
> and inserting it into the target before the ?, so the target in this example 
> becomes:
> Search.aspx?co=0&sk=0&p=2&pi=1
> The URL class then properly constructs the new url as:
> http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0&p=2&pi=1
> If it is agreed that this solution works, I believe the other html parsers in 
> nutch would need to be modified in a similar way.
> Can I get feedback on this proposed solution?  Specifically I'm worried about 
> unforeseen side effects.
> Much thanks
> Here is the patch info:
> Index: 
> src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
> ===
> --- 
> src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
>(revision 916362)
> +++ 
> src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
>(working copy)
> @@ -299,6 +299,50 @@
>  return false;
>}
>
> +  private URL fixURL(URL base, String target) throws MalformedURLException
> +  {
> +   // handle params that are embedded into the base url - move them to 
> target
> +   // so URL class constructs the new url class properly
> +   if  (base.toString().indexOf(';') > 0)  
> +  return fixEmbeddedParams(base, target);
> +   
> +   // handle the case that there is a target that is a pure query.
> +   // Strictly speaking this is a violation of RFC 2396 section 5.2.2 on 
> how to assemble
> +   // URLs but I've seen this in numerous places, for example at
> +   // http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0
> +   // It has urls in the page of the form href="?co=0&sk=0&pg=1", and by 
> default
> +   // URL constructs the base+target combo as 
> +   // http://careers3.accenture.com/Careers/ASPX/?co=0&sk=0&pg=1, 
> incorrectly
> +   // dropping the Search.aspx target
> +   //
> +   // Browsers handle these just fine, they must have an exception 
> similar to this
> +   if (target.startsWith("?"))
> +   {
> +   return fixPureQueryTargets(base, target);
> +   }
> +   
> +   return new URL(base, target);
> +  }
> +  
> +  private URL fixPureQueryTargets(URL base, String target) throws 
> MalformedURLException
> +  {
> + if (!target.startsWith("?"))
> + return new URL(base, target);
> +
> + String basePath = base.getPath();
> + String

[jira] [Updated] (NUTCH-710) Support for rel="canonical" attribute

2012-04-25 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-710:
---

Fix Version/s: (was: nutchgora)
   2.1
   1.6

Set and Classify 

> Support for rel="canonical" attribute
> -
>
> Key: NUTCH-710
> URL: https://issues.apache.org/jira/browse/NUTCH-710
> Project: Nutch
>  Issue Type: New Feature
>Affects Versions: 1.1
>Reporter: Frank McCown
>Priority: Minor
> Fix For: 1.6, 2.1
>
>
> There is a the new rel="canonical" attribute which is
> now being supported by Google, Yahoo, and Live:
> http://googlewebmastercentral.blogspot.com/2009/02/specify-your-canonical.html
> Adding support for this attribute value will potentially reduce the number of 
> URLs crawled and indexed and reduce duplicate page content.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1290) crawlId not supported by all Tools

2012-04-25 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1290:


Patch Info: Patch Available

> crawlId not supported by all Tools
> --
>
> Key: NUTCH-1290
> URL: https://issues.apache.org/jira/browse/NUTCH-1290
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: nutchgora
>Reporter: Mathijs Homminga
>Priority: Minor
> Fix For: nutchgora
>
> Attachments: NUTCH-1290.patch
>
>
> See also: https://issues.apache.org/jira/browse/NUTCH-907
> The StorageUtils class exposes a createDataStore method which uses the 
> default schema for a persistent class specified in the Gora configuration. 
> This method ignores Nutch' storage.schema property and the notion of a 
> crawlId.
> Two tools use this method instead of the createWebStore method (which does 
> support the storage.schema property and a crawlId):
> o.a.n.indexer.IndexerReducer (IndexerJob)
> o.a.n.util.domain.DomainStatistics
>  
> I propose that these two start using the createWebStore method and that we 
> make remove the createDataStore method from the StorageUtils.
> Also, these two tools should support the crawlId command line parameter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-944) Increase the number of elements to look for URLs and add the ability to specify multiple attributes by elements

2012-04-25 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-944:
---

Fix Version/s: (was: nutchgora)
   2.1
   1.6

Set and Classify

> Increase the number of elements to look for URLs and add the ability to 
> specify multiple attributes by elements
> ---
>
> Key: NUTCH-944
> URL: https://issues.apache.org/jira/browse/NUTCH-944
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser
> Environment: GNU/Linux Fedora 12
>Reporter: Jean-Francois Gingras
>Priority: Minor
> Fix For: 1.6, 2.1
>
> Attachments: DOMContentUtils.java.path-1.0, 
> DOMContentUtils.java.path-1.3
>
>
> Here a patch for DOMContentUtils.java that increase the number of elements to 
> look for URLs. It also add the ability to specify multiple attributes by 
> elements, for example:
> linkParams.put("frame", new LinkParams("frame", "longdesc,src", 0));
> linkParams.put("object", new LinkParams("object", 
> "classid,codebase,data,usemap", 0));
> linkParams.put("video", new LinkParams("video", "poster,src", 0)); // HTML 5
> I have a patch for release-1.0 and branch-1.3
> I would love to hear your comments about this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1025) Add option not to commit to Solr

2012-04-25 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1025:


Fix Version/s: (was: nutchgora)
   2.1

Set and Classify

> Add option not to commit to Solr
> 
>
> Key: NUTCH-1025
> URL: https://issues.apache.org/jira/browse/NUTCH-1025
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: nutchgora
>Reporter: Markus Jelsma
>Priority: Minor
> Fix For: 2.1
>
>
> We need an option to prevent a job from sending a commit to Solr. A commit 
> can take a lot of resources (cache warming) and it's not always necessary to 
> commit after index, dedup or clean, especially if they are run immediately 
> after the other.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1285) Debian Packaging for Nutch

2012-04-25 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1285:


Fix Version/s: (was: nutchgora)
   2.1

Set and Classify

> Debian Packaging for Nutch
> --
>
> Key: NUTCH-1285
> URL: https://issues.apache.org/jira/browse/NUTCH-1285
> Project: Nutch
>  Issue Type: New Feature
>  Components: build
>Affects Versions: nutchgora, 1.5
>Reporter: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.6, 2.1
>
>
> This is a utopian type issue which will not be addressed for some time due to 
> many factors, outwith our control which exist within the Debian policy 
> ecosystem. 
> I've been in touch with Ioan over @ Apache James and they have recently 
> (after a number of years) made some real progress with this. Some links are 
> below
> [0] http://svn.apache.org/repos/asf/james/app
> [1] http://svn.apache.org/viewvc/james/app/trunk/pom.xml?view=markup
> [2] https://issues.apache.org/jira/browse/JAMES-1343
> [3] http://www.mail-archive.com/server-dev@james.apache.org/
> [4] http://www.debian.org/doc/debian-policy/
> [5] http://www.debian.org/doc/manuals/maint-guide/index.en.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-978) A Plugin for extracting certain element of a web page on html page parsing.

2012-04-25 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-978:
---

Fix Version/s: (was: nutchgora)
   2.1

Set and Classify

> A Plugin for extracting certain element of a web page on html page parsing.
> ---
>
> Key: NUTCH-978
> URL: https://issues.apache.org/jira/browse/NUTCH-978
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 1.2
> Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9
>Reporter: Ammar Shadiq
>Assignee: Chris A. Mattmann
>Priority: Minor
>  Labels: gsoc2012, mentor
> Fix For: 2.1
>
> Attachments: 
> [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf, 
> app_guardian_ivory_coast_news_exmpl.png, 
> app_screenshoot_configuration_result.png, 
> app_screenshoot_configuration_result_anchor.png, 
> app_screenshoot_source_view.png, app_screenshoot_url_regex_filter.png, 
> for_GSoc.zip, version_alpha2.zip
>
>   Original Estimate: 1,680h
>  Remaining Estimate: 1,680h
>
> Nutch use parse-html plugin to parse web pages, it process the contents of 
> the web page by removing html tags and component like javascript and css and 
> leaving the extracted text to be stored on the index. Nutch by default 
> doesn't have the capability to select certain atomic element on an html page, 
> like certain tags, certain content, some part of the page, etc.
> A html page have a tree-like xml pattern with html tag as its branch and text 
> as its node. This branch and node could be extracted using XPath. XPath 
> allowing us to select a certain branch or node of an XML and therefore could 
> be used to extract certain information and treat it differently based on its 
> content and the user requirements. Furthermore a web domain like news website 
> usually have a same html code structure for storing the information on its 
> web pages. This same html code structure could be parsed using the same XPath 
> query and retrieve the same content information element. All of the XPath 
> query for selecting various content could be stored on a XPath Configuration 
> File.
> The purpose of nutch are for various web source, not all of the web page 
> retrieved from those various source have the same html code structure, thus 
> have to be threated differently using the correct XPath Configuration. The 
> selection of the correct XPath configuration could be done automatically 
> using regex by matching the url of the web page with valid url pattern for 
> that xpath configuration.
> This automatic mechanism allow the user of nutch to process various web page 
> and get only certain information that user wants therefore making the index 
> more accurate and its content more flexible.
> The component for this idea have been tested on nutch 1.2 for selecting 
> certain elements on various news website for the purpose of document 
> clustering. This includes a Configuration Editor Application build using 
> NetBeans 6.9 Application Framework. though its need a few debugging.
> http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1249) Resolve all issues flagged up by adding javac -Xlint arguement

2012-04-25 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1249:


Fix Version/s: (was: nutchgora)
   2.1

Set and Classify

> Resolve all issues flagged up by adding javac -Xlint arguement
> --
>
> Key: NUTCH-1249
> URL: https://issues.apache.org/jira/browse/NUTCH-1249
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: nutchgora
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.6, 2.1
>
>
> There are a heap of issues flagged up by NUTCH-1237, I think over time it 
> would be great to get these addressed and resolved.
> What is interesting is that adding the same arguements to 
> /src/plugin/plugin-build.xml actually breaks my build as tests begin to fail.
> Some of this stuff is documented in the link below
> http://docs.oracle.com/javase/1.5.0/docs/tooldocs/windows/javac.html#options

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-841) Nutch 2.0 webapp

2012-04-25 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-841:
---

Affects Version/s: nutchgora
Fix Version/s: (was: nutchgora)
   2.1

> Nutch 2.0 webapp
> 
>
> Key: NUTCH-841
> URL: https://issues.apache.org/jira/browse/NUTCH-841
> Project: Nutch
>  Issue Type: Improvement
>  Components: web gui
>Affects Versions: nutchgora
> Environment: Nutch 2.0
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
> Fix For: 2.1
>
>
> In light of the conversation on NUTCH-837, we are removing the old Nutch 
> webapp and will replace it with a 2.0 one that works with GORA + Solr. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-875) Port Webgraph to Nutch 2.0

2012-04-25 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-875:
---

Fix Version/s: (was: nutchgora)
   2.1

> Port Webgraph to Nutch 2.0
> --
>
> Key: NUTCH-875
> URL: https://issues.apache.org/jira/browse/NUTCH-875
> Project: Nutch
>  Issue Type: New Feature
>  Components: linkdb
>Affects Versions: nutchgora
>Reporter: Julien Nioche
> Fix For: 2.1
>
>
> The webgraph has not yet been ported to the GORA-based API.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-864) Fetcher generates entries with status 0

2012-04-25 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-864:
---

Affects Version/s: nutchgora
Fix Version/s: (was: nutchgora)
   2.1

Set and Classify

> Fetcher generates entries with status 0
> ---
>
> Key: NUTCH-864
> URL: https://issues.apache.org/jira/browse/NUTCH-864
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: nutchgora
> Environment: Gora with SQLBackend
> URL: https://svn.apache.org/repos/asf/nutch/branches/nutchbase
> Last Changed Rev: 980748
> Last Changed Date: 2010-07-30 14:19:52 +0200 (Fri, 30 Jul 2010)
>Reporter: Julien Nioche
>Assignee: Doğacan Güney
> Fix For: 2.1
>
>
> After a round of fetching which got the following protocol status :
> 10/07/30 15:11:39 INFO mapred.JobClient: ACCESS_DENIED=2
> 10/07/30 15:11:39 INFO mapred.JobClient: SUCCESS=1177
> 10/07/30 15:11:39 INFO mapred.JobClient: GONE=3
> 10/07/30 15:11:39 INFO mapred.JobClient: TEMP_MOVED=138
> 10/07/30 15:11:39 INFO mapred.JobClient: EXCEPTION=93
> 10/07/30 15:11:39 INFO mapred.JobClient: MOVED=521
> 10/07/30 15:11:39 INFO mapred.JobClient: NOTFOUND=62
> I ran : ./nutch org.apache.nutch.crawl.WebTableReader -stats
> 10/07/30 15:12:37 INFO crawl.WebTableReader: Statistics for WebTable: 
> 10/07/30 15:12:37 INFO crawl.WebTableReader: TOTAL urls:  2690
> 10/07/30 15:12:37 INFO crawl.WebTableReader: retry 0: 2690
> 10/07/30 15:12:37 INFO crawl.WebTableReader: min score:   0.0
> 10/07/30 15:12:37 INFO crawl.WebTableReader: avg score:   0.7587361
> 10/07/30 15:12:37 INFO crawl.WebTableReader: max score:   1.0
> 10/07/30 15:12:37 INFO crawl.WebTableReader: status 0 (null): 649
> 10/07/30 15:12:37 INFO crawl.WebTableReader: status 2 (status_fetched):   
> 1177 (SUCCESS=1177)
> 10/07/30 15:12:37 INFO crawl.WebTableReader: status 3 (status_gone):  112 
> 10/07/30 15:12:37 INFO crawl.WebTableReader: status 34 (status_retry):
> 93 (EXCEPTION=93)
> 10/07/30 15:12:37 INFO crawl.WebTableReader: status 4 (status_redir_temp):
> 138  (TEMP_MOVED=138)
> 10/07/30 15:12:37 INFO crawl.WebTableReader: status 5 (status_redir_perm):
> 521 (MOVED=521)
> 10/07/30 15:12:37 INFO crawl.WebTableReader: WebTable statistics: done
> There should not be any entries with status 0 (null)
> I will investigate a bit more...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-970) Injector job crashes with MySQL with table collation set to utf8_general_ci

2012-04-25 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-970:
---

Fix Version/s: (was: nutchgora)
   2.1

Set and Classify

> Injector job crashes with MySQL with table collation set to utf8_general_ci
> ---
>
> Key: NUTCH-970
> URL: https://issues.apache.org/jira/browse/NUTCH-970
> Project: Nutch
>  Issue Type: Bug
>  Components: injector
>Affects Versions: nutchgora
>Reporter: Markus Jelsma
> Fix For: 2.1
>
>
> Running the injector of trunk with an already existing database where the 
> default collation is utf8_* or ucs2_* the following GoraException is thrown:
> InjectorJob: starting
> InjectorJob: urlDir: urls
> InjectorJob: org.apache.gora.util.GoraException: java.io.IOException: 
> com.mysql.jdbc.exceptions.MySQLSyntaxErrorException: Column length too big 
> for column 'text' (max = 21845); use BLOB or TEXT instead
> at 
> org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:110)
> at 
> org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:93)
> at 
> org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:43)
> at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:227)
> at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:252)
> at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:266)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:276)
> Caused by: java.io.IOException: 
> com.mysql.jdbc.exceptions.MySQLSyntaxErrorException: Column length too big 
> for column 'text' (max = 21845); use BLOB or TEXT instead
> at org.apache.gora.sql.store.SqlStore.createSchema(SqlStore.java:226)
> at org.apache.gora.sql.store.SqlStore.initialize(SqlStore.java:172)
> at 
> org.apache.gora.store.DataStoreFactory.initializeDataStore(DataStoreFactory.java:81)
> at 
> org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:104)
> ... 7 more
> Caused by: com.mysql.jdbc.exceptions.MySQLSyntaxErrorException: Column length 
> too big for column 'text' (max = 21845); use BLOB or TEXT instead
> at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:936)
> at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2985)
> at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:1631)
> at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:1723)
> at com.mysql.jdbc.Connection.execSQL(Connection.java:3283)
> at 
> com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1332)
> at 
> com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:1604)
> at 
> com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:1519)
> at 
> com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:1504)
> at org.apache.gora.sql.store.SqlStore.createSchema(SqlStore.java:224)
> ... 10 more

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1094) create comprehensive documentation for Nutchgora branch

2012-04-25 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1094:


Fix Version/s: (was: nutchgora)
   2.1

> create comprehensive documentation for Nutchgora branch
> ---
>
> Key: NUTCH-1094
> URL: https://issues.apache.org/jira/browse/NUTCH-1094
> Project: Nutch
>  Issue Type: Sub-task
>  Components: documentation
>Affects Versions: nutchgora
>Reporter: Lewis John McGibbney
> Fix For: 2.1
>
>
> This should shadow the core documentation for Nutch 1.4 (branch) and 
> mainstream users, however it should include fundamentals specific to Nutch 
> trunk. Until we release Nutch 2.0 this documentation should be stored in svn 
> under a /docs directory. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1026) Strip UTF-8 non-character codepoints

2012-04-25 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1026:


Fix Version/s: (was: nutchgora)
   2.1

Set and Classify

> Strip UTF-8 non-character codepoints
> 
>
> Key: NUTCH-1026
> URL: https://issues.apache.org/jira/browse/NUTCH-1026
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: nutchgora
>Reporter: Markus Jelsma
> Fix For: 2.1
>
>
> During a very large crawl i found a few documents producing non-character 
> codepoints. When indexing to Solr this will yield the following exception:
> {code}
> SEVERE: java.lang.RuntimeException: [was class 
> java.io.CharConversionException] Invalid UTF-8 character 0x at char 
> #1142033, byte #1155068)
> at 
> com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
> at 
> com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
> {code}
> Quite annoying! Here's quick fix for SolrWriter that'll pass the value of the 
> content field to a method to strip away non-characters. I'm not too sure 
> about this implementation but the tests i've done locally with a huge dataset 
> now passes correctly. Here's a list of codepoints to strip away: 
> http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True:]
> Please comment!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-879) URL-s getting lost

2012-04-25 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13262171#comment-13262171
 ] 

Lewis John McGibbney commented on NUTCH-879:


This looks heliishly serious and pretty worrying actually. Ferdy (or anyone 
else), can you please run this against one of your HBase instances, I will do 
the same with Cassandra and we can determine what is going on here. 

> URL-s getting lost
> --
>
> Key: NUTCH-879
> URL: https://issues.apache.org/jira/browse/NUTCH-879
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: nutchgora
> Environment: * Ubuntu 10.4 x64, Sun JDK 1.6
> * using 1-node Hadoop + HDFS
> * trunk r983472, using MySQL store
> * branch-1.3
>Reporter: Andrzej Bialecki 
> Fix For: nutchgora
>
> Attachments: branch-1.3-bench.txt, trunk-bench.txt
>
>
> I ran the Benchmark using branch-1.3 and trunk (formerly nutchbase). With the 
> same Benchmark parameters and the same plugins branch-1.3 collects ~1.5mln 
> urls, while trunk collects ~20,000 urls. Clearly something is wrong.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-992) SolrDedup is broken in trunk

2012-04-25 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-992:
---

Fix Version/s: (was: nutchgora)
   2.1

Set and Classify

> SolrDedup is broken in trunk
> 
>
> Key: NUTCH-992
> URL: https://issues.apache.org/jira/browse/NUTCH-992
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: nutchgora
>Reporter: Markus Jelsma
> Fix For: 2.1
>
>
> SolrDedup seems to have been broken for at least a few months, perhaps more. 
> It does fetch the documents from Solr but when processing the rows we get the 
> following exception:
> Exception in thread "main" java.lang.NullPointerException
> at 
> org.apache.hadoop.io.serializer.SerializationFactory.getSerializer(SerializationFactory.java:73)
> at 
> org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:899)
> at 
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:779)
> at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
> at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)
> at 
> org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:350)
> at 
> org.apache.nutch.indexer.solr.SolrDeleteDuplicates.run(SolrDeleteDuplicates.java:360)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> at 
> org.apache.nutch.indexer.solr.SolrDeleteDuplicates.main(SolrDeleteDuplicates.java:370)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-956) solrindex issues

2012-04-25 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-956:
---

Fix Version/s: (was: nutchgora)
   2.1

Set and Classify 

more work needs to be done here. Also unfortunately Alexis second patch is not 
ASF licensed!

> solrindex issues
> 
>
> Key: NUTCH-956
> URL: https://issues.apache.org/jira/browse/NUTCH-956
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: nutchgora
>Reporter: Alexis
> Fix For: 2.1
>
> Attachments: solr.patch, solr.patch2
>
>
> I ran into a few caveats with solrindex command trying to index documents.
> Please refer to 
> http://techvineyard.blogspot.com/2010/12/build-nutch-20.html#solrindex that 
> describes my tests.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-902) Add all necessary files and configuration so that nutch can be used with different backends out-of-the-box

2012-04-25 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13262160#comment-13262160
 ] 

Lewis John McGibbney commented on NUTCH-902:


I made some commits on this to in include the memory store, AvroStore, 
DataFileAvroStore and Accumulo properties to nutch-site and some rough 
properties to gora.properties. I'm not clued up on the Accumulo mappings and we 
have no mappings for *AvroStore implementations therefore this one really 
should stay open. This being said I do however feel that what is currently 
committed in Nutchgora is enough for anyone to work with. wdygt?

> Add all necessary files and configuration so that nutch can be used with 
> different backends out-of-the-box
> --
>
> Key: NUTCH-902
> URL: https://issues.apache.org/jira/browse/NUTCH-902
> Project: Nutch
>  Issue Type: New Feature
>  Components: documentation, storage
>Affects Versions: nutchbase
>Reporter: Enis Soztutar
>Assignee: Lewis John McGibbney
> Fix For: nutchgora
>
> Attachments: NUTCH-902-v2.patch, NUTCH-902-v3.patch, NUTCH-902.patch
>
>
> As per the discussion in the mailing list and 
> http://wiki.apache.org/nutch/GORA_HBase, it will be good to include all the 
> necessary files and configuration. I propose that we maintain configuration 
> for at least SQL, HBase and Cassandra. 
> The following changes are needed:
> conf/gora-sql-mapping.xml
> conf/gora-hbase-mapping.xml
> conf/gora-cassandra-mapping.xml
> comments on nutch-default and ivy.xml 
> Shall we also include jars from gora-hbase, gora-cassandra and their 
> dependencies ? 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-840) Port tests from parse-html to parse-tika

2012-04-25 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-840:
---

Fix Version/s: (was: nutchgora)
   2.1

Set and Classify

> Port tests from parse-html to parse-tika
> 
>
> Key: NUTCH-840
> URL: https://issues.apache.org/jira/browse/NUTCH-840
> Project: Nutch
>  Issue Type: Task
>  Components: parser
>Affects Versions: 1.1
>Reporter: Julien Nioche
>Assignee: Julien Nioche
> Fix For: 2.1
>
> Attachments: NUTCH-840.patch, NUTCH-840.patch
>
>
> We don't have test for HTML in parse-tika so I'll copy them from the old 
> parse-html plugin

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-842) AutoGenerate WebPage code

2012-04-25 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-842:
---

Affects Version/s: nutchgora
Fix Version/s: (was: nutchgora)
   2.1

Set and Classify

> AutoGenerate WebPage code
> -
>
> Key: NUTCH-842
> URL: https://issues.apache.org/jira/browse/NUTCH-842
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: nutchgora
>Reporter: Doğacan Güney
>Assignee: Doğacan Güney
> Fix For: 2.1
>
> Attachments: NUTCH-842.patch
>
>
> This issue will track the addition of an ant task that will automatically 
> generate o.a.n.storage.WebPage (and ProtocolStatus and ParseStatus) from 
> src/gora/webpage.avsc.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-887) Delegate parsing of feeds to Tika

2012-04-25 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-887:
---

Fix Version/s: (was: nutchgora)
   2.1

Set and Classify

> Delegate parsing of feeds to Tika
> -
>
> Key: NUTCH-887
> URL: https://issues.apache.org/jira/browse/NUTCH-887
> Project: Nutch
>  Issue Type: Wish
>  Components: parser
>Affects Versions: nutchgora
>Reporter: Julien Nioche
> Fix For: 2.1
>
>
> [Starting a new thread from https://issues.apache.org/jira/browse/NUTCH-874]
> One of the plugins which hasn't been ported yet is the feed parser. We could 
> rely on the one we recently added to Tika, knowing that there is a 
> substantial difference in the sense that the Tika feed parser generates a 
> simple XHTML representation of the document where the feeds are simply 
> represented as anchors whereas the Nutch version created new documents for 
> each feed.
> There is also the parse-rss plugin in Nutch which is quite similar - what's 
> the difference with the feed one again? Since the Tika parser would handle 
> all sorts of feed formats why not simply rely on it? 
> Any thoughts on this?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1038) Port IndexingFiltersChecker to 2.0

2012-04-25 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1038:


Affects Version/s: nutchgora
Fix Version/s: (was: nutchgora)
   2.1

Set and Classify

> Port IndexingFiltersChecker to 2.0
> --
>
> Key: NUTCH-1038
> URL: https://issues.apache.org/jira/browse/NUTCH-1038
> Project: Nutch
>  Issue Type: New Feature
>Affects Versions: nutchgora
>Reporter: Markus Jelsma
> Fix For: 2.1
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1283) Radically update all Solr configuration in Nutchgora

2012-04-25 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1283:


Fix Version/s: (was: nutchgora)
   2.1

Set and Classify

> Radically update all Solr configuration in Nutchgora
> 
>
> Key: NUTCH-1283
> URL: https://issues.apache.org/jira/browse/NUTCH-1283
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: nutchgora
>Reporter: Lewis John McGibbney
> Fix For: 2.1
>
>
> We're currently running with a Schema which states it's 1.4 :0| There should 
> be better support for newer stuff going on over the Solrland. Thsi issue 
> should track those improvements entirely.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1340) Increase scalability by only removing markers when they actually exist for DbUpdaterReducer

2012-04-25 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13262121#comment-13262121
 ] 

Lewis John McGibbney commented on NUTCH-1340:
-

Hi Ferdy. I am +1 for this going into 2.0. If you could do your usual and 
provide a small Javadoc comment for the new method you introduce that would be 
great. 

> Increase scalability by only removing markers when they actually exist for 
> DbUpdaterReducer
> ---
>
> Key: NUTCH-1340
> URL: https://issues.apache.org/jira/browse/NUTCH-1340
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Ferdy Galema
> Fix For: nutchgora
>
> Attachments: NUTCH-1340-v1.txt
>
>
> After applying GORA-120 (this already is a huge performance boost by itself) 
> one of the major bottlenecks of the DbUpdaterReducer is the deletion of the 
> markers. The update reducer simply sets every row to delete its markers. A 
> lot of rows do not actually have the markers but the deletes are fired away 
> in any case. Because the markers are already always on the input, a simple 
> check to see if they exist greaty improves performance.
> In particular it is very expensive in HBase, because every single Delete 
> inmediately triggers a connection to the regionservers. (They ignore the 
> "autoflush=false" directive). Although deletes can be done in batch, this is 
> currently not supported by Gora. For one it is very difficult to implement in 
> the current HBaseStore with regard to multithreading, and secondly I noticed 
> performance did not increase significantly.
> By performance debugging on a real life cluster this currently seems to be 
> the biggest bottleneck of the DbUpdaterReducer. (Remember only after applying 
> GORA-120)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1104) Port issues from trunk NutchGora branch

2012-04-25 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1104:


Fix Version/s: (was: nutchgora)
   2.1

Set and Classify

> Port issues from trunk NutchGora branch
> ---
>
> Key: NUTCH-1104
> URL: https://issues.apache.org/jira/browse/NUTCH-1104
> Project: Nutch
>  Issue Type: Task
>Affects Versions: nutchgora
>Reporter: Markus Jelsma
> Fix For: 2.1
>
>
> Umbrella issue for tracking issues that should be ported from 1.x trunk to 
> the NutchGora branch. Please mark ported issues by modifying this description.
> NOT YET PORTED:
> * NUTCH-809 Parse-metatags plugin
> * NUTCH-987 Support HTTP auth for Solr communication
> * NUTCH-1028 Log parser keys
> * NUTCH-1036 Solr jobs should increment counters in Reporter
> * NUTCH-1057 Make fetcher thread time out configurable
> * NUTCH-1067 Configure minimum throughput for fetcher
> * NUTCH-1101 Options to purge db_gone records in updatedb
> * NUTCH-1102 Fetcher, rely on fetcher.parse directive only
> * NUTCH-1105 MaxContentLength option for index-basic
> * NUTCH-940 Statis field plugin
> * NUTCH-1094 create comprehensive documentation for Nutch 2.0 trunk
> * NUTCH-1207 ParserChecker to output signature
> * NUTCH-1090 InvertLinks should inform when ignoring internal links
> * NUTCH-1174 Outlinks are not properly normalized
> * NUTCH-1203 ParseSegment to show number of milliseconds per parse
> * NUTCH-1173 DomainStats doesn't count db_not_modified
> * NUTCH-1155 Host/domain limit in generator is generate.max.count+1
> * NUTCH-1061 Migrate MoreIndexingFilter from Apache ORO to java.util.regex
> * NUTCH-1142 Normalization and filtering in WebGraph
> * NUTCH-1153 LinkRank not to log all keys and not to write Hadoop _SUCCESS 
> file
> * NUTCH-1195 Add Solr 4x (trunk) example schema
> * NUTCH-1141 Configurable Fetcher queue depth
> * NUTCH-1214 DomainStats tool should be named for what it's doing
> * NUTCH-1213 Pass additional SolrParams when indexing to Solr
> * NUTCH-1211 URLFilterChecker command line help doesn't inform user of STDIN 
> requirements
> * NUTCH-1231 Upgrade to Tika 1.0
> * NUTCH-1230 MimeType API deprecated and breaks with Tika 1.0
> * NUTCH-1235 Upgrade to new Hadoop 0.20.205.0
> * NUTCH-1184 Fetcher to parse and follow Nth degree outlinks
> * NUTCH-1214 DomainStats tool should be named for what it's doing
> * NUTCH-1207 ParserChecker to output signature
> * NUTCH-1174 Outlinks are not properly normalized
> * NUTCH-1173 DomainStats doesn't count db_not_modified
> * NUTCH-1142 Normalization and filtering in WebGraph
> PORTED:
> * No issues yet
> NOT GOING TO BE PORTED:
> * No issues, explain why it should not be ported

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1277) Fix [fallthrough] javac warnings

2012-04-25 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1277:


Fix Version/s: (was: nutchgora)
   2.1

Set and Classify

> Fix [fallthrough] javac warnings
> 
>
> Key: NUTCH-1277
> URL: https://issues.apache.org/jira/browse/NUTCH-1277
> Project: Nutch
>  Issue Type: Sub-task
>  Components: build
>Affects Versions: nutchgora, 1.5
>Reporter: Lewis John McGibbney
> Fix For: 1.6, 2.1
>
>
> This usually occurs when we have an instance where a switch statement(s) fall 
> through (that is, one or more break statements are missing).
> We need to determine where a simple
> {code}
> @SuppressWarnings("fallthrough")
> {code}
> is required or whether we need to include the break statements in switch 
> blocks

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1164) Write JUnit tests for protocol-http

2012-04-25 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1164:


Fix Version/s: (was: nutchgora)
   2.1

Set and Classify

> Write JUnit tests for protocol-http
> ---
>
> Key: NUTCH-1164
> URL: https://issues.apache.org/jira/browse/NUTCH-1164
> Project: Nutch
>  Issue Type: Sub-task
>Affects Versions: nutchgora
>Reporter: Lewis John McGibbney
>  Labels: test
> Fix For: 2.1
>
>
> This issue should provide a single Junit test as part of an effort to provide 
> JUnit tests for all nutchgora plugins

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1168) Write JUnit tests for tld

2012-04-25 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1168:


Fix Version/s: (was: nutchgora)
   2.1

Set and Classify

> Write JUnit tests for tld
> -
>
> Key: NUTCH-1168
> URL: https://issues.apache.org/jira/browse/NUTCH-1168
> Project: Nutch
>  Issue Type: Sub-task
>Affects Versions: nutchgora
>Reporter: Lewis John McGibbney
>  Labels: test
> Fix For: 2.1
>
>
> This issue should provide a single Junit test as part of an effort to provide 
> JUnit tests for all nutchgora plugins

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1166) Write JUnit tests for scoring-link

2012-04-25 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1166:


Fix Version/s: (was: nutchgora)
   2.1

Set and Classify

> Write JUnit tests for scoring-link
> --
>
> Key: NUTCH-1166
> URL: https://issues.apache.org/jira/browse/NUTCH-1166
> Project: Nutch
>  Issue Type: Sub-task
>  Components: linkdb
>Affects Versions: nutchgora
>Reporter: Lewis John McGibbney
>  Labels: test
> Fix For: 2.1
>
>
> This issue should provide a single Junit test as part of an effort to provide 
> JUnit tests for all nutchgora plugins

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1169) Write JUnit tests for urlfilter-prefix

2012-04-25 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1169:


Fix Version/s: (was: nutchgora)
   2.1

Set and Classify

> Write JUnit tests for urlfilter-prefix
> --
>
> Key: NUTCH-1169
> URL: https://issues.apache.org/jira/browse/NUTCH-1169
> Project: Nutch
>  Issue Type: Sub-task
>Affects Versions: nutchgora
>Reporter: Lewis John McGibbney
>  Labels: test
> Fix For: 2.1
>
>
> This issue should provide a single Junit test as part of an effort to provide 
> JUnit tests for all nutchgora plugins

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1161) Write JUnit tests for microformats-reltag plugin

2012-04-25 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1161:


Fix Version/s: (was: nutchgora)
   2.1

Set and Classify

> Write JUnit tests for microformats-reltag plugin
> 
>
> Key: NUTCH-1161
> URL: https://issues.apache.org/jira/browse/NUTCH-1161
> Project: Nutch
>  Issue Type: Sub-task
>Affects Versions: nutchgora
>Reporter: Lewis John McGibbney
>  Labels: test
> Fix For: 2.1
>
>
> This issue should provide a single Junit test as part of an effort to provide 
> JUnit tests for all nutchgora plugins

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1160) Write JUnit tests for index-basic

2012-04-25 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1160:


Fix Version/s: (was: nutchgora)
   2.1

Set and Classify

> Write JUnit tests for index-basic
> -
>
> Key: NUTCH-1160
> URL: https://issues.apache.org/jira/browse/NUTCH-1160
> Project: Nutch
>  Issue Type: Sub-task
>  Components: indexer
>Affects Versions: nutchgora
>Reporter: Lewis John McGibbney
>  Labels: test
> Fix For: 2.1
>
>
> This issue should provide a single Junit test as part of an effort to provide 
> JUnit tests for all nutchgora plugins

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1165) Write JUnit tests for protocol-sftp

2012-04-25 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1165:


Fix Version/s: (was: nutchgora)
   2.1

Set and Classify

> Write JUnit tests for protocol-sftp
> ---
>
> Key: NUTCH-1165
> URL: https://issues.apache.org/jira/browse/NUTCH-1165
> Project: Nutch
>  Issue Type: Sub-task
>Affects Versions: nutchgora
>Reporter: Lewis John McGibbney
>  Labels: test
> Fix For: 2.1
>
>
> This issue should provide a single Junit test as part of an effort to provide 
> JUnit tests for all nutchgora plugins

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1163) Write JUnit tests for protocol-ftp

2012-04-25 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1163:


Fix Version/s: (was: nutchgora)
   2.1

Set and Classify

> Write JUnit tests for protocol-ftp
> --
>
> Key: NUTCH-1163
> URL: https://issues.apache.org/jira/browse/NUTCH-1163
> Project: Nutch
>  Issue Type: Sub-task
>Affects Versions: nutchgora
>Reporter: Lewis John McGibbney
>  Labels: test
> Fix For: 2.1
>
>
> This issue should provide a single Junit test as part of an effort to provide 
> JUnit tests for all nutchgora plugins

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1158) Write JUnit tests for all nutchgora plugins

2012-04-25 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1158:


Fix Version/s: (was: nutchgora)
   2.1

Set and Classify

> Write JUnit tests for all nutchgora plugins
> ---
>
> Key: NUTCH-1158
> URL: https://issues.apache.org/jira/browse/NUTCH-1158
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: nutchgora
>Reporter: Lewis John McGibbney
>  Labels: test
> Fix For: 2.1
>
>
> This issue should act as a parent issue to track the development and gradual 
> integration and addition of JUnit tests to accompany all nutchgora plugins. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1170) Write JUnit tests for urlfilter-validator

2012-04-25 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1170:


Fix Version/s: (was: nutchgora)
   2.1

Set and Classify

> Write JUnit tests for urlfilter-validator
> -
>
> Key: NUTCH-1170
> URL: https://issues.apache.org/jira/browse/NUTCH-1170
> Project: Nutch
>  Issue Type: Sub-task
>Affects Versions: nutchgora
>Reporter: Lewis John McGibbney
>  Labels: test
> Fix For: 2.1
>
>
> This issue should provide a single Junit test as part of an effort to provide 
> JUnit tests for all nutchgora plugins

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: Suitable naming for > Nutchgora branch?

2012-04-25 Thread Mattmann, Chris A (388J)
Hi Guys,

Yep I think we've beat the dead horse here about the name :)

This is a good recent discussion/summary: http://s.apache.org/CoY
and I think it had some productive outcomes. I envision a world in
which we keep releasing the current 1.x series until we get up to 1.9,
and then hopefully in parallel release a set of 2.x (eventually release
2.9 if we get that far) and either 3.x is the merge of 1.x and 2.x, or 
1.x becomes 3.x and we leapfrog 2.x to 4.x, etc etc.

IOW, releasing from branches with active maintainers is absolutely
fine and encouraged within Apache. NutchGora right now has at least
Ferdy and Lewis (and you can count me in even though my support 
for the moment is limited to RM'ing) so that's ~3, the trunk has Julien, 
Markus, Lewis, 
myself and others so that's 4+ active peeps, so both branches have plenty
of people who care deeply about releasing Nutch and kicking butt. So
we're all good here.

Net: here's a productive next step for nutchgora. Let's simply release it.
There is nothing preventing us from doing that. If 3 +1s come in from
Nutch PMC members, we can release :) I'd be happy to RM it, as I 
stated in http://s.apache.org/CoY so let's move forward especially
now that there is a Gora 0.2 release (hat tip, Lewis).

Cheers,
Chris

P.S. Yes, and by the way, self-flails, let's release Nutch 1.5 and get
on with that too! *grin*

On Apr 25, 2012, at 6:22 AM, Julien Nioche wrote:

> 
> I must say that since the move of Nutchgora from trunk to branch it's kind of 
> odd that it's still referred to as 2.x. (For now that's okay I guess).
> 
> Moving it from the trunk made a lot of sense and has been abundantly 
> discussed on this list. We had one stable version which is actively 
> maintained and currently used by most people (1.x) and an experimental one 
> largely untested and used by a minority (2.x). Hopefully when nutchgora (for 
> which 2.x is a better name indeed) has had a couple of releases and is used 
> by a larger number of people it will naturally find its place as trunk but 
> for now since most releases are based on 1.x I think the latter should remain 
> the trunk
> 
> Julien
> 
> On Wed, Apr 25, 2012 at 10:46 AM, Lewis John Mcgibbney 
>  wrote:
> Good Morning,
> 
> Does anyone have a differing opinion on naming next development track for 
> Nutchgora branch 2.1?
> 
> Before I set and classify most issues it would be good to know.
> 
> Thank you
> 
> Lewis
> 
> -- 
> Lewis 
> 
> 
> 
> 
> 
> 
> -- 
> 
> Open Source Solutions for Text Engineering
> 
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
> 


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Re: Suitable naming for > Nutchgora branch?

2012-04-25 Thread Julien Nioche
> I must say that since the move of Nutchgora from trunk to branch it's kind
> of odd that it's still referred to as 2.x. (For now that's okay I guess).
>

Moving it from the trunk made a lot of sense and has been abundantly
discussed on this list. We had one stable version which is actively
maintained and currently used by most people (1.x) and an experimental one
largely untested and used by a minority (2.x). Hopefully when nutchgora
(for which 2.x is a better name indeed) has had a couple of releases and is
used by a larger number of people it will naturally find its place as trunk
but for now since most releases are based on 1.x I think the latter should
remain the trunk

Julien

On Wed, Apr 25, 2012 at 10:46 AM, Lewis John Mcgibbney <
lewis.mcgibb...@gmail.com> wrote:

> Good Morning,
>
> Does anyone have a differing opinion on naming next development track for
> Nutchgora branch 2.1?
>
> Before I set and classify most issues it would be good to know.
>
> Thank you
>
> Lewis
>
> --
> *Lewis*
>
>




-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


Re: Suitable naming for > Nutchgora branch?

2012-04-25 Thread Ferdy Galema
Hi Lewis,

2.1 is fine with me. This is assuming 2.x is a good naming scheme in the
first place. I must say that since the move of Nutchgora from trunk to
branch it's kind of odd that it's still referred to as 2.x. (For now that's
okay I guess).

Ferdy

On Wed, Apr 25, 2012 at 10:46 AM, Lewis John Mcgibbney <
lewis.mcgibb...@gmail.com> wrote:

> Good Morning,
>
> Does anyone have a differing opinion on naming next development track for
> Nutchgora branch 2.1?
>
> Before I set and classify most issues it would be good to know.
>
> Thank you
>
> Lewis
>
> --
> *Lewis*
>
>


[jira] [Resolved] (NUTCH-946) cache.jsp does not recognize encoding conversion from content different to UTF-8

2012-04-25 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-946.


Resolution: Won't Fix

This issue is now deprecated and can't be fixed in current development.

> cache.jsp does not recognize encoding conversion from content different to 
> UTF-8
> 
>
> Key: NUTCH-946
> URL: https://issues.apache.org/jira/browse/NUTCH-946
> Project: Nutch
>  Issue Type: Bug
>  Components: web gui
>Affects Versions: 1.2
> Environment: Server version: Apache Tomcat/6.0.29
> Server built:   July 19 2010 1458
> Server number:  6.0.0.29
> OS Name:Linux
> OS Version: 2.6.18-128.7.1.el5
> Architecture:   i386
> JVM Version:1.6.0_22-b04
> JVM Vendor: Sun Microsystems Inc.
>Reporter: Enrique Berlanga
>Priority: Minor
> Attachments: cache-946.patch
>
>
> Cache view does not recognize encoding conversion needed to show properly 
> page content stored in a segment.
> The problem is that it searchs "CharEncodingForConversion" meta in content 
> metadata, but it's stored in parse metadata.
> Here is the patch I've generated for the fixed version:
> ### Eclipse Workspace Patch 1.0
> #P branch-1.2
> Index: src/web/jsp/cached.jsp
> ===
> --- src/web/jsp/cached.jsp(revision 1027060)
> +++ src/web/jsp/cached.jsp(working copy)
> @@ -39,17 +39,18 @@
>  ResourceBundle.getBundle("org.nutch.jsp.cached", request.getLocale())
>  .getLocale().getLanguage();
>  
> -  Metadata metaData = bean.getParseData(details).getContentMeta();
> +  Metadata contentMetaData = bean.getParseData(details).getContentMeta();
> +  Metadata parseMetaData = bean.getParseData(details).getParseMeta();
>  
>String content = null;
> -  String contentType = (String) metaData.get(Metadata.CONTENT_TYPE);
> +  String contentType = (String) contentMetaData.get(Metadata.CONTENT_TYPE);
>if (contentType.startsWith("text/html")) {
>  // FIXME : it's better to emit the original 'byte' sequence 
>  // with 'charset' set to the value of 'CharEncoding',
>  // but I don't know how to emit 'byte sequence' in JSP.
>  // out.getOutputStream().write(bean.getContent(details)) may work, 
>  // but I'm not sure.
> -String encoding = (String) metaData.get("CharEncodingForConversion"); 
> +String encoding = (String) 
> parseMetaData.get("CharEncodingForConversion"); 
>  if (encoding != null) {
>try {
>  content = new String(bean.getContent(details), encoding);

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




Suitable naming for > Nutchgora branch?

2012-04-25 Thread Lewis John Mcgibbney
Good Morning,

Does anyone have a differing opinion on naming next development track for
Nutchgora branch 2.1?

Before I set and classify most issues it would be good to know.

Thank you

Lewis

-- 
*Lewis*


[jira] [Updated] (NUTCH-896) Gora-based tests need to have their own config files

2012-04-25 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-896:
---

Fix Version/s: (was: nutchgora)
   2.1

Set and classify

> Gora-based tests need to have their own config files 
> -
>
> Key: NUTCH-896
> URL: https://issues.apache.org/jira/browse/NUTCH-896
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: nutchgora
>Reporter: Julien Nioche
>Assignee: Julien Nioche
> Fix For: 2.1
>
>
> The tests extending AbstractNutchTest (Injector, Generator, Fetcher) have 
> hard-coded properties for GORA. It would be better to be able to rely on a 
> file gora.properties used only for the tests, just as we do with the 
> nutch-*.xml config files (see CrawlTestUtil). This way we wouldn't use the 
> configs set in the main /conf file as they could be specific to a given GORA 
> backend e.g. Mysql vs hsqldb. This would also help running the tests with a 
> non-default GORA backend. 
> We need to modify GORA and make the method DataStoreFactory.setProperties 
> public. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1162) Write JUnit tests for parse-js

2012-04-25 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1162:


Fix Version/s: (was: nutchgora)
   2.1

Set and classify

> Write JUnit tests for parse-js
> --
>
> Key: NUTCH-1162
> URL: https://issues.apache.org/jira/browse/NUTCH-1162
> Project: Nutch
>  Issue Type: Sub-task
>  Components: parser
>Affects Versions: nutchgora
>Reporter: Lewis John McGibbney
>  Labels: test
> Fix For: 2.1
>
>
> This issue should provide a single Junit test as part of an effort to provide 
> JUnit tests for all nutchgora plugins

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1167) Write JUnit tests for scoring-opic

2012-04-25 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1167:


Fix Version/s: (was: nutchgora)
   2.1

Set and classify

> Write JUnit tests for scoring-opic
> --
>
> Key: NUTCH-1167
> URL: https://issues.apache.org/jira/browse/NUTCH-1167
> Project: Nutch
>  Issue Type: Sub-task
>Affects Versions: nutchgora
>Reporter: Lewis John McGibbney
>  Labels: test
> Fix For: 2.1
>
>
> This issue should provide a single Junit test as part of an effort to provide 
> JUnit tests for all nutchgora plugins

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1159) Write JUnit tests for index-anchor

2012-04-25 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1159:


Fix Version/s: (was: nutchgora)
   2.1

Set and classify

> Write JUnit tests for index-anchor
> --
>
> Key: NUTCH-1159
> URL: https://issues.apache.org/jira/browse/NUTCH-1159
> Project: Nutch
>  Issue Type: Sub-task
>  Components: indexer
>Affects Versions: nutchgora
>Reporter: Lewis John McGibbney
>  Labels: test
> Fix For: 2.1
>
>
> This issue should provide a single Junit test as part of an effort to provide 
> JUnit tests for all nutchgora plugins 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-874) Make sure all plugins in src/plugin are compatible with Nutch 2.0 and Gora

2012-04-25 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-874:
---

Affects Version/s: nutchgora
Fix Version/s: (was: nutchgora)
   2.1

Set and classify

> Make sure all plugins in src/plugin are compatible with Nutch 2.0 and Gora
> --
>
> Key: NUTCH-874
> URL: https://issues.apache.org/jira/browse/NUTCH-874
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: nutchgora
> Environment: Nutch 2.0
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
>Priority: Critical
> Fix For: 2.1
>
>
> I just noticed while fixing NUTCH-564 that the ExtParser hasn't been brought 
> up to date with Nutch 2.0 trunk. We should review the plugins in src/plugin 
> to make sure they all work with Gora/Nutchbase now.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1081) ant tests fail

2012-04-25 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1081:


Fix Version/s: (was: nutchgora)
   2.1

Set and classify

> ant tests fail 
> ---
>
> Key: NUTCH-1081
> URL: https://issues.apache.org/jira/browse/NUTCH-1081
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher, generator, injector, storage
>Affects Versions: nutchgora
> Environment: Ubuntu release 11.04 (natty)
> Kernerl Linux 2.6.38-10-generic
> GNOME 2.32.1
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Critical
> Fix For: 2.1
>
>
> The following tests fail when running ant test on trunk 2.0
> {code}
> [junit] Running org.apache.nutch.api.TestAPI
> [junit] Tests run: 4, Failures: 1, Errors: 0, Time elapsed: 11.028 sec
> [junit] Test org.apache.nutch.api.TestAPI FAILED
> [junit] Running org.apache.nutch.crawl.TestGenerator
> [junit] Tests run: 4, Failures: 0, Errors: 4, Time elapsed: 0.478 sec
> [junit] Test org.apache.nutch.crawl.TestGenerator FAILED
> [junit] Running org.apache.nutch.crawl.TestInjector
> [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 0.474 sec
> [junit] Test org.apache.nutch.crawl.TestInjector FAILED
> [junit] Running org.apache.nutch.fetcher.TestFetcher
> [junit] Tests run: 2, Failures: 0, Errors: 2, Time elapsed: 0.526 sec
> [junit] Test org.apache.nutch.fetcher.TestFetcher FAILED
> [junit] Running org.apache.nutch.storage.TestGoraStorage
> [junit] Tests run: 2, Failures: 0, Errors: 2, Time elapsed: 0.468 sec
> [junit] Test org.apache.nutch.storage.TestGoraStorage FAILED
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-882) Design a Host table in GORA

2012-04-25 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-882:
---

Patch Info: Patch Available

> Design a Host table in GORA
> ---
>
> Key: NUTCH-882
> URL: https://issues.apache.org/jira/browse/NUTCH-882
> Project: Nutch
>  Issue Type: New Feature
>Affects Versions: nutchgora
>Reporter: Julien Nioche
> Fix For: nutchgora
>
> Attachments: NUTCH-882-v1.patch, NUTCH-882-v3.txt, NUTCH-882-v3.txt, 
> hostdb.patch
>
>
> Having a separate GORA table for storing information about hosts (and 
> domains?) would be very useful for : 
> * customising the behaviour of the fetching on a host basis e.g. number of 
> threads, min time between threads etc...
> * storing stats
> * keeping metadata and possibly propagate them to the webpages 
> * keeping a copy of the robots.txt and possibly use that later to filter the 
> webtable
> * store sitemaps files and update the webtable accordingly
> I'll try to come up with a GORA schema for such a host table but any comments 
> are of course already welcome 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira