[jira] [Updated] (NUTCH-1213) Pass additional SolrParams when indexing to Solr

2011-11-25 Thread Andrzej Bialecki (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-1213:
-

Attachment: NUTCH-1213.diff

Path that implements this functionality. SolrParams can be passed as an 
URL-like string, for example:
{code}
nutch solrindex http://localhost:8983/solr/collection1 db -linkdb linkdb 
-params update.chain=distribfmap.a=links segments/2025105233
{code}

 Pass additional SolrParams when indexing to Solr
 

 Key: NUTCH-1213
 URL: https://issues.apache.org/jira/browse/NUTCH-1213
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Attachments: NUTCH-1213.diff


 This is a simple improvement of the SolrIndexer. It adds the ability to pass 
 additional Solr parameters that are applied to each UpdateRequest. This is 
 useful when you have to pass parameters specific to a particular indexing 
 run, which are not in Solr invariants for the update handler, and modifying 
 the Solr configuration for each different indexing run is inconvenient.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1197) Add statically configured field values to solrindex-mapping.xml

2011-11-03 Thread Andrzej Bialecki (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-1197:
-

Attachment: NUTCH-1197.patch

Patch with the implementation. I added some javadocs, and a unit test for both 
the old and the new functionality.

 Add statically configured field values to solrindex-mapping.xml
 ---

 Key: NUTCH-1197
 URL: https://issues.apache.org/jira/browse/NUTCH-1197
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Fix For: 1.4

 Attachments: NUTCH-1197.patch


 In some cases it's useful to be able to add to every document sent to Solr a 
 set of predefined fields with static values. This could be implemented on the 
 Solr side (with a custom UpdateRequestProcessor), but it may be less 
 cumbersome to add them on the Nutch side.
 Example: let's say I have several Nutch configurations all indexing to the 
 same Solr instance, and I want each of them to add its identifier as a field 
 in all documents, e.g. origin=web_crawl_1, origin=file_crawl, 
 origin=unlimited_crawl, etc...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1195) Add Solr 4x (trunk) example schema

2011-11-02 Thread Andrzej Bialecki (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-1195:
-

Attachment: schema-solr4.xml

 Add Solr 4x (trunk) example schema
 --

 Key: NUTCH-1195
 URL: https://issues.apache.org/jira/browse/NUTCH-1195
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Fix For: 1.4

 Attachments: schema-solr4.xml


 The conf/schema.xml that we ship works ok for Solr 3.x, but in Solr trunk 
 some of the class names have been changed, and some field types have been 
 redefined, so if you simply drop this schema into Solr it will cause severe 
 errors and indexing won't work.
 I propose to add a version of the schema.xml file that is tailored to Solr 
 4.x so that users can deploy this schema when indexing to Solr trunk.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a ?

2011-10-11 Thread Andrzej Bialecki (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-797:


Attachment: NUTCH-797.patch

Tentative patch, which changes the meaning of fixEmbeddedParams to 
removeEmbeddedParams.

 parse-tika is not properly constructing URLs when the target begins with a ?
 --

 Key: NUTCH-797
 URL: https://issues.apache.org/jira/browse/NUTCH-797
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.1
 Environment: Win 7, Java(TM) SE Runtime Environment (build 
 1.6.0_16-b01)
 Also repro's on RHEL and java 1.4.2
Reporter: Robert Hohman
Assignee: Andrzej Bialecki 
Priority: Minor
 Fix For: nutchgora

 Attachments: NUTCH-797.patch, pureQueryUrl-2.patch, pureQueryUrl.patch


 This is my first bug and patch on nutch, so apologies if I have not provided 
 enough detail.
 In crawling the page at 
 http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0 there are 
 links in the page that look like this:
 a href=?co=0sk=0p=2pi=12/a/tdtda 
 href=?co=0sk=0p=3pi=13/a
 in org.apache.nutch.parse.tika.DOMContentUtils rev 916362 (trunk), as 
 getOutlinks looks for links, it comes across this link, and constucts a new 
 url with a base URL class built from 
 http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0;, and a 
 target of ?co=0sk=0p=2pi=1
 The URL class, per RFC 3986 at 
 http://labs.apache.org/webarch/uri/rfc/rfc3986.html#relative-merge, defines 
 how to merge these two, and per the RFC, the URL class merges these to: 
 http://careers3.accenture.com/Careers/ASPX/?co=0sk=0p=2pi=1
 because the RFC explicitly states that the rightmost url segment (the 
 Search.aspx in this case) should be ripped off before combining.
 While this is compliant with the RFC, it means the URLs which are created for 
 the next round of fetching are incorrect.  Modern browsers seem to handle 
 this case (I checked IE8 and Firefox 3.5), so I'm guessing this is an obscure 
 exception or handling of what is a poorly formed url on accenture's part.
 I have fixed this by modifying DOMContentUtils to look for the case where a ? 
 begins the target, and then pulling the rightmost component out of the base 
 and inserting it into the target before the ?, so the target in this example 
 becomes:
 Search.aspx?co=0sk=0p=2pi=1
 The URL class then properly constructs the new url as:
 http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0p=2pi=1
 If it is agreed that this solution works, I believe the other html parsers in 
 nutch would need to be modified in a similar way.
 Can I get feedback on this proposed solution?  Specifically I'm worried about 
 unforeseen side effects.
 Much thanks
 Here is the patch info:
 Index: 
 src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
 ===
 --- 
 src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
(revision 916362)
 +++ 
 src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
(working copy)
 @@ -299,6 +299,50 @@
  return false;
}

 +  private URL fixURL(URL base, String target) throws MalformedURLException
 +  {
 +   // handle params that are embedded into the base url - move them to 
 target
 +   // so URL class constructs the new url class properly
 +   if  (base.toString().indexOf(';')  0)  
 +  return fixEmbeddedParams(base, target);
 +   
 +   // handle the case that there is a target that is a pure query.
 +   // Strictly speaking this is a violation of RFC 2396 section 5.2.2 on 
 how to assemble
 +   // URLs but I've seen this in numerous places, for example at
 +   // http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0
 +   // It has urls in the page of the form href=?co=0sk=0pg=1, and by 
 default
 +   // URL constructs the base+target combo as 
 +   // http://careers3.accenture.com/Careers/ASPX/?co=0sk=0pg=1, 
 incorrectly
 +   // dropping the Search.aspx target
 +   //
 +   // Browsers handle these just fine, they must have an exception 
 similar to this
 +   if (target.startsWith(?))
 +   {
 +   return fixPureQueryTargets(base, target);
 +   }
 +   
 +   return new URL(base, target);
 +  }
 +  
 +  private URL fixPureQueryTargets(URL base, String target) throws 
 MalformedURLException
 +  {
 + if (!target.startsWith(?))
 + return new URL(base, target);
 +
 + String basePath = base.getPath();
 + String baseRightMost=;
 + int baseRightMostIdx = basePath.lastIndexOf(/);
 + if (baseRightMostIdx != -1)
 + {
 

[jira] [Updated] (NUTCH-1154) Upgrade to Tika 0.10

2011-10-07 Thread Andrzej Bialecki (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-1154:
-

Attachment: NUTCH-1154.diff

Patch to upgrade to Tika 0.10. Unfortunately, TestRTFParser fails with this 
version of Tika - the extracted body of the text is empty. See TIKA-748. Still, 
I think the improvements in PDF and Office parsers are worth the upgrade.

 Upgrade to Tika 0.10
 

 Key: NUTCH-1154
 URL: https://issues.apache.org/jira/browse/NUTCH-1154
 Project: Nutch
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.4
Reporter: Andrzej Bialecki 
 Attachments: NUTCH-1154.diff


 There have been significant improvements in Tika 0.10 and it would be nice to 
 use the latest Tika in 1.4.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira