[jira] Commented: (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a ?

2010-03-18 Thread Jukka Zitting (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846865#action_12846865
 ] 

Jukka Zitting commented on NUTCH-797:
-

I guess we need to apply the same logic also to other Tika parsers that may 
deal with relative URLs.

Since we in any case need this functionality in Tika, would it be useful for 
Nutch if it was made available as a public utility class or method in 
tika-core? It would be great if we could avoid duplicating the code in 
different projects.

 parse-tika is not properly constructing URLs when the target begins with a ?
 --

 Key: NUTCH-797
 URL: https://issues.apache.org/jira/browse/NUTCH-797
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.1
 Environment: Win 7, Java(TM) SE Runtime Environment (build 
 1.6.0_16-b01)
 Also repro's on RHEL and java 1.4.2
Reporter: Robert Hohman
Priority: Minor
 Attachments: pureQueryUrl-2.patch, pureQueryUrl.patch


 This is my first bug and patch on nutch, so apologies if I have not provided 
 enough detail.
 In crawling the page at 
 http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0 there are 
 links in the page that look like this:
 a href=?co=0sk=0p=2pi=12/a/tdtda 
 href=?co=0sk=0p=3pi=13/a
 in org.apache.nutch.parse.tika.DOMContentUtils rev 916362 (trunk), as 
 getOutlinks looks for links, it comes across this link, and constucts a new 
 url with a base URL class built from 
 http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0;, and a 
 target of ?co=0sk=0p=2pi=1
 The URL class, per RFC 3986 at 
 http://labs.apache.org/webarch/uri/rfc/rfc3986.html#relative-merge, defines 
 how to merge these two, and per the RFC, the URL class merges these to: 
 http://careers3.accenture.com/Careers/ASPX/?co=0sk=0p=2pi=1
 because the RFC explicitly states that the rightmost url segment (the 
 Search.aspx in this case) should be ripped off before combining.
 While this is compliant with the RFC, it means the URLs which are created for 
 the next round of fetching are incorrect.  Modern browsers seem to handle 
 this case (I checked IE8 and Firefox 3.5), so I'm guessing this is an obscure 
 exception or handling of what is a poorly formed url on accenture's part.
 I have fixed this by modifying DOMContentUtils to look for the case where a ? 
 begins the target, and then pulling the rightmost component out of the base 
 and inserting it into the target before the ?, so the target in this example 
 becomes:
 Search.aspx?co=0sk=0p=2pi=1
 The URL class then properly constructs the new url as:
 http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0p=2pi=1
 If it is agreed that this solution works, I believe the other html parsers in 
 nutch would need to be modified in a similar way.
 Can I get feedback on this proposed solution?  Specifically I'm worried about 
 unforeseen side effects.
 Much thanks
 Here is the patch info:
 Index: 
 src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
 ===
 --- 
 src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
(revision 916362)
 +++ 
 src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
(working copy)
 @@ -299,6 +299,50 @@
  return false;
}

 +  private URL fixURL(URL base, String target) throws MalformedURLException
 +  {
 +   // handle params that are embedded into the base url - move them to 
 target
 +   // so URL class constructs the new url class properly
 +   if  (base.toString().indexOf(';')  0)  
 +  return fixEmbeddedParams(base, target);
 +   
 +   // handle the case that there is a target that is a pure query.
 +   // Strictly speaking this is a violation of RFC 2396 section 5.2.2 on 
 how to assemble
 +   // URLs but I've seen this in numerous places, for example at
 +   // http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0
 +   // It has urls in the page of the form href=?co=0sk=0pg=1, and by 
 default
 +   // URL constructs the base+target combo as 
 +   // http://careers3.accenture.com/Careers/ASPX/?co=0sk=0pg=1, 
 incorrectly
 +   // dropping the Search.aspx target
 +   //
 +   // Browsers handle these just fine, they must have an exception 
 similar to this
 +   if (target.startsWith(?))
 +   {
 +   return fixPureQueryTargets(base, target);
 +   }
 +   
 +   return new URL(base, target);
 +  }
 +  
 +  private URL fixPureQueryTargets(URL base, String target) throws 
 MalformedURLException
 +  {
 + if (!target.startsWith(?))
 + return new URL(base, target);
 +
 + 

[jira] Commented: (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a ?

2010-03-17 Thread Jukka Zitting (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846521#action_12846521
 ] 

Jukka Zitting commented on NUTCH-797:
-

Wouldn't it be easier for Nutch to pass the base URL as the CONTENT_LOCATION 
metadata to the Tika parser? Then Tika would automatically apply these fixes, 
as discussed in TIKA-287.

 parse-tika is not properly constructing URLs when the target begins with a ?
 --

 Key: NUTCH-797
 URL: https://issues.apache.org/jira/browse/NUTCH-797
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.1
 Environment: Win 7, Java(TM) SE Runtime Environment (build 
 1.6.0_16-b01)
 Also repro's on RHEL and java 1.4.2
Reporter: Robert Hohman
Priority: Minor
 Attachments: pureQueryUrl-2.patch, pureQueryUrl.patch


 This is my first bug and patch on nutch, so apologies if I have not provided 
 enough detail.
 In crawling the page at 
 http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0 there are 
 links in the page that look like this:
 a href=?co=0sk=0p=2pi=12/a/tdtda 
 href=?co=0sk=0p=3pi=13/a
 in org.apache.nutch.parse.tika.DOMContentUtils rev 916362 (trunk), as 
 getOutlinks looks for links, it comes across this link, and constucts a new 
 url with a base URL class built from 
 http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0;, and a 
 target of ?co=0sk=0p=2pi=1
 The URL class, per RFC 3986 at 
 http://labs.apache.org/webarch/uri/rfc/rfc3986.html#relative-merge, defines 
 how to merge these two, and per the RFC, the URL class merges these to: 
 http://careers3.accenture.com/Careers/ASPX/?co=0sk=0p=2pi=1
 because the RFC explicitly states that the rightmost url segment (the 
 Search.aspx in this case) should be ripped off before combining.
 While this is compliant with the RFC, it means the URLs which are created for 
 the next round of fetching are incorrect.  Modern browsers seem to handle 
 this case (I checked IE8 and Firefox 3.5), so I'm guessing this is an obscure 
 exception or handling of what is a poorly formed url on accenture's part.
 I have fixed this by modifying DOMContentUtils to look for the case where a ? 
 begins the target, and then pulling the rightmost component out of the base 
 and inserting it into the target before the ?, so the target in this example 
 becomes:
 Search.aspx?co=0sk=0p=2pi=1
 The URL class then properly constructs the new url as:
 http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0p=2pi=1
 If it is agreed that this solution works, I believe the other html parsers in 
 nutch would need to be modified in a similar way.
 Can I get feedback on this proposed solution?  Specifically I'm worried about 
 unforeseen side effects.
 Much thanks
 Here is the patch info:
 Index: 
 src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
 ===
 --- 
 src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
(revision 916362)
 +++ 
 src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
(working copy)
 @@ -299,6 +299,50 @@
  return false;
}

 +  private URL fixURL(URL base, String target) throws MalformedURLException
 +  {
 +   // handle params that are embedded into the base url - move them to 
 target
 +   // so URL class constructs the new url class properly
 +   if  (base.toString().indexOf(';')  0)  
 +  return fixEmbeddedParams(base, target);
 +   
 +   // handle the case that there is a target that is a pure query.
 +   // Strictly speaking this is a violation of RFC 2396 section 5.2.2 on 
 how to assemble
 +   // URLs but I've seen this in numerous places, for example at
 +   // http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0
 +   // It has urls in the page of the form href=?co=0sk=0pg=1, and by 
 default
 +   // URL constructs the base+target combo as 
 +   // http://careers3.accenture.com/Careers/ASPX/?co=0sk=0pg=1, 
 incorrectly
 +   // dropping the Search.aspx target
 +   //
 +   // Browsers handle these just fine, they must have an exception 
 similar to this
 +   if (target.startsWith(?))
 +   {
 +   return fixPureQueryTargets(base, target);
 +   }
 +   
 +   return new URL(base, target);
 +  }
 +  
 +  private URL fixPureQueryTargets(URL base, String target) throws 
 MalformedURLException
 +  {
 + if (!target.startsWith(?))
 + return new URL(base, target);
 +
 + String basePath = base.getPath();
 + String baseRightMost=;
 + int baseRightMostIdx = basePath.lastIndexOf(/);
 + if (baseRightMostIdx != -1)
 + {
 

[jira] Created: (NUTCH-724) Drop the JAI libraries

2009-03-19 Thread Jukka Zitting (JIRA)
Drop the JAI libraries
--

 Key: NUTCH-724
 URL: https://issues.apache.org/jira/browse/NUTCH-724
 Project: Nutch
  Issue Type: Bug
Reporter: Jukka Zitting
Priority: Blocker
 Fix For: 1.0.0


The PDF parser plugin contains Java Advanced Imaging (JAI) libraries 
(jai_core.jar and jai_codec.jar) that are licensed under the Sun Binary Code 
License. The license is incompatible with Apache policies, so we need to drop 
those libraries.

AFAIK (see PDFBOX-381) PDFBox only uses the JAI libraries for handling page 
rotations and tiff images, so simply dropping the JAI jars shouldn't have too 
much impact. A better solution would be to switch to using Apache PDFBox that 
has a proper workaround for this issue, but the first Apache PDFBox release has 
not yet been made.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-722) Nutch contains jars that we cannot redistribute

2009-03-19 Thread Jukka Zitting (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683473#action_12683473
 ] 

Jukka Zitting commented on NUTCH-722:
-

See PDFBOX-381 for how the JAI dependency issues was solved in the currently 
incubating Apache PDFBox. Unfortunately we don't yet have an official release 
of Apache PDFBox.

 Nutch contains jars that we cannot redistribute
 ---

 Key: NUTCH-722
 URL: https://issues.apache.org/jira/browse/NUTCH-722
 Project: Nutch
  Issue Type: Bug
Reporter: Sami Siren
Priority: Blocker
 Fix For: 1.0.0


 It seems that we have some jars (as part of pdf parser) that we cannot 
 redistribute.
 Jukkas comment from email:
 
 The release contains the Java Advanced Imaging libraries (jai_core.jar and 
 jai_codec.jar) which are licensed under Sun's Binary Code License. We can't 
 redistribute those libraries.
 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-722) Nutch contains jars that we cannot redistribute

2009-03-19 Thread Jukka Zitting (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683474#action_12683474
 ] 

Jukka Zitting commented on NUTCH-722:
-

One acceptable alternative for now is to drop the jars and add a note to end 
users that they should explicitly get and add the JAI libraries if they want 
support for PDF documents with rotated pages or embedded TIFF images.

 Nutch contains jars that we cannot redistribute
 ---

 Key: NUTCH-722
 URL: https://issues.apache.org/jira/browse/NUTCH-722
 Project: Nutch
  Issue Type: Bug
Reporter: Sami Siren
Priority: Blocker
 Fix For: 1.0.0


 It seems that we have some jars (as part of pdf parser) that we cannot 
 redistribute.
 Jukkas comment from email:
 
 The release contains the Java Advanced Imaging libraries (jai_core.jar and 
 jai_codec.jar) which are licensed under Sun's Binary Code License. We can't 
 redistribute those libraries.
 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-725) NOTICE.txt is lacking info that should be there

2009-03-19 Thread Jukka Zitting (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683648#action_12683648
 ] 

Jukka Zitting commented on NUTCH-725:
-

Looks good.

 NOTICE.txt is lacking info that should be there
 ---

 Key: NUTCH-725
 URL: https://issues.apache.org/jira/browse/NUTCH-725
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Sami Siren

 Jukkas comment from email:
 * The NOTICE.txt file should start with the the following lines:
   Apache Nutch
   Copyright 2009 The Apache Software Foundation
 * The NOTICE.txt file should contain the required copyright notices
 from all bundled libraries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-723) LICENCE.txt is lacking info that should be there

2009-03-19 Thread Jukka Zitting (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683649#action_12683649
 ] 

Jukka Zitting commented on NUTCH-723:
-

Looks good to me.

PS. There's not really a need to repeat the ALv2 for all Apache components, the 
first copy at the beginning is enough to cover them all (except of course any 
non-ALv2 parts). But it's no problem to repeat the license if you think it's 
clearer to explicitly mention the full licensing terms of each bundled library.

 LICENCE.txt is lacking info that should be there
 

 Key: NUTCH-723
 URL: https://issues.apache.org/jira/browse/NUTCH-723
 Project: Nutch
  Issue Type: Bug
  Components: build
Affects Versions: 1.0.0
Reporter: Sami Siren

 Jukkas comment from email:
 * The LICENSE.txt file should have at least references to the licenses of the 
 bundled libraries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-621) Nutch needs to declare it's crypto usage

2008-09-28 Thread Jukka Zitting (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12635218#action_12635218
 ] 

Jukka Zitting commented on NUTCH-621:
-

 [...] get back the email from the govt [...]

It's a one-way notification, AFAIK the government never responds to the crypto 
notifications. I guess it's just for archival purposes.

So it's better not to wait for a response. :-)

 Nutch needs to declare it's crypto usage
 

 Key: NUTCH-621
 URL: https://issues.apache.org/jira/browse/NUTCH-621
 Project: Nutch
  Issue Type: Task
Reporter: Grant Ingersoll
Assignee: Chris A. Mattmann
Priority: Blocker
 Attachments: NUTCH-621.Mattmann.091008.step3.txt, 
 NUTCH-621.step1.Mattmann.090408.patch.txt, 
 NUTCH-621.step1.Mattmann.091008.patch.txt


 Per the ASF board direction outlined at 
 http://www.apache.org/dev/crypto.html, Nutch needs to declare it's use of 
 crypto libraries (i.e. BouncyCastle, via PDFBox/Tika).
 See TIKA-118.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.