[jira] Commented: (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a "?"

2010-03-18 Thread Jukka Zitting (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846865#action_12846865
 ] 

Jukka Zitting commented on NUTCH-797:
-

I guess we need to apply the same logic also to other Tika parsers that may 
deal with relative URLs.

Since we in any case need this functionality in Tika, would it be useful for 
Nutch if it was made available as a public utility class or method in 
tika-core? It would be great if we could avoid duplicating the code in 
different projects.

> parse-tika is not properly constructing URLs when the target begins with a "?"
> --
>
> Key: NUTCH-797
> URL: https://issues.apache.org/jira/browse/NUTCH-797
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.1
> Environment: Win 7, Java(TM) SE Runtime Environment (build 
> 1.6.0_16-b01)
> Also repro's on RHEL and java 1.4.2
>Reporter: Robert Hohman
>Priority: Minor
> Attachments: pureQueryUrl-2.patch, pureQueryUrl.patch
>
>
> This is my first bug and patch on nutch, so apologies if I have not provided 
> enough detail.
> In crawling the page at 
> http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0 there are 
> links in the page that look like this:
> 2 href="?co=0&sk=0&p=3&pi=1">3
> in org.apache.nutch.parse.tika.DOMContentUtils rev 916362 (trunk), as 
> getOutlinks looks for links, it comes across this link, and constucts a new 
> url with a base URL class built from 
> "http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0";, and a 
> target of "?co=0&sk=0&p=2&pi=1"
> The URL class, per RFC 3986 at 
> http://labs.apache.org/webarch/uri/rfc/rfc3986.html#relative-merge, defines 
> how to merge these two, and per the RFC, the URL class merges these to: 
> http://careers3.accenture.com/Careers/ASPX/?co=0&sk=0&p=2&pi=1
> because the RFC explicitly states that the rightmost url segment (the 
> Search.aspx in this case) should be ripped off before combining.
> While this is compliant with the RFC, it means the URLs which are created for 
> the next round of fetching are incorrect.  Modern browsers seem to handle 
> this case (I checked IE8 and Firefox 3.5), so I'm guessing this is an obscure 
> exception or handling of what is a poorly formed url on accenture's part.
> I have fixed this by modifying DOMContentUtils to look for the case where a ? 
> begins the target, and then pulling the rightmost component out of the base 
> and inserting it into the target before the ?, so the target in this example 
> becomes:
> Search.aspx?co=0&sk=0&p=2&pi=1
> The URL class then properly constructs the new url as:
> http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0&p=2&pi=1
> If it is agreed that this solution works, I believe the other html parsers in 
> nutch would need to be modified in a similar way.
> Can I get feedback on this proposed solution?  Specifically I'm worried about 
> unforeseen side effects.
> Much thanks
> Here is the patch info:
> Index: 
> src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
> ===
> --- 
> src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
>(revision 916362)
> +++ 
> src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
>(working copy)
> @@ -299,6 +299,50 @@
>  return false;
>}
>
> +  private URL fixURL(URL base, String target) throws MalformedURLException
> +  {
> +   // handle params that are embedded into the base url - move them to 
> target
> +   // so URL class constructs the new url class properly
> +   if  (base.toString().indexOf(';') > 0)  
> +  return fixEmbeddedParams(base, target);
> +   
> +   // handle the case that there is a target that is a pure query.
> +   // Strictly speaking this is a violation of RFC 2396 section 5.2.2 on 
> how to assemble
> +   // URLs but I've seen this in numerous places, for example at
> +   // http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0
> +   // It has urls in the page of the form href="?co=0&sk=0&pg=1", and by 
> default
> +   // URL constructs the base+target combo as 
> +   // http://careers3.accenture.com/Careers/ASPX/?co=0&sk=0&pg=1, 
> incorrectly
> +   // dropping the Search.aspx target
> +   //
> +   // Browsers handle these just fine, they must have an exception 
> similar to this
> +   if (target.startsWith("?"))
> +   {
> +   return fixPureQueryTargets(base, target);
> +   }
> +   
> +   return new URL(base, target);
> +  }
> +  
> +  private URL fixPureQueryTargets(URL base, String target) throws 
> MalformedURLException

[jira] Commented: (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a "?"

2010-03-17 Thread Jukka Zitting (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846521#action_12846521
 ] 

Jukka Zitting commented on NUTCH-797:
-

Wouldn't it be easier for Nutch to pass the base URL as the CONTENT_LOCATION 
metadata to the Tika parser? Then Tika would automatically apply these fixes, 
as discussed in TIKA-287.

> parse-tika is not properly constructing URLs when the target begins with a "?"
> --
>
> Key: NUTCH-797
> URL: https://issues.apache.org/jira/browse/NUTCH-797
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.1
> Environment: Win 7, Java(TM) SE Runtime Environment (build 
> 1.6.0_16-b01)
> Also repro's on RHEL and java 1.4.2
>Reporter: Robert Hohman
>Priority: Minor
> Attachments: pureQueryUrl-2.patch, pureQueryUrl.patch
>
>
> This is my first bug and patch on nutch, so apologies if I have not provided 
> enough detail.
> In crawling the page at 
> http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0 there are 
> links in the page that look like this:
> 2 href="?co=0&sk=0&p=3&pi=1">3
> in org.apache.nutch.parse.tika.DOMContentUtils rev 916362 (trunk), as 
> getOutlinks looks for links, it comes across this link, and constucts a new 
> url with a base URL class built from 
> "http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0";, and a 
> target of "?co=0&sk=0&p=2&pi=1"
> The URL class, per RFC 3986 at 
> http://labs.apache.org/webarch/uri/rfc/rfc3986.html#relative-merge, defines 
> how to merge these two, and per the RFC, the URL class merges these to: 
> http://careers3.accenture.com/Careers/ASPX/?co=0&sk=0&p=2&pi=1
> because the RFC explicitly states that the rightmost url segment (the 
> Search.aspx in this case) should be ripped off before combining.
> While this is compliant with the RFC, it means the URLs which are created for 
> the next round of fetching are incorrect.  Modern browsers seem to handle 
> this case (I checked IE8 and Firefox 3.5), so I'm guessing this is an obscure 
> exception or handling of what is a poorly formed url on accenture's part.
> I have fixed this by modifying DOMContentUtils to look for the case where a ? 
> begins the target, and then pulling the rightmost component out of the base 
> and inserting it into the target before the ?, so the target in this example 
> becomes:
> Search.aspx?co=0&sk=0&p=2&pi=1
> The URL class then properly constructs the new url as:
> http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0&p=2&pi=1
> If it is agreed that this solution works, I believe the other html parsers in 
> nutch would need to be modified in a similar way.
> Can I get feedback on this proposed solution?  Specifically I'm worried about 
> unforeseen side effects.
> Much thanks
> Here is the patch info:
> Index: 
> src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
> ===
> --- 
> src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
>(revision 916362)
> +++ 
> src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
>(working copy)
> @@ -299,6 +299,50 @@
>  return false;
>}
>
> +  private URL fixURL(URL base, String target) throws MalformedURLException
> +  {
> +   // handle params that are embedded into the base url - move them to 
> target
> +   // so URL class constructs the new url class properly
> +   if  (base.toString().indexOf(';') > 0)  
> +  return fixEmbeddedParams(base, target);
> +   
> +   // handle the case that there is a target that is a pure query.
> +   // Strictly speaking this is a violation of RFC 2396 section 5.2.2 on 
> how to assemble
> +   // URLs but I've seen this in numerous places, for example at
> +   // http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0
> +   // It has urls in the page of the form href="?co=0&sk=0&pg=1", and by 
> default
> +   // URL constructs the base+target combo as 
> +   // http://careers3.accenture.com/Careers/ASPX/?co=0&sk=0&pg=1, 
> incorrectly
> +   // dropping the Search.aspx target
> +   //
> +   // Browsers handle these just fine, they must have an exception 
> similar to this
> +   if (target.startsWith("?"))
> +   {
> +   return fixPureQueryTargets(base, target);
> +   }
> +   
> +   return new URL(base, target);
> +  }
> +  
> +  private URL fixPureQueryTargets(URL base, String target) throws 
> MalformedURLException
> +  {
> + if (!target.startsWith("?"))
> + return new URL(base, target);
> +
> + String basePath = base.getPath();
> + String baseRightM

[jira] Commented: (NUTCH-723) LICENCE.txt is lacking info that should be there

2009-03-19 Thread Jukka Zitting (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683649#action_12683649
 ] 

Jukka Zitting commented on NUTCH-723:
-

Looks good to me.

PS. There's not really a need to repeat the ALv2 for all Apache components, the 
first copy at the beginning is enough to cover them all (except of course any 
non-ALv2 parts). But it's no problem to repeat the license if you think it's 
clearer to explicitly mention the full licensing terms of each bundled library.

> LICENCE.txt is lacking info that should be there
> 
>
> Key: NUTCH-723
> URL: https://issues.apache.org/jira/browse/NUTCH-723
> Project: Nutch
>  Issue Type: Bug
>  Components: build
>Affects Versions: 1.0.0
>Reporter: Sami Siren
>
> Jukkas comment from email:
> * The LICENSE.txt file should have at least references to the licenses of the 
> bundled libraries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-725) NOTICE.txt is lacking info that should be there

2009-03-19 Thread Jukka Zitting (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683648#action_12683648
 ] 

Jukka Zitting commented on NUTCH-725:
-

Looks good.

> NOTICE.txt is lacking info that should be there
> ---
>
> Key: NUTCH-725
> URL: https://issues.apache.org/jira/browse/NUTCH-725
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.0.0
>Reporter: Sami Siren
>
> Jukkas comment from email:
> * The NOTICE.txt file should start with the the following lines:
>   Apache Nutch
>   Copyright 2009 The Apache Software Foundation
> * The NOTICE.txt file should contain the required copyright notices
> from all bundled libraries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-722) Nutch contains jars that we cannot redistribute

2009-03-19 Thread Jukka Zitting (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683474#action_12683474
 ] 

Jukka Zitting commented on NUTCH-722:
-

One acceptable alternative for now is to drop the jars and add a note to end 
users that they should explicitly get and add the JAI libraries if they want 
support for PDF documents with rotated pages or embedded TIFF images.

> Nutch contains jars that we cannot redistribute
> ---
>
> Key: NUTCH-722
> URL: https://issues.apache.org/jira/browse/NUTCH-722
> Project: Nutch
>  Issue Type: Bug
>Reporter: Sami Siren
>Priority: Blocker
> Fix For: 1.0.0
>
>
> It seems that we have some jars (as part of pdf parser) that we cannot 
> redistribute.
> Jukkas comment from email:
> "
> The release contains the Java Advanced Imaging libraries (jai_core.jar and 
> jai_codec.jar) which are licensed under Sun's Binary Code License. We can't 
> redistribute those libraries.
> "

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-722) Nutch contains jars that we cannot redistribute

2009-03-19 Thread Jukka Zitting (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683473#action_12683473
 ] 

Jukka Zitting commented on NUTCH-722:
-

See PDFBOX-381 for how the JAI dependency issues was solved in the currently 
incubating Apache PDFBox. Unfortunately we don't yet have an official release 
of Apache PDFBox.

> Nutch contains jars that we cannot redistribute
> ---
>
> Key: NUTCH-722
> URL: https://issues.apache.org/jira/browse/NUTCH-722
> Project: Nutch
>  Issue Type: Bug
>Reporter: Sami Siren
>Priority: Blocker
> Fix For: 1.0.0
>
>
> It seems that we have some jars (as part of pdf parser) that we cannot 
> redistribute.
> Jukkas comment from email:
> "
> The release contains the Java Advanced Imaging libraries (jai_core.jar and 
> jai_codec.jar) which are licensed under Sun's Binary Code License. We can't 
> redistribute those libraries.
> "

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-724) Drop the JAI libraries

2009-03-19 Thread Jukka Zitting (JIRA)
Drop the JAI libraries
--

 Key: NUTCH-724
 URL: https://issues.apache.org/jira/browse/NUTCH-724
 Project: Nutch
  Issue Type: Bug
Reporter: Jukka Zitting
Priority: Blocker
 Fix For: 1.0.0


The PDF parser plugin contains Java Advanced Imaging (JAI) libraries 
(jai_core.jar and jai_codec.jar) that are licensed under the Sun Binary Code 
License. The license is incompatible with Apache policies, so we need to drop 
those libraries.

AFAIK (see PDFBOX-381) PDFBox only uses the JAI libraries for handling page 
rotations and tiff images, so simply dropping the JAI jars shouldn't have too 
much impact. A better solution would be to switch to using Apache PDFBox that 
has a proper workaround for this issue, but the first Apache PDFBox release has 
not yet been made.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-621) Nutch needs to declare it's crypto usage

2008-09-28 Thread Jukka Zitting (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635218#action_12635218
 ] 

Jukka Zitting commented on NUTCH-621:
-

> [...] get back the email from the govt [...]

It's a one-way notification, AFAIK the government never responds to the crypto 
notifications. I guess it's just for archival purposes.

So it's better not to wait for a response. :-)

> Nutch needs to declare it's crypto usage
> 
>
> Key: NUTCH-621
> URL: https://issues.apache.org/jira/browse/NUTCH-621
> Project: Nutch
>  Issue Type: Task
>Reporter: Grant Ingersoll
>Assignee: Chris A. Mattmann
>Priority: Blocker
> Attachments: NUTCH-621.Mattmann.091008.step3.txt, 
> NUTCH-621.step1.Mattmann.090408.patch.txt, 
> NUTCH-621.step1.Mattmann.091008.patch.txt
>
>
> Per the ASF board direction outlined at 
> http://www.apache.org/dev/crypto.html, Nutch needs to declare it's use of 
> crypto libraries (i.e. BouncyCastle, via PDFBox/Tika).
> See TIKA-118.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.