Re: GSOC RDF Microformats Support
Hi Chris, Thanks for your feedback. I was planning to use any23 and tika but I dont have detailed grasp of both projects. I guess Im gonna need to dive in both. I would appreciate if you could guide me thanks On Fri, Mar 27, 2015 at 4:07 PM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: Hi Remzi - thanks! You may want to consider this as a Tika or Any23 project since Nutch delegates its parsing to Tika (and Any23 uses Tika [and vice versa] to handle micro formats). Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Remzi Düzağaç remz...@gmail.com Reply-To: d...@nutch.apache.org d...@nutch.apache.org Date: Friday, March 27, 2015 at 5:07 AM To: d...@nutch.apache.org d...@nutch.apache.org Subject: GSOC RDF Microformats Support Hi Guys, I have sent a proposal to gsoc. I would like to add rdf microformat support to nutch. I kindly ask for your support. Is there anyone volunteer to be my mentor on this topic? Thank you very much
[jira] [Updated] (TIKA-1581) jhighlight license concerns
[ https://issues.apache.org/jira/browse/TIKA-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong-Thai Nguyen updated TIKA-1581: --- Fix Version/s: 1.8 jhighlight license concerns --- Key: TIKA-1581 URL: https://issues.apache.org/jira/browse/TIKA-1581 Project: Tika Issue Type: Bug Affects Versions: 1.7 Reporter: Karl Wright Fix For: 1.8 jhighlight jar is a Tika dependency. The Lucene team discovered that, while it claims to be a CDDL/LGPL dual-license, some of its functionality is LGPL only: {code} Solr's contrib/extraction contains jhighlight-1.0.jar which declares itself as dual CDDL or LGPL license. However, some of its classes are distributed only under LGPL, e.g. com.uwyn.jhighlight.highlighter. CppHighlighter.java GroovyHighlighter.java JavaHighlighter.java XmlHighlighter.java I downloaded the sources from Maven (http://search.maven.org/remotecontent?filepath=com/uwyn/jhighlight/1.0/jhighlight-1.0-sources.jar) to confirm that, and also found this SVN repo: http://svn.rifers.org/jhighlight/tags/release-1.0, though the project's website seems to not exist anymore (https://jhighlight.dev.java.net/). I didn't find any direct usage of it in our code, so I guess it's probably needed by a 3rd party dependency, such as Tika. Therefore if we e.g. omit it, things will compile, but may fail at runtime. {code} Is it possible to remove this dependency for future releases, or allow only optional inclusion of this package? It is of concern to the ManifoldCF project because we distribute a binary package that includes Tika and its required dependencies, which currently includes jHighlight. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-1581) jhighlight license concerns
[ https://issues.apache.org/jira/browse/TIKA-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong-Thai Nguyen resolved TIKA-1581. Resolution: Fixed jhighlight license concerns --- Key: TIKA-1581 URL: https://issues.apache.org/jira/browse/TIKA-1581 Project: Tika Issue Type: Bug Affects Versions: 1.7 Reporter: Karl Wright Fix For: 1.8 jhighlight jar is a Tika dependency. The Lucene team discovered that, while it claims to be a CDDL/LGPL dual-license, some of its functionality is LGPL only: {code} Solr's contrib/extraction contains jhighlight-1.0.jar which declares itself as dual CDDL or LGPL license. However, some of its classes are distributed only under LGPL, e.g. com.uwyn.jhighlight.highlighter. CppHighlighter.java GroovyHighlighter.java JavaHighlighter.java XmlHighlighter.java I downloaded the sources from Maven (http://search.maven.org/remotecontent?filepath=com/uwyn/jhighlight/1.0/jhighlight-1.0-sources.jar) to confirm that, and also found this SVN repo: http://svn.rifers.org/jhighlight/tags/release-1.0, though the project's website seems to not exist anymore (https://jhighlight.dev.java.net/). I didn't find any direct usage of it in our code, so I guess it's probably needed by a 3rd party dependency, such as Tika. Therefore if we e.g. omit it, things will compile, but may fail at runtime. {code} Is it possible to remove this dependency for future releases, or allow only optional inclusion of this package? It is of concern to the ManifoldCF project because we distribute a binary package that includes Tika and its required dependencies, which currently includes jHighlight. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (TIKA-1354) ForkParser doesn't work in OSGI container
[ https://issues.apache.org/jira/browse/TIKA-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich closed TIKA-1354. - Resolution: Fixed Fix Version/s: 1.7 Marking as Fixed. ForkParser doesn't work in OSGI container - Key: TIKA-1354 URL: https://issues.apache.org/jira/browse/TIKA-1354 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.6 Reporter: Michal Hlavac Fix For: 1.7 I can't find way to run ForkParser in OSGI container. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1581) jhighlight license concerns
[ https://issues.apache.org/jira/browse/TIKA-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14384025#comment-14384025 ] Hudson commented on TIKA-1581: -- FAILURE: Integrated in tika-trunk-jdk1.7 #575 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/575/]) TIKA-1581 - Typo CHANGES.txt (thaichat04: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1669602) * /tika/trunk/CHANGES.txt * /tika/trunk/NOTICE.txt jhighlight license concerns --- Key: TIKA-1581 URL: https://issues.apache.org/jira/browse/TIKA-1581 Project: Tika Issue Type: Bug Affects Versions: 1.7 Reporter: Karl Wright Fix For: 1.8 jhighlight jar is a Tika dependency. The Lucene team discovered that, while it claims to be a CDDL/LGPL dual-license, some of its functionality is LGPL only: {code} Solr's contrib/extraction contains jhighlight-1.0.jar which declares itself as dual CDDL or LGPL license. However, some of its classes are distributed only under LGPL, e.g. com.uwyn.jhighlight.highlighter. CppHighlighter.java GroovyHighlighter.java JavaHighlighter.java XmlHighlighter.java I downloaded the sources from Maven (http://search.maven.org/remotecontent?filepath=com/uwyn/jhighlight/1.0/jhighlight-1.0-sources.jar) to confirm that, and also found this SVN repo: http://svn.rifers.org/jhighlight/tags/release-1.0, though the project's website seems to not exist anymore (https://jhighlight.dev.java.net/). I didn't find any direct usage of it in our code, so I guess it's probably needed by a 3rd party dependency, such as Tika. Therefore if we e.g. omit it, things will compile, but may fail at runtime. {code} Is it possible to remove this dependency for future releases, or allow only optional inclusion of this package? It is of concern to the ManifoldCF project because we distribute a binary package that includes Tika and its required dependencies, which currently includes jHighlight. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
tika-trunk-jdk1.7 - Build # 575 - Failure
The Apache Jenkins build system has built tika-trunk-jdk1.7 (build #575) Status: Failure Check console output at https://builds.apache.org/job/tika-trunk-jdk1.7/575/ to view the results.
Enabling CORS
Hi Folks, I'm trying to enable CORS on a few of Tika's Server resources. But, after adding the pom.xml dependency and a @CrossOriginResourceSharing( allowOrigins = {url} ) annotation to the resources, the Access-Control-Allow-Origin header is still not given. Is there another configuration I need to add? Tika's server doesn't currently have a bean configuration like at the bottom of the examples page http://cxf.apache.org/docs/jax-rs-cors.html#JAX-RSCORS-Examples. Thanks for any help, Tyler
[jira] [Commented] (TIKA-1583) Convert Module Level READMEs to Markdown
[ https://issues.apache.org/jira/browse/TIKA-1583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14384355#comment-14384355 ] Hudson commented on TIKA-1583: -- SUCCESS: Integrated in tika-trunk-jdk1.7 #576 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/576/]) TIKA-1583. Remove old README. (tpalsulich: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1669651) * /tika/trunk/tika-server/README TIKA-1583. Small formatting changes for tika-server README. (tpalsulich: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1669645) * /tika/trunk/tika-server/README.md TIKA-1583. Convert tika-server README to markdown. (tpalsulich: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1669644) * /tika/trunk/tika-server/README.md Convert Module Level READMEs to Markdown Key: TIKA-1583 URL: https://issues.apache.org/jira/browse/TIKA-1583 Project: Tika Issue Type: Improvement Reporter: Tyler Palsulich Assignee: Tyler Palsulich Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Access Control Allow Origin
Thank you, Sergey! I didn't know about that feature. I am going to try to work up a patch this weekend which enables CORS. I'll let you know if I run into any issues. Thanks again, Tyler On Thu, Mar 26, 2015 at 2:39 AM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Tyler Palsulich tpalsul...@gmail.com Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Tuesday, March 24, 2015 at 3:41 PM To: dev@tika.apache.org dev@tika.apache.org Subject: Access Control Allow Origin Hi Folks, I took a stab at creating an example website to submit a file to the form resource of our VM. See http://tpalsulich.github.io/TikaExamples/. If I try to use AJAX to submit the request to make the page prettier (see the script in the head of the page (with ev.preventDefault() commented out), I get the following error: XMLHttpRequest cannot load http://162.242.228.174:9998/tika/form. No 'Access-Control-Allow-Origin' header is present on the requested resource. Origin 'http://tpalsulich.github.io' is therefore not allowed access. The response had HTTP status code 400. We can't allow the tika-server response header to accept * in general, since that isn't secure. So, would there be interest in including this sort of site on the VM? Then, the AJAX request won't be external and we won't have this error. The version button just takes you to the version resource on the VM (doesn't do anything with the file). Tyler
[jira] [Created] (TIKA-1583) Convert Module Level READMEs to Markdown
Tyler Palsulich created TIKA-1583: - Summary: Convert Module Level READMEs to Markdown Key: TIKA-1583 URL: https://issues.apache.org/jira/browse/TIKA-1583 Project: Tika Issue Type: Improvement Reporter: Tyler Palsulich Assignee: Tyler Palsulich Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-1583) Convert Module Level READMEs to Markdown
[ https://issues.apache.org/jira/browse/TIKA-1583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich resolved TIKA-1583. --- Resolution: Done Done in r1669644 and r1669645. Convert Module Level READMEs to Markdown Key: TIKA-1583 URL: https://issues.apache.org/jira/browse/TIKA-1583 Project: Tika Issue Type: Improvement Reporter: Tyler Palsulich Assignee: Tyler Palsulich Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1577) NetCDF Data Extraction
[ https://issues.apache.org/jira/browse/TIKA-1577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14384323#comment-14384323 ] Ann Burgess commented on TIKA-1577: --- This is a great idea. I'm all for not re-creating code if it already exists in good form! NetCDF Data Extraction -- Key: TIKA-1577 URL: https://issues.apache.org/jira/browse/TIKA-1577 Project: Tika Issue Type: Improvement Components: handler, parser Affects Versions: 1.7 Reporter: Ann Burgess Assignee: Ann Burgess Labels: features, handler Fix For: 1.8 Original Estimate: 504h Remaining Estimate: 504h A netCDF classic or 64-bit offset dataset is stored as a single file comprising two parts: - a header, containing all the information about dimensions, attributes, and variables except for the variable data; - a data part, comprising fixed-size data, containing the data for variables that don't have an unlimited dimension; and variable-size data, containing the data for variables that have an unlimited dimension. The NetCDFparser currently extracts the header part. -- text extracts file Dimensions and Variables -- metadata extracts Global Attributes We want the option to extract the data part of NetCDF files. Lets use the NetCDF test file for our dev testing: tika/tika-parsers/src/test/resources/test-documents/sresa1b_ncar_ccsm3_0_run1_21.nc -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Enabling CORS
Hi Tyler, You need to add CrossOriginResourceSharingFilter to the list of providers for your server. Please let me know if you need help with that, the way to add it depends on a way you configure the server. Best Regards, Andriy Redko TP Hi Folks, TP I'm trying to enable CORS on a few of Tika's Server resources. But, after TP adding the pom.xml dependency and a TP @CrossOriginResourceSharing( TP allowOrigins = {url} TP ) TP annotation to the resources, the Access-Control-Allow-Origin header is TP still not given. TP Is there another configuration I need to add? Tika's server doesn't TP currently have a bean configuration like at the bottom of the examples page TP http://cxf.apache.org/docs/jax-rs-cors.html#JAX-RSCORS-Examples. TP Thanks for any help, TP Tyler
Re: Enabling CORS
Hi, That worked! Thank you! I'll let you know if I have any more issues. Tyler On Fri, Mar 27, 2015 at 5:00 PM, Andriy Redko drr...@gmail.com wrote: Hi Tyler, You need to add CrossOriginResourceSharingFilter to the list of providers for your server. Please let me know if you need help with that, the way to add it depends on a way you configure the server. Best Regards, Andriy Redko *TP Hi Folks, TP I'm trying to enable CORS on a few of Tika's Server resources. But, after TP adding the pom.xml dependency and a TP @CrossOriginResourceSharing( TP allowOrigins = {url} TP ) TP annotation to the resources, the Access-Control-Allow-Origin header is TP still not given. TP Is there another configuration I need to add? Tika's server doesn't TP currently have a bean configuration like at the bottom of the examples page TP * http://cxf.apache.org/docs/jax-rs-cors.html#JAX-RSCORS-Examples *. TP Thanks for any help, TP Tyler *
Re: GSOC RDF Microformats Support
Hi Remiz, Sure! Check out this 5 min writing a parser guide in Tika: https://tika.apache.org/1.7/parser_guide.html OK, so then check out Any23: http://any23.apache.org/ It has support for parsing RDF Microformats. So, you may want to create a MicroformatsParser in Tika; then if it’s supported in Tika, it will in turn be available in Nutch and its parse-tika plugin if you upgrade it to the latest version of Tika. You can see how to do this here: http://s.apache.org/fsY Cheers and best of luck - hope that’s enough to get your proposal kicked off. Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Remzi Düzağaç remz...@gmail.com Reply-To: d...@nutch.apache.org d...@nutch.apache.org Date: Friday, March 27, 2015 at 7:22 AM To: dev d...@nutch.apache.org Cc: dev@tika.apache.org dev@tika.apache.org, d...@any23.apache.org d...@any23.apache.org Subject: Re: GSOC RDF Microformats Support Hi Chris, Thanks for your feedback. I was planning to use any23 and tika but I dont have detailed grasp of both projects. I guess Im gonna need to dive in both. I would appreciate if you could guide me thanks On Fri, Mar 27, 2015 at 4:07 PM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: Hi Remzi - thanks! You may want to consider this as a Tika or Any23 project since Nutch delegates its parsing to Tika (and Any23 uses Tika [and vice versa] to handle micro formats). Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Remzi Düzağaç remz...@gmail.com Reply-To: d...@nutch.apache.org d...@nutch.apache.org Date: Friday, March 27, 2015 at 5:07 AM To: d...@nutch.apache.org d...@nutch.apache.org Subject: GSOC RDF Microformats Support Hi Guys, I have sent a proposal to gsoc. I would like to add rdf microformat support to nutch. I kindly ask for your support. Is there anyone volunteer to be my mentor on this topic? Thank you very much
FW: [DEADLINE] Google Summer of Code Deadline Approaching Soon
++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Lewis John Mcgibbney lewis.mcgibb...@gmail.com Reply-To: d...@any23.apache.org d...@any23.apache.org Date: Wednesday, March 25, 2015 at 9:35 PM To: u...@nutch.apache.org u...@nutch.apache.org, d...@nutch.apache.org d...@nutch.apache.org, u...@gora.apache.org u...@gora.apache.org, d...@gora.apache.org d...@gora.apache.org, u...@any23.apache.org u...@any23.apache.org, d...@any23.apache.org d...@any23.apache.org, u...@oodt.apache.org u...@oodt.apache.org, d...@oodt.apache.org d...@oodt.apache.org Subject: [DEADLINE] Google Summer of Code Deadline Approaching Soon Hi All, The deadline for this years GSoC student submissions is approaching fast and I would be very keen to see more proposals from the communities above. I've been involved on and off with several students from across all of the above communtiies hence the reason I am emailing these lists. I would strongly suggest that if any students are still planning on submitting, to get the submissions in ASAP. Thanks Lewis -- *Lewis*