Re: GSOC RDF Microformats Support

2015-03-27 Thread Remzi Düzağaç
Hi Chris,

Thanks for your feedback.
I was planning to use any23 and tika but I dont have detailed grasp of both
projects. I guess Im gonna need to dive in both.
I would appreciate if you could guide me

thanks

On Fri, Mar 27, 2015 at 4:07 PM, Mattmann, Chris A (3980) 
chris.a.mattm...@jpl.nasa.gov wrote:

 Hi Remzi - thanks! You may want to consider this as a Tika or
 Any23 project since Nutch delegates its parsing to Tika (and
 Any23 uses Tika [and vice versa] to handle micro formats).

 Cheers,
 Chris

 ++
 Chris Mattmann, Ph.D.
 Chief Architect
 Instrument Software and Science Data Systems Section (398)
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 168-519, Mailstop: 168-527
 Email: chris.a.mattm...@nasa.gov
 WWW:  http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Associate Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++






 -Original Message-
 From: Remzi Düzağaç remz...@gmail.com
 Reply-To: d...@nutch.apache.org d...@nutch.apache.org
 Date: Friday, March 27, 2015 at 5:07 AM
 To: d...@nutch.apache.org d...@nutch.apache.org
 Subject: GSOC RDF Microformats Support

 Hi Guys,
 
 
 I have sent a proposal to gsoc. I would like to add rdf microformat
 support to nutch. I kindly ask for your support. Is there anyone
 volunteer to be my mentor on this topic?
 
 
 Thank you very much
 




[jira] [Updated] (TIKA-1581) jhighlight license concerns

2015-03-27 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen updated TIKA-1581:
---
Fix Version/s: 1.8

 jhighlight license concerns
 ---

 Key: TIKA-1581
 URL: https://issues.apache.org/jira/browse/TIKA-1581
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.7
Reporter: Karl Wright
 Fix For: 1.8


 jhighlight jar is a Tika dependency.  The Lucene team discovered that, while 
 it claims to be a CDDL/LGPL dual-license, some of its functionality is LGPL 
 only:
 {code}
 Solr's contrib/extraction contains jhighlight-1.0.jar which declares itself 
 as dual CDDL or LGPL license. However, some of its classes are distributed 
 only under LGPL, e.g.
 com.uwyn.jhighlight.highlighter.
   CppHighlighter.java
   GroovyHighlighter.java
   JavaHighlighter.java
   XmlHighlighter.java
 I downloaded the sources from Maven 
 (http://search.maven.org/remotecontent?filepath=com/uwyn/jhighlight/1.0/jhighlight-1.0-sources.jar)
  to confirm that, and also found this SVN repo: 
 http://svn.rifers.org/jhighlight/tags/release-1.0, though the project's 
 website seems to not exist anymore (https://jhighlight.dev.java.net/).
 I didn't find any direct usage of it in our code, so I guess it's probably 
 needed by a 3rd party dependency, such as Tika. Therefore if we e.g. omit it, 
 things will compile, but may fail at runtime.
 {code}
 Is it possible to remove this dependency for future releases, or allow only 
 optional inclusion of this package?  It is of concern to the ManifoldCF 
 project because we distribute a binary package that includes Tika and its 
 required dependencies, which currently includes jHighlight.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1581) jhighlight license concerns

2015-03-27 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen resolved TIKA-1581.

Resolution: Fixed

 jhighlight license concerns
 ---

 Key: TIKA-1581
 URL: https://issues.apache.org/jira/browse/TIKA-1581
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.7
Reporter: Karl Wright
 Fix For: 1.8


 jhighlight jar is a Tika dependency.  The Lucene team discovered that, while 
 it claims to be a CDDL/LGPL dual-license, some of its functionality is LGPL 
 only:
 {code}
 Solr's contrib/extraction contains jhighlight-1.0.jar which declares itself 
 as dual CDDL or LGPL license. However, some of its classes are distributed 
 only under LGPL, e.g.
 com.uwyn.jhighlight.highlighter.
   CppHighlighter.java
   GroovyHighlighter.java
   JavaHighlighter.java
   XmlHighlighter.java
 I downloaded the sources from Maven 
 (http://search.maven.org/remotecontent?filepath=com/uwyn/jhighlight/1.0/jhighlight-1.0-sources.jar)
  to confirm that, and also found this SVN repo: 
 http://svn.rifers.org/jhighlight/tags/release-1.0, though the project's 
 website seems to not exist anymore (https://jhighlight.dev.java.net/).
 I didn't find any direct usage of it in our code, so I guess it's probably 
 needed by a 3rd party dependency, such as Tika. Therefore if we e.g. omit it, 
 things will compile, but may fail at runtime.
 {code}
 Is it possible to remove this dependency for future releases, or allow only 
 optional inclusion of this package?  It is of concern to the ManifoldCF 
 project because we distribute a binary package that includes Tika and its 
 required dependencies, which currently includes jHighlight.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (TIKA-1354) ForkParser doesn't work in OSGI container

2015-03-27 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich closed TIKA-1354.
-
   Resolution: Fixed
Fix Version/s: 1.7

Marking as Fixed.

 ForkParser doesn't work in OSGI container
 -

 Key: TIKA-1354
 URL: https://issues.apache.org/jira/browse/TIKA-1354
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.6
Reporter: Michal Hlavac
 Fix For: 1.7


 I can't find way to run ForkParser in OSGI container.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1581) jhighlight license concerns

2015-03-27 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14384025#comment-14384025
 ] 

Hudson commented on TIKA-1581:
--

FAILURE: Integrated in tika-trunk-jdk1.7 #575 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/575/])
TIKA-1581 - Typo  CHANGES.txt (thaichat04: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1669602)
* /tika/trunk/CHANGES.txt
* /tika/trunk/NOTICE.txt


 jhighlight license concerns
 ---

 Key: TIKA-1581
 URL: https://issues.apache.org/jira/browse/TIKA-1581
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.7
Reporter: Karl Wright
 Fix For: 1.8


 jhighlight jar is a Tika dependency.  The Lucene team discovered that, while 
 it claims to be a CDDL/LGPL dual-license, some of its functionality is LGPL 
 only:
 {code}
 Solr's contrib/extraction contains jhighlight-1.0.jar which declares itself 
 as dual CDDL or LGPL license. However, some of its classes are distributed 
 only under LGPL, e.g.
 com.uwyn.jhighlight.highlighter.
   CppHighlighter.java
   GroovyHighlighter.java
   JavaHighlighter.java
   XmlHighlighter.java
 I downloaded the sources from Maven 
 (http://search.maven.org/remotecontent?filepath=com/uwyn/jhighlight/1.0/jhighlight-1.0-sources.jar)
  to confirm that, and also found this SVN repo: 
 http://svn.rifers.org/jhighlight/tags/release-1.0, though the project's 
 website seems to not exist anymore (https://jhighlight.dev.java.net/).
 I didn't find any direct usage of it in our code, so I guess it's probably 
 needed by a 3rd party dependency, such as Tika. Therefore if we e.g. omit it, 
 things will compile, but may fail at runtime.
 {code}
 Is it possible to remove this dependency for future releases, or allow only 
 optional inclusion of this package?  It is of concern to the ManifoldCF 
 project because we distribute a binary package that includes Tika and its 
 required dependencies, which currently includes jHighlight.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


tika-trunk-jdk1.7 - Build # 575 - Failure

2015-03-27 Thread Apache Jenkins Server
The Apache Jenkins build system has built tika-trunk-jdk1.7 (build #575)

Status: Failure

Check console output at https://builds.apache.org/job/tika-trunk-jdk1.7/575/ to 
view the results.

Enabling CORS

2015-03-27 Thread Tyler Palsulich
Hi Folks,

I'm trying to enable CORS on a few of Tika's Server resources. But, after
adding the pom.xml dependency and a

@CrossOriginResourceSharing(
allowOrigins = {url}
)

annotation to the resources, the Access-Control-Allow-Origin header is
still not given.

Is there another configuration I need to add? Tika's server doesn't
currently have a bean configuration like at the bottom of the examples page
http://cxf.apache.org/docs/jax-rs-cors.html#JAX-RSCORS-Examples.

Thanks for any help,
Tyler


[jira] [Commented] (TIKA-1583) Convert Module Level READMEs to Markdown

2015-03-27 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14384355#comment-14384355
 ] 

Hudson commented on TIKA-1583:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #576 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/576/])
TIKA-1583. Remove old README. (tpalsulich: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1669651)
* /tika/trunk/tika-server/README
TIKA-1583. Small formatting changes for tika-server README. (tpalsulich: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1669645)
* /tika/trunk/tika-server/README.md
TIKA-1583. Convert tika-server README to markdown. (tpalsulich: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1669644)
* /tika/trunk/tika-server/README.md


 Convert Module Level READMEs to Markdown
 

 Key: TIKA-1583
 URL: https://issues.apache.org/jira/browse/TIKA-1583
 Project: Tika
  Issue Type: Improvement
Reporter: Tyler Palsulich
Assignee: Tyler Palsulich
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Access Control Allow Origin

2015-03-27 Thread Tyler Palsulich
Thank you, Sergey! I didn't know about that feature. I am going to try to
work up a patch this weekend which enables CORS. I'll let you know if I run
into any issues.

Thanks again,
Tyler

On Thu, Mar 26, 2015 at 2:39 AM, Mattmann, Chris A (3980) 
chris.a.mattm...@jpl.nasa.gov wrote:



 ++
 Chris Mattmann, Ph.D.
 Chief Architect
 Instrument Software and Science Data Systems Section (398)
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 168-519, Mailstop: 168-527
 Email: chris.a.mattm...@nasa.gov
 WWW:  http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Associate Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++






 -Original Message-
 From: Tyler Palsulich tpalsul...@gmail.com
 Reply-To: dev@tika.apache.org dev@tika.apache.org
 Date: Tuesday, March 24, 2015 at 3:41 PM
 To: dev@tika.apache.org dev@tika.apache.org
 Subject: Access Control Allow Origin

 Hi Folks,
 
 I took a stab at creating an example website to submit a file to the form
 resource of our VM. See http://tpalsulich.github.io/TikaExamples/.
 
 If I try to use AJAX to submit the request to make the page prettier (see
 the script in the head of the page (with ev.preventDefault() commented
 out), I get the following error:
 
 XMLHttpRequest cannot load http://162.242.228.174:9998/tika/form. No
 'Access-Control-Allow-Origin' header is present on the requested resource.
 Origin 'http://tpalsulich.github.io' is therefore not allowed access. The
 response had HTTP status code 400.
 
 We can't allow the tika-server response header to accept * in general,
 since that isn't secure. So, would there be interest in including this
 sort
 of site on the VM? Then, the AJAX request won't be external and we won't
 have this error.
 
 The version button just takes you to the version resource on the VM
 (doesn't do anything with the file).
 
 Tyler




[jira] [Created] (TIKA-1583) Convert Module Level READMEs to Markdown

2015-03-27 Thread Tyler Palsulich (JIRA)
Tyler Palsulich created TIKA-1583:
-

 Summary: Convert Module Level READMEs to Markdown
 Key: TIKA-1583
 URL: https://issues.apache.org/jira/browse/TIKA-1583
 Project: Tika
  Issue Type: Improvement
Reporter: Tyler Palsulich
Assignee: Tyler Palsulich
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1583) Convert Module Level READMEs to Markdown

2015-03-27 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich resolved TIKA-1583.
---
Resolution: Done

Done in r1669644 and r1669645.

 Convert Module Level READMEs to Markdown
 

 Key: TIKA-1583
 URL: https://issues.apache.org/jira/browse/TIKA-1583
 Project: Tika
  Issue Type: Improvement
Reporter: Tyler Palsulich
Assignee: Tyler Palsulich
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1577) NetCDF Data Extraction

2015-03-27 Thread Ann Burgess (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14384323#comment-14384323
 ] 

Ann Burgess commented on TIKA-1577:
---

This is a great idea.  I'm all for not re-creating code if it already exists in 
good form!

 NetCDF Data Extraction
 --

 Key: TIKA-1577
 URL: https://issues.apache.org/jira/browse/TIKA-1577
 Project: Tika
  Issue Type: Improvement
  Components: handler, parser
Affects Versions: 1.7
Reporter: Ann Burgess
Assignee: Ann Burgess
  Labels: features, handler
 Fix For: 1.8

   Original Estimate: 504h
  Remaining Estimate: 504h

 A netCDF classic or 64-bit offset dataset is stored as a single file 
 comprising two parts:
  - a header, containing all the information about dimensions, attributes, and 
 variables except for the variable data;
  - a data part, comprising fixed-size data, containing the data for variables 
 that don't have an unlimited dimension; and variable-size data, containing 
 the data for variables that have an unlimited dimension.
 The NetCDFparser currently extracts the header part.  
  -- text extracts file Dimensions and Variables
  -- metadata extracts Global Attributes
 We want the option to extract the data part of NetCDF files.  
 Lets use the NetCDF test file for our dev testing:  
 tika/tika-parsers/src/test/resources/test-documents/sresa1b_ncar_ccsm3_0_run1_21.nc
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Enabling CORS

2015-03-27 Thread Andriy Redko

Hi Tyler,

You need to add CrossOriginResourceSharingFilter to the list of providers for 
your server.
Please let me know if you need help with that, the way to add it depends on a 
way you configure the server.

Best Regards, 
Andriy Redko

TP Hi Folks,

TP I'm trying to enable CORS on a few of Tika's Server resources. But, after
TP adding the pom.xml dependency and a

TP @CrossOriginResourceSharing(
TP allowOrigins = {url}
TP )

TP annotation to the resources, the Access-Control-Allow-Origin header is
TP still not given.

TP Is there another configuration I need to add? Tika's server doesn't
TP currently have a bean configuration like at the bottom of the examples page
TP http://cxf.apache.org/docs/jax-rs-cors.html#JAX-RSCORS-Examples.

TP Thanks for any help,
TP Tyler



Re: Enabling CORS

2015-03-27 Thread Tyler Palsulich
Hi,

That worked! Thank you! I'll let you know if I have any more issues.

Tyler

On Fri, Mar 27, 2015 at 5:00 PM, Andriy Redko drr...@gmail.com wrote:

  Hi Tyler,

 You need to add CrossOriginResourceSharingFilter to the list of providers
 for your server.
 Please let me know if you need help with that, the way to add it depends
 on a way you configure the server.

 Best Regards,
 Andriy Redko















 *TP Hi Folks, TP I'm trying to enable CORS on a few of Tika's Server
 resources. But, after TP adding the pom.xml dependency and a TP
 @CrossOriginResourceSharing( TP allowOrigins = {url} TP )
 TP annotation to the resources, the Access-Control-Allow-Origin header is
 TP still not given. TP Is there another configuration I need to add?
 Tika's server doesn't TP currently have a bean configuration like at the
 bottom of the examples page TP *
 http://cxf.apache.org/docs/jax-rs-cors.html#JAX-RSCORS-Examples




 *. TP Thanks for any help, TP Tyler *



Re: GSOC RDF Microformats Support

2015-03-27 Thread Mattmann, Chris A (3980)
Hi Remiz,

Sure!

Check out this 5 min writing a parser guide in Tika:

https://tika.apache.org/1.7/parser_guide.html


OK, so then check out Any23:

http://any23.apache.org/

It has support for parsing RDF Microformats. So, you
may want to create a MicroformatsParser in Tika; then
if it’s supported in Tika, it will in turn be available
in Nutch and its parse-tika plugin if you upgrade it to
the latest version of Tika.

You can see how to do this here:

http://s.apache.org/fsY

Cheers and best of luck - hope that’s enough to get
your proposal kicked off.

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: Remzi Düzağaç remz...@gmail.com
Reply-To: d...@nutch.apache.org d...@nutch.apache.org
Date: Friday, March 27, 2015 at 7:22 AM
To: dev d...@nutch.apache.org
Cc: dev@tika.apache.org dev@tika.apache.org, d...@any23.apache.org
d...@any23.apache.org
Subject: Re: GSOC RDF Microformats Support

Hi Chris,


Thanks for your feedback.
I was planning to use any23 and tika but I dont have detailed grasp of
both projects. I guess Im gonna need to dive in both.


I would appreciate if you could guide me


thanks

On Fri, Mar 27, 2015 at 4:07 PM, Mattmann, Chris A (3980)
chris.a.mattm...@jpl.nasa.gov wrote:

Hi Remzi - thanks! You may want to consider this as a Tika or
Any23 project since Nutch delegates its parsing to Tika (and
Any23 uses Tika [and vice versa] to handle micro formats).

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: Remzi Düzağaç remz...@gmail.com
Reply-To: d...@nutch.apache.org d...@nutch.apache.org
Date: Friday, March 27, 2015 at 5:07 AM
To: d...@nutch.apache.org d...@nutch.apache.org
Subject: GSOC RDF Microformats Support

Hi Guys,


I have sent a proposal to gsoc. I would like to add rdf microformat
support to nutch. I kindly ask for your support. Is there anyone
volunteer to be my mentor on this topic?


Thank you very much















FW: [DEADLINE] Google Summer of Code Deadline Approaching Soon

2015-03-27 Thread Mattmann, Chris A (3980)


++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: Lewis John Mcgibbney lewis.mcgibb...@gmail.com
Reply-To: d...@any23.apache.org d...@any23.apache.org
Date: Wednesday, March 25, 2015 at 9:35 PM
To: u...@nutch.apache.org u...@nutch.apache.org,
d...@nutch.apache.org d...@nutch.apache.org, u...@gora.apache.org
u...@gora.apache.org, d...@gora.apache.org d...@gora.apache.org,
u...@any23.apache.org u...@any23.apache.org, d...@any23.apache.org
d...@any23.apache.org, u...@oodt.apache.org u...@oodt.apache.org,
d...@oodt.apache.org d...@oodt.apache.org
Subject: [DEADLINE] Google Summer of Code Deadline Approaching Soon

Hi All,
The deadline for this years GSoC student submissions is approaching fast
and I would be very keen to see more proposals from the communities above.
I've been involved on and off with several students from across all of the
above communtiies hence the reason I am emailing these lists.
I would strongly suggest that if any students are still planning on
submitting, to get the submissions in ASAP.
Thanks
Lewis


-- 
*Lewis*