[jira] [Commented] (TIKA-1335) mime type for CSV files incorrectly detected as text/plain
[ https://issues.apache.org/jira/browse/TIKA-1335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14031681#comment-14031681 ] Hudson commented on TIKA-1335: -- SUCCESS: Integrated in tika-trunk-jdk1.6 #46 (See [https://builds.apache.org/job/tika-trunk-jdk1.6/46/]) Update docs for TIKA-1335 TIKA-1336. (mattmann: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1602618) * /tika/trunk/CHANGES.txt > mime type for CSV files incorrectly detected as text/plain > -- > > Key: TIKA-1335 > URL: https://issues.apache.org/jira/browse/TIKA-1335 > Project: Tika > Issue Type: Bug > Components: mime >Affects Versions: 1.5, 1.6 >Reporter: Kaijian Xu >Assignee: Chris A. Mattmann > Fix For: 1.6 > > Attachments: CDEC_WEATHER_2010_03_02, foo.csv, velocity.csv > > > Mime type autodetection returns "text/plain" for CSV files, for example: > % tika -m foo.csv > Content-Encoding: ISO-8859-1 > Content-Length: 78 > Content-Type: text/plain; charset=ISO-8859-1 > resourceName: foo.csv > This occurs regardless of whether the filename has the appropriate *.csv > extension or not. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1336) Provide a Detector JAXRS endpoint
[ https://issues.apache.org/jira/browse/TIKA-1336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14031680#comment-14031680 ] Hudson commented on TIKA-1336: -- SUCCESS: Integrated in tika-trunk-jdk1.6 #46 (See [https://builds.apache.org/job/tika-trunk-jdk1.6/46/]) Update docs for TIKA-1335 TIKA-1336. (mattmann: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1602618) * /tika/trunk/CHANGES.txt > Provide a Detector JAXRS endpoint > - > > Key: TIKA-1336 > URL: https://issues.apache.org/jira/browse/TIKA-1336 > Project: Tika > Issue Type: Improvement > Components: detector, server >Affects Versions: 1.5 >Reporter: Nick Burch >Assignee: Chris A. Mattmann > Fix For: 1.6 > > > As identified in TIKA-1335, the Tika Server now has an endpoint which will > tell you what Detectors are available to it, but not one that will trigger > detection. That means your only way to do detection is to request the > metadata, and check the content type, but that isn't always as accurate as an > explicit detection call (eg if a general parser picks up the file) > We should therefore add in a new endpoint that just does the detection -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1336) Provide a Detector JAXRS endpoint
[ https://issues.apache.org/jira/browse/TIKA-1336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14031673#comment-14031673 ] Hudson commented on TIKA-1336: -- SUCCESS: Integrated in tika-trunk-jdk1.7 #46 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/46/]) Update docs for TIKA-1335 TIKA-1336. (mattmann: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1602618) * /tika/trunk/CHANGES.txt - fix for TIKA-1336 This closes #10 (mattmann: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1602617) * /tika/trunk/tika-server/README * /tika/trunk/tika-server/src/main/java/org/apache/tika/server/DetectorResource.java * /tika/trunk/tika-server/src/main/java/org/apache/tika/server/TikaServerCli.java * /tika/trunk/tika-server/src/test/java/org/apache/tika/server/DetectorResourceTest.java * /tika/trunk/tika-server/src/test/resources/CDEC_WEATHER_2010_03_02 * /tika/trunk/tika-server/src/test/resources/foo.csv > Provide a Detector JAXRS endpoint > - > > Key: TIKA-1336 > URL: https://issues.apache.org/jira/browse/TIKA-1336 > Project: Tika > Issue Type: Improvement > Components: detector, server >Affects Versions: 1.5 >Reporter: Nick Burch >Assignee: Chris A. Mattmann > Fix For: 1.6 > > > As identified in TIKA-1335, the Tika Server now has an endpoint which will > tell you what Detectors are available to it, but not one that will trigger > detection. That means your only way to do detection is to request the > metadata, and check the content type, but that isn't always as accurate as an > explicit detection call (eg if a general parser picks up the file) > We should therefore add in a new endpoint that just does the detection -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1335) mime type for CSV files incorrectly detected as text/plain
[ https://issues.apache.org/jira/browse/TIKA-1335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14031674#comment-14031674 ] Hudson commented on TIKA-1335: -- SUCCESS: Integrated in tika-trunk-jdk1.7 #46 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/46/]) Update docs for TIKA-1335 TIKA-1336. (mattmann: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1602618) * /tika/trunk/CHANGES.txt > mime type for CSV files incorrectly detected as text/plain > -- > > Key: TIKA-1335 > URL: https://issues.apache.org/jira/browse/TIKA-1335 > Project: Tika > Issue Type: Bug > Components: mime >Affects Versions: 1.5, 1.6 >Reporter: Kaijian Xu >Assignee: Chris A. Mattmann > Fix For: 1.6 > > Attachments: CDEC_WEATHER_2010_03_02, foo.csv, velocity.csv > > > Mime type autodetection returns "text/plain" for CSV files, for example: > % tika -m foo.csv > Content-Encoding: ISO-8859-1 > Content-Length: 78 > Content-Type: text/plain; charset=ISO-8859-1 > resourceName: foo.csv > This occurs regardless of whether the filename has the appropriate *.csv > extension or not. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1336) Provide a Detector JAXRS endpoint
[ https://issues.apache.org/jira/browse/TIKA-1336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14031662#comment-14031662 ] Hudson commented on TIKA-1336: -- SUCCESS: Integrated in tika-trunk-jdk1.6 #45 (See [https://builds.apache.org/job/tika-trunk-jdk1.6/45/]) - fix for TIKA-1336 This closes #10 (mattmann: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1602617) * /tika/trunk/tika-server/README * /tika/trunk/tika-server/src/main/java/org/apache/tika/server/DetectorResource.java * /tika/trunk/tika-server/src/main/java/org/apache/tika/server/TikaServerCli.java * /tika/trunk/tika-server/src/test/java/org/apache/tika/server/DetectorResourceTest.java * /tika/trunk/tika-server/src/test/resources/CDEC_WEATHER_2010_03_02 * /tika/trunk/tika-server/src/test/resources/foo.csv > Provide a Detector JAXRS endpoint > - > > Key: TIKA-1336 > URL: https://issues.apache.org/jira/browse/TIKA-1336 > Project: Tika > Issue Type: Improvement > Components: detector, server >Affects Versions: 1.5 >Reporter: Nick Burch >Assignee: Chris A. Mattmann > Fix For: 1.6 > > > As identified in TIKA-1335, the Tika Server now has an endpoint which will > tell you what Detectors are available to it, but not one that will trigger > detection. That means your only way to do detection is to request the > metadata, and check the content type, but that isn't always as accurate as an > explicit detection call (eg if a general parser picks up the file) > We should therefore add in a new endpoint that just does the detection -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1336) Provide a Detector JAXRS endpoint
[ https://issues.apache.org/jira/browse/TIKA-1336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14031661#comment-14031661 ] ASF GitHub Bot commented on TIKA-1336: -- Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/10 > Provide a Detector JAXRS endpoint > - > > Key: TIKA-1336 > URL: https://issues.apache.org/jira/browse/TIKA-1336 > Project: Tika > Issue Type: Improvement > Components: detector, server >Affects Versions: 1.5 >Reporter: Nick Burch >Assignee: Chris A. Mattmann > Fix For: 1.6 > > > As identified in TIKA-1335, the Tika Server now has an endpoint which will > tell you what Detectors are available to it, but not one that will trigger > detection. That means your only way to do detection is to request the > metadata, and check the content type, but that isn't always as accurate as an > explicit detection call (eg if a general parser picks up the file) > We should therefore add in a new endpoint that just does the detection -- This message was sent by Atlassian JIRA (v6.2#6252)
[GitHub] tika pull request: Fix for TIKA-1336: initial working detect strea...
Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/10 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Resolved] (TIKA-1335) mime type for CSV files incorrectly detected as text/plain
[ https://issues.apache.org/jira/browse/TIKA-1335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann resolved TIKA-1335. - Resolution: Fixed Fix Version/s: 1.6 - per issue comments > mime type for CSV files incorrectly detected as text/plain > -- > > Key: TIKA-1335 > URL: https://issues.apache.org/jira/browse/TIKA-1335 > Project: Tika > Issue Type: Bug > Components: mime >Affects Versions: 1.5, 1.6 >Reporter: Kaijian Xu >Assignee: Chris A. Mattmann > Fix For: 1.6 > > Attachments: CDEC_WEATHER_2010_03_02, foo.csv, velocity.csv > > > Mime type autodetection returns "text/plain" for CSV files, for example: > % tika -m foo.csv > Content-Encoding: ISO-8859-1 > Content-Length: 78 > Content-Type: text/plain; charset=ISO-8859-1 > resourceName: foo.csv > This occurs regardless of whether the filename has the appropriate *.csv > extension or not. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (TIKA-1336) Provide a Detector JAXRS endpoint
[ https://issues.apache.org/jira/browse/TIKA-1336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann resolved TIKA-1336. - Resolution: Fixed Fix Version/s: 1.6 Committed an initial version of the JAX-RS detect interface in r1602617 from GitHub PR #10. Also updated docs on wiki and in README file. You have to give it a hint on CSVs by providing the filename in the Content-Disposition header, but maybe later we can improve our media type detection more on CSV files to differentiate them from text/plain. Thanks to [~gagravarr] and [~kxu] for motivation in getting this done. > Provide a Detector JAXRS endpoint > - > > Key: TIKA-1336 > URL: https://issues.apache.org/jira/browse/TIKA-1336 > Project: Tika > Issue Type: Improvement > Components: detector, server >Affects Versions: 1.5 >Reporter: Nick Burch >Assignee: Chris A. Mattmann > Fix For: 1.6 > > > As identified in TIKA-1335, the Tika Server now has an endpoint which will > tell you what Detectors are available to it, but not one that will trigger > detection. That means your only way to do detection is to request the > metadata, and check the content type, but that isn't always as accurate as an > explicit detection call (eg if a general parser picks up the file) > We should therefore add in a new endpoint that just does the detection -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1335) mime type for CSV files incorrectly detected as text/plain
[ https://issues.apache.org/jira/browse/TIKA-1335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14031653#comment-14031653 ] Chris A. Mattmann commented on TIKA-1335: - [~kxu] see the update I committed in TIKA-1336 - and the docs on the wiki for JaxRS: https://wiki.apache.org/tika/TikaJAXRS I think this should take care of your detections for now, so long as you provide the filename hint or trick in the Content-Disposition header. I'm marking this as resolved for now, please open up a new more specific issue if this doesn't deal with your fix. > mime type for CSV files incorrectly detected as text/plain > -- > > Key: TIKA-1335 > URL: https://issues.apache.org/jira/browse/TIKA-1335 > Project: Tika > Issue Type: Bug > Components: mime >Affects Versions: 1.5, 1.6 >Reporter: Kaijian Xu >Assignee: Chris A. Mattmann > Attachments: CDEC_WEATHER_2010_03_02, foo.csv, velocity.csv > > > Mime type autodetection returns "text/plain" for CSV files, for example: > % tika -m foo.csv > Content-Encoding: ISO-8859-1 > Content-Length: 78 > Content-Type: text/plain; charset=ISO-8859-1 > resourceName: foo.csv > This occurs regardless of whether the filename has the appropriate *.csv > extension or not. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1336) Provide a Detector JAXRS endpoint
[ https://issues.apache.org/jira/browse/TIKA-1336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14031649#comment-14031649 ] Chris A. Mattmann commented on TIKA-1336: - Docs updated for the resource in: https://wiki.apache.org/tika/TikaJAXRS > Provide a Detector JAXRS endpoint > - > > Key: TIKA-1336 > URL: https://issues.apache.org/jira/browse/TIKA-1336 > Project: Tika > Issue Type: Improvement > Components: detector, server >Affects Versions: 1.5 >Reporter: Nick Burch >Assignee: Chris A. Mattmann > > As identified in TIKA-1335, the Tika Server now has an endpoint which will > tell you what Detectors are available to it, but not one that will trigger > detection. That means your only way to do detection is to request the > metadata, and check the content type, but that isn't always as accurate as an > explicit detection call (eg if a general parser picks up the file) > We should therefore add in a new endpoint that just does the detection -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1336) Provide a Detector JAXRS endpoint
[ https://issues.apache.org/jira/browse/TIKA-1336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14031647#comment-14031647 ] ASF GitHub Bot commented on TIKA-1336: -- GitHub user chrismattmann opened a pull request: https://github.com/apache/tika/pull/10 Fix for TIKA-1336: initial working detect stream interface, along with u... ...nit tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/chrismattmann/tika TIKA-1336 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/10.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #10 commit b315a1797f2ffb0f83abdcff051d110facf2a128 Author: Chris Mattmann Date: 2014-06-14T18:25:21Z Fix for TIKA-1336: initial working detect stream interface, along with unit tests. > Provide a Detector JAXRS endpoint > - > > Key: TIKA-1336 > URL: https://issues.apache.org/jira/browse/TIKA-1336 > Project: Tika > Issue Type: Improvement > Components: detector, server >Affects Versions: 1.5 >Reporter: Nick Burch >Assignee: Chris A. Mattmann > > As identified in TIKA-1335, the Tika Server now has an endpoint which will > tell you what Detectors are available to it, but not one that will trigger > detection. That means your only way to do detection is to request the > metadata, and check the content type, but that isn't always as accurate as an > explicit detection call (eg if a general parser picks up the file) > We should therefore add in a new endpoint that just does the detection -- This message was sent by Atlassian JIRA (v6.2#6252)
[GitHub] tika pull request: Fix for TIKA-1336: initial working detect strea...
GitHub user chrismattmann opened a pull request: https://github.com/apache/tika/pull/10 Fix for TIKA-1336: initial working detect stream interface, along with u... ...nit tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/chrismattmann/tika TIKA-1336 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/10.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #10 commit b315a1797f2ffb0f83abdcff051d110facf2a128 Author: Chris Mattmann Date: 2014-06-14T18:25:21Z Fix for TIKA-1336: initial working detect stream interface, along with unit tests. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (TIKA-1335) mime type for CSV files incorrectly detected as text/plain
[ https://issues.apache.org/jira/browse/TIKA-1335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14031614#comment-14031614 ] Chris A. Mattmann commented on TIKA-1335: - well, I'll try and make some progress, either way. > mime type for CSV files incorrectly detected as text/plain > -- > > Key: TIKA-1335 > URL: https://issues.apache.org/jira/browse/TIKA-1335 > Project: Tika > Issue Type: Bug > Components: mime >Affects Versions: 1.5, 1.6 >Reporter: Kaijian Xu >Assignee: Chris A. Mattmann > Attachments: CDEC_WEATHER_2010_03_02, foo.csv, velocity.csv > > > Mime type autodetection returns "text/plain" for CSV files, for example: > % tika -m foo.csv > Content-Encoding: ISO-8859-1 > Content-Length: 78 > Content-Type: text/plain; charset=ISO-8859-1 > resourceName: foo.csv > This occurs regardless of whether the filename has the appropriate *.csv > extension or not. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1335) mime type for CSV files incorrectly detected as text/plain
[ https://issues.apache.org/jira/browse/TIKA-1335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14031492#comment-14031492 ] Nick Burch commented on TIKA-1335: -- I don't think we're going to be able to write mime matchers that reliably detect CSV, not least because there's so many variants of it (tab? comma? quoted? double quoted? escaped?) > mime type for CSV files incorrectly detected as text/plain > -- > > Key: TIKA-1335 > URL: https://issues.apache.org/jira/browse/TIKA-1335 > Project: Tika > Issue Type: Bug > Components: mime >Affects Versions: 1.5, 1.6 >Reporter: Kaijian Xu >Assignee: Chris A. Mattmann > Attachments: CDEC_WEATHER_2010_03_02, foo.csv, velocity.csv > > > Mime type autodetection returns "text/plain" for CSV files, for example: > % tika -m foo.csv > Content-Encoding: ISO-8859-1 > Content-Length: 78 > Content-Type: text/plain; charset=ISO-8859-1 > resourceName: foo.csv > This occurs regardless of whether the filename has the appropriate *.csv > extension or not. -- This message was sent by Atlassian JIRA (v6.2#6252)