[jira] [Commented] (TIKA-1196) JAX-RS server only responds to queries to/from http://localhost

2013-11-18 Thread Sergey Beryozkin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13825254#comment-13825254
 ] 

Sergey Beryozkin commented on TIKA-1196:


Rian, thanks for the patch. I'd prefer going for a 'host' option only and keep 
the default to 'localhost' as Nick also suggested. 
You are right it does not make much sense for cases where clients are not 
collocated, but in those cases we are most likely also have to care about the 
secure HTTPS. Making sure the server can run in the secure mode is a separate 
issue IMHO (can be done via configuring CXF Jetty connectors or supporting the 
war deployments with the containers taking care of HTTPS)
Thanks. Sergey

 JAX-RS server only responds to queries to/from http://localhost
 ---

 Key: TIKA-1196
 URL: https://issues.apache.org/jira/browse/TIKA-1196
 Project: Tika
  Issue Type: Bug
  Components: server
Affects Versions: 1.4
 Environment: Mac OS X, Windows Server 2008
Reporter: Rian Stockbower
Priority: Minor
  Labels: JAXRS, hostname, web-service
 Attachments: tika-1196.patch, tika-1196b.patch, tika-1196c.patch


 I'm not sure if this is a problem with the Tika JAX-RS server, or with how it 
 uses CXF under the hood. Anyway:
 I have a large text extraction job (10-15 million documents) that I'm using 
 the web service for. It would be nice to be able to distribute this 
 horizontally across multiple nodes to speed up the processing. I had thought 
 to have a job queue with a couple consumers, farming out PUT requests across 
 several Tika web service endpoints.
 But the JAX-RS web service will only respond to queries made to 
 {{http://localhost:9998/tika}}.
 I can't call {{http://hostname:9998/tika}} -- even if it's still a local 
 operation.
 Here is a list of things I've tried:
 * I changed line 89 of TikaServerCLI.java to compute the name of the host at 
 runtime. No go: the server starts up, and immediately terminates.
 * I changed line 89 of TikaServerCLI.java to be a hostname (not a FQDN), and 
 re-compiled:
 ** {{mvn compile -rf :tika-server}} compiles successfully. Start up the 
 server, and it terminates, just like when I tried to compute the hostname at 
 runtime
 ** {{mvn install}} from the topmost Tika directory gets the service 
 responding to both {{http://hostname:9998/tika}} and 
 {{http://hostname.domain.net:9998/tika}} (Seemed weird, this is why I was 
 thinking it was further up the chain in CXF?)
 In a perfect world:
 # The server should respond to any valid calls that make sense:
 #* 127.0.0.1
 #* localhost
 #* hostname
 #* host.domain.tld
 #* ip_address
 # A {{hostname}} invocation parameter could be used to limit what the service 
 responds to when it's started up. (A very optional, nice-to-have.)



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (TIKA-1196) JAX-RS server only responds to queries to/from http://localhost

2013-11-18 Thread Rian Stockbower (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13825332#comment-13825332
 ] 

Rian Stockbower commented on TIKA-1196:
---

I can put it back to localhost, but I'm not sure _why_ that's desirable. (Other 
than that's the way it was.) What's the reasoning behind having it limited to 
loopback addresses by default? This is not the behavior I would expect as a 
user. As a user, I would expect it to work like a web service: it does 
something when I make a semantically valid call to it.

From an operational perspective, there's some added complexity as well: when I 
deploy this to N nodes, I'll have to have my invocation script compute the 
local hostname before launching the service. Admittedly this is a small 
problem, but I don't see why it needs to be a problem at all.

What am I missing here?

 JAX-RS server only responds to queries to/from http://localhost
 ---

 Key: TIKA-1196
 URL: https://issues.apache.org/jira/browse/TIKA-1196
 Project: Tika
  Issue Type: Bug
  Components: server
Affects Versions: 1.4
 Environment: Mac OS X, Windows Server 2008
Reporter: Rian Stockbower
Priority: Minor
  Labels: JAXRS, hostname, web-service
 Attachments: tika-1196.patch, tika-1196b.patch, tika-1196c.patch


 I'm not sure if this is a problem with the Tika JAX-RS server, or with how it 
 uses CXF under the hood. Anyway:
 I have a large text extraction job (10-15 million documents) that I'm using 
 the web service for. It would be nice to be able to distribute this 
 horizontally across multiple nodes to speed up the processing. I had thought 
 to have a job queue with a couple consumers, farming out PUT requests across 
 several Tika web service endpoints.
 But the JAX-RS web service will only respond to queries made to 
 {{http://localhost:9998/tika}}.
 I can't call {{http://hostname:9998/tika}} -- even if it's still a local 
 operation.
 Here is a list of things I've tried:
 * I changed line 89 of TikaServerCLI.java to compute the name of the host at 
 runtime. No go: the server starts up, and immediately terminates.
 * I changed line 89 of TikaServerCLI.java to be a hostname (not a FQDN), and 
 re-compiled:
 ** {{mvn compile -rf :tika-server}} compiles successfully. Start up the 
 server, and it terminates, just like when I tried to compute the hostname at 
 runtime
 ** {{mvn install}} from the topmost Tika directory gets the service 
 responding to both {{http://hostname:9998/tika}} and 
 {{http://hostname.domain.net:9998/tika}} (Seemed weird, this is why I was 
 thinking it was further up the chain in CXF?)
 In a perfect world:
 # The server should respond to any valid calls that make sense:
 #* 127.0.0.1
 #* localhost
 #* hostname
 #* host.domain.tld
 #* ip_address
 # A {{hostname}} invocation parameter could be used to limit what the service 
 responds to when it's started up. (A very optional, nice-to-have.)



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (TIKA-1196) JAX-RS server only responds to queries to/from http://localhost

2013-11-18 Thread Sergey Beryozkin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13825357#comment-13825357
 ] 

Sergey Beryozkin commented on TIKA-1196:


IMHO what needs to be decided upon is: what is more important for Tika Server, 
for it supporting all the possible host variations out of the box or expect the 
users do more work when the server is accessed remotely. If the security is not 
an issue for the Server then it does not make sense to keep the local host by 
default a lot, but if it is then opening it up completely by default does not 
seem right - it would seem reasonable to me for users actually do more work in 
such cases, with the host calculation requiring the least if effort :-), and 
with setting up the server certificates taking the most of effort

 JAX-RS server only responds to queries to/from http://localhost
 ---

 Key: TIKA-1196
 URL: https://issues.apache.org/jira/browse/TIKA-1196
 Project: Tika
  Issue Type: Bug
  Components: server
Affects Versions: 1.4
 Environment: Mac OS X, Windows Server 2008
Reporter: Rian Stockbower
Priority: Minor
  Labels: JAXRS, hostname, web-service
 Attachments: tika-1196.patch, tika-1196b.patch, tika-1196c.patch


 I'm not sure if this is a problem with the Tika JAX-RS server, or with how it 
 uses CXF under the hood. Anyway:
 I have a large text extraction job (10-15 million documents) that I'm using 
 the web service for. It would be nice to be able to distribute this 
 horizontally across multiple nodes to speed up the processing. I had thought 
 to have a job queue with a couple consumers, farming out PUT requests across 
 several Tika web service endpoints.
 But the JAX-RS web service will only respond to queries made to 
 {{http://localhost:9998/tika}}.
 I can't call {{http://hostname:9998/tika}} -- even if it's still a local 
 operation.
 Here is a list of things I've tried:
 * I changed line 89 of TikaServerCLI.java to compute the name of the host at 
 runtime. No go: the server starts up, and immediately terminates.
 * I changed line 89 of TikaServerCLI.java to be a hostname (not a FQDN), and 
 re-compiled:
 ** {{mvn compile -rf :tika-server}} compiles successfully. Start up the 
 server, and it terminates, just like when I tried to compute the hostname at 
 runtime
 ** {{mvn install}} from the topmost Tika directory gets the service 
 responding to both {{http://hostname:9998/tika}} and 
 {{http://hostname.domain.net:9998/tika}} (Seemed weird, this is why I was 
 thinking it was further up the chain in CXF?)
 In a perfect world:
 # The server should respond to any valid calls that make sense:
 #* 127.0.0.1
 #* localhost
 #* hostname
 #* host.domain.tld
 #* ip_address
 # A {{hostname}} invocation parameter could be used to limit what the service 
 responds to when it's started up. (A very optional, nice-to-have.)



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Comment Edited] (TIKA-1196) JAX-RS server only responds to queries to/from http://localhost

2013-11-18 Thread Sergey Beryozkin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13825357#comment-13825357
 ] 

Sergey Beryozkin edited comment on TIKA-1196 at 11/18/13 2:43 PM:
--

IMHO what needs to be decided upon is: what is more important for Tika Server, 
for it supporting all the possible host variations out of the box or expect the 
users do more work when the server is accessed remotely. If the security is not 
an issue for the Server then it does not make sense to keep the local host by 
default a lot, but if it is then opening it up completely by default does not 
seem right - it would seem reasonable to me for users actually having to do 
more work in such cases, with the host calculation requiring the least of 
effort :-), and with setting up the server certificates taking the most of 
effort


was (Author: sergey_beryozkin):
IMHO what needs to be decided upon is: what is more important for Tika Server, 
for it supporting all the possible host variations out of the box or expect the 
users do more work when the server is accessed remotely. If the security is not 
an issue for the Server then it does not make sense to keep the local host by 
default a lot, but if it is then opening it up completely by default does not 
seem right - it would seem reasonable to me for users actually do more work in 
such cases, with the host calculation requiring the least if effort :-), and 
with setting up the server certificates taking the most of effort

 JAX-RS server only responds to queries to/from http://localhost
 ---

 Key: TIKA-1196
 URL: https://issues.apache.org/jira/browse/TIKA-1196
 Project: Tika
  Issue Type: Bug
  Components: server
Affects Versions: 1.4
 Environment: Mac OS X, Windows Server 2008
Reporter: Rian Stockbower
Priority: Minor
  Labels: JAXRS, hostname, web-service
 Attachments: tika-1196.patch, tika-1196b.patch, tika-1196c.patch


 I'm not sure if this is a problem with the Tika JAX-RS server, or with how it 
 uses CXF under the hood. Anyway:
 I have a large text extraction job (10-15 million documents) that I'm using 
 the web service for. It would be nice to be able to distribute this 
 horizontally across multiple nodes to speed up the processing. I had thought 
 to have a job queue with a couple consumers, farming out PUT requests across 
 several Tika web service endpoints.
 But the JAX-RS web service will only respond to queries made to 
 {{http://localhost:9998/tika}}.
 I can't call {{http://hostname:9998/tika}} -- even if it's still a local 
 operation.
 Here is a list of things I've tried:
 * I changed line 89 of TikaServerCLI.java to compute the name of the host at 
 runtime. No go: the server starts up, and immediately terminates.
 * I changed line 89 of TikaServerCLI.java to be a hostname (not a FQDN), and 
 re-compiled:
 ** {{mvn compile -rf :tika-server}} compiles successfully. Start up the 
 server, and it terminates, just like when I tried to compute the hostname at 
 runtime
 ** {{mvn install}} from the topmost Tika directory gets the service 
 responding to both {{http://hostname:9998/tika}} and 
 {{http://hostname.domain.net:9998/tika}} (Seemed weird, this is why I was 
 thinking it was further up the chain in CXF?)
 In a perfect world:
 # The server should respond to any valid calls that make sense:
 #* 127.0.0.1
 #* localhost
 #* hostname
 #* host.domain.tld
 #* ip_address
 # A {{hostname}} invocation parameter could be used to limit what the service 
 responds to when it's started up. (A very optional, nice-to-have.)



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (TIKA-1196) JAX-RS server only responds to queries to/from http://localhost

2013-11-18 Thread Rian Stockbower (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13825370#comment-13825370
 ] 

Rian Stockbower commented on TIKA-1196:
---

Those are more or less my thoughts. I'll solicit comments from the Tika users 
mailing list.

 JAX-RS server only responds to queries to/from http://localhost
 ---

 Key: TIKA-1196
 URL: https://issues.apache.org/jira/browse/TIKA-1196
 Project: Tika
  Issue Type: Bug
  Components: server
Affects Versions: 1.4
 Environment: Mac OS X, Windows Server 2008
Reporter: Rian Stockbower
Priority: Minor
  Labels: JAXRS, hostname, web-service
 Attachments: tika-1196.patch, tika-1196b.patch, tika-1196c.patch


 I'm not sure if this is a problem with the Tika JAX-RS server, or with how it 
 uses CXF under the hood. Anyway:
 I have a large text extraction job (10-15 million documents) that I'm using 
 the web service for. It would be nice to be able to distribute this 
 horizontally across multiple nodes to speed up the processing. I had thought 
 to have a job queue with a couple consumers, farming out PUT requests across 
 several Tika web service endpoints.
 But the JAX-RS web service will only respond to queries made to 
 {{http://localhost:9998/tika}}.
 I can't call {{http://hostname:9998/tika}} -- even if it's still a local 
 operation.
 Here is a list of things I've tried:
 * I changed line 89 of TikaServerCLI.java to compute the name of the host at 
 runtime. No go: the server starts up, and immediately terminates.
 * I changed line 89 of TikaServerCLI.java to be a hostname (not a FQDN), and 
 re-compiled:
 ** {{mvn compile -rf :tika-server}} compiles successfully. Start up the 
 server, and it terminates, just like when I tried to compute the hostname at 
 runtime
 ** {{mvn install}} from the topmost Tika directory gets the service 
 responding to both {{http://hostname:9998/tika}} and 
 {{http://hostname.domain.net:9998/tika}} (Seemed weird, this is why I was 
 thinking it was further up the chain in CXF?)
 In a perfect world:
 # The server should respond to any valid calls that make sense:
 #* 127.0.0.1
 #* localhost
 #* hostname
 #* host.domain.tld
 #* ip_address
 # A {{hostname}} invocation parameter could be used to limit what the service 
 responds to when it's started up. (A very optional, nice-to-have.)



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (TIKA-1193) Allow access to HtmlParser's HtmlSchema

2013-11-18 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13825823#comment-13825823
 ] 

Markus Jelsma commented on TIKA-1193:
-

Hi- are there any objections to putting this in? I know unit tests can break if 
applications incorrectly modify the schema, e.g. removing the shape attrib from 
anchors, but that's the responsibility of the application.. Perhaps marking it 
as expert would be satisfying?

 Allow access to HtmlParser's HtmlSchema
 ---

 Key: TIKA-1193
 URL: https://issues.apache.org/jira/browse/TIKA-1193
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.4
Reporter: Markus Jelsma
 Fix For: 1.5

 Attachments: TIKA-1193-trunk.patch


 TagSoup's HTMLSchema is not really well suited for HTML5 nor is it capable of 
 correctly handling some very strange quirks, e.g. table inside anchors. By 
 allowing access to the schema applications can modify the schema to suit 
 their needs on the fly.
 This would also mean that we don't have to rely on TIKA-985 getting 
 committed, we can change it from our own applications.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (TIKA-1193) Allow access to HtmlParser's HtmlSchema

2013-11-18 Thread Jukka Zitting (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13825832#comment-13825832
 ] 

Jukka Zitting commented on TIKA-1193:
-

A cleaner approach would probably be to allow the caller to pass a custom 
schema through the ParseContext object:

{code}
ParseContext context = new ParseContext();
context.set(Schema.class, ...);
parser.parse(..., context);
{code}

The {{HtmlParser}} class could then get the custom schema from the context:

{code}
Schema schema = context.get(Schema.class, HTML_SCHEMA);
parser.setProperty(org.ccil.cowan.tagsoup.Parser.schemaProperty, schema);
{code}

 Allow access to HtmlParser's HtmlSchema
 ---

 Key: TIKA-1193
 URL: https://issues.apache.org/jira/browse/TIKA-1193
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.4
Reporter: Markus Jelsma
 Fix For: 1.5

 Attachments: TIKA-1193-trunk.patch


 TagSoup's HTMLSchema is not really well suited for HTML5 nor is it capable of 
 correctly handling some very strange quirks, e.g. table inside anchors. By 
 allowing access to the schema applications can modify the schema to suit 
 their needs on the fly.
 This would also mean that we don't have to rely on TIKA-985 getting 
 committed, we can change it from our own applications.



--
This message was sent by Atlassian JIRA
(v6.1#6144)