[jira] [Commented] (TIKA-1196) JAX-RS server only responds to queries to/from http://localhost
[ https://issues.apache.org/jira/browse/TIKA-1196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13825254#comment-13825254 ] Sergey Beryozkin commented on TIKA-1196: Rian, thanks for the patch. I'd prefer going for a 'host' option only and keep the default to 'localhost' as Nick also suggested. You are right it does not make much sense for cases where clients are not collocated, but in those cases we are most likely also have to care about the secure HTTPS. Making sure the server can run in the secure mode is a separate issue IMHO (can be done via configuring CXF Jetty connectors or supporting the war deployments with the containers taking care of HTTPS) Thanks. Sergey JAX-RS server only responds to queries to/from http://localhost --- Key: TIKA-1196 URL: https://issues.apache.org/jira/browse/TIKA-1196 Project: Tika Issue Type: Bug Components: server Affects Versions: 1.4 Environment: Mac OS X, Windows Server 2008 Reporter: Rian Stockbower Priority: Minor Labels: JAXRS, hostname, web-service Attachments: tika-1196.patch, tika-1196b.patch, tika-1196c.patch I'm not sure if this is a problem with the Tika JAX-RS server, or with how it uses CXF under the hood. Anyway: I have a large text extraction job (10-15 million documents) that I'm using the web service for. It would be nice to be able to distribute this horizontally across multiple nodes to speed up the processing. I had thought to have a job queue with a couple consumers, farming out PUT requests across several Tika web service endpoints. But the JAX-RS web service will only respond to queries made to {{http://localhost:9998/tika}}. I can't call {{http://hostname:9998/tika}} -- even if it's still a local operation. Here is a list of things I've tried: * I changed line 89 of TikaServerCLI.java to compute the name of the host at runtime. No go: the server starts up, and immediately terminates. * I changed line 89 of TikaServerCLI.java to be a hostname (not a FQDN), and re-compiled: ** {{mvn compile -rf :tika-server}} compiles successfully. Start up the server, and it terminates, just like when I tried to compute the hostname at runtime ** {{mvn install}} from the topmost Tika directory gets the service responding to both {{http://hostname:9998/tika}} and {{http://hostname.domain.net:9998/tika}} (Seemed weird, this is why I was thinking it was further up the chain in CXF?) In a perfect world: # The server should respond to any valid calls that make sense: #* 127.0.0.1 #* localhost #* hostname #* host.domain.tld #* ip_address # A {{hostname}} invocation parameter could be used to limit what the service responds to when it's started up. (A very optional, nice-to-have.) -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (TIKA-1196) JAX-RS server only responds to queries to/from http://localhost
[ https://issues.apache.org/jira/browse/TIKA-1196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13825332#comment-13825332 ] Rian Stockbower commented on TIKA-1196: --- I can put it back to localhost, but I'm not sure _why_ that's desirable. (Other than that's the way it was.) What's the reasoning behind having it limited to loopback addresses by default? This is not the behavior I would expect as a user. As a user, I would expect it to work like a web service: it does something when I make a semantically valid call to it. From an operational perspective, there's some added complexity as well: when I deploy this to N nodes, I'll have to have my invocation script compute the local hostname before launching the service. Admittedly this is a small problem, but I don't see why it needs to be a problem at all. What am I missing here? JAX-RS server only responds to queries to/from http://localhost --- Key: TIKA-1196 URL: https://issues.apache.org/jira/browse/TIKA-1196 Project: Tika Issue Type: Bug Components: server Affects Versions: 1.4 Environment: Mac OS X, Windows Server 2008 Reporter: Rian Stockbower Priority: Minor Labels: JAXRS, hostname, web-service Attachments: tika-1196.patch, tika-1196b.patch, tika-1196c.patch I'm not sure if this is a problem with the Tika JAX-RS server, or with how it uses CXF under the hood. Anyway: I have a large text extraction job (10-15 million documents) that I'm using the web service for. It would be nice to be able to distribute this horizontally across multiple nodes to speed up the processing. I had thought to have a job queue with a couple consumers, farming out PUT requests across several Tika web service endpoints. But the JAX-RS web service will only respond to queries made to {{http://localhost:9998/tika}}. I can't call {{http://hostname:9998/tika}} -- even if it's still a local operation. Here is a list of things I've tried: * I changed line 89 of TikaServerCLI.java to compute the name of the host at runtime. No go: the server starts up, and immediately terminates. * I changed line 89 of TikaServerCLI.java to be a hostname (not a FQDN), and re-compiled: ** {{mvn compile -rf :tika-server}} compiles successfully. Start up the server, and it terminates, just like when I tried to compute the hostname at runtime ** {{mvn install}} from the topmost Tika directory gets the service responding to both {{http://hostname:9998/tika}} and {{http://hostname.domain.net:9998/tika}} (Seemed weird, this is why I was thinking it was further up the chain in CXF?) In a perfect world: # The server should respond to any valid calls that make sense: #* 127.0.0.1 #* localhost #* hostname #* host.domain.tld #* ip_address # A {{hostname}} invocation parameter could be used to limit what the service responds to when it's started up. (A very optional, nice-to-have.) -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (TIKA-1196) JAX-RS server only responds to queries to/from http://localhost
[ https://issues.apache.org/jira/browse/TIKA-1196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13825357#comment-13825357 ] Sergey Beryozkin commented on TIKA-1196: IMHO what needs to be decided upon is: what is more important for Tika Server, for it supporting all the possible host variations out of the box or expect the users do more work when the server is accessed remotely. If the security is not an issue for the Server then it does not make sense to keep the local host by default a lot, but if it is then opening it up completely by default does not seem right - it would seem reasonable to me for users actually do more work in such cases, with the host calculation requiring the least if effort :-), and with setting up the server certificates taking the most of effort JAX-RS server only responds to queries to/from http://localhost --- Key: TIKA-1196 URL: https://issues.apache.org/jira/browse/TIKA-1196 Project: Tika Issue Type: Bug Components: server Affects Versions: 1.4 Environment: Mac OS X, Windows Server 2008 Reporter: Rian Stockbower Priority: Minor Labels: JAXRS, hostname, web-service Attachments: tika-1196.patch, tika-1196b.patch, tika-1196c.patch I'm not sure if this is a problem with the Tika JAX-RS server, or with how it uses CXF under the hood. Anyway: I have a large text extraction job (10-15 million documents) that I'm using the web service for. It would be nice to be able to distribute this horizontally across multiple nodes to speed up the processing. I had thought to have a job queue with a couple consumers, farming out PUT requests across several Tika web service endpoints. But the JAX-RS web service will only respond to queries made to {{http://localhost:9998/tika}}. I can't call {{http://hostname:9998/tika}} -- even if it's still a local operation. Here is a list of things I've tried: * I changed line 89 of TikaServerCLI.java to compute the name of the host at runtime. No go: the server starts up, and immediately terminates. * I changed line 89 of TikaServerCLI.java to be a hostname (not a FQDN), and re-compiled: ** {{mvn compile -rf :tika-server}} compiles successfully. Start up the server, and it terminates, just like when I tried to compute the hostname at runtime ** {{mvn install}} from the topmost Tika directory gets the service responding to both {{http://hostname:9998/tika}} and {{http://hostname.domain.net:9998/tika}} (Seemed weird, this is why I was thinking it was further up the chain in CXF?) In a perfect world: # The server should respond to any valid calls that make sense: #* 127.0.0.1 #* localhost #* hostname #* host.domain.tld #* ip_address # A {{hostname}} invocation parameter could be used to limit what the service responds to when it's started up. (A very optional, nice-to-have.) -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Comment Edited] (TIKA-1196) JAX-RS server only responds to queries to/from http://localhost
[ https://issues.apache.org/jira/browse/TIKA-1196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13825357#comment-13825357 ] Sergey Beryozkin edited comment on TIKA-1196 at 11/18/13 2:43 PM: -- IMHO what needs to be decided upon is: what is more important for Tika Server, for it supporting all the possible host variations out of the box or expect the users do more work when the server is accessed remotely. If the security is not an issue for the Server then it does not make sense to keep the local host by default a lot, but if it is then opening it up completely by default does not seem right - it would seem reasonable to me for users actually having to do more work in such cases, with the host calculation requiring the least of effort :-), and with setting up the server certificates taking the most of effort was (Author: sergey_beryozkin): IMHO what needs to be decided upon is: what is more important for Tika Server, for it supporting all the possible host variations out of the box or expect the users do more work when the server is accessed remotely. If the security is not an issue for the Server then it does not make sense to keep the local host by default a lot, but if it is then opening it up completely by default does not seem right - it would seem reasonable to me for users actually do more work in such cases, with the host calculation requiring the least if effort :-), and with setting up the server certificates taking the most of effort JAX-RS server only responds to queries to/from http://localhost --- Key: TIKA-1196 URL: https://issues.apache.org/jira/browse/TIKA-1196 Project: Tika Issue Type: Bug Components: server Affects Versions: 1.4 Environment: Mac OS X, Windows Server 2008 Reporter: Rian Stockbower Priority: Minor Labels: JAXRS, hostname, web-service Attachments: tika-1196.patch, tika-1196b.patch, tika-1196c.patch I'm not sure if this is a problem with the Tika JAX-RS server, or with how it uses CXF under the hood. Anyway: I have a large text extraction job (10-15 million documents) that I'm using the web service for. It would be nice to be able to distribute this horizontally across multiple nodes to speed up the processing. I had thought to have a job queue with a couple consumers, farming out PUT requests across several Tika web service endpoints. But the JAX-RS web service will only respond to queries made to {{http://localhost:9998/tika}}. I can't call {{http://hostname:9998/tika}} -- even if it's still a local operation. Here is a list of things I've tried: * I changed line 89 of TikaServerCLI.java to compute the name of the host at runtime. No go: the server starts up, and immediately terminates. * I changed line 89 of TikaServerCLI.java to be a hostname (not a FQDN), and re-compiled: ** {{mvn compile -rf :tika-server}} compiles successfully. Start up the server, and it terminates, just like when I tried to compute the hostname at runtime ** {{mvn install}} from the topmost Tika directory gets the service responding to both {{http://hostname:9998/tika}} and {{http://hostname.domain.net:9998/tika}} (Seemed weird, this is why I was thinking it was further up the chain in CXF?) In a perfect world: # The server should respond to any valid calls that make sense: #* 127.0.0.1 #* localhost #* hostname #* host.domain.tld #* ip_address # A {{hostname}} invocation parameter could be used to limit what the service responds to when it's started up. (A very optional, nice-to-have.) -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (TIKA-1196) JAX-RS server only responds to queries to/from http://localhost
[ https://issues.apache.org/jira/browse/TIKA-1196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13825370#comment-13825370 ] Rian Stockbower commented on TIKA-1196: --- Those are more or less my thoughts. I'll solicit comments from the Tika users mailing list. JAX-RS server only responds to queries to/from http://localhost --- Key: TIKA-1196 URL: https://issues.apache.org/jira/browse/TIKA-1196 Project: Tika Issue Type: Bug Components: server Affects Versions: 1.4 Environment: Mac OS X, Windows Server 2008 Reporter: Rian Stockbower Priority: Minor Labels: JAXRS, hostname, web-service Attachments: tika-1196.patch, tika-1196b.patch, tika-1196c.patch I'm not sure if this is a problem with the Tika JAX-RS server, or with how it uses CXF under the hood. Anyway: I have a large text extraction job (10-15 million documents) that I'm using the web service for. It would be nice to be able to distribute this horizontally across multiple nodes to speed up the processing. I had thought to have a job queue with a couple consumers, farming out PUT requests across several Tika web service endpoints. But the JAX-RS web service will only respond to queries made to {{http://localhost:9998/tika}}. I can't call {{http://hostname:9998/tika}} -- even if it's still a local operation. Here is a list of things I've tried: * I changed line 89 of TikaServerCLI.java to compute the name of the host at runtime. No go: the server starts up, and immediately terminates. * I changed line 89 of TikaServerCLI.java to be a hostname (not a FQDN), and re-compiled: ** {{mvn compile -rf :tika-server}} compiles successfully. Start up the server, and it terminates, just like when I tried to compute the hostname at runtime ** {{mvn install}} from the topmost Tika directory gets the service responding to both {{http://hostname:9998/tika}} and {{http://hostname.domain.net:9998/tika}} (Seemed weird, this is why I was thinking it was further up the chain in CXF?) In a perfect world: # The server should respond to any valid calls that make sense: #* 127.0.0.1 #* localhost #* hostname #* host.domain.tld #* ip_address # A {{hostname}} invocation parameter could be used to limit what the service responds to when it's started up. (A very optional, nice-to-have.) -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (TIKA-1193) Allow access to HtmlParser's HtmlSchema
[ https://issues.apache.org/jira/browse/TIKA-1193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13825823#comment-13825823 ] Markus Jelsma commented on TIKA-1193: - Hi- are there any objections to putting this in? I know unit tests can break if applications incorrectly modify the schema, e.g. removing the shape attrib from anchors, but that's the responsibility of the application.. Perhaps marking it as expert would be satisfying? Allow access to HtmlParser's HtmlSchema --- Key: TIKA-1193 URL: https://issues.apache.org/jira/browse/TIKA-1193 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.4 Reporter: Markus Jelsma Fix For: 1.5 Attachments: TIKA-1193-trunk.patch TagSoup's HTMLSchema is not really well suited for HTML5 nor is it capable of correctly handling some very strange quirks, e.g. table inside anchors. By allowing access to the schema applications can modify the schema to suit their needs on the fly. This would also mean that we don't have to rely on TIKA-985 getting committed, we can change it from our own applications. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (TIKA-1193) Allow access to HtmlParser's HtmlSchema
[ https://issues.apache.org/jira/browse/TIKA-1193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13825832#comment-13825832 ] Jukka Zitting commented on TIKA-1193: - A cleaner approach would probably be to allow the caller to pass a custom schema through the ParseContext object: {code} ParseContext context = new ParseContext(); context.set(Schema.class, ...); parser.parse(..., context); {code} The {{HtmlParser}} class could then get the custom schema from the context: {code} Schema schema = context.get(Schema.class, HTML_SCHEMA); parser.setProperty(org.ccil.cowan.tagsoup.Parser.schemaProperty, schema); {code} Allow access to HtmlParser's HtmlSchema --- Key: TIKA-1193 URL: https://issues.apache.org/jira/browse/TIKA-1193 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.4 Reporter: Markus Jelsma Fix For: 1.5 Attachments: TIKA-1193-trunk.patch TagSoup's HTMLSchema is not really well suited for HTML5 nor is it capable of correctly handling some very strange quirks, e.g. table inside anchors. By allowing access to the schema applications can modify the schema to suit their needs on the fly. This would also mean that we don't have to rely on TIKA-985 getting committed, we can change it from our own applications. -- This message was sent by Atlassian JIRA (v6.1#6144)