> On Jul 2, 2020, at 15:33, Andy Seaborne <a...@apache.org> wrote: > > There may be two things going on ... or one > > The WARN > > This is broken input data but from looking at the Jena code, I'm not seeing > how a encoding error on it's own can sent an unset RDF_StreamRow (which is > what prints as <RDF_StreamRow >) The Jena code is quite simple and by > inspection I can't see a path that sends that. And in Thrift you can send > unset data anyway - the write path throws a client side a exception. > > So I'm wondering about the path from client to server especially with the > broken pipes earlier which might impact a reverse proxy or tomcat itself. > Maybe there is junk on the wire. > > > The W-transactions > This is TDB1 presumably. > > The lack of response suggests they are backing up and that can happen in TDB1 > if the engine database is blocked from writing to the storage. that said - it > does log if that is happening unless I/O is blocked at the OS level. > >>> On 02/07/2020 18:37, Chris Tomlinson wrote: >> Hi Andy, >>>> On Jul 2, 2020, at 10:22 AM, Andy Seaborne <a...@apache.org> wrote: >>> The log done not contain "Unrecognized". >>> Is the "ThriftConvert WARN visit:" message from a different process? >> Sorry. I should have been much clearer: >>>> I have attached the catalina.out with the 199M or so "ThriftConvert WARN”s >>>> removed. > > I wasn't clear if "removed" meant stderr another source or not.
grep -v "ThriftConvert WARN visit: Unrecognized: <RDF_StreamRow >" catalina.out > >> The event starts at >> line 11305 "[2020-06-30 14:12:51] Fuseki INFO [120611] PUT >> http://buda1. . . .” >> with the first WARNing at 14:12:53 after the "[120611] 200 OK (2.106 s)”; >> and ends with the stacktraces resulting from “systemctl restart >> fuseki.service” at : >> line 12940 "30-Jun-2020 16:21:28.445" >> and the stacktraces end at startup: >> line 12939 "[2020-06-30 16:21:26] Fuseki INFO [121275] PUT >> http://buda1. . . . " >> I have a 55.1MB gzip’d file with all of the ~199M "ThriftConvert WARN”s. It >> is 15GB+ uncompressed. The rate of the WARNing messages is around 20K - 25K >> per second. >> If you need the entire file, where should I upload it? > > From what I can see, it (WARN) isn't database related. No it seems to me to be related to getting the payload off the wire. > >>> There are some broken pipe errors much earlier but they happen if clients >>> go away untidily. They are on query responses not PUTs. >>> The fact 120612 doesn't even print "Body:” >> I think we conclude from this that the RDFConnection became unstable >> processing the incoming payload of 7536 triples in the ontologySchema graph. >> We have significantly larger graphs that are PUT using the same pattern as I >> displayed earlier. >> The same source bits/bytes loaded w/o issue before and after this issue so >> it seems likely that there was an undetected transmission error that >> RDFConnection did not get notified of and was not able to recover from. May >> be that’s peculiar to the Thrift encoding? > > Caveat a bug in thrift, no. Either it works or it doesn't and your system > works most of the time. Yes this is a very intermittent problem > >>> You could run with RDFConnectionRemote - what RDFConnectionFuseki does is >>> switch to thrift encodings, RDFConnectionRemote uses the standards ones. >> We’ll take this under consideration. The problem occurred once at the end of >> April and now at the end of June. Not very frequent given the amount of PUTs >> that we do on a daily basis. It’s not what I would call effectively >> repeatable. >> Is there a reason that we should distrust Thrift in general? Is it somehow >> more susceptible to undetected transmission errors that the standard >> encodings? > > Not that I know of. I'm suggesting it to change the situation to get more > information. Ok. I'm not sure we'll make a change just now as we're in the middle of a bit of a crunch. > >> Do we need to consider avoiding Thrift everywhere? > > If it works for months, and it is a simple code path, suggests to me it is > working. > > Even the concurrency looks OK because it locally writes a buffer so the HTTP > length is available. So it seems to me that there may be an opportunity for some sort of robustification in RDFConnection. There is evidently a loop somewhere that doesn't terminate, retrying the parsing repeatedly or something like that. The payload is finite so there wold appear to be a test that repeatedly fails but doesn't make progress in consuming the payload. Thanks, Chris > > Andy > >> Thanks, >> Chris >>> Andy >>> On 01/07/2020 17:34, Chris Tomlinson wrote: >>>> Hi Andy, >>>>> On Jul 1, 2020, at 7:59 AM, Andy Seaborne <a...@apache.org >>>>> <mailto:a...@apache.org>> wrote: >>>>> Presumably the client is in java using RDFConnectionFactory.connectFuseki? >>>> Yes the clients are in java 1.8 also on Debian 4.9. Fuseki is running on >>>> one ec2 instance and some of the clients are on the same instance and >>>> others are on another ec2 instance on the same region. The clients use the >>>> following patterns: >>>>> fuConnBuilder = RDFConnectionFuseki.create() >>>>> .destination(baseUrl) >>>>> .queryEndpoint(baseUrl+"/query") >>>>> .gspEndpoint(baseUrl+"/data") >>>>> .updateEndpoint(baseUrl+"/update”); >>>>> fuConn = fuConnBuilder.build(); >>>>> if (!fuConn.isInTransaction()) { >>>>> fuConn.begin(ReadWrite.WRITE); >>>>> } >>>>> fuConn.put(graphName, model); >>>>> fuConn.commit(); >>>>> Do you have the data from 120611? >>>> The data for the PUT of 120611, can be retrieved via >>>> http://purl.bdrc.io/graph/PersonUIShapes.ttl. The source is >>>> person.ui.shapes.ttl >>>> <https://github.com/buda-base/editor-templates/blob/master/templates/core/person.ui.shapes.ttl> >>>> which is loaded via OntDocumentManager w/ setProcessImports(true) from GH. >>>> From the log, 120611 appears to have completed successfully with 839 >>>> triples (the expected number); however, since then there have been several >>>> reloads of that graph during restarts and so on - we’re in development - >>>> so the exact bits in the 120611 PUT are not specifically available. This >>>> particular data has not changed in the past couple of weeks. >>>> As for 120612 it never completed and the source data has not had any >>>> changes since 2020-06-29 prior to the “failure” and after and there have >>>> been numerous reloads w/o issue before and since. Only during the period >>>> 14:12 - 16:21 do we see the PUTs hang. >>>>> Could the request have got truncated? >>>> Referring to 120611, I don’t see any evidence of truncation. The expected >>>> number of triples, 839, is reported in the log. >>>>> The fact later PUTs stall suggest (does not prove) that the earlier >>>>> operation broken somehow but I'd expect a stacktrace in the log. >>>> There may be a stacktrace at the very end when I restarted fuseki. There >>>> were quite a few threads aborted w/ stacktraces but I don’t know how to >>>> interpret which might have been associated with 120612 which appears to >>>> have never completed receiving ontologySchema since no >>>> Fuseki INFO [120612] Body: . . . >>>> appears in the log. There are another 13 occurrences of PUTs which hang >>>> during the period 14:12 - 16:21. >>>> I have attached the catalina.out with the 199M or so "ThriftConvert WARN”s >>>> removed. It shows the stacktraces resulting from the “systemctl restart >>>> fuseki.service” at 30-Jun-2020 16:21:28.445 line# 12940 to the startup at >>>> 30-Jun-2020 16:21:34.866 line# 18856. >>>> The log also shows that just the PUTs hung, GETs and POSTs (of queries) >>>> continued to be processed, perhaps a little less rapidly than usual. >>>>> Overlapping should be fine - as that are the same database, one will >>>>> happen before the other even though they are separate graphs. >>>> As we expected. >>>> It seems that this broken behavior is an unusual occurrence. >>>> Thank you for your help. It’s rather perplexing. >>>> Chris >>>>> Andy >>>>> On 30/06/2020 22:29, Chris Tomlinson wrote: >>>>>> Hello, >>>>>> Running jena/fuseki 3.14.0, compiled and running under openjdk version >>>>>> “1.8.0_252”. >>>>>> Just some hours ago fuseki running on an AWS ec2 for almost a year, >>>>>> logged around 199M occurrences of (times are UTC): >>>>>>> [2020-06-30 14:12:50] Fuseki INFO [120610] 200 OK (8 ms) >>>>>>> [2020-06-30 14:12:51] Fuseki INFO [120611] PUT >>>>>>> http://buda1.bdrc.io:13180/fuseki/corerw/data?graph=http://purl.bdrc.io/graph/PersonUIShapes >>>>>>> [2020-06-30 14:12:52] Fuseki INFO [120612] PUT >>>>>>> http://buda1.bdrc.io:13180/fuseki/corerw/data?graph=http://purl.bdrc.io/graph/ontologySchema >>>>>>> [2020-06-30 14:12:53] Fuseki INFO [120611] Body: Content-Length=73984, >>>>>>> Content-Type=application/rdf+thrift, Charset=null => RDF-THRIFT : >>>>>>> Count=839 Triples=839 Quads=0 >>>>>>> [2020-06-30 14:12:53] Fuseki INFO [120611] 200 OK (2.106 s) >>>>>>> [2020-06-30 14:12:57] ThriftConvert WARN visit: Unrecognized: >>>>>>> <RDF_StreamRow > >>>>>>> [2020-06-30 14:12:57] ThriftConvert WARN visit: Unrecognized: >>>>>>> <RDF_StreamRow > >>>>>>> [2020-06-30 14:12:57] ThriftConvert WARN visit: Unrecognized: >>>>>>> <RDF_StreamRow > >>>>>>> [2020-06-30 14:12:57] ThriftConvert WARN visit: Unrecognized: >>>>>>> <RDF_StreamRow > >>>>>>> [2020-06-30 14:12:57] ThriftConvert WARN visit: Unrecognized: >>>>>>> <RDF_StreamRow > >>>>>> . . . >>>>>>> [2020-06-30 16:21:33] ThriftConvert WARN visit: Unrecognized: >>>>>>> <RDF_StreamRow > >>>>>>> [2020-06-30 16:21:33] ThriftConvert WARN visit: Unrecognized: >>>>>>> <RDF_StreamRow > >>>>>>> [2020-06-30 16:21:33] ThriftConvert WARN visit: Unrecognized: >>>>>>> <RDF_StreamRow > >>>>>>> [2020-06-30 16:21:33] ThriftConvert WARN visit: Unrecognized: >>>>>>> <RDF_StreamRow > >>>>>>> [2020-06-30 16:21:33] ThriftConvert WARN visit: Unrecognized: >>>>>>> <RDF_StreamRow > >>>>>>> [2020-06-30 16:21:33] ThriftConvert WARN visit: Unrecognized: >>>>>>> <RDF_StreamRow > >>>>>>> [2020-06-30 16:21:33] ThriftConvert WARN visit: Unrecognized: >>>>>>> <RDF_StreamRow > >>>>>>> [2020-06-30 16:21:33] ThriftConvert WARN visit: Unrecognized: >>>>>>> <RDF_StreamRow > >>>>>>> [2020-06-30 16:21:33] ThriftConvert WARN visit: Unrecognized: >>>>>>> <RDF_StreamRow > >>>>>>> [2020-06-30 16:21:33] ThriftConvert WARN visit: Unrecognized: >>>>>>> <RDF_StreamRow > >>>>>>> [2020-06-30 16:21:33] ThriftConvert WARN visit: Unrecognized: >>>>>>> <RDF_StreamRow > >>>>>>> [2020-06-30 16:21:33] ThriftConvert WARN visit: Unrecognized: >>>>>>> <RDF_StreamRow > >>>>>>> 30-Jun-2020 16:21:34.866 INFO [main] >>>>>>> org.apache.catalina.startup.VersionLoggerListener.log Server version: >>>>>>> Apache Tomcat/8.0.53 >>>>>>> 30-Jun-2020 16:21:34.868 INFO [main] >>>>>>> org.apache.catalina.startup.VersionLoggerListener.log Server built: >>>>>>> Jun 29 2018 14:42:45 UTC >>>>>>> 30-Jun-2020 16:21:34.868 INFO [main] >>>>>>> org.apache.catalina.startup.VersionLoggerListener.log Server number: >>>>>>> 8.0.53.0 >>>>>>> 30-Jun-2020 16:21:34.868 INFO [main] >>>>>>> org.apache.catalina.startup.VersionLoggerListener.log OS Name: >>>>>>> Linux >>>>>>> 30-Jun-2020 16:21:34.868 INFO [main] >>>>>>> org.apache.catalina.startup.VersionLoggerListener.log OS Version: >>>>>>> 4.9.0-8-amd64 >>>>>>> 30-Jun-2020 16:21:34.868 INFO [main] >>>>>>> org.apache.catalina.startup.VersionLoggerListener.log Architecture: >>>>>>> amd64 >>>>>>> 30-Jun-2020 16:21:34.869 INFO [main] >>>>>>> org.apache.catalina.startup.VersionLoggerListener.log Java Home: >>>>>>> /usr/lib/jvm/java-8-openjdk-amd64/jre >>>>>>> 30-Jun-2020 16:21:34.869 INFO [main] >>>>>>> org.apache.catalina.startup.VersionLoggerListener.log JVM Version: >>>>>>> 1.8.0_252-8u252-b09-1~deb9u1-b09 >>>>>>> 30-Jun-2020 16:21:34.869 INFO [main] >>>>>>> org.apache.catalina.startup.VersionLoggerListener.log JVM Vendor: >>>>>>> Oracle Corporation >>>>>>> 30-Jun-2020 16:21:34.869 INFO [main] >>>>>>> org.apache.catalina.startup.VersionLoggerListener.log CATALINA_BASE: >>>>>>> /usr/local/fuseki/tomcat >>>>>> Until I saw the issue and restarted fuseki. >>>>>> An interesting thing I see in the trace is the overlapping PUTs just as >>>>>> fuseki wigged out. There were no changes in code or the graphs that were >>>>>> the being PUT as part of a restart of an app server. >>>>>> Fuseki didn't go completely unresponsive except that later PUTs were >>>>>> indefinitely delayed - which was how I noticed the issue. >>>>>> We saw this once before in 29 April and didn’t delve into it at the time. >>>>>> Has anyone seen this sort of thing before? >>>>>> Thanks, >>>>>> Chris