Arrow moved to using a custom namespace in v10.0.0 On Sun, Feb 25, 2024 at 5:26 PM Simon Eves <simon.e...@heavy.ai> wrote:
> Ooh, good call! > > That also corresponds with what I just tried, which was to leave the > change in, but have the *size()* method return a value derived the old > way instead of just returning *size_*, and also compare the two and log > any mismatch. This also fails, which would seem to discount my thought that > perhaps the math wasn't equivalent, and something else was getting confused > by a different value returned from size() and then trampling on memory. > However, no value mismatch is reported before it fails. > > (pause for search) > > So I scanned all the static libs in our dependency bundle with *nm*, and > whaddya know... Apache Arrow (9.0.0) also uses *flatbuffers* and also > with no namespace! I pulled the source, and it's v1.12.0... the > *vector_downward* class has the same data members as the v2.0.0 in GDAL, > without *size_*, which was inserted in the middle. > > The latest Arrow 15.0 uses the latest *flatbuffers* 23.5.26, but with a > custom namespace. I'll look through to see when they did that. 9.0.0 is > only 18 months old, but we could probably stand to upgrade that too. > > namespace arrow_vendored_private::flatbuffers {} > namespace flatbuffers = arrow_vendored_private::flatbuffers; > > This also, of course, explains why we only hit the problem in the full > server build, and I was unable to reproduce it with the simple test > program, because that only linked GDAL and not Arrow too. > > OK, so I guess we might be able to avoid it by upgrading Arrow, as long as > that doesn't break something else. I guess you need to do the custom > namespace thing too, though. > > I hate computers. > > Simon > > > On Sun, Feb 25, 2024 at 3:43 PM Even Rouault <even.roua...@spatialys.com> > wrote: > >> >> >> Not obvious why that change would have broken anything, and certainly >> still absolutely no idea why it only happens in a full static build. >> >> At that point, I would slightly bet on the fact that your whole >> application would have another component using flatbuffers at a different >> version, which wouldn't have the new vector_downward::size_ member. >> Although I would expect that static linking would be in a better position >> to detect duplicated symbols than dynamic linking... >> >> One thing we didn't do in GDAL is to add a GDAL specific namespace around >> its flatbuffers component (we did that in MapServer to avoid potential >> conflicts between MapServer's flatbuffers copy with the GDAL one) >> >> An interesting experiment would be to revert >> https://github.com/google/flatbuffers/commit/9e4ca857b6dadf116703f612187e33b7d4bb6688 >> but add a unused size_ member to see if that's enough to break things. Or >> just scrumble a bit the order of members of vector_downward. >> >> Or try replacing the "flatbuffers" namespace by something like >> "gdal_flatbuffers" >> >> Simon >> >> >> >> On Sat, Feb 24, 2024 at 5:27 PM Simon Eves <simon.e...@heavy.ai> wrote: >> >>> OK, so I tried a custom build of 3.7.3 with the latest *flatbuffers* >>> (23.5.26), which was a drop-in replacement for 2.0.6 other than the version >>> asserts. >>> >>> This does not exhibit the original problem either. >>> >>> However, while it produces files which the stock static build, the >>> static build with the older *flatbuffers* (2.0.0), and the Ubuntu >>> dynamic build, can all read just fine, it is unable to read ANY files back >>> in again (in the context of our server geo importer, anyway). >>> >>> GDAL throws a *CE_Failure* of *Header failed consistency verification >>> (1), *which is from *OGRFlatGeobufLayer::Open(),* and the dataset >>> reports no layers (or at least, no vector layers). >>> >>> This also appears to be a side-effect of it being a static build, as >>> *ogrinfo* built from the same source (with *flatbuffers* 2.0.0), but in >>> regular shared libs mode, can read all three files just fine. I have been >>> unable to achieve a full-static tools build, so I can't try that right now. >>> >>> This either means that the problem is still there in some form in the >>> latest *flatbuffers*, but has moved, or that the higher-level FGB file >>> schema verification can be affected by the *flatbuffers* version. Both >>> are equally concerning. >>> >>> Anyway, the build with the older *flatbuffers* 2.0.0 extracted from the >>> v3.5.3 tree (with the *Table::VerifyField* mod) seems to work fine in >>> all ways, so we're probably gonna go with that, in the absence of anything >>> else. >>> >>> One other weirdness is that, of the three files, the two produced by the >>> static builds (*flatbuffers* 2.0.0 and *flatbuffers* 23.5.26) are 16 >>> bytes longer than the one from the Ubuntu dynamic build. All three read >>> just fine with *ogrinfo* and our server geo importer, and result in the >>> same table. Here is a link to a bundle with all three files plus the >>> GeoJSON equivalent (*MULTIPOLYGON* US states with some metadata). >>> >>> >>> https://drive.google.com/file/d/1ETRuV63gvUL4aNAT_4KvjrtK1uiCrFun/view?usp=sharing >>> >>> As ever, happy to get into the weeds with more details of the original >>> problem, but pretty sure that 95% of the readers of this list don't want >>> this thread to get any longer! :) >>> >>> Simon >>> >>> >>> On Fri, Feb 23, 2024 at 3:46 PM Simon Eves <simon.e...@heavy.ai> wrote: >>> >>>> Our emails crossed. I am indeed testing with the latest flatbuffers now >>>> too. >>>> >>>> Agreed on the rest. >>>> >>>> On Fri, Feb 23, 2024 at 3:42 PM Even Rouault < >>>> even.roua...@spatialys.com> wrote: >>>> >>>>> Simon, >>>>> >>>>> did you try to update to the latest >>>>> https://github.com/google/flatbuffers/releases to see if that would >>>>> solve the issue ? If that worked, that would be the best way forward... >>>>> >>>>> Otherwise if the issue persists with the latest flatbuffers release, a >>>>> (admitedly rather tedious) option would be to do a git bisect on the >>>>> flatbuffers code to identify the culprit commit. With some luck, the root >>>>> cause might be obvious if a single culptrit commit can be exhibited >>>>> (perhaps some subtle C++ undefined behaviour triggered? also it is a bit >>>>> mysterious that it hits only for static builds), or otherwise raise to the >>>>> upstream flatbuffers project to ask for their expertise >>>>> >>>>> Even >>>>> Le 23/02/2024 à 23:54, Simon Eves via gdal-dev a écrit : >>>>> >>>>> I was able to create a fork of 3.7.3 with just the *flatbuffers* >>>>> replaced with the pre-3.6.x version (2.0.0). >>>>> >>>>> This seemed to only require changes to the version asserts and adding >>>>> an *align* parameter to *Table::VerifyField()* to match the newer >>>>> API. >>>>> >>>>> >>>>> https://github.com/heavyai/gdal/tree/simon.eves/release/3.7/downgrade_to_flatbuffers_2.0.0 >>>>> >>>>> Our system works correctly and passes all GDAL I/O tests with that >>>>> version. Obviously this isn't an ideal solution, but this is otherwise a >>>>> release blocker for us. >>>>> >>>>> I would still very much like to discuss the original problem more >>>>> deeply, and hopefully come up with a better solution. >>>>> >>>>> Yours hopefully, >>>>> >>>>> Simon >>>>> >>>>> >>>>> >>>>> On Thu, Feb 22, 2024 at 10:22 PM Simon Eves <simon.e...@heavy.ai> >>>>> wrote: >>>>> >>>>>> Thank you, Robert, for the RR tip. I shall try it. >>>>>> >>>>>> I have new findings to report, however. >>>>>> >>>>>> First of all, I confirmed that a build against GDAL 3.4.1 (the >>>>>> version we were on before) still works. I also confirmed that builds >>>>>> against 3.7.3 and 3.8.4 still failed even with no additional library >>>>>> dependencies (just sqlite3 and proj), in case it was a side-effect of us >>>>>> also adding more of those. I then tried 3.5.3, with the CMake build (same >>>>>> config as we use for 3.7.3) and that worked. I then tried 3.6.4 (again, >>>>>> same CMake config) and that failed. These were all from bundles. >>>>>> >>>>>> I then started delving through the GDAL repo itself. I found the >>>>>> common root commit of 3.5.3 and 3.6.4, and all the commits in the >>>>>> *ogr/ogrsf_frmts/flatgeobuf* sub-project between that one and the >>>>>> final of each. For 3.5.3, this was only two. I built and tested both, and >>>>>> they were fine. I then tried the very first one that was new in the 3.6.4 >>>>>> chain (not in the history of 3.5.3), which was actually a bulk update to >>>>>> the *flatbuffers* sub-library, committed by Bjorn Harrtell on May 8 >>>>>> 2022 (SHA f7d8876). That one had the issue. I then tried the >>>>>> immediately-preceding commit (an unrelated docs change) and that one was >>>>>> fine. >>>>>> >>>>>> My current hypothesis, therefore, is that the *flatbuffers* update >>>>>> introduced the issue, or at least, the susceptibility of the issue. >>>>>> >>>>>> I still cannot explain why it only occurs in an all-static build, and >>>>>> even less able to explain why it only occurs in our full system and not >>>>>> with the simple test program against the very same static lib build that >>>>>> does the very same sequence of GDAL API calls, but I repeated the build >>>>>> tests of the commits either side and a few other random ones a bit >>>>>> further >>>>>> away in each direction, and the results were consistent. Again, it >>>>>> happens >>>>>> with both GCC 11 and Clang 14 builds, Debug or Release. >>>>>> >>>>>> I will continue tomorrow to look at the actual changes to >>>>>> *flatbuffers* in that update, although they are quite significant. >>>>>> Certainly, the *vector_downward* class, which is directly involved, >>>>>> was a new file in that update (although on inspection of that file's >>>>>> history in the *google/flatbuffers* repo, it seems it was just split >>>>>> out of another header). >>>>>> >>>>>> Bjorn, I don't mean to call you out directly, but I am CC'ing you to >>>>>> ensure you see this, as you appear to be a significant contributor to the >>>>>> *flatbuffers* project itself. Any insight you may have would be very >>>>>> welcome. I am of course happy to describe my debugging findings in more >>>>>> detail, privately if you wish, rather than spamming the list. >>>>>> >>>>>> Simon >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Tue, Feb 20, 2024 at 1:49 PM Robert Coup < >>>>>> robert.c...@koordinates.com> wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> On Tue, 20 Feb 2024 at 21:44, Robert Coup < >>>>>>> robert.c...@koordinates.com> wrote: >>>>>>> >>>>>>>> Hi Simon, >>>>>>>> >>>>>>>> On Tue, 20 Feb 2024 at 21:11, Simon Eves <simon.e...@heavy.ai> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Here's the stack trace for the original assert. Something is >>>>>>>>> stepping on scratch_ to make it 0x1000000000 instead of null, which it >>>>>>>>> starts out as when the flatbuffer object is created, but by the time >>>>>>>>> it >>>>>>>>> gets to allocating memory, it's broken. >>>>>>>>> >>>>>>>> >>>>>>>> What happens if you set a watchpoint in gdb when the flatbuffer is >>>>>>>> created? >>>>>>>> >>>>>>>> watch -l myfb->scratch >>>>>>>> or watch *0x1234c0ffee >>>>>>>> >>>>>>> >>>>>>> Or I've also had success with Mozilla's rr: https://rr-project.org/ >>>>>>> — you can run to a point where scratch is wrong, set a watchpoint on it, >>>>>>> and then run the program backwards to find out what touched it. >>>>>>> >>>>>>> Rob :) >>>>>>> >>>>>> >>>>> _______________________________________________ >>>>> gdal-dev mailing >>>>> listgdal-dev@lists.osgeo.orghttps://lists.osgeo.org/mailman/listinfo/gdal-dev >>>>> >>>>> -- http://www.spatialys.com >>>>> My software is free, but my time generally not. >>>>> >>>>> -- http://www.spatialys.com >> My software is free, but my time generally not. >> >>
_______________________________________________ gdal-dev mailing list gdal-dev@lists.osgeo.org https://lists.osgeo.org/mailman/listinfo/gdal-dev