Re: [lldb-dev] [RFC] Segmented Address Space Support in LLDB

Greg Clayton via lldb-dev Mon, 26 Oct 2020 15:45:44 -0700


> On Oct 22, 2020, at 1:25 AM, Jason Molenda <[email protected]> wrote:
> 
> Hi Greg, Pavel.
> 
> I think it's worth saying that this is very early in this project.  We know 
> we're going to need the ability to track segments on addresses, but honestly 
> a lot of the finer details aren't clear yet.  It's such a fundamental change 
> that we wanted to start a discussion, even though I know it's hard to have 
> detailed discussions still.
> 
> In the envisioned environment, there will be a default segment, and most 
> addresses will be in the default segment.  DWARF, user input (lldb cmdline), 
> SB API, and clang expressions are going to be the places where segments are 
> specified --- Dump methods and ProcessGDBRemote will be the main place where 
> the segments are displayed/used.  There will be modifications to the memory 
> read/write gdb RSP packets to include these.
> 
> This early in the project, it's hard to tell what will be upstreamed to the 
> llvm.org <http://llvm.org/> monorepo, or when.  My personal opinion is that 
> we don't actually want to add segment support to llvm.org <http://llvm.org/> 
> lldb at this point.  We'd be initializing every address object with 
> LLDB_INVALID_SEGMENT or LLDB_DEFAULT_SEGMENT, and then testing that each 
> object is initialized this way?  I don't see this actually being useful.
> 
> However, changing lldb's target addresses to be strictly handled in terms of 
> objects will allow us to add a segment discriminator ivar to Address and 
> ProcessAddress on our local branch while this is in development, and minimize 
> the places where we're diverging from the llvm.org <http://llvm.org/> 
> sources.  We'll need to have local modifications at the places where a 
> segment is input (DWARF, cmdline, SB API, compiler type) or output (Dump, 
> ProcesssGDBRemote) and, hopefully, the vast majority of lldb can be 
> unmodified.
> 
> The proposal was written in terms of what we need to accomplish based on our 
> current understanding for this project, but I think there will be a lot of 
> details figured out as we get more concrete experience of how this all works. 
>  And when it's appropriate to upstream to llvm.org <http://llvm.org/>, we'll 
> be better prepared to discuss the tradeoffs of the approaches we took in 
> extending Address/ProcessAddress to incorporate a segment.
> 
> My hope is that these generic OO'ification of target addresses will not 
> change lldb beyond moving off of addr_t for now.
> 
> I included a couple of inlined comments, but I need to address more of yours 
> & Pavel's notes later, I've been dealing with a few crazy things and am way 
> behind on emails but didn't want to wait any longer to send something out.


No worries! I would vote to upstream as much as possible as soon as possible to 
avoid differences and merging issues for you guys. I would really like to see 
LLDB have support for segmented address spaces. Many comments I made were just 
my thinking out loud and trying to ease the changes in with as little 
disruption as possible.
> 
> 
> 
>> On Oct 19, 2020, at 4:11 PM, Greg Clayton via lldb-dev 
>> <[email protected]> wrote:
>> 
>> 
>> 
>>> On Oct 19, 2020, at 2:56 PM, Jonas Devlieghere via lldb-dev 
>>> <[email protected]> wrote:
>>> 
>>> We want to support segmented address spaces in LLDB. Currently, all of 
>>> LLDB’s external API, command line interface, and internals assume that an 
>>> address in memory can be addressed unambiguously as an addr_t (aka 
>>> uint64_t). To support a segmented address space we’d need to extend addr_t 
>>> with a discriminator (an aspace_t) to uniquely identify a location in 
>>> memory. This RFC outlines what would need to change and how we propose to 
>>> do that.
>>> 
>>> ### Addresses in LLDB
>>> 
>>> Currently, LLDB has two ways of representing an address:
>>> 
>>> - Address object. Mostly represents addresses as Section+offset for a 
>>> binary image loaded in the Target. An Address in this form can persist 
>>> across executions, e.g. an address breakpoint in a binary image that loads 
>>> at a different address every execution. An Address object can represent 
>>> memory not mapped to a binary image. Heap, stack, jitted items, will all be 
>>> represented as the uint64_t load address of the object, and cannot persist 
>>> across multiple executions. You must have the Target object available to 
>>> get the current load address of an Address object in the current process 
>>> run. Some parts of lldb do not have a Target available to them, so they 
>>> require that the Address can be devolved to an addr_t (aka uint64_t) and 
>>> passed in.
>>> - The addr_t (aka uint64_t) type. Primarily used when receiving input (e.g. 
>>> from a user on the command line) or when interacting with the inferior 
>>> (reading/writing memory) for addresses that need not persist across runs. 
>>> Also used when reading DWARF and in our symbol tables to represent file 
>>> offset addresses, where the size of an Address object would be 
>>> objectionable.
>>> 
>> 
>> Correction: LLDB has 3 kinds of uint64_t addresses:
>> - "file address" which are always mapped to a section + offset if put into a 
>> Address object. This value only makes sense to the lldb_private::Module that 
>> contains it. The only way to pass this around is as a lldb_private::Address. 
>> You can make queries on a file address using "image lookup --address" before 
>> you are debugging, but a single file address can result in multiple matches 
>> in multiple modules because each module might contain something at this 
>> virtual address. This object might be able to be converted to a "load 
>> address" if the section is loaded in your debug target. Since the target 
>> contains the section load list, the target is needed when converting between 
>> Address and addr_t objects.
>> - "load address" which is guaranteed to be unique in a process with no 
>> segments. It can always be put into a lldb_private::Address object, but that 
>> object won't always have a section. If there is no section, it means the 
>> memory location maps to stack, heap, or other memory that doesn't reside in 
>> a object file section. This object might be able to be converted to a 
>> section + offset address if the address matches one of the loaded sections 
>> in a target. If this can be converted to a Address object that has a 
>> section, then it can persist across debug sessions, otherwise, not.
>> - "host address" which is a pointer to memory in the LLDB process itself. 
>> Used for storing expression results and other things. You cannot convert 
>> this to/from a "file" or "load" address.
> 
> Yes, good point, host memory is a third type of address that we use.  And our 
> symbols tables, for instance, internally represent themselves as uint64_t 
> offsets into the file or section, I forget which, and we're not talking about 
> changing those uint64_t style addresses.  On our project, I do not believe 
> the symbol table will give us segment information.

You should be able to classify symbols from the symbol table to a segment 
though right?

We could add a special define for host addresses if needed.

> 
> 
> 
>> 
>>> ## Proposal
>>> 
>>> ### Address + ProcessAddress
>>> 
>>> - The Address object gains a segment discriminator member variable. 
>>> Everything that creates an Address will need to provide this segment 
>>> discriminator.
>> 
>> So an interesting thing to think about is if lldb_private::Section object 
>> should contain a segment identifier? If this is the case, then an Address 
>> object can have a Section that has a segment _and_ the Address object itself 
>> might have one that was set from the section as well. It would be good to 
>> figure out what the rules are for this case and it might lead to the need 
>> for an intelligent accessor that always prefers the section's segment if a 
>> section is available. The Address object must have one in case we have a 
>> pointer to memory in data and there is no section for this (like any heap 
>> addresses).
> 
> I don't believe a Section in this project will have a segment.  We're looking 
> purely at individual variables, primarily from debug information.

So if you have a global variable, it will have a symbol right? And it will have 
debug info. Are you saying that only the debug info would have segment info? It 
seems important to be able to view a global variable without debug info.

> 
> 
>>> - A ProcessAddress object which is a uint64_t and a segment discriminator 
>>> as a replacement for addr_t. ProcessAddress objects would not persist 
>>> across multiple executions. Similar to how you can create an addr_t from an 
>>> Address+Target today, you can create a ProcessAddress given an 
>>> Address+Target. When we pass around addr_ts today, they would be replaced 
>>> with ProcessAddress, with the exception of symbol tables where the added 
>>> space would be significant, and we do not believe we need segment 
>>> discriminators today.
>> 
>> Would SegmentedAddress be a more descriptive name here?
>> 
>> A few things I would like to see on ProcessAddress or SegmentedAddress:
>> - Have a segment definition that says "no segment" like LLDB_INVALID_SEGMENT 
>> or LLDB_NO_SEGMENT and allow these objects to be constructed with just a 
>> lldb::addr_t and the segment gets auto set to LLDB_NO_SEGMENT
> 
>> - Any code that uses these should test if there is no segment and continue 
>> to do what they used to do before
>> - like read/write memory in ProcessGDBRemote
> 
> 
> To be honest, testing this is going to be one of the tricky things I'm not 
> sure how we'll do.  we will have a default segment that addresses will use 
> unless overridden, but how we spot places that *incorrectly* failed to 
> initialize the segment of an Address/ProcessAddress is something we're going 
> to need to figure out.  

Tests I can think of:
- read function disassembly from a code segment that would have the same 
address as something from a data segment
- read a variable from a data segment that would have the same address as 
something from a code segment

It would be good to figure out where segments are going to come from. I would 
hope that some sections would be able to be mapped to certain segments so that 
we can live with a binary that has no debug info and still read say a global 
variable. I know we can do things right inside of the debug info since DWARF 
can have segment information.

> 
> 
>> - Anything that dumps one of these objects should dump just like they used 
>> to (just a uint64_t hex representation and no other notation)
>> - Add code that can convert a "load address" into a ProcessAddress or 
>> SegmentedAddress that invent the segment notation and have no changes for 
>> targets that don't support segmented address spaces
>> - 0x1000 should convert to ProcessAddress where the address is 0x1000 and 
>> segment is LLDB_INVALID_SEGMENT or LLDB_NO_SEGMENT if the process doesn't 
>> support segmented addresses
> 
> 
>> - 0x1000 would return an error on conversion for processes that do support 
>> segmented addresses as the segment must be specified? Or should there be a 
>> default segment if we run into this case?
>> - Come up with some quick way to represent segmented addresses for an 
>> address of 0x1000 in segment 2: ideas:
>>   - [2]0x1000
>>   - {2}0x1000
>>   - 0x1000[2]
>>   - 0x1000{2}
>>   - {0x1000, 2}
> 
> To be honest, we haven't thought about the UI side of this very much yet.  I 
> think there will be ABI or ArchSpec style information that maps segment 
> numbers to human-understandable names.  It's ABI style enumerated numbers - 
> the DWARF will include a number that is passed down to the remote gdb stub.

Should be fine to have named segments if needed, but we will need to come up 
with a way to specify a segment. We could always add new arguments to command 
line commands if needed.
> 
> 
>> 
>>> 
>>> ### Address Only
>>> 
>>> Extend the lldb_private::Address class to be the one representation of 
>>> locations; including file based ones valid before running, file addresses 
>>> resolved in a process, and process specific addresses (heap/stack/JIT code) 
>>> that are only valid during a run. That is attractive because it would 
>>> provide a uniform interface to any “where is something” question you would 
>>> ask, either about symbols in files, variables in stack frames, etc.
>>> 
>>> At present, when we resolve a Section+Offset Address to a “load address” we 
>>> provide a Target to the resolution API.  Providing the Target externally 
>>> makes sense because a Target knows whether the Section is present or not 
>>> and can unambiguously return a load address.    We could continue that 
>>> approach since the Target always holds only one process, or extend it to 
>>> allow passing in a Process when resolving non-file backed addresses.  But 
>>> this would make the conversion from addr_t uses to Address uses more 
>>> difficult, since we will have to push the Target or Process into all the 
>>> API’s that make use of just an addr_t.  Using a single Address class seems 
>>> less attractive when you have to provide an external entity to make sense 
>>> of it at all the use sites.
>>> 
>>> We could improve this situation by including a Process (as a weak pointer) 
>>> and fill that in on the boundaries where in the current code we go from an 
>>> Address to a process specific addr_t.  That would make the conversion 
>>> easier, but add complexity.  Since Addresses are ubiquitous, you won’t know 
>>> what any given Address you’ve been handed actually contains.  It could even 
>>> have been resolved for another process than the current one.  Making 
>>> Address usage-dependent in this way reduces the attractiveness of the 
>>> solution.
>>> 
>>> ## Approach
>>> 
>>> Replacing all the instances of addr_t by hand would be a lot of work. 
>>> Therefore we propose writing a clang-based tool to automate this menial 
>>> task. The tool would update function signatures and replace uses of addr_t 
>>> inside those functions to get the addr_t from the ProcessAddress or Address 
>>> and return the appropriate object for functions that currently return an 
>>> addr_t. The goal of this tool is to generate one big NFC patch. This tool 
>>> needs not be perfect, at some point it will be more work to improve the 
>>> tool than fixing up the remaining code by hand. After this patch LLDB would 
>>> still not really understand address spaces but it will have everything in 
>>> place to support them.
>> 
>> This won't be NFC really as each location that plays with what used to be 
>> addr_t now must check if the segment is invalid before doing what it did 
>> before _and_ return an error if the segment is something valid.
>> 
>> It might be better to look at all of the APIs that could end up using a 
>> plain "addr_t" and adding new APIs that take a ProcessAddress and call the 
>> old API if the segment is LLDB_INVALID_SEGMENT or LLDB_NO_SEGMENT, and 
>> return an error if the segment is valid. For example in the Process class we 
>> have:
>> 
>> virtual size_t Process::DoReadMemory(lldb::addr_t vm_addr, void *buf, size_t 
>> size, Status &error) = 0;
>> 
>> We could add a new overload:
>> 
>> virtual size_t Process::DoReadMemory(ProcessAddress proc_addr, void *buf, 
>> size_t size, Status &error) {
>> if (proc_addr.GetSegment() == LLDB_NO_SEGMENT)
>>   return DoReadMemory(proc_addr.GetAddress(), but, size, error);
>> error.SetErrorString("segmented addresses are not supported on this 
>> process");
>> return 0
>> }
>> 
>> Then we can start modifying the locations that need to support segmented 
>> addresses as needed. For instance, if we were to add segmented address 
>> support to ProcessGDBRemote, then we would override this function in that 
>> class.
>> 
>> I am not sure if slowly adding this functionality is better than replacing 
>> this all right away, but we can't just do a global replace without adding 
>> functionality or error checking IMHO.
>> 
>> 
>>> Once all the APIs are updated, we can start working on the functional 
>>> changes. This means actually interpreting the aspace_t values and making 
>>> sure they don’t get dropped.
>>> 
>>> Finally, when all this work is done and we’re happy with the approach, we 
>>> extend the SB API with overloads for the functions that currently take or 
>>> return addr_t . I want to do this last so we have time to iterate before 
>>> committing to a stable interface.
>> 
>> This might be one reason for doing the approach suggested above where we add 
>> new internal APIs that take a ProcessAddress and cut over to using them. As 
>> it would mean all of the current APIs in the lldb::SB layer would remain in 
>> place (they can't be removed) and would still make sense.
>> 
>>> 
>>> ## Testing
>>> 
>>> By splitting off the intrusive non-functional changes we are able to rely 
>>> on the existing tests for coverage. Smaller functional changes can be 
>>> tested in isolation, either through a unit test or a small GDB remote test. 
>>> For end-to-end testing we can run the test suite with a modified 
>>> debugserver that spoofs address spaces.
>> 
>> That makes sense. ProcessGDBRemote will need to dynamically respond with 
>> wether it supports segmented addresses by overloading the DoReadMemory that 
>> takes a ProcessAddress and do the right thing.
>> 
>> Thanks for taking this on. I hope some of the comments above help moving 
>> this forward.
>> 
>> Greg
>> 
>>> 
>>> Thanks,
>>> Jonas
>>> 
>>> _______________________________________________
>>> lldb-dev mailing list
>>> [email protected]
>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev
>> 
>> _______________________________________________
>> lldb-dev mailing list
>> [email protected] <mailto:[email protected]>
>> https://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev 
>> <https://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev>

_______________________________________________
lldb-dev mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev

Re: [lldb-dev] [RFC] Segmented Address Space Support in LLDB

Reply via email to