malaperle added inline comments.

================
Comment at: clangd/Symbol.h:37
+// The class presents a C++ symbol, e.g. class, function.
+struct Symbol {
+  // The symbol identifier, using USR.
----------------
malaperle wrote:
> sammccall wrote:
> > hokein wrote:
> > > malaperle wrote:
> > > > sammccall wrote:
> > > > > hokein wrote:
> > > > > > malaperle wrote:
> > > > > > > I think it would be nice to have methods as an interface to get 
> > > > > > > this data instead of storing them directly. So that an 
> > > > > > > index-on-disk could go fetch the data. Especially the occurrences 
> > > > > > > which can take a lot of memory (I'm working on a branch that does 
> > > > > > > that). But perhaps defining that interface is not within the 
> > > > > > > scope of this patch and could be better discussed in D40548 ?
> > > > > > I agree. We can't load all the symbol occurrences into the memory 
> > > > > > since they are too large. We need to design interface for the 
> > > > > > symbol occurrences. 
> > > > > > 
> > > > > > We could discuss the interface here, but CodeCompletion is the main 
> > > > > > thing which this patch focuses on. 
> > > > > > We can't load all the symbol occurrences into the memory since they 
> > > > > > are too large
> > > > > 
> > > > > I've heard this often, but never backed up by data :-)
> > > > > 
> > > > > Naively an array of references for a symbol could be doc ID + offset 
> > > > > + length, let's say 16 bytes.
> > > > > 
> > > > > If a source file consisted entirely of references to 1-character 
> > > > > symbols separated by punctuation (1 reference per 2 bytes) then the 
> > > > > total size of these references would be 8x the size of the source 
> > > > > file - in practice much less. That's not very big.
> > > > > 
> > > > > (Maybe there are edge cases with macros/templates, but we can keep 
> > > > > them under control)
> > > > I'd have to break down how much memory it used by what, I'll come back 
> > > > to you on that. Indexing llvm with ClangdIndexDataStorage, which is 
> > > > pretty packed is about 200MB. That's already a lot considering we want 
> > > > to index code bases many times bigger. But I'll try to come up with 
> > > > more precise numbers. I'm open to different strategies.
> > > > 
> > > I can see two points here:
> > > 
> > > 1. For all symbol occurrences of a TU, it is not quite large, and we can 
> > > keep them in memory.
> > > 2. For all symbol occurrences of a whole project, it's not a good idea to 
> > > load all of them into memory (For LLVM project, the size of YAML dataset 
> > > is ~1.2G).  
> > (This is still a sidebar - not asking for any changes)
> > 
> > The YAML dataset is not a good proxy for how big the data is (at least 
> > without an effort to estimate constant factor).
> > And "it's not a good idea" isn't an assertion that can hold without 
> > reasons, assumptions, and data.
> > If the size turns out to be, say, 120MB for LLVM, and we want to scale to 
> > 10x that, and we're spending 500MB/file for ASTs, then it might well be a 
> > good trade.
> > The YAML dataset is not a good proxy for how big the data is (at least 
> > without an effort to estimate constant factor).
> 
> Indeed. I'll try to come up with more realistic numbers. There are other 
> things not accounted for in the 16 bytes mentioned above, like storing roles 
> and relations.
> 
> > 500MB/file for ASTs
> 
> What do you mean? 500MB worth of occurrences per file? Or Preambles perhaps?
> What do you mean? 500MB worth of occurrences per file? Or Preambles perhaps?

Oh I see, the AST must be in memory for fast reparse. I just tried opening 3 
files at the same time I it was already around 500MB. Hmm, that's a bit 
alarming.


Repository:
  rCTE Clang Tools Extra

https://reviews.llvm.org/D40897



_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

Reply via email to