Re: [GSoC] Designing a faster index format - Progress report week 13
Junio C Hamano writes: > Thomas Rast writes: > >> Junio's index-v4 was a speed boost mainly because it cuts down on the >> size of the index. Do we want to throw that out? > > That's pretty much orthogonal, isn't it? > > The index-v4 is merely to show how a stupid prefix compression of > pathnames without nothing else would reduce the file size and I/O > cost, in order to set the bar for anything more clever than that. > > I thought that this discussion is about keeping, squishing, or > discarding part of the cached stat info, and nobody is suggesting > what to do with the prefix compression of pathnames. True, sorry for being so confusing. 'Throw "that" out' was meant to refer to the observation that smaller indexes are generally faster. Whether this matters in the long run is another question. Perhaps partial loading combined with something like inotify can avoid full reads in most operations. -- Thomas Rast trast@{inf,student}.ethz.ch -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GSoC] Designing a faster index format - Progress report week 13
Thomas Rast writes: > Junio's index-v4 was a speed boost mainly because it cuts down on the > size of the index. Do we want to throw that out? That's pretty much orthogonal, isn't it? The index-v4 is merely to show how a stupid prefix compression of pathnames without nothing else would reduce the file size and I/O cost, in order to set the bar for anything more clever than that. I thought that this discussion is about keeping, squishing, or discarding part of the cached stat info, and nobody is suggesting what to do with the prefix compression of pathnames. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GSoC] Designing a faster index format - Progress report week 13
Robin Rosenberg writes: > Junio C Hamano skrev 2012-07-22 23.08: >> Thomas Rast writes: >> >>> What is the status quo? I take it JGit does not have any of ctime, dev, >>> ino etc., and either leaves the existing value or puts a 0 >>> an argument in favor of splitting stat_crc into its fields again? >> >> A difference is that JGit already has such code, and we would be >> adding a burden to do so yet again. It also may not just be JGit, >> but anything that wants to be "compatible" with systems whose >> filesystem interface does not give enough data by omitting fields >> the current index pays attention to. >> >> It isn't really a discussion about splitting again, but more about >> not squishing them into a new field in the first place---IIUC, even >> outside Windows, ctime is already problematic on some systems where >> background processes muck with extended attributes Git does not pay >> attention to. If the patch makes us lose the ability to selectively >> ignore changes to certain fields (e.g. changes to dev and ino are >> noticed but ctime are ignored) by squishing them into one new field, >> wouldn't removing them without adding such a useless field a simpler >> way to go? > > I wasnt't thinking of splitting, but now I read it again, I do think > it should split. Aren't you two going off in different directions? I read Junio as implying that if size/ctime/dev/ino are a pain to deal with, we should just drop them altogether. You seem to be saying the opposite: > Having size accessible is a good thing, and even > better if it a 64-bit value so we don't have the modulo-4G problem > when looking at it. Current size is 4G + 33 bytes, index says 33. Did > the > file change or not? > > Having access to size make the need for actually > invoking the racy git logic and comparing file content less likely. I don't think this is true. Racy git logic is needed every time that the file *looks* unchanged, but isn't. In the case where the file is certified (by mtime) unchanged, we don't go checking. But in the case where it *looks* changed, we still have to go and read it to know if, perhaps, the only thing the user did was hit "save" again. Not to mention that this really hurts in terms of index size. Our benchmark for index-v5 is the Webkit project, which stands at 180k files. So every 6B/entry is about an MB of final size, which needs to be loaded, hashed (or crc'd), then hashed/crc'd again and written. Junio's index-v4 was a speed boost mainly because it cuts down on the size of the index. Do we want to throw that out? -- Thomas Rast trast@{inf,student}.ethz.ch -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GSoC] Designing a faster index format - Progress report week 13
Junio C Hamano skrev 2012-07-22 23.08: Thomas Rast writes: Hum, I'm a bit lost now. What is the status quo? I take it JGit does not have any of ctime, dev, ino etc., and either leaves the existing value or puts a 0 an argument in favor of splitting stat_crc into its fields again? A difference is that JGit already has such code, and we would be adding a burden to do so yet again. It also may not just be JGit, but anything that wants to be "compatible" with systems whose filesystem interface does not give enough data by omitting fields the current index pays attention to. It isn't really a discussion about splitting again, but more about not squishing them into a new field in the first place---IIUC, even outside Windows, ctime is already problematic on some systems where background processes muck with extended attributes Git does not pay attention to. If the patch makes us lose the ability to selectively ignore changes to certain fields (e.g. changes to dev and ino are noticed but ctime are ignored) by squishing them into one new field, wouldn't removing them without adding such a useless field a simpler way to go? I wasnt't thinking of splitting, but now I read it again, I do think it should split. Having size accessible is a good thing, and even better if it a 64-bit value so we don't have the modulo-4G problem when looking at it. Current size is 4G + 33 bytes, index says 33. Did the file change or not? Having access to size make the need for actually invoking the racy git logic and comparing file content less likely. As for ctime it is accessible in Java7, though everyone aren't using it and JGit code has to run on Java5. An idea is to make an optional component, but that doesn't make ctime available everywhere. -- robin -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GSoC] Designing a faster index format - Progress report week 13
Thomas Rast writes: > Hum, I'm a bit lost now. > > What is the status quo? I take it JGit does not have any of ctime, dev, > ino etc., and either leaves the existing value or puts a 0 > an argument in favor of splitting stat_crc into its fields again? A difference is that JGit already has such code, and we would be adding a burden to do so yet again. It also may not just be JGit, but anything that wants to be "compatible" with systems whose filesystem interface does not give enough data by omitting fields the current index pays attention to. It isn't really a discussion about splitting again, but more about not squishing them into a new field in the first place---IIUC, even outside Windows, ctime is already problematic on some systems where background processes muck with extended attributes Git does not pay attention to. If the patch makes us lose the ability to selectively ignore changes to certain fields (e.g. changes to dev and ino are noticed but ctime are ignored) by squishing them into one new field, wouldn't removing them without adding such a useless field a simpler way to go? -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GSoC] Designing a faster index format - Progress report week 13
Junio C Hamano writes: > Robin Rosenberg writes: > >> A note on how JGit would work here. Java has none of the fields >> that constitute statcrc. I guess we would write zero here when >> creating new entries. Git could recognize that when checking status >> and simply assume "clean" unless mtime or st_size says otherwise. > > Even though it may not be the end of the world, that is certainly > bad. Recording the constituent fields separately without the statcrc > microoptimization, thereby not shaving a handful of bytes per the > index entry, is not the end of the world either in the same sense, > which leads us to question the benefit we would be getting from such > a change. Hum, I'm a bit lost now. What is the status quo? I take it JGit does not have any of ctime, dev, ino etc., and either leaves the existing value or puts a 0. Which is not different from either leaving the stat crc in place, or putting a 0. Except that IIUC, putting a 0 in both cases means forcing a refresh once C git comes along (or some other reader that knows about the fields). So if we want to keep the safety net, a magic "I don't know" value would indeed be a good idea. But I don't see how what Robin said constitutes an argument in favor of splitting stat_crc into its fields again? -- Thomas Rast trast@{inf,student}.ethz.ch -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GSoC] Designing a faster index format - Progress report week 13
Robin Rosenberg writes: > A note on how JGit would work here. Java has none of the fields > that constitute statcrc. I guess we would write zero here when > creating new entries. Git could recognize that when checking status > and simply assume "clean" unless mtime or st_size says otherwise. Even though it may not be the end of the world, that is certainly bad. Recording the constituent fields separately without the statcrc microoptimization, thereby not shaving a handful of bytes per the index entry, is not the end of the world either in the same sense, which leads us to question the benefit we would be getting from such a change. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GSoC] Designing a faster index format - Progress report week 13
A note on how JGit would work here. Java has none of the fields that constitute statcrc. I guess we would write zero here when creating new entries. Git could recognize that when checking status and simply assume "clean" unless mtime or st_size says otherwise. For existing entries JGit could either keep the old value (which is probably a lie since) or zero it. A modification to the spec would be that 0 == not set. -- robin -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GSoC] Designing a faster index format - Progress report week 13
Thanks Junio for reading the progress report, this is just corrected version without the errors that he pointed out. == Work done in the previous 12 weeks == - Definition of a tentative index file v5 format [1]. This differs from the proposal in making it possible to bisect the directory entries and file entries, to do a binary search. The exact bits for each section were also defined. To further compress the index, along with prefix compression, the stat data is hashed, since it's only used for equality comparison, but the plain data is never used. Thanks to Michael Haggerty, Nguyen Thai Ngoc Duy, Thomas Rast and Robin Rosenberg for feedback. - Prototype of a converter from the index format v2/v3 to the index format v5. [2] The converter reads the index from a git repository, can output parts of the index (header, index entries as in git ls-files --debug, cache tree as in test-dump-cache-tree, or the reuc data). Then it writes the v5 index file format to .git/index-v5. Thanks to Michael Haggerty for the code review. - Prototype of a reader for the new index file format. [3] The reader has mainly the purpose to show the algorithm used to read the index lexicographically sorted after the full name which is required by the current internal memory format. Big thanks for reviewing this code and giving me advice on refactoring goes to Michael Haggerty. - Read the on-disk index file format and translate it to the current in memory format. This doesn't include reading any of the current extensions, which are now part of the main index. The code again is on github. [4] Thanks for reviewing the first steps to Thomas Rast. - Read the cache-tree data (formerly an extension, now it's integrated with the rest of the directory data) from the new ondisk format. There are still a few optimizations to do in this algorithm. - Started implementing the API (suggested by Duy), but it's still in the very early stages. There is one commit for this on GitHub [1], but it's a very early work in progress. - Started implementing the writer, which extracts the directories from the in-memory format, and writes the header and the directories to disk. The algorithm uses a hash-table instead of a simple list, to avoid many corner cases. - Implemented writing the file block to disk, and basic tests from the test suite are running fine, not including tests that require conflicted data or the cache-tree to work, which both are not implemented yet. - Started implementing a patch to introduce a ce_namelen field in struct cache_entry and drop the name length from the flags. [5] == Work done int the last week == - Polished the patch for the ce_namelen field. The thread for the patch can be found at [5]. Again thanks to Junio, Duy and Thomas for reviewing it and giving me suggestions for improving it. - Implemented the cache-tree and conflict data writing to the index-v5 file. == Outlook for the next week == - There are still a few bugs in the conflict writing, which will be fixed, to make the test suite pass with index-v5. - Once the test suite passes, the code still needs to be refactored and optimized. - If the two points above go well, I'll continue working on the api that Duy suggested. [1] https://github.com/tgummerer/git/wiki/Index-file-format-v5 [2] https://github.com/tgummerer/git/blob/pythonprototype/git-convert-index.py [3] https://github.com/tgummerer/git/blob/pythonprototype/git-read-index-v5.py [4] https://github.com/tgummerer/git/tree/index-v5 [5] http://thread.gmane.org/gmane.comp.version-control.git/200997 -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GSoC] Designing a faster index format - Progress report week 13
On 07/16, Junio C Hamano wrote: > Thomas Gummerer writes: > > > == Work done in the previous 12 weeks == > > > > - Definition of a tentative index file v5 format [1]. This differs > > from the proposal in making it possible to bisect the directory > > entries and file entries, to do a binary search. The exact bits > > for each section were also defined. To further compress the index, > > along with prefix compression, the stat data is hashed, since > > it's only used for comparison, but the plain data is never used. > > s/comparison/equality comparison/ perhaps? > Exactly, thanks. > > Thanks to Michael Haggerty, Nguyen Thai Ngoc Duy, Thomas Rast > > and Robin Rosenberg for feedback. > > > - Read the index format format and translate it to the current in > > s/format format/on-disk file format/ or something? > Yes, thanks. > > memory format. This doesn't include reading any of the current > > extensions, which are now part of the main index. The code again > > is on github. [4] Thanks for reviewing the first steps to Thomas > > Rast. > > > - Started implementing the writer, which extracts the directories from > > the in-memory format, and writes the header and the directories to > > disk. > > - I found a few bugs in the algorithm for extracting the directories > > and decided to completely rewrite it, using a hash table instead of > > simple lists, since the old one would have to many corner cases to > > handle. > > What does "the algorithm" refer to? Is it the one described in the > previous bullet point, or is it the code in production? If latter, > it would help to separate out the task to fix the breakage, as > people with the current or previous versions of Git will be > negatively affected until that bug is fixed. If former, I am not > sure if this task needs to be described in two bullet points ("I did > X, X had bug so I redid X in a different way" is still a single task > to do X). It refers to the algorithm in the previous bullet point, which extracts the directories, and can be included in the above bullet point. Sorry for the confusion. > > == Work done int the last week == > > > > - Polished the patch for the ce_namelen field. The thread for the > > patch can be found at [5]. > > Thanks for this one; I think it is ready for 'next', but if you are > still not satisfied I do not mind waiting for further perfection. Thanks, I'm satisfied with it, for me it can be merged to 'next'. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GSoC] Designing a faster index format - Progress report week 13
Thomas Gummerer writes: > == Work done in the previous 12 weeks == > > - Definition of a tentative index file v5 format [1]. This differs > from the proposal in making it possible to bisect the directory > entries and file entries, to do a binary search. The exact bits > for each section were also defined. To further compress the index, > along with prefix compression, the stat data is hashed, since > it's only used for comparison, but the plain data is never used. s/comparison/equality comparison/ perhaps? > Thanks to Michael Haggerty, Nguyen Thai Ngoc Duy, Thomas Rast > and Robin Rosenberg for feedback. > - Read the index format format and translate it to the current in s/format format/on-disk file format/ or something? > memory format. This doesn't include reading any of the current > extensions, which are now part of the main index. The code again > is on github. [4] Thanks for reviewing the first steps to Thomas > Rast. > - Started implementing the writer, which extracts the directories from > the in-memory format, and writes the header and the directories to > disk. > - I found a few bugs in the algorithm for extracting the directories > and decided to completely rewrite it, using a hash table instead of > simple lists, since the old one would have to many corner cases to > handle. What does "the algorithm" refer to? Is it the one described in the previous bullet point, or is it the code in production? If latter, it would help to separate out the task to fix the breakage, as people with the current or previous versions of Git will be negatively affected until that bug is fixed. If former, I am not sure if this task needs to be described in two bullet points ("I did X, X had bug so I redid X in a different way" is still a single task to do X). > == Work done int the last week == > > - Polished the patch for the ce_namelen field. The thread for the > patch can be found at [5]. Thanks for this one; I think it is ready for 'next', but if you are still not satisfied I do not mind waiting for further perfection. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[GSoC] Designing a faster index format - Progress report week 13
== Work done in the previous 12 weeks == - Definition of a tentative index file v5 format [1]. This differs from the proposal in making it possible to bisect the directory entries and file entries, to do a binary search. The exact bits for each section were also defined. To further compress the index, along with prefix compression, the stat data is hashed, since it's only used for comparison, but the plain data is never used. Thanks to Michael Haggerty, Nguyen Thai Ngoc Duy, Thomas Rast and Robin Rosenberg for feedback. - Prototype of a converter from the index format v2/v3 to the index format v5. [2] The converter reads the index from a git repository, can output parts of the index (header, index entries as in git ls-files --debug, cache tree as in test-dump-cache-tree, or the reuc data). Then it writes the v5 index file format to .git/index-v5. Thanks to Michael Haggerty for the code review. - Prototype of a reader for the new index file format. [3] The reader has mainly the purpose to show the algorithm used to read the index lexicographically sorted after the full name which is required by the current internal memory format. Big thanks for reviewing this code and giving me advice on refactoring goes to Michael Haggerty. - Read the index format format and translate it to the current in memory format. This doesn't include reading any of the current extensions, which are now part of the main index. The code again is on github. [4] Thanks for reviewing the first steps to Thomas Rast. - Read the cache-tree data (formerly an extension, now it's integrated with the rest of the directory data) from the new ondisk format. There are still a few optimizations to do in this algorithm. - Started implementing the API (suggested by Duy), but it's still in the very early stages. There is one commit for this on GitHub [1], but it's a very early work in progress. - Started implementing the writer, which extracts the directories from the in-memory format, and writes the header and the directories to disk. - I found a few bugs in the algorithm for extracting the directories and decided to completely rewrite it, using a hash table instead of simple lists, since the old one would have to many corner cases to handle. - Implemented writing the file block to disk, and basic tests from the test suite are running fine, not including tests that require conflicted data or the cache-tree to work, which both are not implemented yet. - Started implementing a patch to introduce a ce_namelen field in struct cache_entry and drop the name length from the flags. [5] == Work done int the last week == - Polished the patch for the ce_namelen field. The thread for the patch can be found at [5]. Again thanks to Junio, Duy and Thomas for reviewing it and giving me suggestions for improving it. - Implemented the cache-tree and conflict data writing to the index-v5 file. == Outlook for the next week == - There are still a few bugs in the conflict writing, which will be fixed, to make the test suite pass with index-v5. - Once the test suite passes, the code still needs to be refactored and optimized. - If the two points above go well, I'll continue working on the api that Duy suggested. [1] https://github.com/tgummerer/git/wiki/Index-file-format-v5 [2] https://github.com/tgummerer/git/blob/pythonprototype/git-convert-index.py [3] https://github.com/tgummerer/git/blob/pythonprototype/git-read-index-v5.py [4] https://github.com/tgummerer/git/tree/index-v5 [5] http://thread.gmane.org/gmane.comp.version-control.git/200997 -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html