subject:"\"\\\[GSoC\\\] Designing a faster index format \\\- Progress report week 13\""

Re: [GSoC] Designing a faster index format - Progress report week 13

2012-07-24 Thread Thomas Rast

Junio C Hamano  writes:

> Thomas Rast  writes:
>
>> Junio's index-v4 was a speed boost mainly because it cuts down on the
>> size of the index.  Do we want to throw that out?
>
> That's pretty much orthogonal, isn't it?
>
> The index-v4 is merely to show how a stupid prefix compression of
> pathnames without nothing else would reduce the file size and I/O
> cost, in order to set the bar for anything more clever than that.
>
> I thought that this discussion is about keeping, squishing, or
> discarding part of the cached stat info, and nobody is suggesting
> what to do with the prefix compression of pathnames.

True, sorry for being so confusing.  'Throw "that" out' was meant to
refer to the observation that smaller indexes are generally faster.

Whether this matters in the long run is another question.  Perhaps
partial loading combined with something like inotify can avoid full
reads in most operations.

-- 
Thomas Rast
trast@{inf,student}.ethz.ch
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [GSoC] Designing a faster index format - Progress report week 13

2012-07-24 Thread Junio C Hamano

Thomas Rast  writes:

> Junio's index-v4 was a speed boost mainly because it cuts down on the
> size of the index.  Do we want to throw that out?

That's pretty much orthogonal, isn't it?

The index-v4 is merely to show how a stupid prefix compression of
pathnames without nothing else would reduce the file size and I/O
cost, in order to set the bar for anything more clever than that.

I thought that this discussion is about keeping, squishing, or
discarding part of the cached stat info, and nobody is suggesting
what to do with the prefix compression of pathnames.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [GSoC] Designing a faster index format - Progress report week 13

2012-07-24 Thread Thomas Rast

Robin Rosenberg  writes:

> Junio C Hamano skrev 2012-07-22 23.08:
>> Thomas Rast  writes:
>>
>>> What is the status quo?  I take it JGit does not have any of ctime, dev,
>>> ino etc., and either leaves the existing value or puts a 0
>>> an argument in favor of splitting stat_crc into its fields again?
>>
>> A difference is that JGit already has such code, and we would be
>> adding a burden to do so yet again.  It also may not just be JGit,
>> but anything that wants to be "compatible" with systems whose
>> filesystem interface does not give enough data by omitting fields
>> the current index pays attention to.
>>
>> It isn't really a discussion about splitting again, but more about
>> not squishing them into a new field in the first place---IIUC, even
>> outside Windows, ctime is already problematic on some systems where
>> background processes muck with extended attributes Git does not pay
>> attention to. If the patch makes us lose the ability to selectively
>> ignore changes to certain fields (e.g. changes to dev and ino are
>> noticed but ctime are ignored) by squishing them into one new field,
>> wouldn't removing them without adding such a useless field a simpler
>> way to go?
> 
> I wasnt't thinking of splitting, but now I read it again, I do think
> it should split.

Aren't you two going off in different directions?  I read Junio as
implying that if size/ctime/dev/ino are a pain to deal with, we should
just drop them altogether.  You seem to be saying the opposite:

> Having size accessible is a good thing, and even
> better if it a 64-bit value so we don't have the modulo-4G problem
> when looking at it. Current size is 4G + 33 bytes, index says 33. Did
> the
> file change or not?
>
> Having access to size make the need for actually
> invoking the racy git logic and comparing file content less likely.

I don't think this is true.  Racy git logic is needed every time that
the file *looks* unchanged, but isn't.  In the case where the file is
certified (by mtime) unchanged, we don't go checking.  But in the case
where it *looks* changed, we still have to go and read it to know if,
perhaps, the only thing the user did was hit "save" again.

Not to mention that this really hurts in terms of index size.  Our
benchmark for index-v5 is the Webkit project, which stands at 180k
files.  So every 6B/entry is about an MB of final size, which needs to
be loaded, hashed (or crc'd), then hashed/crc'd again and written.
Junio's index-v4 was a speed boost mainly because it cuts down on the
size of the index.  Do we want to throw that out?

-- 
Thomas Rast
trast@{inf,student}.ethz.ch
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [GSoC] Designing a faster index format - Progress report week 13

2012-07-23 Thread Robin Rosenberg


Junio C Hamano skrev 2012-07-22 23.08:

Thomas Rast  writes:


Hum, I'm a bit lost now.

What is the status quo?  I take it JGit does not have any of ctime, dev,
ino etc., and either leaves the existing value or puts a 0
an argument in favor of splitting stat_crc into its fields again?


A difference is that JGit already has such code, and we would be
adding a burden to do so yet again.  It also may not just be JGit,
but anything that wants to be "compatible" with systems whose
filesystem interface does not give enough data by omitting fields
the current index pays attention to.

It isn't really a discussion about splitting again, but more about
not squishing them into a new field in the first place---IIUC, even
outside Windows, ctime is already problematic on some systems where
background processes muck with extended attributes Git does not pay
attention to. If the patch makes us lose the ability to selectively
ignore changes to certain fields (e.g. changes to dev and ino are
noticed but ctime are ignored) by squishing them into one new field,
wouldn't removing them without adding such a useless field a simpler
way to go?


I wasnt't thinking of splitting, but now I read it again, I do think
it should split. Having size accessible is a good thing, and even better 
if it a 64-bit value so we don't have the modulo-4G problem when looking 
at it. Current size is 4G + 33 bytes, index says 33. Did the

file change or not?

Having access to size make the need for actually
invoking the racy git logic and comparing file content less likely.

As for ctime it is accessible in Java7, though everyone aren't using it 
and JGit code has to run on Java5. An idea is to make an optional 
component, but that doesn't make ctime available everywhere.


-- robin

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [GSoC] Designing a faster index format - Progress report week 13

2012-07-22 Thread Junio C Hamano

Thomas Rast  writes:

> Hum, I'm a bit lost now.
>
> What is the status quo?  I take it JGit does not have any of ctime, dev,
> ino etc., and either leaves the existing value or puts a 0
> an argument in favor of splitting stat_crc into its fields again?

A difference is that JGit already has such code, and we would be
adding a burden to do so yet again.  It also may not just be JGit,
but anything that wants to be "compatible" with systems whose
filesystem interface does not give enough data by omitting fields
the current index pays attention to.

It isn't really a discussion about splitting again, but more about
not squishing them into a new field in the first place---IIUC, even
outside Windows, ctime is already problematic on some systems where
background processes muck with extended attributes Git does not pay
attention to. If the patch makes us lose the ability to selectively
ignore changes to certain fields (e.g. changes to dev and ino are
noticed but ctime are ignored) by squishing them into one new field,
wouldn't removing them without adding such a useless field a simpler
way to go?
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [GSoC] Designing a faster index format - Progress report week 13

2012-07-22 Thread Thomas Rast

Junio C Hamano  writes:

> Robin Rosenberg  writes:
>
>> A note on how JGit would work here. Java has none of the fields
>> that constitute statcrc. I guess we would write zero here when
>> creating new entries. Git could recognize that when checking status
>> and simply assume "clean" unless mtime or st_size says otherwise.
>
> Even though it may not be the end of the world, that is certainly
> bad. Recording the constituent fields separately without the statcrc
> microoptimization, thereby not shaving a handful of bytes per the
> index entry, is not the end of the world either in the same sense,
> which leads us to question the benefit we would be getting from such
> a change.

Hum, I'm a bit lost now.

What is the status quo?  I take it JGit does not have any of ctime, dev,
ino etc., and either leaves the existing value or puts a 0.  Which is
not different from either leaving the stat crc in place, or putting a 0.
Except that IIUC, putting a 0 in both cases means forcing a refresh once
C git comes along (or some other reader that knows about the fields).

So if we want to keep the safety net, a magic "I don't know" value would
indeed be a good idea.  But I don't see how what Robin said constitutes
an argument in favor of splitting stat_crc into its fields again?

-- 
Thomas Rast
trast@{inf,student}.ethz.ch
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [GSoC] Designing a faster index format - Progress report week 13

2012-07-22 Thread Junio C Hamano

Robin Rosenberg  writes:

> A note on how JGit would work here. Java has none of the fields
> that constitute statcrc. I guess we would write zero here when
> creating new entries. Git could recognize that when checking status
> and simply assume "clean" unless mtime or st_size says otherwise.

Even though it may not be the end of the world, that is certainly
bad. Recording the constituent fields separately without the statcrc
microoptimization, thereby not shaving a handful of bytes per the
index entry, is not the end of the world either in the same sense,
which leads us to question the benefit we would be getting from such
a change.

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [GSoC] Designing a faster index format - Progress report week 13

2012-07-22 Thread Robin Rosenberg



A note on how JGit would work here. Java has none of the fields
that constitute statcrc. I guess we would write zero here when
creating new entries. Git could recognize that when checking status and 
simply assume "clean" unless mtime or st_size says otherwise.


For existing entries JGit could either keep the old value (which is
probably a lie since) or zero it.

A modification to the spec would be that 0 == not set.

-- robin
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [GSoC] Designing a faster index format - Progress report week 13

2012-07-17 Thread Thomas Gummerer

Thanks Junio for reading the progress report, this is just
corrected version without the errors that he pointed out.

== Work done in the previous 12 weeks ==

- Definition of a tentative index file v5 format [1]. This differs
  from the proposal in making it possible to bisect the directory
  entries and file entries, to do a binary search. The exact bits
  for each section were also defined. To further compress the index,
  along with prefix compression, the stat data is hashed, since
  it's only used for equality comparison, but the plain data is
  never used.
  Thanks to Michael Haggerty, Nguyen Thai Ngoc Duy, Thomas Rast
  and Robin Rosenberg for feedback.
- Prototype of a converter from the index format v2/v3 to the index
  format v5. [2] The converter reads the index from a git repository,
  can output parts of the index (header, index entries as in
  git ls-files --debug, cache tree as in test-dump-cache-tree, or
  the reuc data). Then it writes the v5 index file format to
  .git/index-v5. Thanks to Michael Haggerty for the code review.
- Prototype of a reader for the new index file format. [3] The
  reader has mainly the purpose to show the algorithm used to read
  the index lexicographically sorted after the full name which is
  required by the current internal memory format. Big thanks for
  reviewing this code and giving me advice on refactoring goes
  to Michael Haggerty.
- Read the on-disk index file format and translate it to the current
  in memory format. This doesn't include reading any of the current
  extensions, which are now part of the main index. The code again
  is on github. [4] Thanks for reviewing the first steps to Thomas
  Rast.
- Read the cache-tree data (formerly an extension, now it's integrated
  with the rest of the directory data) from the new ondisk format.
  There are still a few optimizations to do in this algorithm.
- Started implementing the API (suggested by Duy), but it's still
  in the very early stages. There is one commit for this on GitHub [1],
  but it's a very early work in progress.
- Started implementing the writer, which extracts the directories from
  the in-memory format, and writes the header and the directories to
  disk. The algorithm uses a hash-table instead of a simple list,
  to avoid many corner cases.
- Implemented writing the file block to disk, and basic tests from the
  test suite are running fine, not including tests that require
  conflicted data or the cache-tree to work, which both are not
  implemented yet.
- Started implementing a patch to introduce a ce_namelen field in
  struct cache_entry and drop the name length from the flags. [5]

== Work done int the last week ==

- Polished the patch for the ce_namelen field. The thread for the
  patch can be found at [5]. Again thanks to Junio, Duy and Thomas
  for reviewing it and giving me suggestions for improving it.
- Implemented the cache-tree and conflict data writing to the
  index-v5 file.
  
== Outlook for the next week ==

- There are still a few bugs in the conflict writing, which will
  be fixed, to make the test suite pass with index-v5.
- Once the test suite passes, the code still needs to be refactored
  and optimized.
- If the two points above go well, I'll continue working on the api
  that Duy suggested.

[1] https://github.com/tgummerer/git/wiki/Index-file-format-v5
[2] https://github.com/tgummerer/git/blob/pythonprototype/git-convert-index.py
[3] https://github.com/tgummerer/git/blob/pythonprototype/git-read-index-v5.py
[4] https://github.com/tgummerer/git/tree/index-v5
[5] http://thread.gmane.org/gmane.comp.version-control.git/200997
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [GSoC] Designing a faster index format - Progress report week 13

2012-07-17 Thread Thomas Gummerer



On 07/16, Junio C Hamano wrote:
> Thomas Gummerer  writes:
> 
> > == Work done in the previous 12 weeks ==
> >
> > - Definition of a tentative index file v5 format [1]. This differs
> >   from the proposal in making it possible to bisect the directory
> >   entries and file entries, to do a binary search. The exact bits
> >   for each section were also defined. To further compress the index,
> >   along with prefix compression, the stat data is hashed, since
> >   it's only used for comparison, but the plain data is never used.
> 
> s/comparison/equality comparison/ perhaps?
> 

Exactly, thanks.

> >   Thanks to Michael Haggerty, Nguyen Thai Ngoc Duy, Thomas Rast
> >   and Robin Rosenberg for feedback.
> 
> > - Read the index format format and translate it to the current in
> 
> s/format format/on-disk file format/ or something?
>

Yes, thanks.

> >   memory format. This doesn't include reading any of the current
> >   extensions, which are now part of the main index. The code again
> >   is on github. [4] Thanks for reviewing the first steps to Thomas
> >   Rast.
> 
> > - Started implementing the writer, which extracts the directories from
> >   the in-memory format, and writes the header and the directories to
> >   disk.
> > - I found a few bugs in the algorithm for extracting the directories
> >   and decided to completely rewrite it, using a hash table instead of
> >   simple lists, since the old one would have to many corner cases to
> >   handle.
> 
> What does "the algorithm" refer to?  Is it the one described in the
> previous bullet point, or is it the code in production?  If latter,
> it would help to separate out the task to fix the breakage, as
> people with the current or previous versions of Git will be
> negatively affected until that bug is fixed.  If former, I am not
> sure if this task needs to be described in two bullet points ("I did
> X, X had bug so I redid X in a different way" is still a single task
> to do X).

It refers to the algorithm in the previous bullet point, which
extracts the directories, and can be included in the above bullet
point. Sorry for the confusion.

> > == Work done int the last week ==
> >
> > - Polished the patch for the ce_namelen field. The thread for the
> >   patch can be found at [5].
> 
> Thanks for this one; I think it is ready for 'next', but if you are
> still not satisfied I do not mind waiting for further perfection.

Thanks, I'm satisfied with it, for me it can be merged to 'next'.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [GSoC] Designing a faster index format - Progress report week 13

2012-07-16 Thread Junio C Hamano

Thomas Gummerer  writes:

> == Work done in the previous 12 weeks ==
>
> - Definition of a tentative index file v5 format [1]. This differs
>   from the proposal in making it possible to bisect the directory
>   entries and file entries, to do a binary search. The exact bits
>   for each section were also defined. To further compress the index,
>   along with prefix compression, the stat data is hashed, since
>   it's only used for comparison, but the plain data is never used.

s/comparison/equality comparison/ perhaps?

>   Thanks to Michael Haggerty, Nguyen Thai Ngoc Duy, Thomas Rast
>   and Robin Rosenberg for feedback.

> - Read the index format format and translate it to the current in

s/format format/on-disk file format/ or something?

>   memory format. This doesn't include reading any of the current
>   extensions, which are now part of the main index. The code again
>   is on github. [4] Thanks for reviewing the first steps to Thomas
>   Rast.

> - Started implementing the writer, which extracts the directories from
>   the in-memory format, and writes the header and the directories to
>   disk.
> - I found a few bugs in the algorithm for extracting the directories
>   and decided to completely rewrite it, using a hash table instead of
>   simple lists, since the old one would have to many corner cases to
>   handle.

What does "the algorithm" refer to?  Is it the one described in the
previous bullet point, or is it the code in production?  If latter,
it would help to separate out the task to fix the breakage, as
people with the current or previous versions of Git will be
negatively affected until that bug is fixed.  If former, I am not
sure if this task needs to be described in two bullet points ("I did
X, X had bug so I redid X in a different way" is still a single task
to do X).

> == Work done int the last week ==
>
> - Polished the patch for the ce_namelen field. The thread for the
>   patch can be found at [5].

Thanks for this one; I think it is ready for 'next', but if you are
still not satisfied I do not mind waiting for further perfection.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[GSoC] Designing a faster index format - Progress report week 13

2012-07-16 Thread Thomas Gummerer

== Work done in the previous 12 weeks ==

- Definition of a tentative index file v5 format [1]. This differs
  from the proposal in making it possible to bisect the directory
  entries and file entries, to do a binary search. The exact bits
  for each section were also defined. To further compress the index,
  along with prefix compression, the stat data is hashed, since
  it's only used for comparison, but the plain data is never used.
  Thanks to Michael Haggerty, Nguyen Thai Ngoc Duy, Thomas Rast
  and Robin Rosenberg for feedback.
- Prototype of a converter from the index format v2/v3 to the index
  format v5. [2] The converter reads the index from a git repository,
  can output parts of the index (header, index entries as in
  git ls-files --debug, cache tree as in test-dump-cache-tree, or
  the reuc data). Then it writes the v5 index file format to
  .git/index-v5. Thanks to Michael Haggerty for the code review.
- Prototype of a reader for the new index file format. [3] The
  reader has mainly the purpose to show the algorithm used to read
  the index lexicographically sorted after the full name which is
  required by the current internal memory format. Big thanks for
  reviewing this code and giving me advice on refactoring goes
  to Michael Haggerty.
- Read the index format format and translate it to the current in
  memory format. This doesn't include reading any of the current
  extensions, which are now part of the main index. The code again
  is on github. [4] Thanks for reviewing the first steps to Thomas
  Rast.
- Read the cache-tree data (formerly an extension, now it's integrated
  with the rest of the directory data) from the new ondisk format.
  There are still a few optimizations to do in this algorithm.
- Started implementing the API (suggested by Duy), but it's still
  in the very early stages. There is one commit for this on GitHub [1],
  but it's a very early work in progress.
- Started implementing the writer, which extracts the directories from
  the in-memory format, and writes the header and the directories to
  disk.
- I found a few bugs in the algorithm for extracting the directories
  and decided to completely rewrite it, using a hash table instead of
  simple lists, since the old one would have to many corner cases to
  handle.
- Implemented writing the file block to disk, and basic tests from the
  test suite are running fine, not including tests that require
  conflicted data or the cache-tree to work, which both are not
  implemented yet.
- Started implementing a patch to introduce a ce_namelen field in
  struct cache_entry and drop the name length from the flags. [5]

== Work done int the last week ==

- Polished the patch for the ce_namelen field. The thread for the
  patch can be found at [5]. Again thanks to Junio, Duy and Thomas
  for reviewing it and giving me suggestions for improving it.
- Implemented the cache-tree and conflict data writing to the
  index-v5 file.
  
== Outlook for the next week ==

- There are still a few bugs in the conflict writing, which will
  be fixed, to make the test suite pass with index-v5.
- Once the test suite passes, the code still needs to be refactored
  and optimized.
- If the two points above go well, I'll continue working on the api
  that Duy suggested.

[1] https://github.com/tgummerer/git/wiki/Index-file-format-v5
[2] https://github.com/tgummerer/git/blob/pythonprototype/git-convert-index.py
[3] https://github.com/tgummerer/git/blob/pythonprototype/git-read-index-v5.py
[4] https://github.com/tgummerer/git/tree/index-v5
[5] http://thread.gmane.org/gmane.comp.version-control.git/200997
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [GSoC] Designing a faster index format - Progress report week 13

Re: [GSoC] Designing a faster index format - Progress report week 13

Re: [GSoC] Designing a faster index format - Progress report week 13

Re: [GSoC] Designing a faster index format - Progress report week 13

Re: [GSoC] Designing a faster index format - Progress report week 13

Re: [GSoC] Designing a faster index format - Progress report week 13

Re: [GSoC] Designing a faster index format - Progress report week 13

Re: [GSoC] Designing a faster index format - Progress report week 13

Re: [GSoC] Designing a faster index format - Progress report week 13

Re: [GSoC] Designing a faster index format - Progress report week 13

Re: [GSoC] Designing a faster index format - Progress report week 13

[GSoC] Designing a faster index format - Progress report week 13

12 matches

Site Navigation

Mail list logo

Footer information