[pacman-dev] Repository management

2017-05-09 Thread Allan McRae
Hi all,

Every time I attempt to work on repo-add, I find it to be a very
difficult endeavour.  Even though it is half the size of makepkg
(without even including any of libmakepkg), it is much more convoluted
to work on.

We also have a weird repository database system.  We have:
- .db dbs with package information, signatures and delta information
- .files dbs that are the same as .db dbs but additionally include filelists

There are two reasons the .files dbs replicate all information in the
.db dbs
 - .db and .files dbs getting out of sync could cause issues
 - a complete database is useful for things like archweb, mostly to
avoid the above

I would also like to include information on source packages to these
databases.  The files information is separate due to wanting our primary
database to be small.  Likewise, source package information needs to be
separate (the signatures take most of the size in the .db dbs, so adding
source package signatures effectively doubles the size).

So two points up for discussion:


1) Sync repository layout?  I don't see any point in leaving the tar
based format, as reading of sync databases is not a bottleneck.  (The
local db format can be a bottleneck, but that is a separate discussion...)

Do we split the information in .db out of .files and add a .full db with
complete information?  Then any .src db could follow suit and just have
source package information.  How do we get around the out of sync issue
(e.g., a package is removed from .db, but we have an old .files database
with it).  Do we add timestamps, and print a warning on -F operations
when the two are out of sync?


2) Do we need a better (read "more easily maintainable") tool for
handling database generation and updates?  libalpm already can read in
information package files, so we could add libalpm/db_write.c with the
database creation functions.   Should we unify our repo format with our
local database format which we already write?


I am looking for ideas here.  Please brainstorm to your hearts content.

Cheers,
Allan


Re: [pacman-dev] Repository management

2017-05-10 Thread Dave Reisner
On Tue, May 09, 2017 at 10:54:44PM +1000, Allan McRae wrote:
> Hi all,
> 
> Every time I attempt to work on repo-add, I find it to be a very
> difficult endeavour.  Even though it is half the size of makepkg
> (without even including any of libmakepkg), it is much more convoluted
> to work on.
> 
> We also have a weird repository database system.  We have:
> - .db dbs with package information, signatures and delta information
> - .files dbs that are the same as .db dbs but additionally include filelists
> 
> There are two reasons the .files dbs replicate all information in the
> .db dbs
>  - .db and .files dbs getting out of sync could cause issues
>  - a complete database is useful for things like archweb, mostly to
> avoid the above
> 
> I would also like to include information on source packages to these
> databases.  The files information is separate due to wanting our primary
> database to be small.  Likewise, source package information needs to be
> separate (the signatures take most of the size in the .db dbs, so adding
> source package signatures effectively doubles the size).
> 
> So two points up for discussion:
> 
> 
> 1) Sync repository layout?  I don't see any point in leaving the tar
> based format, as reading of sync databases is not a bottleneck.  (The
> local db format can be a bottleneck, but that is a separate discussion...)

Isn't this a historical reversal? IIRC, the sync DBs used to be expanded
onto disk, and we decided to leave them as tarballs to address
performance/fragmentation concerns.

> Do we split the information in .db out of .files and add a .full db with
> complete information?  Then any .src db could follow suit and just have
> source package information.  How do we get around the out of sync issue
> (e.g., a package is removed from .db, but we have an old .files database
> with it).  Do we add timestamps, and print a warning on -F operations
> when the two are out of sync?
> 
> 
> 2) Do we need a better (read "more easily maintainable") tool for
> handling database generation and updates?  libalpm already can read in
> information package files, so we could add libalpm/db_write.c with the
> database creation functions.   Should we unify our repo format with our
> local database format which we already write?
> 

I'd urge you not to make this a part of pacman. It's too far off the
beaten path for most users to make it a part of an already complicated
tool.

> 
> I am looking for ideas here.  Please brainstorm to your hearts content.

WRT replacing repo-add, I'd suggest we come up with a the use cases we
want to support, design an interface to meet them, and then come up with
the implementation. Might be nice to start with the Arch Linux
repository layout as an example that we'd want to support (pooled
packages with symlinks into repo dirs).

> Cheers,
> Allan


Re: [pacman-dev] Repository management

2017-05-10 Thread Allan McRae
On 11/05/17 02:54, Dave Reisner wrote:
> On Tue, May 09, 2017 at 10:54:44PM +1000, Allan McRae wrote:
>> Hi all,
>>
>> Every time I attempt to work on repo-add, I find it to be a very
>> difficult endeavour.  Even though it is half the size of makepkg
>> (without even including any of libmakepkg), it is much more convoluted
>> to work on.
>>
>> We also have a weird repository database system.  We have:
>> - .db dbs with package information, signatures and delta information
>> - .files dbs that are the same as .db dbs but additionally include filelists
>>
>> There are two reasons the .files dbs replicate all information in the
>> .db dbs
>>  - .db and .files dbs getting out of sync could cause issues
>>  - a complete database is useful for things like archweb, mostly to
>> avoid the above
>>
>> I would also like to include information on source packages to these
>> databases.  The files information is separate due to wanting our primary
>> database to be small.  Likewise, source package information needs to be
>> separate (the signatures take most of the size in the .db dbs, so adding
>> source package signatures effectively doubles the size).
>>
>> So two points up for discussion:
>>
>>
>> 1) Sync repository layout?  I don't see any point in leaving the tar
>> based format, as reading of sync databases is not a bottleneck.  (The
>> local db format can be a bottleneck, but that is a separate discussion...)
> 
> Isn't this a historical reversal? IIRC, the sync DBs used to be expanded
> onto disk, and we decided to leave them as tarballs to address
> performance/fragmentation concerns.

To be clear, I was saying to stay tar based and not to move to something
else.

>> Do we split the information in .db out of .files and add a .full db with
>> complete information?  Then any .src db could follow suit and just have
>> source package information.  How do we get around the out of sync issue
>> (e.g., a package is removed from .db, but we have an old .files database
>> with it).  Do we add timestamps, and print a warning on -F operations
>> when the two are out of sync?
>>
>>
>> 2) Do we need a better (read "more easily maintainable") tool for
>> handling database generation and updates?  libalpm already can read in
>> information package files, so we could add libalpm/db_write.c with the
>> database creation functions.   Should we unify our repo format with our
>> local database format which we already write?
>>
> 
> I'd urge you not to make this a part of pacman. It's too far off the
> beaten path for most users to make it a part of an already complicated
> tool.
> 

Definitely not part of pacman.  I was suggesting another program with a
libalpm backend.

>>
>> I am looking for ideas here.  Please brainstorm to your hearts content.
> 
> WRT replacing repo-add, I'd suggest we come up with a the use cases we
> want to support, design an interface to meet them, and then come up with
> the implementation. Might be nice to start with the Arch Linux
> repository layout as an example that we'd want to support (pooled
> packages with symlinks into repo dirs).
> 
>> Cheers,
>> Allan
> .
> 


Re: [pacman-dev] Repository management

2017-05-10 Thread Andrew Gregory
On 05/09/17 at 10:54pm, Allan McRae wrote:
> Hi all,
> 
> Every time I attempt to work on repo-add, I find it to be a very
> difficult endeavour.  Even though it is half the size of makepkg
> (without even including any of libmakepkg), it is much more convoluted
> to work on.
> 
> We also have a weird repository database system.  We have:
> - .db dbs with package information, signatures and delta information
> - .files dbs that are the same as .db dbs but additionally include filelists
> 
> There are two reasons the .files dbs replicate all information in the
> .db dbs
>  - .db and .files dbs getting out of sync could cause issues
>  - a complete database is useful for things like archweb, mostly to
> avoid the above
> 
> I would also like to include information on source packages to these
> databases.  The files information is separate due to wanting our primary
> database to be small.  Likewise, source package information needs to be
> separate (the signatures take most of the size in the .db dbs, so adding
> source package signatures effectively doubles the size).
> 
> So two points up for discussion:
> 
> 
> 1) Sync repository layout?  I don't see any point in leaving the tar
> based format, as reading of sync databases is not a bottleneck.  (The
> local db format can be a bottleneck, but that is a separate discussion...)
> 
> Do we split the information in .db out of .files and add a .full db with
> complete information?  Then any .src db could follow suit and just have
> source package information.  How do we get around the out of sync issue
> (e.g., a package is removed from .db, but we have an old .files database
> with it).  Do we add timestamps, and print a warning on -F operations
> when the two are out of sync?
 
What about just not including the signature in the database?  Make the
inclusion of the signature optional and have pacman (or whatever
downloads the source package) also look for a corresponding .sig file
if it's not in the db.  pacman -U already looks for a .sig file when
downloading a package and you have a feature request to download .sig
files even with -S, so code-wise this seems like a pretty clean
solution. Then you can include the source information right in the
primary DB and Arch's devtools can opt to omit the signature from the
db.
 
> 2) Do we need a better (read "more easily maintainable") tool for
> handling database generation and updates?  libalpm already can read in
> information package files, so we could add libalpm/db_write.c with the
> database creation functions.   Should we unify our repo format with our
> local database format which we already write?

I would love to see us drop the ini-style .PKGINFO format, if that's
what you mean.  Even without adding a database writer to libalpm,
having two formats for the exact same data is unnecessary and leads to
inconsistencies between the two.

apg


Re: [pacman-dev] Repository management

2017-05-15 Thread Xyne
On 2017-05-09 22:54 +1000
Allan McRae wrote:

>I am looking for ideas here.  Please brainstorm to your hearts content.

Ok :)


>So two points up for discussion:
>
>
>1) Sync repository layout?  I don't see any point in leaving the tar
>based format, as reading of sync databases is not a bottleneck.  (The
>local db format can be a bottleneck, but that is a separate discussion...)
>
>Do we split the information in .db out of .files and add a .full db with
>complete information?  Then any .src db could follow suit and just have
>source package information.  How do we get around the out of sync issue
>(e.g., a package is removed from .db, but we have an old .files database
>with it).  Do we add timestamps, and print a warning on -F operations
>when the two are out of sync?

Add a timestamp inside each database (*.db, *.files, *.src). When pacman
downloads a database, instead of saving it as . and squashing the
previous database, save it as -.. Each refresh operation
(pacman -Sy, pacman -Fy) is associated with a particular database (*.db and
*.files, respectively). Create an untimestamped symlink to that database, e.g.

$ pacman -Sy...
# retrieve .db and save as -.db
# ln -s -.db .db

$ pacman -Fy...
# retrieve .db and save as -.db
# retrieve .files and save as -.files
# ln -s -.files .files

# something similar for *.src files

For operations that only involve the current .db files, no change is
needed for loading the database.

For loading .files, you will need to dereference .files first,
grab  from -.files in the example above, and
then use it to load -.db instead of .db. Same method
for *.src files.

For cleanup of the timestamped files, collect the valid timestamps from the
untimestamped symlinks and then remove anything that doesn't match them. This
should probably be done with each database refresh. Maybe you can use the same
function that you use to clean up the package cache with -Sc while leaving
installed packages.

Obviously there will be some redundancy in the up to 3 copies of
-.db but I think that's better than e.g. breaking pkgfile
searches after an upgrade.

With this approach you could also download the latest version of the sync
databases as -.db without symlinking .db to it, and then
use that to query upgradable packages and other info from the mirror.

For propagating the database to the servers, nothing changes. Whenever the
database is updated, generate .db, .files, .src and whatever
else at the same time with the same internal timestamp and then just push them
out as usual.


>2) Do we need a better (read "more easily maintainable") tool for
>handling database generation and updates?  libalpm already can read in
>information package files, so we could add libalpm/db_write.c with the
>database creation functions.   Should we unify our repo format with our
>local database format which we already write?

Yes for unification, preferably in a standardized format (e.g. yaml). Having
the functionality to read and write the files in libalpm would be useful for
third-party tool developers.





On 2017-05-10 12:54 -0400
Dave Reisner wrote:

>WRT replacing repo-add, I'd suggest we come up with a the use cases we
>want to support, design an interface to meet them, and then come up with
>the implementation. Might be nice to start with the Arch Linux
>repository layout as an example that we'd want to support (pooled
>packages with symlinks into repo dirs).

What about using a relative subpath instead of a filename in the database. That
would enable transparent freeform repo layouts (e.g. pooled packages without
symlinks, package groups in different subdirs, etc.).

You could also avoid the need for subdirectories by adding the architecture
to the database filename, e.g. ..



To simplify repo-add, you could include .SRCINFO directly to avoid parsing and
reformatting/rewriting that metadata. Keep it as a separate file then add a new
one (call it PKGINFO?) for information about the *.pkg.* file itself (build
date, packager, signature, checksum, size, relative filepath, etc.). Add other
files to contain related information (e.g. INSTALLINFO with install time, file
list, install origin?). That way, each step copies existing files and adds a
new one with the new info (repo-add: collect SRCINFO, add PKGINFO; install a
package: copy SRCINFO AND PKGINFO to local db, create INSTALLINFO etc.)

A repo metadata file would also be required in the root directory with the repo
timestamp for the timestamped databases described above. The file could also
collect other metadata such as package providers and maybe replacements to
speed up some operations. 


Regards,
Xyne


Re: [pacman-dev] Repository management

2017-05-15 Thread Xyne
Xyne wrote:

>Obviously there will be some redundancy in the up to 3 copies of
>-.db but I think that's better than e.g. breaking pkgfile
>searches after an upgrade.

Just to expand on that, the worst case scenario leads to the same level of
redundancy as we currently have with complete *.files databases, while the best
case leads to no redundancy, all the while preserving the independence of
pacman -S... and pacman -F... (and whatever else you want to add).

>With this approach you could also download the latest version of the sync
>databases as -.db without symlinking .db to it, and then
>use that to query upgradable packages and other info from the mirror.

To make that work with my suggestion for cleaning up old timestamped databases,
add a symlink named e.g. .future, .next or .remote. That could
be used by e.g. checkupdates or pre-emptive package downloading scripts.

There may even be cases where the cleanup is unwanted, such as for a script
that regularly downloads databases and upgradable packages to provide an
incremental upgrade path at a later date (obviously regular updates are
preferred, but maybe useful and reasonable in some rare cases).

In my previous reply, I had forgotten that pacman -Sc prompts for the database
and pkgcache cleanups independently. Forget what I said about automatic
cleanups. Offload that to pacman -Sc.

Regards,
Xyne


Re: [pacman-dev] Repository management

2017-07-29 Thread Mark Weiman
On Tue, 2017-05-09 at 22:54 +1000, Allan McRae wrote:
> Hi all,
> 
> Every time I attempt to work on repo-add, I find it to be a very
> difficult endeavour.  Even though it is half the size of makepkg
> (without even including any of libmakepkg), it is much more
> convoluted
> to work on.
> 
> We also have a weird repository database system.  We have:
> - .db dbs with package information, signatures and delta information
> - .files dbs that are the same as .db dbs but additionally include
> filelists
> 
> There are two reasons the .files dbs replicate all information in the
> .db dbs
>  - .db and .files dbs getting out of sync could cause issues
>  - a complete database is useful for things like archweb, mostly to
> avoid the above
> 
> I would also like to include information on source packages to these
> databases.  The files information is separate due to wanting our
> primary
> database to be small.  Likewise, source package information needs to
> be
> separate (the signatures take most of the size in the .db dbs, so
> adding
> source package signatures effectively doubles the size).
> 
> So two points up for discussion:
> 
> 
> 1) Sync repository layout?  I don't see any point in leaving the tar
> based format, as reading of sync databases is not a bottleneck.  (The
> local db format can be a bottleneck, but that is a separate
> discussion...)
> 
> Do we split the information in .db out of .files and add a .full db
> with
> complete information?  Then any .src db could follow suit and just
> have
> source package information.  How do we get around the out of sync
> issue
> (e.g., a package is removed from .db, but we have an old .files
> database
> with it).  Do we add timestamps, and print a warning on -F operations
> when the two are out of sync?
> 

Perhaps instead of timestamps, how about adding a .DBINFO file and
include a hash in that file that is shared between both the .db and
.files databases (and perhaps the source db as well). This way, when
something checks the .files, you can tell if it doesn't match the .db
(because in my opinion, the .db is more important so that's what I
would compare anything to).

I'm not really sure what good a .full db would do for us though. Just
seems to me like extra stuff to download.

> 
> 2) Do we need a better (read "more easily maintainable") tool for
> handling database generation and updates?  libalpm already can read
> in
> information package files, so we could add libalpm/db_write.c with
> the
> database creation functions.   Should we unify our repo format with
> our
> local database format which we already write?
> 

I think this would be great. Especially the part of implementing
something in libalpm to do this. It would allow projects like pyalpm or
my own php-alpm to be used to also create repos.

> 
> I am looking for ideas here.  Please brainstorm to your hearts
> content.

I know this is two months after the fact, but here's my take on it.

Mark