Re: [Apertium-stuff] Separate Corpus Repos

Francis Tyers Thu, 12 Dec 2019 08:55:50 -0800

El 2019-12-12 09:16, Kevin Brubeck Unhammer escribió:

Tino Didriksen <m...@tinodidriksen.com>
čálii:

I would like for corpus and other indirect data to go in separate
repositories. Basically, if the data is not used during the build, it
should go elsewhere.


What if it's used during `make test`?

By the same argument, should we remove scripts that are used during
development, but not required for build (stuff that is kept in the dev/
subfolder)? If we get too strict on the requirement of "only things
necessary for build", people may start just not checking in useful

scripts, which to me seems worse. And it's already quite annoyinghaving

to check out three repos just to work on one language pair; if

development depends on corpora repos, you have not just three, but*six*

places where you can forget to git push, or where you have to compare
git logs to review changes.

We need corpus data under Apertium's control so that we don't rely on3rdparties. However, bundling this data in the languages' and pairs'reposmeans that those repos grow unbounded, especially when the data ischanged.


I agree that "big" data shouldn't be in the regular repos, since it
slows down checking them out. But less than a few megabytes of text
won't make much difference to a repo with tens of MB's of .dix entries.

It also messes up the changelog. I use a script to generate AUTHORSfromthe changelog, because nobody keeps that up to date. But this getsmuddied
when unnecessary data is in the repo.


In general I would want to include annotators as authors, though I can
imagine situations where it's not clear-cut, e.g. where the dataset is
too large or is not quite relevant for developing the rest of the repo.

I think having corpus-xxx and corpus-xxx-yyy repos could be a good
thing, but I don't think we should have a hard requirement of moving
data over there, especially if the data is useful during testing and
development. I do think it makes sense to move larger corpora out, for
faster cloning.


I like the idea of not having large corpora in the git repos for
languages and language pairs.

I'm not sure if corpora-xxx in the github is the right way to go though.

I think it would be better to store them on a web server and either:

1) Have apertium-xxx/text that has a script that will download thecorpus

    from the server and a gitignore to not have it in the repo.
2) Use something like git-annex (this is bit more involved)

It would be great to e.g. keep updated cleaned versions of Wikipediadumps,

and also be able to store non-redistributable stuff.

I can expand a bit on this proposal if necessary.

Fran


_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Separate Corpus Repos

Reply via email to