Hi all,

I have the task to convert the repository of one of our main products
from mercurial to git.  I've already done some test conversions using
fast-export (https://github.com/frej/fast-export), and the results look
very good.

After running the conversion, I'll also remove some big binary files
which should never have been committed there using the BFG Repo Cleaner
(https://rtyley.github.io/bfg-repo-cleaner/).  I've also tested that,
and it seems to work as intended.

What should also be done in this conversion process is to change the
file encoding of all source code (java) files from ISO-8859-1/15 to
UTF-8, and here I'm asking for advice.

Right now, I have a script which would guess the encoding of each file
using `file --mime-encoding $file` and convert it from the guessed
encoding to UTF-8 using `iconv`.  I'd run it on every active branch and
commit the results individually after the general hg to git conversion.

Well, the "on every active branch" is the problematic point.  It's
time-consuming manual work with chances to shoot yourself in the foot.
Additionally, developers might introduce encoding problems again when
switching between converted and non-converted branches because the IDE
defines the standard encoding per project (root directory) and not per
branch...

Long story short: can I somehow manage to do the ISO-8859 to UTF-8
conversion in the process of converting from hg to git so that the end
result looks like the project has used UTF-8 straight from the
beginning?

Sadly, it's not the case that every java file has always used ISO-8859.
Some files have been switched between ISO-8859 and UTF-8 several times
due to broken/missing editor configurations.  Thereby, encoding errors
have been introduced which resulted, e.g., in files containing the
Unicode Replacement Character or what you get when you save that again
as ISO-8859.  Since these errors are only in comments and string
literals, they usually didn't pop up immediately because the project
still compiled...

Bonus question: I guess I can somehow configure our Git server (we use a
GitLab instance if that matters) to reject pushes containing non-UTF-8
changes to java files.  How would I do that?  And in case I need to do
the "convert each active branch to UTF-8 using some extra commit"
approach, is there a way to exclude a list of legacy branches from this
rule?

Thanks a lot in advance,
Tassilo

-- 
You received this message because you are subscribed to the Google Groups "Git 
for human beings" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to git-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to