Thank you very much for your long reply, I have never seen such a
thorough answer to a question in a mailing list.
I tried installing the last version of pmwiki, but it raised some other
problems and the encoding problem wasn't fixed anyway. The filenames
were the same on both installations; I used rsync to copy them and
checked that both of them had exactly the same characters.
I have been trying other things on my behalf, and though I could not
find the cause of the problem, I managed to find a solution, or perhaps
more of a hack. Basically, I added the extended Latin1 characters to the
regular expressions involved in name recognition, and rewrote the
functions that are used to make a filename (MakePageName,
MakeUploadName). For example, originally:
$NamePattern = '[[:upper:]\\d][\\w]*(?:-\\w+)*';
And I changed it to:
$NamePattern = '[\\w\\x80-\\xfe]+(?:-[[\\w\\x80-\\xfe]+)*';
So the Latin1 characters could be included. Likewise, in MakePageName(),
I changed
SDV($PageNameChars,'-[:alnum:]');
SDV($MakePageNamePatterns, array(
"/'/" => '', # strip single-quotes
"/[^$PageNameChars]+/" => ' ', # convert everything else to
space
'/((^|[^-\\w])\\w)/e' => "strtoupper('$1')",
'/ /' => ''));
to:
SDV($PageNameChars,'-[:alnum:]\\x80-\\xfe');
SDV($MakePageNamePatterns, array(
"/'/" => '', # strip single-quotes
"/[^$PageNameChars]+/" => ' ', # convert everything else to
space
"/(?<=^| )([a-z])/e" => "strtoupper('$1')",
"/ /" => ''));
because when there was a word containing a special character,
MakePageName firstly would change the offending letter to a space, and
as a consequence the word would be split at that point and would change
the next letter to uppercase, thinking it was a different word. That
explains the "Documentación" => "DocumentaciN" conversion.
This is the first time I delve into PHP, and I am not really sure of how
good are the regular expressions I used (or if I affected all the right
symbols: $NamePattern is clearly used, but I also changed $GroupPattern,
$WikiWordPattern and $SuffixPattern). I took inspiration from
scripts/xlpage*.php.
These changes made work everything that wasn't working, or so it seems
by now. The titles look correct and the files are accessed without
having to change their filenames.
Again, thank you very much for all the time you devoted to answer my
question,
Leandro.
On 09/28/2013 06:29 AM, Petko Yotov wrote:
Unfortunately there is not an easy solution to this problem, see below.
Leandro Fanzone writes:
Hello, I have an installation of pmwiki on a Fedora Core 4 server,
and I decided to migrate it to Ubuntu 12.04. As I did not want to
install pmwiki again, I just copied /var/www to the new machine and
installed Apache + PHP.
As a result, some pages that had titles with Spanish letters (á, ñ,
etc.) cannot be accessed anymore. I see that the files do exist
(albeit they have the special letters changed somehow) but when I try
to open those pages pmwiki cannot find them. For example: a page
called "Documentación" exists in the filesystem as "Documentaci?n",
but pmwiki tries to access it as "DocumentaciN". It seems an encoding
problem, apparently the contents are stored in Latin1 (ISO-8859-1),
and in the filenames sometimes the special letters were changed with
? and sometimes they keep the Latin1 letter, but for some reason
pmwiki does not generate the same filename as before to access them.
I am completely lost, I don't know if this is a configuration problem
of PHP, of Apache, of the LANG variable...
This is likely a problem of the filesystem encoding (charset). It is
possible that the older server had a different filesystem encoding
than the new one.
A charset (character set) is set of rules defining the byte or bytes
used to represent different letters, characters and symbols. Different
charsets generally use the same bytes for the plain Roman/Latin
letters (ASCII) and the most used punctuation symbols, but for example
international letters like "ó" may be "tied" to different bytes in
different charsets. If your filenames contain such characters, there
is no guarantee that you'll be able to copy them without errors from
one filesystem to another.
PmWiki (actually PHP) doesn't care much about the charset, it tries to
process just the stream of bytes, whatever the charset.
So if your wiki content is in Latin1 and PmWiki creates a link to a
page "Documentación", it will look for a filename which is the stream
of bytes with positions 68, 111, 99, 117, 109, 101, 110, 116, 97, 99,
105, 245, 110, where the "ó" character is byte number 245.
If in your directory there is no such filename, PmWiki will show a
link as if the page doesn't exist.
The Unicode/UTF-8 charset defines "ó" as two consecutive bytes, 195
and 179, which are obviously not the same.
When you copy files from one filesystem to another, there may be two
cases -either (A) your copy program is aware of the two charsets and
recodes the actual letters to the correct byte positions, or (B) it is
not aware of the charsets and tries to copy the files and tells the
new filesystem "this file is named this string of bytes: 68, 111, 99,
117, 109, 101, 110, 116, 97, 99, 105, 245, 110" which (B1) may or (B2)
may not be accepted by the new filesystem -- eg. that stream of bytes
is not valid UTF-8.
In case of (A) you'll be able to see the correct filenames when you
browse your filesystem, but PmWiki may be unable to find the files as
it expectsdifferent byte streams/positions.
In case of (B1) PmWiki should be able to find its filenames and it
should work like before, but when you browse your filesystem, you may
see weird characters.
In case of (B2) neither you, nor PmWiki see the correct filenames with
international characters. It looks as if you are in this case.
Note, Pagelists/searches use a different approach than links. A link
to a page asks if there is such a file, while a pagelist/search will
list all files in the wiki.d directory and will try to process them -
if a file is named "Documentaci?n", the "?" character is not allowed
in a pagename so PmWiki tries to deduce an allowed pagename and it can
list "DocumentaciN".
I think I can just change every filename to match pmwiki,
Try with a 1-2 files first to see if it works, because you'll have the
(A) case above and PmWiki may still not be able to locate them.
but on one hand that implies a lot of work, and on the other, the
titles that has special characters are changed as well, which looks
horrible.
What does "looks horrible" mean? If you rename a file to something
that looks OK in the filesystem, PmWiki may be able to access it and
will try to show these bytes in the Latin1 charset. If the filesystem
charset is UTF-8, pmwiki will show "Documentación" because the bytes
195 and 179 ("ó" in UTF-8) are the characters "Ã" and "³" in Latin1.
Some wiki admins restrict pagenames and filenames to ASCII characters,
which are on the same byte positions in most charsets. Then the page
is named "Documentacion" and there is a directive (:title
Documentación:) in it so that it displays correctly. This is generally
more migration-proof than allowing all international characters.
There is a recipe that converts all links to the correct plain
letters, see
http://www.pmwiki.org/wiki/Cookbook/ISO8859MakePageNamePatterns
If you want to go this way, you just write a small bash script on the
old server (!!BACKUP. Your. Files. Before!!) that will rename the
files to ascii characters: something like this:
for filename in * ; do \
newfilename=`echo $filename | \
iconv -f iso8859-1 -t ascii//TRANSLIT -c -`; \
echo "$filename -> $newfilename"; \
done
This will just show you if and how your filenames would be renamed. If
you are OK with this, change the script it to actually rename the files.
Then install the recipe ISO8859MakePageNamePatterns and test if the
wiki works on the old server. If it does, place (:title Correct
title:) in the pages where the accents were lost, and copy the wiki.d
directory and local/config.php to the new server.
Another note: the encoding of the config.php file also matters - if
your wiki is in iso8859-1, save your file on that encoding and not,
eg. UTF-8. You must use a text editor allowing you to select the
encoding of the files. See
http://www.pmwiki.org/wiki/PmWiki/LocalCustomizations#encoding .
Good luck,
Petko
_______________________________________________
pmwiki-users mailing list
[email protected]
http://www.pmichaud.com/mailman/listinfo/pmwiki-users
_______________________________________________
pmwiki-users mailing list
[email protected]
http://www.pmichaud.com/mailman/listinfo/pmwiki-users