Re: [darcs-users] [patch37] Store textual patch metadata encoded in UTF-8

Eric Kow Tue, 10 Nov 2009 07:13:46 -0800

Hi Reinier,

I think this is ready to go pending some questions (marked COMMENT) below.


I really don't trust myself as a reviewer for this kind of stuff, so I would
be happier if a somebody else did a second pass on it to make sure I didn't
miss anything, perhaps looking out for sins of omission (whereas I just did the
most elementary kind of review, which is "read the code and try to understand
what it's doing"), and also answering the question "how do we know this doesn't
contaminate our handling of the patch contents".

Also, I know this is the second time I'm asking this (but this time with some
variants which you may happier with than my first request) : I'd like you to
consider a second time the possible merits of using diff+patch to submit a nice
clean bundle without the rollbacks or conflicts and with some patches
consolidated or divided up as you see fit.  To make this simpler for you,
perhaps do this only against the patch bundle context and not the current repo
state (let somebody else worry about the recent conflicts).

In any case, when we're ready to apply this, it may also be good to send a
gzipped version of the patch bundle.

Thanks!

On Sun, Nov 01, 2009 at 17:55:50 +0000, Reinier Lamers wrote:
> The patches make darcs treat old-style patches as locale-encoded.

COMMENT: So let me make sure I understand the implications of this.  Before this
bundle, Darcs would treat patches as just a sequence of bytes.  From the
user's perspective is that effectively equivalent to a Unicode-aware
darcs treating them as locale-encoded?

Patches listed
--------------
>   * Make record store patch metadata in UTF-8

Seen in previous message.  It had two TODOs

 TODO: understand unpackPSFromUTF8 (was toString)
 TODO: think about implications of this change. [re: Darcs.Utils]

As for the first TOOD, I think that's addressed below.  For the second
TODO, I think that's just making askUser return a proper Unicode string

>   * Add test for UTF-8 metadata
>   * Add tests to check that amend-record stores metadata in UTF-8
>   * Make UTF-8 test a bit more compact
>   * Add tests for rollback's storing metadata as UTF-8
>   * Test that patches with non-latin1 metadata can be safely moved between 
> repos
>   * Test that darcs tag stores metadata in UTF-8

Reviewed very rapidly by Trent (thanks for that!)

>   * Resolve conflicts between UTF-8 and main branch
>   * Resolve conflict with darcs-hs merge
>   * Resolve conflict about the handling of the author pref file
  
Ignored as being boring

>   * Make amend-record store metadata as UTF-8
>   * Make _darcs/prefs/author be locale-encoded
>   * Move locale conversion out of IO monad
>   * Normalize Unicode of the patch metadata for comparison.
>   * Add C wrappers for ICU functions to work around changing symbol names
>   * Stop handling all command line arguments as Unicode text
>   * Add tests for function of UTF-8 marker line and for NFC normalization

I think I've read through this.  I did not spot anything wrong

Make amend-record store metadata as UTF-8
-----------------------------------------
Is this patch a misnomer?  It seems like these PatchInfo modifications are
quite general and while I could be convinced that they would affect
amend-record (with no changes to the actual AmendRecord command), perhaps
the name should have reflected that generality.

> -import Codec.Binary.UTF8.Generic ( fromString )
> +import Codec.Binary.UTF8.Generic ( fromString, toString )

>  just_name :: PatchInfo -> String
> -just_name pinf = if is_inverted pinf then "UNDO: " ++ BC.unpack (_pi_name 
> pinf)
> -                                     else BC.unpack (_pi_name pinf)
> +just_name pinf = if is_inverted pinf then "UNDO: " ++ toString (_pi_name 
> pinf)
> +                                     else toString (_pi_name pinf)

We now assume that we're dealing with UTF-8 bytestrings.

>  just_author :: PatchInfo -> String
> hunk ./src/Darcs/Patch/Info.hs 109
> -just_author = BC.unpack . _pi_author
> +just_author = toString . _pi_author

> hunk ./src/Darcs/Patch/Info.hs 113
> -    text (friendly_d $ _pi_date pi) <> text "  " <> packedString (_pi_author 
> pi)
> - $$ hfn (_pi_name pi)
> +    text (friendly_d $ _pi_date pi) <> text "  " <> text (pi_author pi)
> + $$ hfn (pi_name pi)
>   $$ vcat (map ((text "  " <>) . packedString) (ignore_junk $ _pi_log pi))

COMMENT: Is there a reason we don't also remove these packedString lines from
the bit that prints the patch log?

I'm not 100% clear on why we remove these packedString calls for that matter.
For the interested, this comes from the Printer module:

-- | 'packedString' builds a 'Doc' from a 'B.ByteString' using 'printable'
packedString :: B.ByteString -> Doc
 
Does the Printer module have to be updated to deal with UTF-8 bytestrings as
well?  Should we have separate types to distinguish between UTF-8 bytestrings
(used for patch metadata) and unknown-encoding-bytestrings (perhaps used for
patches themselves)?

>  pi_name :: PatchInfo -> String
> -pi_name = BC.unpack . _pi_name
> +pi_name = toString . _pi_name
>
> -pi_author = BC.unpack . _pi_author
> +pi_author = toString . _pi_author
>
> -pi_log = map BC.unpack . ignore_junk . _pi_log
> +pi_log = map toString . ignore_junk . _pi_log
>
>  pi_tag :: PatchInfo -> Maybe String
>  pi_tag pinf =
> hunk ./src/Darcs/Patch/Info.hs 158
>      if l == t
> -      then Just $ BC.unpack r
> +      then Just $ toString r
>        else Nothing
>      where (l, r) = B.splitAt (B.length t) (_pi_name pinf)
> hunk ./src/Darcs/Patch/Info.hs 161
> -          t = BC.pack "TAG "
> +          t = fromString "TAG "

More assuming that we have UTF-8 encoded bytestrings in our patchinfo
(this is also assuming that we do the conversion for old-style patches
somewhere else)

>  pi_rename :: PatchInfo -> String -> PatchInfo
> -pi_rename x n = x { _pi_name = BC.pack n }
> +pi_rename x n = x { _pi_name = fromString n }
>
>  set_pi_date :: String -> PatchInfo -> PatchInfo
> -set_pi_date date pi = pi { _pi_date = BC.pack date }
> +set_pi_date date pi = pi { _pi_date = fromString date }

Also make sure we store PatchInfo as UTF-8 bytestrings

>  to_xml :: PatchInfo -> Doc
>  to_xml pi =
>          text "<patch"
> -    <+> text "author='" <> escapeXML (just_author pi) <> text "'"
> -    <+> text "date='" <> escapeXML (BC.unpack $ _pi_date pi) <> text "'"
> +    <+> text "author='" <> escapeXMLByteString (_pi_author pi) <> text "'"
> +    <+> text "date='" <> escapeXMLByteString (_pi_date pi) <> text "'"
>      <+> text "local_date='" <> escapeXML (friendly_d $ _pi_date pi) <> text 
> "'"

Again: should local the date also be included to this, since it may one
day go locale-specific?

> +-- Escape XML characters in a UTF-8 encoded ByteString, and turn it into a 
> Doc.
> +-- The data will be in the Doc as a bytestring.
> +escapeXMLByteString :: B.ByteString -> Doc
> +escapeXMLByteString = packedString . bstrReplace '\'' "&apos;"
> +                                   . bstrReplace '"'  "&quot;"
> +                                   . bstrReplace '>'  "&gt;"
> +                                   . bstrReplace '<'  "&lt;"
> +                                   . bstrReplace '&'  "&amp;"

This looks like a direct translation of espaceXML.  Perhaps some sort of
refactor would be good using a higher-order function?  Or is espaceXML just
going away?

>  strReplace :: Char -> String -> String -> String
>  strReplace _ _ [] = []
> hunk ./src/Darcs/Patch/Info.hs 211
>    | x == z    = y ++ (strReplace x y zs)
>    | otherwise = z : (strReplace x y zs)
>  
> +bstrReplace :: Char -> String -> B.ByteString -> B.ByteString
> +bstrReplace c s bs | B.null bs   = B.empty
> +                   | otherwise   = if BC.head bs == c
> +                                     then B.append (BC.pack s)
> +                                                   (bstrReplace c s (B.tail 
> bs))
> +                                     else B.cons (B.head bs)
> +                                                 (bstrReplace c s (B.tail 
> bs))
> +

Yuck.  I wonder if there are generic functions that can replace both strReplace
and bstrReplace.

Also, it looks like bstrReplace could do with some sharing of the head and tail,
that we could do more to make it look like strReplace.

>  make_alt_filename :: PatchInfo -> String
>  make_alt_filename pi@(PatchInfo { is_inverted = False }) =
> hunk ./src/Darcs/Patch/Info.hs 221
> -    fix_up_fname (midtrunc (pi_name pi)++"-"++just_author pi++"-"++BC.unpack 
> (_pi_date pi))
> +    fix_up_fname (midtrunc (pi_name pi)++"-"++just_author pi++"-"++toString 
> (_pi_date pi))
>  make_alt_filename pi@(PatchInfo { is_inverted = True}) =
>      make_alt_filename (pi { is_inverted = False }) ++ "-inverted"

What's this function for?  Could you haddock it?  Anyway, I notice that you
later rolled this change back so I'm ignoring it.

Make _darcs/prefs/author be locale-encoded
------------------------------------------
> conflictor [
> hunk ./src/Darcs/Arguments.lhs 831
> -          writeFile (darcsdir++"/prefs/author") add
> +          writeFile (darcsdir ++ "/prefs/author") $
> +                    unlines ["# " ++ line | line <- fileHelpAuthor] ++ add
> ]
> :
> hunk ./src/Darcs/Arguments.lhs 831
> -          writeFile (darcsdir++"/prefs/author") add
> +          writeLocaleFile (darcsdir++"/prefs/author") add

Huh, so I wonder if this is a nice effect of the conflictor representation;
that you can easily read the actual patch as the part that follows the ':'

Otherwise, this is as you promised in the description to this patch bundle above

Move locale conversion out of IO monad
--------------------------------------
>  This is motivated by the circumstance that GHC 6.12 will offer locale
>  conversion with a pure type.

Uh, OK.  I'll trust the GHC folks on that one and don't have the imagination
to see why that would be a bad thing.  I guess that makes things nicer to read.

> +-- unsafePerformIO in the locale function is ratified by the fact that GHC 
> 6.12
> +-- and above also supply locale conversion with functions with a pure type.
> +decodeLocale :: B.ByteString -> String
> +decodeLocale = unsafePerformIO . runInputT defaultSettings . decode

>  -- | Encode a String to a ByteString according to the current locale
> -encodeLocale :: String -> IO B.ByteString
> -encodeLocale = runInputT defaultSettings . encode
> +encodeLocale :: String -> B.ByteString
> +encodeLocale = unsafePerformIO . runInputT defaultSettings . encode

No surprises, I guess.

>      case undecodedAuthor of
> -        Just a  -> Just `fmap` decodeString a
> +        Just a  -> return (Just (decodeString a))
>          Nothing -> return Nothing

Looks like you could just replace that with something like

   return (decodeString `fmap` undecodedAuthor)

The rest of this are just straightforward consequences of the above

Stop handling all command line arguments as Unicode text
--------------------------------------------------------
> --- | A locale-aware version of getArgs
> ---   This function returns the command-line arguments, interpreted in the
> ---   current locale instead of byte-by-byte.
> -getArgsLocale :: IO [String]
> -getArgsLocale = do
> -    byteArgs <- getArgs >>= return . map encodeLatin1
> -    return (map decodeLocale byteArgs)

Oh!  This seems to rollback part of your very first patch.  So now we stop
assuming that all the args are Unicode text, only a small subset of
them.  Why?  Is this related to filenames?

Right now the subset consists of --author, --from and -m, i.e. things which
appear in the patch metadata

> +from_opt = DarcsArgOption [] ["from"] (Author . decodeString) "EMAIL"
> +                          "specify email address"

Well, if we're going to do this, why not --to, --cc and --subject as well?

Normalize Unicode of the patch metadata for comparison.
-------------------------------------------------------
Skimming http://www.unicode.org/reports/tr15/, the context behind this
patch is that even within Unicode there is often more than one way to
represent the same character (excuse me if I'm committing a horrible
abuse of language here).

For example, the c with a cedilla could be represented either as the
single 'ç' or as plain old 'c' plus a cedilla diacritic.  There are
other kinds of equivalences too, for example, the order of diacritics
for characters that have more than one (eg in Vietnamese).  Luckily,
Unicode also defines equivalences and normal forms for the above.  In
this patch, Reinier has chosen the NFC form, which basically gives us
the composed form (eg. ç) wherever applicable.

What we want to avoid is the situation where Darcs thinks that two patch
metadata strings are different merely because they use different
representations for the same characters.  I'm not sure how this
situation could arise or why we think it's necessary to protect against
it, but instinctively it does feel right to want to do this.  This is
the sort of thing I'd like to pass up to some sort of Unicode guy, but I
guess John Cowan has approved it...

COMMENT: Are there any risks of this getting us into trouble for
old-style patches?  Is it possible that the normalisation process would
somehow mess them up, for example, by transforming a sequence of bytes
that just so happens to look like UTF-8-encoded 'c followed by a
cedilla' into UTF-8 'ç'?

>  This also includes two other changes: the change from the utf8-string
>  package to the text package for encoding matters

What's the motivation behind this?
Also, why do we use the 0.3 version of text and not the latest 0.5?

> +  extra-libraries:  icuuc
> +

How portable a change is this?  On my Mac, I saw that I had a libicucore
and not a libicuuc.  MacPorts says it has a libicu package.  Do we just
get used to requiring people to install that like they might libz?

Ah, I see you've also noticed the text-icu package; why doesn't it
work on major Linux distributions?

> +unpackPSFromUTF8 :: B.ByteString -> String
> +unpackPSFromUTF8  = T.unpack . decodeUtf8
> +
> +packStringToUTF8 :: String -> B.ByteString
> +packStringToUTF8 = encodeUtf8 . T.pack

From UTF-8 bytestring to UTF-16 Text to String and back.  Maybe we
should just be shipping Text around whenever we know we're dealing
with Unicode instead of UTF-8 bytestrings?

> +-- | Convert a String to a ByteString by encoding it with UTF-8 and 
> normalizing
> +-- it to Normalized Form C. For the why and how of normalization, see
> +-- http://www.unicode.org/reports/tr15/ .
> +packStringToUTF8NFC :: String -> B.ByteString
> +packStringToUTF8NFC = encodeUtf8 . normalizeText . T.pack

Is there a reason to prefer NFC to NFD?  Is the choice entirely
arbitrary so long as we're using a normal form?

> +foreign import ccall "unicode/unorm.h unorm_normalize_3_8" raw_normalize :: 
> Ptr Word16 -> Int32 -> CInt -> Int32 -> Ptr Word16 -> Int32 -> Ptr CInt -> IO 
> Int32

So for the interested, unorm_normalize has this documentation.

int32_t unorm_normalize         (       const UChar *            source,
                int32_t         sourceLength,
                UNormalizationMode      mode,
                int32_t         options,
                UChar *         result,
                int32_t         resultLength,
                UErrorCode *    status   
        )       

COMMENT: Do we need to care about the potential difference between
UTF-16 and ICU's UChar type (which is also 16 bits wide)?

> +-- | Normalize a Text according to Normal Form C (NFC).
> +--   This home grown binding of ICU is inefficient, so do not use it for 
> longer
> +--   strings. Hopefully it can be removed in the 2.5 release or so when 
> text-icu 
> +--   works on the major Linux distributions.
> +normalizeText :: Text -> Text
> +normalizeText s = unsafePerformIO $ T.useAsPtr s $ \sptr slen ->
> +                      allocaBytes bufSizeBytes $ \tptr ->
> +                          let len = fromIntegral slen :: Int32
> +                          in do resultLength <- handleError $ \errptr ->
> +                                                    raw_normalize sptr len 
> normalizationType unicode32 tptr (fromIntegral bufSizeBytes) errptr
> +                                T.fromPtr tptr (fromIntegral resultLength)

Otherwise, I'm reading this as
- create a copy of the Text in the form of a C array
- create a new array that's 3 times the size of the original UTF-16 array
  (notice the *2 below because of 16 as opposed to 8 bits)
- pass both to unorm_normalize along with size information.

> +{-|
> + Provides simple (i.e. abort-if-anything-wrong) error handling for ICU
> + functions.
> +
> + Takes as an argument a function that writes an ICU error code to a certain
> + memory address (like most ICU4C functions do).
> +
> + This function runs the given function, giving it a memory address to write 
> the
> + error code to. When the given function indicates an error, it aborts the
> + program. Otherwise it just returns the result.
> +-}
> +handleError :: (Ptr CInt -> IO a) -> IO a
> +handleError f = alloca $ \errptr ->
> +                    do poke errptr 0
> +                       result <- f errptr
> +                       errorCode <- peek errptr
> +                       when (errorCode > 0) (error (errMsg ++ show 
> errorCode))
> +                       return result
> +  where errMsg = "darcs ByteStringUtils: error returned by ICU4C function: "

If anything goes wrong during normalisation, our head explodes, which looks
fair enough.

Note that I tend to get nervous and want to run to a grown-up when I see this
low-level stuff.  I think what I'm reading just says "allocate space for a C
int, pass f a pointer to that int, and explode if we read back a non-zero value
for that int"

> hunk ./src/Darcs/Email.hs 7
>        && ord c < 128                    = c2bs c
> -    | otherwise                         = B.concat (map qbyte (UTF8.encode 
> [c]))
> +    | otherwise                         = B.concat
> +                                            (map qbyte
> +                                              (B.unpack
> +                                                (packStringToUTF8 [c])))

This looks like merely replacing the utf8-string stuff with the equivalent from
Text

>  patchinfo :: String -> String -> String -> [String] -> IO PatchInfo
>  patchinfo date name author log =
> -    add_junk $ PatchInfo { _pi_date     = fromString date
> -                         , _pi_name     = fromString name
> -                         , _pi_author   = fromString author
> -                         , _pi_log      = map fromString log
> +    add_junk $ PatchInfo { _pi_date     = BC.pack date
> +                         , _pi_name     = packStringToUTF8NFC name
> +                         , _pi_author   = packStringToUTF8NFC author
> +                         , _pi_log      = map packStringToUTF8NFC log
>                           , is_inverted  = False }

We systematically normalise patch metadata to NFC

COMMENT: I wonder if there is any risk of manipulating the PatchInfo
using functions that don't packToUTF8NFC

>  -- | add_junk adds a line that contains a random number to make the patch

Sounds like we also need to update this haddock

> hunk ./src/Darcs/Patch/Info.hs 87
> -       return $ pinf { _pi_log = BC.pack (head ignored++showHex x ""++" 
> UTF-8"):
> +       return $ pinf { _pi_log = BC.pack (head ignored++showHex x ""):
> +                                 BC.pack (head ignored++ "UTF-8"):
>                                   _pi_log pinf }

Oh! I completely missed that you had written this all on one line the
first time.  Yes, I think I like the new way better.

[skipped a bunch of toString => metadataToString conversions]

> +-- | Is the metadata in this PatchInfo UTF-8-encoded?
> +isUTF8 :: PatchInfo -> Bool
> +isUTF8 = any (==(BC.pack "Ignore-this: UTF-8")) . _pi_log

No surprises.  A PatchInfo is new-style if it has an Ignore-this: line.
Note the use of _pi_log here instead of pi_log (which does friendly
things like stripping out the Ignore-this lines)

> +-- | Convert a metadata ByteString to a string, with the exact way of 
> conversion
> +-- depending on the version of darcs that created the PatchInfo. In newer 
> darcs
> +-- the metadata are UTF-8, older darcs was not encoding-aware.
> +metadataToString :: PatchInfo -> B.ByteString -> String
> +metadataToString pinf | isUTF8 pinf = unpackPSFromUTF8
> +                      | otherwise   = decodeLocale

In contrast, we store the underlying metadata as NFC-normalised UTF-8
bytestrings.

I wonder if we should have a metadataFromString :: String -> B.ByteString

Also it does sound like I'd need to use Text to represent the metadata
if we're
going to be doing that anyway.

Add C wrappers for ICU functions to work around changing symbol names
---------------------------------------------------------------------
I don't really understand this patch

> -foreign import ccall "unicode/unorm.h unorm_normalize_3_8" raw_normalize :: 
> Ptr Word16 -> Int32 -> CInt -> Int32 -> Ptr Word16 -> Int32 -> Ptr CInt -> IO 
> Int32
> +foreign import ccall "fpstring.h normalize" raw_normalize :: Ptr Word16 -> 
> Int32 -> CInt -> Int32 -> Ptr Word16 -> Int32 -> Ptr CInt -> IO Int32

Specifically, why not just import unorm_normalize?  Perhaps because it's
a just macro and we can't do that with the FFI?

>  normalizeText :: Text -> Text
> -normalizeText s = unsafePerformIO $ T.useAsPtr s $ \sptr slen ->
> +normalizeText s = unsafePerformIO $ T.useAsPtr s $ \sptr slen -> do

Superfluous change?

> +/* Unicode normalization: wrap unorm_normalize because ICU encodes the 
> library
> + * version in the symbol name. */
> +int32_t normalize(UChar *source, int32_t sourceLength, UNormalizationMode 
> mode,
> +                  int32_t options, UChar *result, int32_t resultLength,
> +                  UErrorCode *status)
> +{
> +    return unorm_normalize(source, sourceLength, mode, options,
> +                           result, resultLength, status);
> +}
> +
> +/* Check Unicode normalization. Again, wrap the function because ICU's
> + * symbol names are not stable across ICU versions. */

Is there a good reason for that?
Is there any reason to think it would be unwise to defeat this?

> +UBool isNormalized(const UChar *src, int32_t srcLength, UNormalizationMode 
> mode,
> +                   UErrorCode *pErrorCode)
> +{
> +    return unorm_isNormalized(src, srcLength, mode, pErrorCode);
> +}

This looks like it's just used by tests.

Add tests for function of UTF-8 marker line and for NFC normalization
---------------------------------------------------------------------
I think I'd be happier if the modifications to ByteStringUtils, etc
were in a separate patch.

>                            in do resultLength <- handleError $ \errptr ->
> -                                                    raw_normalize sptr len 
> normalizationType unicode32 tptr (fromIntegral bufSizeBytes) errptr
> +                                                    raw_normalize sptr len 
> nfcType 0 tptr (fromIntegral bufSizeBytes) errptr

Is there a reason we replace 0x20 by 0?

> +-- | Check if a String is in normalized form C according to Unicode
> +isNFC :: String -> Bool
> +isNFC = isNFCText . T.pack

OK

> +-- | Integer constant to tell ICU to use NFC
> +nfcType :: CInt
> +nfcType = 4

Just factored out from a where clause

> +-- The members with names that start with '_' are not supposed to be used
> +-- directly in code that does not care how the patch info is stored.

Useful haddock.  Are they even meant to be exported?

> +instance Arbitrary UnicodeString where
> +    -- 0x10ffff is the highest Unicode code point ; 0xd800 - 0xdfff are
> +    -- surrogates.
> +    arbitrary = UnicodeString `fmap` listOf (oneof [choose ('\0', '\xd799')
> +                                                   ,choose ('\xe000', 
> '\x10ffff')])

http://www.unicode.org/glossary/#surrogate_code_point
http://en.wikipedia.org/wiki/Mapping_of_Unicode_characters#Surrogates

OK I'm just trusting this to mean that in the test we never bother generating
characters that are not Unicode characters per se (eg. UTF-16 artefacts don't
count)

> +instance Arbitrary MyPatchInfo where
> +    arbitrary = do n <- asString `fmap` arbitrary
> +                   d <- arbitrary
> +                   a <- asString `fmap` arbitrary
> +                   l <- (lines . asString) `fmap` arbitrary
> +                   let arbPatchInfo = unsafePerformIO $ patchinfo n d a l
> +                   MyPatchInfo `fmap`
> +                       (oneof [return arbPatchInfo
> +                              ,return (deleteUTF8Marker arbPatchInfo)])
> +    shrink mpi = flip withMyPatchInfo mpi $ \pi -> do
> +        sn <- shrink (pi_name pi)
> +        sa <- shrink (pi_author pi)
> +        sl <- shrink (filter (not . isPrefixOf "Ignore-this:") (pi_log pi))
> +        return (MyPatchInfo (unsafePerformIO $
> +                                 patchinfo sn (BC.unpack (_pi_date pi)) sa 
> sl))

Randomly generates a patchinfo either for a new-style or old-style patch

> +-- | Test that metadata in patches are decoded as UTF-8 or locale depending 
> on
> +-- the relevant 'Ignore-this' line in the log.
> +metadataStringTest :: Test
> +metadataStringTest = testProperty "Testing patch metadata decoding" $
> +    withMyPatchInfo $
> +        \patchInfo -> classify (isUTF8 patchInfo) "UTF-8-encoded" $
> +            let decoder | isUTF8 patchInfo = unpackPSFromUTF8
> +                        | otherwise        = decodeLocale
> +            in (  decoder (_pi_author patchInfo)  == pi_author patchInfo
> +               && decoder (_pi_name   patchInfo)  == pi_name   patchInfo
> +               && map decoder (_pi_log patchInfo) `superset` pi_log 
> patchInfo)

Is this test useful if your locale uses a UTF-8 encoding?

> +packUnpackTest :: Test
> +packUnpackTest = testProperty "Testing UTF-8 packing and unpacking" $
> +    \uString -> asString uString == (unpackPSFromUTF8 . packStringToUTF8) 
> (asString uString)

This looks like it's more testing the low-level ByteStringUtils

> +superset :: (Eq a, Ord a) => [a] -> [a] -> Bool
> +superset a b = sorted_superset (sort a) (sort b)
> +  where sorted_superset (x:xs) (y:ys) | x == y = sorted_superset xs ys
> +                                      | x <  y = sorted_superset xs (y:ys)
> +                                      | y <  x = False
> +        sorted_superset []     (_:_)           = False
> +        sorted_superset _      []              = True

I suppose we could have a more conservative test instead, seeing whether
b can be split into some prefix and suffix of a

> +-- | Test that metadata strings are normalized to NFC
> +metadataNormalizationTest :: Test
> +metadataNormalizationTest = testProperty "Testing metadata normalization" $
> +    withMyPatchInfo $
> +        \patchInfo -> isUTF8 patchInfo ==>
> +                          isNFC (pi_author patchInfo)
> +                          && isNFC (pi_name patchInfo)
> +                          && all isNFC (pi_log patchInfo)

Looks good

> +-- | Generate an arbitrary list of at least one element
>  unempty :: Arbitrary a => Gen [a]
>  unempty = do
> hunk ./src/Darcs/Test/Patch/Test.hs 162
> +  a <- arbitrary
>    as <- arbitrary
> hunk ./src/Darcs/Test/Patch/Test.hs 164
> -  case as of
> -    [] -> unempty
> -    _ -> return as
> +  return (a:as)

Seems like just a minor refactor, replacing a function that generates
a non-empty list with a cleaner variation

-- 
Eric Kow <http://www.nltg.brighton.ac.uk/home/Eric.Kow>
PGP Key ID: 08AC04F9

signature.asc
Description: Digital signature

_______________________________________________
darcs-users mailing list
[email protected]
http://lists.osuosl.org/mailman/listinfo/darcs-users

Re: [darcs-users] [patch37] Store textual patch metadata encoded in UTF-8

Reply via email to