Hello, Here is the latest OCaml Weekly News, for the week of December 23 to 30, 2014.
1) Uucp 0.9.1 2) Uuseg 0.8.0 Unicode Text Segmentation 3) Other OCaml News ======================================================================== 1) Uucp 0.9.1 Archive: <https://sympa.inria.fr/sympa/arc/caml-list/2014-12/msg00103.html> ------------------------------------------------------------------------ ** Daniel Bünzli announced: I'd like to announce the release of Uucp 0.9.1 which should be available shortly in opam. This release adds a new `Uucp.Break` module with Unicode's line, grapheme cluster, word and sentence break properties. <https://github.com/dbuenzli/uucp/blob/0.9.1/CHANGES.md> Uucp provides efficient access to a selection of character properties of the Unicode character database. Home page: <http://erratique.ch/software/uucp> API docs: <http://erratique.ch/software/uucp/doc/Uucp> ======================================================================== 2) Uuseg 0.8.0 Unicode Text Segmentation Archive: <https://sympa.inria.fr/sympa/arc/caml-list/2014-12/msg00104.html> ------------------------------------------------------------------------ ** Daniel Bünzli announced: I'd like to announce the first release of Uuseg which should be available shortly in OPAM. Here's the blurb: Uuseg is an OCaml library for segmenting Unicode text. It implements the locale independent Unicode text segmentation algorithms [1] to detect grapheme cluster, word and sentence boundaries and the Unicode line breaking algorithm [2] to detect line break opportunities. The library is independent from any IO mechanism or Unicode text data structure and it can process text without a complete in-memory representation. Uuseg depends on Uucp and optionally on Uutf for support on OCaml UTF-X encoded strings. It is distributed under the BSD3 license. [1]: <http://www.unicode.org/reports/tr29/> [2]: <http://www.unicode.org/reports/tr14/> Home page: <http://erratique.ch/software/uuseg> API Docs: <http://erratique.ch/software/uuseg/doc> This library is useful if you need to find in Unicode data the user-perceived characters -- grapheme clusters in Unicode terminology -- e.g. for breaking strings, cursor movement, text selection, backspace deletion, or fixed-width layouts of Unicode data (see the end of this email for applications with Format). It can also be used to break text into words and sentences or to detect line break opportunities, see again the end of this email. Note that these algorithms are locale-independent and may not work well on all the scripts defined in Unicode or on the actual language you are dealing with. Do not take these outputs as a silver bullet and refer to the standards above for further information about their limitations and how to alleviate them. For that reason the API offers a very crude and low level API --- basically a state machine --- to define your own segmenters so that the same high-level API can be reused with customizations. Because of time constraints the current implementation of the algorithms is not particulary clever (and sometimes ugly). They were done manually in an ad-hoc manner. There's room for improvement, e.g. to devise more mechanical procedures to get from the rules to implementation -- this would also allow to tap into the Unicode's Common Locale Data Repository that does include locale dependent tailoring for segmentation in a (supposedly) machine readable formats [3]. If you are interested on working on (or in funding) this, get in touch. I do not expect the API to change much, but as usual things may change before a 1.0.0. Happy grapheme clustering, Daniel [3] <http://www.unicode.org/reports/tr35/tr35-general.html#Segmentations> # Uuseg and Format Uuseg can be used to improve OCaml's Format's capabilities on Unicode's alphabetic scripts and symbols. There are a few utility functions in the optional `Uuseg_string` library (you'll need to depend on `Uutf`). The most basic function `Uuseg_string.pp_utf_8` simply instructs the Format engine to treat grapheme clusters as a single character (see below) rather than laying out their byte or decomposed representation like a "%s" format would do: # #require "uuseg.string";; (* Pre-composed é (U+00E9) *) # (* broken *) for i = 0 to 76 do Format.printf "%s@," "\xC3\xA9" done;; éééééééééééééééééééééééééééééééééééééé éééééééééééééééééééééééééééééééééééééé é # for i = 0 to 76 do Format.printf "%a@," Uuseg_string.pp_utf_8 "é" done;; ééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééé (* Decomposed é: e + ´ (U+0065, U+0301) ) *) # (* broken *) for i = 0 to 76 do Format.printf "%s@," "\x65\xCC\x81" done;; ééééééééééééééééééééééééé ééééééééééééééééééééééééé ééééééééééééééééééééééééé éé # for i = 0 to 76 do Format.printf "%a@," Uuseg_string.pp_utf_8 "\x65\xCC\x81" done;; ééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééé - : unit = () Then we have `Uuseg_string.pp_utf_8_text` which instructs Format about break opportunities and mandatory line break according to the Unicode line breaking algorithm. For example layout some Georgian, the first paragraph of OCaml's wikipedia Georgian page (which btw could benefit of some update if someone on the list has the ability) : <http://ka.wikipedia.org/wiki/%E1%83%9D%E1%83%91%E1%83%98%E1%83%94%E1%83%A5%E1%83%A2%E1%83%A3%E1%83%A0%E1%83%98_%E1%83%99%E1%83%90%E1%83%9B%E1%83%9A%E1%83%98> # let g = "????????? ????, ????? (Objective Caml, Ocaml) ??????????? ???????? \ ????????????? ????, ??????? ???? ???????? ???????? ????????????? \ ?????????????????. ?? ??? ???????? ?????? ?????, ????? ??????, \ ?????? ??????? ?? ??????? ???? 1996 ????. ???? ???????, ??????????? \ ?? ?????????? ??????? ???????? ??????? ???????????.";; # Format.set_margin 25;; # Format.printf "@[%a@]" Uuseg_string.pp_utf_8_text g;; ????????? ????, ????? (Objective Caml, Ocaml) ??????????? ???????? ????????????? ????, ??????? ???? ???????? ???????? ????????????? ?????????????????. ?? ??? ???????? ?????? ?????, ????? ??????, ?????? ??????? ?? ??????? ???? 1996 ????. ???? ???????, ??????????? ?? ?????????? ??????? ???????? ??????? ???????????. Soft hyphens are handled by the line breaking algorithm however: 1. Your rendering software may sadly print them which defeats the purpose (e.g. macosx's terminal). 2. It's not possible to replace the soft hyphen by a hard one if the line gets broken at that point, as there's no provision for this in Format (which would also be useful to pp outputs that have line continuation characters). Though I guess `Format.pp_set_formatter_output_functions` could be investigated for that. # Format.set_margin 10;; # let h = "hy\xC2\xADphen\xC2\xADat\xC2\xADed";; # Format.printf "@[%a@]" Uuseg_string.pp_utf_8_text h;; hyphen ated ======================================================================== 3) Other OCaml News ------------------------------------------------------------------------ ** From the ocamlcore planet blog: Thanks to Alp Mestan, we now include in the OCaml Weekly News the links to the recent posts from the ocamlcore planet blog at <http://planet.ocaml.org/>. 2014: <https://gaiustech.wordpress.com/2014/12/28/2014/> Cryptokit 1.10 released: <https://forge.ocamlcore.org/forum/forum.php?forum_id=919> Uuseg 0.8.0: <http://erratique.ch/software/uuseg> LBFGS 0.8.6 released: <https://forge.ocamlcore.org/forum/forum.php?forum_id=918> ======================================================================== Old cwn ------------------------------------------------------------------------ If you happen to miss a CWN, you can send me a message (alan.schm...@polytechnique.org) and I'll mail it to you, or go take a look at the archive (<http://alan.petitepomme.net/cwn/>) or the RSS feed of the archives (<http://alan.petitepomme.net/cwn/cwn.rss>). If you also wish to receive it every week by mail, you may subscribe online at <http://lists.idyll.org/listinfo/caml-news-weekly/> . ======================================================================== -- OpenPGP Key ID : 040D0A3B4ED2E5C7
signature.asc
Description: PGP signature
_______________________________________________ caml-news-weekly mailing list caml-news-weekly@lists.idyll.org http://lists.idyll.org/listinfo/caml-news-weekly