subject:"Re\: \[Haskell\-cafe\] Re\: String vs ByteString"

Alright, here's the results for the first three in the list (please forgive
me for being lazy- I am a Haskell programmer after all):

ifeng.com:
UTF8: 299949
UTF16: 566610

dzh.mop.com:
GBK: 1866
UTF8: 1891
UTF16: 3684

www.csdn.net:
UTF8: 122870
UTF16: 217420

Seems like UTF8 is a consistent winner versus UTF16, and not much of a loser
to the native formats.

Michael

On Wed, Aug 18, 2010 at 11:01 AM, anderson leo fireman...@gmail.com wrote:

 More typical Chinese web sites:
 www.ifeng.com (web site likes nytimes)
 dzh.mop.com   (community for fun)
 www.csdn.net  (web site for IT)
 www.sohu.com(web site like yahoo)
 www.sina.com (web site like yahoo)

 -- Andrew


 On Wed, Aug 18, 2010 at 11:40 AM, Michael Snoyman mich...@snoyman.comwrote:

 Well, I'm not certain if it counts as a typical Chinese website, but here
 are the stats;

 UTF8: 64,198
 UTF16: 113,160

 And just for fun, after gziping:

 UTF8: 17,708
 UTF16: 19,367


 On Wed, Aug 18, 2010 at 2:59 AM, anderson leo fireman...@gmail.comwrote:

 Hi michael, here is a web site http://zh.wikipedia.org/zh-cn/. It is the
 wikipedia for Chinese.

 -Andrew

 On Tue, Aug 17, 2010 at 7:00 PM, Michael Snoyman mich...@snoyman.comwrote:



 On Tue, Aug 17, 2010 at 1:50 PM, Yitzchak Gale g...@sefer.org wrote:

 Ketil Malde wrote:
  I haven't benchmarked it, but I'm fairly sure that, if you try to fit
 a
  3Gbyte file (the Human genome, say¹), into a computer with 4Gbytes of
  RAM, UTF-16 will be slower than UTF-8...

 I don't think the genome is typical text. And
 I doubt that is true if that text is in a CJK language.

  I think that *IF* we are aiming for a single, grand, unified text
  library to Rule Them All, it needs to use UTF-8.

 Given the growth rate of China's economy, if CJK isn't
 already the majority of text being processed in the world,
 it will be soon. I have seen media reports claiming CJK is
 now a majority of text data going over the wire on the web,
 though I haven't seen anything scientific backing up those claims.
 It certainly seems reasonable. I believe Google's measurements
 based on their own web index showing wide adoption of UTF-8
 are very badly skewed due to a strong Western bias.

 In that case, if we have to pick one encoding for Data.Text,
 UTF-16 is likely to be a better choice than UTF-8, especially
 if the cost is fairly low even for the special case of Western
 languages. Also, UTF-16 has become by far the dominant internal
 text format for most software and for most user platforms.
 Except on desktop Linux - and whether we like it or not, Linux
 desktops will remain a tiny minority for the foreseeable future.

  I think you are conflating two points here, and ignoring some
 important data. Regarding the data: you haven't actually quoted any
 statistics about the prevalence of CJK data, but even if the majority of 
 web
 pages served are in those three languages, a fairly high percentage of the
 content will *still* be ASCII, due simply to the HTML, CSS and Javascript
 overhead. I'd hate to make up statistics on the spot, especially when I
 don't have any numbers from you to compare them with.

 As far as the conflation, there are two questions with regard to the
 encoding choice: encoding/decoding time and space usage. I don't think
 *anyone* is asserting that UTF-16 is a common encoding for files anywhere,
 so by using UTF-16 we are simply incurring an overhead in every case. We
 can't consider a CJK encoding for text, so its prevalence is irrelevant to
 this topic. What *is* relevant is that a very large percentage of web pages
 *are*, in fact, standardizing on UTF-8, and that all 7-bit text files are 
 by
 default UTF-8.

 As far as space usage, you are correct that CJK data will take up more
 memory in UTF-8 than UTF-16. The question still remains whether the overall
 document size will be larger: I'd be interested in taking a random sampling
 of CJK-encoded pages and comparing their UTF-8 and UTF-16 file sizes. I
 think simply talking about this in the vacuum of data is pointless. If
 anyone can recommend a CJK website which would be considered representative
 (or a few), I'll do the test myself.

 Michael

 ___
 Haskell-Cafe mailing list
 Haskell-Cafe@haskell.org
 http://www.haskell.org/mailman/listinfo/haskell-cafe





___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString

On Wed, Aug 18, 2010 at 2:12 AM, John Meacham j...@repetae.net wrote:

 ranty thing to follow
 That said, there is never a reason to use UTF-16, it is a vestigial
 remanent from the brief period when it was thought 16 bits would be
 enough for the unicode standard, any defense of it nowadays is after the
 fact justification for having accidentally standardized on it back in
 the day.


This is false. Text uses UTF-16 internally as early benchmarks indicated
that it was faster. See Tom Harper's response to the other thread that was
spawned of this thread by Ketil.

Text continues to be UTF-16 today because

* no one has written a benchmark that shows that UTF-8 would be faster
*for use in Data.Text*, and
* no one has written a patch that converts Text to use UTF-8 internally.

I'm quite frustrated by this whole discussion; there's lots of talking, no
coding, and only a little benchmarking (of web sites, not code). This will
get us nowhere.

Cheers,
Johan
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString

2010-08-18 Thread Ivan Lazar Miljenovic

Johan Tibell johan.tib...@gmail.com writes:

 Text continues to be UTF-16 today because

 * no one has written a benchmark that shows that UTF-8 would be faster
 *for use in Data.Text*, and
 * no one has written a patch that converts Text to use UTF-8 internally.

 I'm quite frustrated by this whole discussion; there's lots of talking, no
 coding, and only a little benchmarking (of web sites, not code). This will
 get us nowhere.

This was my impression as well.  If someone desperately wants Text to
use UTF-8 internally, why not help code such a change rather than just
waving the suggestion around in the air?

-- 
Ivan Lazar Miljenovic
ivan.miljeno...@gmail.com
IvanMiljenovic.wordpress.com
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString

On Wed, Aug 18, 2010 at 2:39 PM, Johan Tibell johan.tib...@gmail.comwrote:

 On Wed, Aug 18, 2010 at 2:12 AM, John Meacham j...@repetae.net wrote:

 ranty thing to follow
 That said, there is never a reason to use UTF-16, it is a vestigial
 remanent from the brief period when it was thought 16 bits would be
 enough for the unicode standard, any defense of it nowadays is after the
 fact justification for having accidentally standardized on it back in
 the day.


 This is false. Text uses UTF-16 internally as early benchmarks indicated
 that it was faster. See Tom Harper's response to the other thread that was
 spawned of this thread by Ketil.

 Text continues to be UTF-16 today because

 * no one has written a benchmark that shows that UTF-8 would be faster
 *for use in Data.Text*, and
 * no one has written a patch that converts Text to use UTF-8
 internally.

 I'm quite frustrated by this whole discussion; there's lots of talking, no
 coding, and only a little benchmarking (of web sites, not code). This will
 get us nowhere.

 Here's my response to the two points:

* I haven't written a patch showing that Data.Text would be faster using
UTF-8 because that would require fulfilling the second point (I'll get to in
a second). I *have* shown where there are huge performance differences
between text and ByteString/String. Unfortunately, the response has been
don't use bytestring, it's the wrong datatype, text will get fixed, which
is quite underwhelming.

* Since the prevailing attitude has been such a disregard to any facts shown
thus far, it seems that the effort required to learn the internals of the
text package and attempt a patch would be wasted. In the meanwhile, Jasper
has released blaze-builder which does an amazing job at producing UTF-8
encoded data, which for the moment is my main need. As much as I'll be
chastised by the community, I'll stick with this approach for the moment.

Now if you tell me that text would consider applying a UTF-8 patch, that
would be a different story. But I don't have the time to maintain a separate
UTF-8 version of text. For me, the whole point of this discussion was to
determine whether we should attempt porting to UTF-8, which as I understand
it would be a rather large undertaking.

Michael
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString

2010-08-18 Thread Duncan Coutts

On 18 August 2010 15:04, Michael Snoyman mich...@snoyman.com wrote:

 For me, the whole point of this discussion was to
 determine whether we should attempt porting to UTF-8, which as I understand
 it would be a rather large undertaking.

And the answer to that is, yes but only if we have good reason to
believe it will actually be faster, and that's where we're most
interested in benchmarks rather than hand waving.

As Johan and others have said, the original choice to use UTF16 was
based on benchmarks showing it was faster (than UTF8 or UTF32). So if
we want to counter that then we need either to argue that these were
the wrong choice of benchmarks that do not reflect real usage, or that
with better implementations that the balance would shift.

Now there is an interesting argument to claim that we spend more time
shovling strings about than we do actually processing them in any
interesting way and therefore that we should pick benchmarks that
reflect that. This would then shift the balance to favour the internal
representation being identical to some particular popular external
representation --- even if that internal representation is slower for
many processing tasks.

Duncan
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString

Hi Michael,

On Wed, Aug 18, 2010 at 4:04 PM, Michael Snoyman mich...@snoyman.comwrote:

 Here's my response to the two points:

 * I haven't written a patch showing that Data.Text would be faster using
 UTF-8 because that would require fulfilling the second point (I'll get to in
 a second). I *have* shown where there are huge performance differences
 between text and ByteString/String. Unfortunately, the response has been
 don't use bytestring, it's the wrong datatype, text will get fixed, which
 is quite underwhelming.


I went through all the emails you sent on with topic String vs ByteString
and Re: String vs ByteString and I can't find a single benchmark. I do
agree with you that

* UTF-8 is more compact than UTF-16, and
* UTF-8 is by far the most used encoding on the web.

and that establishes a reasonable *theoretical* argument for why switching
to UTF-8 might be faster.

What I'm looking for is a program that shows a big difference so we can
validate the hypothesis. As Duncan mentioned we already ran some benchmarks
early on the showed the opposite. Someone posted a benchmark earlier in this
thread and Bryan addressed the issue raised by that poster. We want more of
those.


 * Since the prevailing attitude has been such a disregard to any facts
 shown thus far, it seems that the effort required to learn the internals of
 the text package and attempt a patch would be wasted. In the meanwhile,
 Jasper has released blaze-builder which does an amazing job at producing
 UTF-8 encoded data, which for the moment is my main need. As much as I'll be
 chastised by the community, I'll stick with this approach for the moment.


I'm not sure this discussion has surfaced that many facts. What we do have
is plenty of theories. I can easily add some more:

* GHC is not doing a good job laying out the branches in the validation
code that does arithmetic on the input byte sequence, to validate the input
and compute the Unicode code point that should be streamed using fusion.

* The differences in text and bytestring's fusion framework get in the
way of some optimization in GHC (text uses a more sophisticated fusion
frameworks that handles some cases bytestring can't according to Bryan).

* Lingering space leaks is hurting performance (Bryan plugged one
already).

* The use of a polymorphic loop state in the fusion framework gets in
the way of unboxing.

* Extraneous copying in the Handle implementation slows down I/O.

All these are plausible reasons why Text might perform worse than
ByteString. We need find out why ones are true by benchmarking and looking
at the generated Core.


  Now if you tell me that text would consider applying a UTF-8 patch, that
 would be a different story. But I don't have the time to maintain a separate
 UTF-8 version of text. For me, the whole point of this discussion was to
 determine whether we should attempt porting to UTF-8, which as I understand
 it would be a rather large undertaking.


I don't see any reason why Bryan wouldn't accept an UTF-8 patch if it was
faster on some set of benchmarks (starting with the ones already in the
library) that we agree on.

Cheers,
Johan
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString

On Wed, Aug 18, 2010 at 4:12 AM, wren ng thornton w...@freegeek.org wrote:

 There was a study recently on this. They found that there are four main
 parts of the Internet:

 * a densely connected core, where from any site you can get to any other
 * an in cone, from which you can reach the core (but not other in-cone
 members, since then you'd both be in the core)
 * an out cone, which can be reached from the core (but which cannot reach
 each other)
 * and, unconnected islands

 The surprising part is they found that all four parts are approximately the
 same size. I forget the exact numbers, but they're all 25+/-5%.

 This implies that an exhaustive crawl of the web would require having about
 50% of all websites as seeds (the in-cone plus the islands). If we're only
 interested in a representative sample, then we could get by with fewer.
 However, that depends a lot on the definition of representative. And we
 can't have an accurate definition of representative without doing the entire
 crawl at some point in order to discover the appropriate distributions. Then
 again, distributions change over time...

 Thus, I would guess that Google only has 50~75% of the net: the core, the
 out-cone, and a fraction of the islands and in-cone.


That's an interesting result.

However, if you weigh each page with its page views you'll probably find
that Google (and other search engines) probably cover much more than that
since page views on sites tend to follow a power-law distribution.

-- Johan
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString

On Wed, Aug 18, 2010 at 6:24 PM, Johan Tibell johan.tib...@gmail.comwrote:

 Hi Michael,


 On Wed, Aug 18, 2010 at 4:04 PM, Michael Snoyman mich...@snoyman.comwrote:

 Here's my response to the two points:

 * I haven't written a patch showing that Data.Text would be faster using
 UTF-8 because that would require fulfilling the second point (I'll get to in
 a second). I *have* shown where there are huge performance differences
 between text and ByteString/String. Unfortunately, the response has been
 don't use bytestring, it's the wrong datatype, text will get fixed, which
 is quite underwhelming.


 I went through all the emails you sent on with topic String vs ByteString
 and Re: String vs ByteString and I can't find a single benchmark. I do
 agree with you that

 * UTF-8 is more compact than UTF-16, and
 * UTF-8 is by far the most used encoding on the web.

 and that establishes a reasonable *theoretical* argument for why switching
 to UTF-8 might be faster.

 What I'm looking for is a program that shows a big difference so we can
 validate the hypothesis. As Duncan mentioned we already ran some benchmarks
 early on the showed the opposite. Someone posted a benchmark earlier in this
 thread and Bryan addressed the issue raised by that poster. We want more of
 those.


Sorry, I thought I'd sent these out. While working on optimizing Hamlet I
started playing around with the BigTable benchmark. I wrote two blog posts
on the topic:

http://www.snoyman.com/blog/entry/bigtable-benchmarks/
http://www.snoyman.com/blog/entry/optimizing-hamlet/

Originally, Hamlet had been based on the text package; the huge slow-down
introduced by text convinced me to migrate to bytestrings, and ultimately
blaze-html/blaze-builder. It could be that these were flaws in text that are
correctable and have nothing to do with UTF-16; however, it will be
difficult to produce a benchmark purely on the UTF-8/UTF-16 divide. Using
UTF-16 bytestrings would probably overstate the impact since it wouldn't be
using Bryan's fusion logic.

* Since the prevailing attitude has been such a disregard to any facts shown
 thus far, it seems that the effort required to learn the internals of the
 text package and attempt a patch would be wasted. In the meanwhile, Jasper
 has released blaze-builder which does an amazing job at producing UTF-8
 encoded data, which for the moment is my main need. As much as I'll be
 chastised by the community, I'll stick with this approach for the moment.


 I'm not sure this discussion has surfaced that many facts. What we do have
 is plenty of theories. I can easily add some more:

 * GHC is not doing a good job laying out the branches in the validation
 code that does arithmetic on the input byte sequence, to validate the input
 and compute the Unicode code point that should be streamed using fusion.

 * The differences in text and bytestring's fusion framework get in the
 way of some optimization in GHC (text uses a more sophisticated fusion
 frameworks that handles some cases bytestring can't according to Bryan).

 * Lingering space leaks is hurting performance (Bryan plugged one
 already).

 * The use of a polymorphic loop state in the fusion framework gets in
 the way of unboxing.

 * Extraneous copying in the Handle implementation slows down I/O.

 All these are plausible reasons why Text might perform worse than
 ByteString. We need find out why ones are true by benchmarking and looking
 at the generated Core.


 Now if you tell me that text would consider applying a UTF-8 patch, that
 would be a different story. But I don't have the time to maintain a separate
 UTF-8 version of text. For me, the whole point of this discussion was to
 determine whether we should attempt porting to UTF-8, which as I understand
 it would be a rather large undertaking.


 I don't see any reason why Bryan wouldn't accept an UTF-8 patch if it was
 faster on some set of benchmarks (starting with the ones already in the
 library) that we agree on.

 I think that's the main issue, and one that Duncan nailed on the head: we
have to think about what are the important benchmarks. For Hamlet, I need
fast UTF-8 bytestring generation. I don't care at all about algorithmic
speed for split texts, as an example. My (probably uneducated) guess is that
UTF-16 tends to perform many operations in memory faster since almost all
characters are represented as 16 bits, while the big benefit for UTF-8 is in
reading UTF-8 data, rendering UTF-8 data and decreased memory usage. But as
I said, that's an (uneducated) guess.

Some people have been floating the idea of multiple text packages. I
personally would *not* want to go down that road, but it might be the only
approach that allows top performance for all use cases. As is, I'm quite
happy using blaze-builder for Hamlet.

Michael
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org

Re: [Haskell-cafe] Re: String vs ByteString

2010-08-18 Thread Bryan O'Sullivan

On Wed, Aug 18, 2010 at 10:12 AM, Michael Snoyman mich...@snoyman.comwrote:


 While working on optimizing Hamlet I started playing around with the
 BigTable benchmark. I wrote two blog posts on the topic:


 http://www.snoyman.com/blog/entry/bigtable-benchmarks/
 http://www.snoyman.com/blog/entry/optimizing-hamlet/


Even though your benchmark didn't explicitly come up in this thread, Johan
and I spent some time improving the performance of Text for it. As a result,
in darcs HEAD, Text is faster than String, but slower than ByteString. I'd
certainly like to close that gap more aggressively.

If the other contributors to this thread took just one minute to craft a
benchmark they cared about for every ten minutes they spend producing hot
air, we'd be a lot better off.


 It could be that these were flaws in text that are correctable and have
 nothing to do with UTF-16;


Since the internal representation used by text is completely opaque, we
could of course change it if necessary, with no user-visible consequences.
I've yet to see any data that suggests that it's specifically UTF-16 that is
related to any performance shortfalls, however.

Some people have been floating the idea of multiple text packages. I
 personally would *not* want to go down that road, but it might be the only
 approach that allows top performance for all use cases.


I'd be surprised if that proves necessary.
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString

On Wed, Aug 18, 2010 at 7:12 PM, Michael Snoyman mich...@snoyman.comwrote:

 On Wed, Aug 18, 2010 at 6:24 PM, Johan Tibell johan.tib...@gmail.comwrote:



 Sorry, I thought I'd sent these out. While working on optimizing Hamlet I
 started playing around with the BigTable benchmark. I wrote two blog posts
 on the topic:

 http://www.snoyman.com/blog/entry/bigtable-benchmarks/
 http://www.snoyman.com/blog/entry/optimizing-hamlet/

 Originally, Hamlet had been based on the text package; the huge slow-down
 introduced by text convinced me to migrate to bytestrings, and ultimately
 blaze-html/blaze-builder. It could be that these were flaws in text that are
 correctable and have nothing to do with UTF-16; however, it will be
 difficult to produce a benchmark purely on the UTF-8/UTF-16 divide. Using
 UTF-16 bytestrings would probably overstate the impact since it wouldn't be
 using Bryan's fusion logic.


Those are great. As Bryan mentioned we've already improved performance and I
think I know how to improve it further.

I appreciate that it's difficult to show the UTF-8/UTF-16 divide. I think
the approach we're trying at the moment is looking at benchmarks, improving
performance, and repeating until we can't improve anymore. It could be the
case that we get a benchmark where the performance difference between
bytestring and text cannot be explained/fixed by factors other than changing
the internal encoding. That would be strong evidence that we should try to
switch the internal encoding. We haven't seen any such benchmarks yet.

As for blaze I'm not sure exactly how it deals with UTF-8 input. I tried to
browse through the repo but could find that input ByteStrings are actually
validated anywhere. If they're not it's a big generous to say that it deals
with UTF-8 data, as it would really just be concatenating byte sequences,
without validating them. We should ask Jasper about the current state.


 I don't see any reason why Bryan wouldn't accept an UTF-8 patch if it was
 faster on some set of benchmarks (starting with the ones already in the
 library) that we agree on.

 I think that's the main issue, and one that Duncan nailed on the head: we
 have to think about what are the important benchmarks. For Hamlet, I need
 fast UTF-8 bytestring generation. I don't care at all about algorithmic
 speed for split texts, as an example. My (probably uneducated) guess is that
 UTF-16 tends to perform many operations in memory faster since almost all
 characters are represented as 16 bits, while the big benefit for UTF-8 is in
 reading UTF-8 data, rendering UTF-8 data and decreased memory usage. But as
 I said, that's an (uneducated) guess.


I agree. Lets create some more benchmarks.

For example, lately I've been working on a benchmark, inspired by a real
world problem, where I iterate over the lines in a ~500 MBs file, encoded
using UTF-8 data, inserting each line in a Data.Map and do a bunch of
further processing on it (such as splitting the strings into words). This
tests text I/O throughput, memory overhead, performance of string
comparison, etc.

We already have benchmarks for reading files (in UTF-8) in several different
ways (lazy I/O and iteratee style folds).

Boil down the things you care about into a self contained benchmark and send
it to this list or put it somewhere were we can retrieve it.

Cheers,
Johan
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString

On Wed, Aug 18, 2010 at 11:58 PM, Johan Tibell johan.tib...@gmail.comwrote:

 As for blaze I'm not sure exactly how it deals with UTF-8 input. I tried to
 browse through the repo but could find that input ByteStrings are actually
 validated anywhere. If they're not it's a big generous to say that it deals
 with UTF-8 data, as it would really just be concatenating byte sequences,
 without validating them. We should ask Jasper about the current state.



As far as I can tell, Blaze *never* validates input ByteStrings. The
proper approach to inserting data into blaze is either via String or Text.
I requested that Jasper provide an unsafeByteString function in Blaze for
Hamlet's usage: Hamlet does the UTF-8 encoding at compile time and is able
to gain a little extra performance boost.

If you want to properly validate bytestrings before inputing them, I believe
the best approach would be to use utf8-string or text to read in the
bytestrings, but Jasper may have a better approach.

Michael
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString

Benedikt Huber benj...@gmx.net writes:

 Despite of all this, I think the performance of the text
 package is very promising, and hope it will improve further!

I agree, Data.Text is great.  Unfortunately, its internal use of UTF-16
makes it inefficient for many purposes.

A large fraction - probably most - textual data isn't natural language
text, but data formatted in textual form, and these formats are
typically restricted to ASCII (except for a few text fields).

For instance, a typical project for me might be 10-100GB of data, mostly
in various text formats, real text only making up a few percent of
this.  The combined (all languages) Wikipedia is 2G words, probably less
than 20GB. 

Being agnostic about string encoding - viz. treating it as bytes - works
okay, but it would be nice to allow Unicode in the bits that actually
are text, like string fields and labels and such.

Due to the sizes involved, I think that in order to efficiently process
text-formatted data, UTF-8 is the no-brainer choice for encoding --
certainly in storage, but also for in-memory processing. Unfortunately,
there is no clear Data.Text-like effort here.  There's (at least):

utf8-string - provides utf-8 encoded lazy and strict bytestrings as
  well as some other data types (and a common class) and
  System.Environment functionality.

utf8-light  - provides encoding/decoding to/from (strict?) bytestrings

regex-tdfa-utf8  - regular expressions on UTF-8 encoded lazy bytestrings
utf8-env- provides an UTF8 aware System.Environment

uhexdump   - hex dumps for UTF-8 (?)

compact-string - support for many different string encodings
compact-string-fix - indicates that the above is unmaintained

From a quick glance, it appears that utf8-string is the most complete
and well maintained of the crowd, but I could be wrong.  It'd be nice if
a similar effort as Data.Text has seen could be applied to
e.g. utf8-string, to produce a similarly efficient and effective library
and allow the deprecation of the others.  IMO, this could in time
replace .Char8 as the default ByteString string representation.
Hackathon, anyone? 

-k
-- 
If I haven't seen further, it is by standing in the footprints of giants
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString

On Tue, Aug 17, 2010 at 10:08 AM, Ketil Malde ke...@malde.org wrote:

 Benedikt Huber benj...@gmx.net writes:

  Despite of all this, I think the performance of the text
  package is very promising, and hope it will improve further!

 I agree, Data.Text is great.  Unfortunately, its internal use of UTF-16
 makes it inefficient for many purposes.

 [..]

From a quick glance, it appears that utf8-string is the most complete
 and well maintained of the crowd, but I could be wrong.  It'd be nice if
 a similar effort as Data.Text has seen could be applied to
 e.g. utf8-string, to produce a similarly efficient and effective library
 and allow the deprecation of the others.  IMO, this could in time
 replace .Char8 as the default ByteString string representation.
 Hackathon, anyone?

 Let me ask the question a different way: what are the motivations for
having the text package use UTF-16 internaly? I know that some system APIs
in Windows use it (at least, I think they do), and perhaps it's more
efficient for certain types of processing, but overall do those benefits
outweigh all of the reasons for UTF-8 pointed out in this thread?

Michael
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString

On Tue, Aug 17, 2010 at 9:08 AM, Ketil Malde ke...@malde.org wrote:

 Benedikt Huber benj...@gmx.net writes:

  Despite of all this, I think the performance of the text
  package is very promising, and hope it will improve further!

 I agree, Data.Text is great.  Unfortunately, its internal use of UTF-16
 makes it inefficient for many purposes.


It's not clear to me that using UTF-16 internally does make Data.Text
noticeably slower. If we could get conclusive evidence that using UTF-16
hurts performance, we could look into changing the internal representation
(a major undertaking). What Bryan and I need is benchmarks showing where
Data.Text is performing poorly, compare to String or ByteString, so we can
investigate the cause(s).

Hypothesis are a good starting point for performance improvements, but
they're not enough. We need benchmarks and people looking at profiling and
compiler output to really understand what's going on. For example, how many
know that the Handle implementations copies the input first into a mutable
buffer and then into a Text value, for reads less than the buffer size (8k
if I remember correctly). One of these copies could be avoided. How do we
know that it's using UTF-16 that's our current performance bottleneck and
not this extra copy? We need to benchmark, change the code, and then
benchmark again.

Perhaps the outcome of all the benchmarking and investigation is indeed that
UTF-16 is a problem; then we can change the internal encoding. But there are
other possibilities, like poorly laid out branches in the generated code. We
need to understand what's going on if we are to make progress.

A large fraction - probably most - textual data isn't natural language
 text, but data formatted in textual form, and these formats are
 typically restricted to ASCII (except for a few text fields).

 For instance, a typical project for me might be 10-100GB of data, mostly
 in various text formats, real text only making up a few percent of
 this.  The combined (all languages) Wikipedia is 2G words, probably less
 than 20GB.


I think this is an important observation.

Cheers,
Johan
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString

2010-08-17 Thread Tom Harper

 I agree, Data.Text is great.  Unfortunately, its internal use of UTF-16
 makes it inefficient for many purposes.

In the first iteration of the Text package, UTF-16 was chosen because
it had a nice balance of arithmetic overhead and space.  The
arithmetic for UTF-8 started to have serious performance impacts in
situations where the entire document was outside ASCII (i.e. a Russian
or Arabic document), but UTF-16 was still relatively compact, compared
to both the UTF-32 and String alternatives.  This, however, obviously
does not represent your use case.   I don't know if your use case is
the more common one (though it seems likely).

The underlying principles of Text should work fine with UTF-8.  It has
changed a lot since its original writing (thanks to some excellent
tuning and maintenance by bos), including some more efficient binary
arithmetic.  The situation may have changed with respect to the
performance limitations of UTF-8, or there may be room for it and a
UTF-16 version.  Any takers for implementing a UTF-8 version and
comparing the two?


 A large fraction - probably most - textual data isn't natural language
 text, but data formatted in textual form, and these formats are
 typically restricted to ASCII (except for a few text fields).

 For instance, a typical project for me might be 10-100GB of data, mostly
 in various text formats, real text only making up a few percent of
 this.  The combined (all languages) Wikipedia is 2G words, probably less
 than 20GB.

 Being agnostic about string encoding - viz. treating it as bytes - works
 okay, but it would be nice to allow Unicode in the bits that actually
 are text, like string fields and labels and such.

Is your point that ASCII characters take up the same amount of space
(i.e. 16 bits) as higher code points? Do you have any comparisons that
quantify how much this affects your ability to process text in real
terms?  Does it make it too slow? Infeasible memory-wise?
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString

Johan Tibell johan.tib...@gmail.com writes:

 It's not clear to me that using UTF-16 internally does make Data.Text
 noticeably slower. 

I haven't benchmarked it, but I'm fairly sure that, if you try to fit a
3Gbyte file (the Human genome, say¹), into a computer with 4Gbytes of
RAM, UTF-16 will be slower than UTF-8.  Many applications will get away
with streaming over data, retaining only a small part, but some won't.

In other cases (e.g. processing CJK text, and perhap also
non-Latin1 text), I'm sure it'll be faster - but my (still
unsubstantiated) guess is that the difference will be much smaller, and
it'll be a case of winning some and losing some - and I'd also
conjecture that having 3Gb real text (i.e. natural language, as
opposed to text-formatted data) is rare.

I think that *IF* we are aiming for a single, grand, unified text
library to Rule Them All, it needs to use UTF-8.  Alternatively, we
can have different libraries with different representations for
different purposes, where you'll get another few percent of juice by
switching to the most appropriate.

Currently the latter approach looks to be in favor, so if we can't have
one single library, let us at least aim for a set of libraries with
consistent interfaces and optimal performance.  Data.Text is great for
UTF-16, and I'd like to have something similar for UTF-8.  Is all I'm
trying to say.

-k
-- 
If I haven't seen further, it is by standing in the footprints of giants
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString

Ketil Malde ke...@malde.org writes:

 Johan Tibell johan.tib...@gmail.com writes:

 It's not clear to me that using UTF-16 internally does make Data.Text
 noticeably slower. 

 I haven't benchmarked it, but I'm fairly sure that, if you try to fit a
 3Gbyte file (the Human genome, say¹), into a computer with 4Gbytes of
 RAM, UTF-16 will be slower than UTF-8.  Many applications will get away
 with streaming over data, retaining only a small part, but some won't.

Seeing as how the genome just uses 4 base letters, wouldn't it be
better to not treat it as text but use something else?  Or do you just
mean storage-wise to be able to be read in a text editor, etc. as well
(in case someone is trying to do their mad genetic manipulation by
hand)?

-- 
Ivan Lazar Miljenovic
ivan.miljeno...@gmail.com
IvanMiljenovic.wordpress.com
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString

2010-08-17 Thread Colin Paul Adams

 Ketil == Ketil Malde ke...@malde.org writes:

Ketil Johan Tibell johan.tib...@gmail.com writes:
 It's not clear to me that using UTF-16 internally does make
 Data.Text noticeably slower.


Ketil I think that *IF* we are aiming for a single, grand, unified
Ketil text library to Rule Them All, it needs to use UTF-8.
Ketil Alternatively, we can have different libraries with different
Ketil representations for different purposes, where you'll get
Ketil another few percent of juice by switching to the most
Ketil appropriate.

Why not instead allow the programmer to decide at the function level
which internal encoding to use?
-- 
Colin Adams
Preston Lancashire
()  ascii ribbon campaign - against html e-mail
/\  www.asciiribbon.org   - against proprietary attachments
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString

2010-08-17 Thread Yitzchak Gale

Ketil Malde wrote:
 I haven't benchmarked it, but I'm fairly sure that, if you try to fit a
 3Gbyte file (the Human genome, say¹), into a computer with 4Gbytes of
 RAM, UTF-16 will be slower than UTF-8...

I don't think the genome is typical text. And
I doubt that is true if that text is in a CJK language.

 I think that *IF* we are aiming for a single, grand, unified text
 library to Rule Them All, it needs to use UTF-8.

Given the growth rate of China's economy, if CJK isn't
already the majority of text being processed in the world,
it will be soon. I have seen media reports claiming CJK is
now a majority of text data going over the wire on the web,
though I haven't seen anything scientific backing up those claims.
It certainly seems reasonable. I believe Google's measurements
based on their own web index showing wide adoption of UTF-8
are very badly skewed due to a strong Western bias.

In that case, if we have to pick one encoding for Data.Text,
UTF-16 is likely to be a better choice than UTF-8, especially
if the cost is fairly low even for the special case of Western
languages. Also, UTF-16 has become by far the dominant internal
text format for most software and for most user platforms.
Except on desktop Linux - and whether we like it or not, Linux
desktops will remain a tiny minority for the foreseeable future.

 Alternatively, we
 can have different libraries with different representations for
 different purposes, where you'll get another few percent of juice by
 switching to the most appropriate.

 Currently the latter approach looks to be in favor, so if we can't have
 one single library, let us at least aim for a set of libraries with
 consistent interfaces and optimal performance.  Data.Text is great for
 UTF-16, and I'd like to have something similar for UTF-8.  Is all I'm
 trying to say.

I agree.

Thanks,
Yitz
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString

Tom Harper rtomhar...@gmail.com writes:

 2010/8/17 Bulat Ziganshin bulat.zigans...@gmail.com:
 Hello Tom,

 snip

 i don't understand what you mean. are you support all 2^20 codepoints
 in Data.Text package?

 Bulat,

 Yes, its internal representation is UTF-16, which is capable of
 encoding *any* valid Unicode codepoint.

Just like Char is capable of encoding any valid Unicode codepoint.

-- 
Ivan Lazar Miljenovic
ivan.miljeno...@gmail.com
IvanMiljenovic.wordpress.com
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString

2010-08-17 Thread Miguel Mitrofanov




Ivan Lazar Miljenovic wrote:

Tom Harper rtomhar...@gmail.com writes:


2010/8/17 Bulat Ziganshin bulat.zigans...@gmail.com:

Hello Tom,

snip


i don't understand what you mean. are you support all 2^20 codepoints
in Data.Text package?

Bulat,

Yes, its internal representation is UTF-16, which is capable of
encoding *any* valid Unicode codepoint.


Just like Char is capable of encoding any valid Unicode codepoint.



Char is not an encoding, right?
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString

Miguel Mitrofanov miguelim...@yandex.ru writes:

 Ivan Lazar Miljenovic wrote:
 Tom Harper rtomhar...@gmail.com writes:

 2010/8/17 Bulat Ziganshin bulat.zigans...@gmail.com:
 Hello Tom,
 snip

 i don't understand what you mean. are you support all 2^20 codepoints
 in Data.Text package?
 Bulat,

 Yes, its internal representation is UTF-16, which is capable of
 encoding *any* valid Unicode codepoint.

 Just like Char is capable of encoding any valid Unicode codepoint.


 Char is not an encoding, right?

No, but in GHC at least it corresponds to a Unicode codepoint.


-- 
Ivan Lazar Miljenovic
ivan.miljeno...@gmail.com
IvanMiljenovic.wordpress.com
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString

On Tue, Aug 17, 2010 at 1:50 PM, Yitzchak Gale g...@sefer.org wrote:

 Ketil Malde wrote:
  I haven't benchmarked it, but I'm fairly sure that, if you try to fit a
  3Gbyte file (the Human genome, say¹), into a computer with 4Gbytes of
  RAM, UTF-16 will be slower than UTF-8...

 I don't think the genome is typical text. And
 I doubt that is true if that text is in a CJK language.

  I think that *IF* we are aiming for a single, grand, unified text
  library to Rule Them All, it needs to use UTF-8.

 Given the growth rate of China's economy, if CJK isn't
 already the majority of text being processed in the world,
 it will be soon. I have seen media reports claiming CJK is
 now a majority of text data going over the wire on the web,
 though I haven't seen anything scientific backing up those claims.
 It certainly seems reasonable. I believe Google's measurements
 based on their own web index showing wide adoption of UTF-8
 are very badly skewed due to a strong Western bias.

 In that case, if we have to pick one encoding for Data.Text,
 UTF-16 is likely to be a better choice than UTF-8, especially
 if the cost is fairly low even for the special case of Western
 languages. Also, UTF-16 has become by far the dominant internal
 text format for most software and for most user platforms.
 Except on desktop Linux - and whether we like it or not, Linux
 desktops will remain a tiny minority for the foreseeable future.

  I think you are conflating two points here, and ignoring some important
data. Regarding the data: you haven't actually quoted any statistics about
the prevalence of CJK data, but even if the majority of web pages served are
in those three languages, a fairly high percentage of the content will
*still* be ASCII, due simply to the HTML, CSS and Javascript overhead. I'd
hate to make up statistics on the spot, especially when I don't have any
numbers from you to compare them with.

As far as the conflation, there are two questions with regard to the
encoding choice: encoding/decoding time and space usage. I don't think
*anyone* is asserting that UTF-16 is a common encoding for files anywhere,
so by using UTF-16 we are simply incurring an overhead in every case. We
can't consider a CJK encoding for text, so its prevalence is irrelevant to
this topic. What *is* relevant is that a very large percentage of web pages
*are*, in fact, standardizing on UTF-8, and that all 7-bit text files are by
default UTF-8.

As far as space usage, you are correct that CJK data will take up more
memory in UTF-8 than UTF-16. The question still remains whether the overall
document size will be larger: I'd be interested in taking a random sampling
of CJK-encoded pages and comparing their UTF-8 and UTF-16 file sizes. I
think simply talking about this in the vacuum of data is pointless. If
anyone can recommend a CJK website which would be considered representative
(or a few), I'll do the test myself.

Michael
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString

On Tue, Aug 17, 2010 at 12:54, Ivan Lazar Miljenovic 
ivan.miljeno...@gmail.com wrote:

 Tom Harper rtomhar...@gmail.com writes:

  2010/8/17 Bulat Ziganshin bulat.zigans...@gmail.com:
  Hello Tom,
 
  snip
 
  i don't understand what you mean. are you support all 2^20 codepoints
  in Data.Text package?
 
  Bulat,
 
  Yes, its internal representation is UTF-16, which is capable of
  encoding *any* valid Unicode codepoint.

 Just like Char is capable of encoding any valid Unicode codepoint.


Unless a Char in Haskell is 32 bits (or at least more than 16 bits) it con
NOT encode all Unicode points.

-Tako
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString

Ivan Lazar Miljenovic ivan.miljeno...@gmail.com writes:

 Seeing as how the genome just uses 4 base letters,   

Yes, the bulk of the data is not really text at all, but each sequence
(it's fragmented due to the molecular division into chromosomes, and
due to incompleteness) also has a textual header.  Generally, the Fasta
format looks like this:

  sequence-id some arbitrary metadata blah blah
  ACGATATACGCGCATGCGAT...
  ..lines and lines of letters...

(As an aside, although there are only four nucleotides (ACGT), there are
occasional wildcard characters, the most common being N for aNy
nucleotide, but there are defined wildcards for all subsets of the alphabet.)

 wouldn't it be better to not treat it as text but use something else?

I generally use ByteStrings, with the .Char8 interface if/when
appropriate.  This is actually a pretty good choice; even if people use
Unicode in the headers, I don't particularly want to care - as long as
it is transparent.  In some cases, I'd like to, say, search headers for
some specific string - in these cases, a nice, tidy, rich, and optimized
Data.ByteString(.Lazy).UTF8 would be nice.  (But obviously not terribly
essential at the moment, since I haven't bothered to test the available
options.  I guess for my stuff, the (human consumable) text bits are
neither very performance intensive, nor large, so I could probably and
fairly cheaply wrap relevant operations or fields with Data.Text's
{de,en}codeUtf8.  And in practice - partly due to lacking software
support, I'm sure - it's all ASCII anyway. :-) 

It'd be nice to have efficient substring searches and regular
expression, etc for the sequence data, but often this will be better
addressed by more specific algorithms, and in any case, a .Char8
implementation is likely to be more efficient than any gratuitous
Unicode encoding.

 (in case someone is trying to do their mad genetic manipulation by
 hand)?

You'd be surprised what a determined biologist can achive, armed only
with Word, Excel, and a reckless disregard for surmountability.

-k
-- 
If I haven't seen further, it is by standing in the footprints of giants
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString

Hi Ketil,

On Tue, Aug 17, 2010 at 12:09 PM, Ketil Malde ke...@malde.org wrote:

 Johan Tibell johan.tib...@gmail.com writes:

  It's not clear to me that using UTF-16 internally does make Data.Text
  noticeably slower.

 I haven't benchmarked it, but I'm fairly sure that, if you try to fit a
 3Gbyte file (the Human genome, say¹), into a computer with 4Gbytes of
 RAM, UTF-16 will be slower than UTF-8.  Many applications will get away
 with streaming over data, retaining only a small part, but some won't.


I'm not sure if this is a great example as genome data is probably much
better stored in a vector (using a few bits per letter). I agree that
whenever one data structure will fit in the available RAM and another won't
the smaller will win. I just don't know if this case is worth spending weeks
worth of work optimizing for. That's why I'd like to see benchmarks for more
idiomatic use cases.


 In other cases (e.g. processing CJK text, and perhap also
 non-Latin1 text), I'm sure it'll be faster - but my (still
 unsubstantiated) guess is that the difference will be much smaller, and
 it'll be a case of winning some and losing some - and I'd also
 conjecture that having 3Gb real text (i.e. natural language, as
 opposed to text-formatted data) is rare.


I would like to verify this guess. In my personal experience it's really
hard to guess which changes will lead to a noticeable performance
improvement. I'm probably wrong more often than I'm right.

Cheers,
Johan
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString

Tako Schotanus t...@codejive.org writes:

 On Tue, Aug 17, 2010 at 12:54, Ivan Lazar Miljenovic 
 ivan.miljeno...@gmail.com wrote:

 Tom Harper rtomhar...@gmail.com writes:

  2010/8/17 Bulat Ziganshin bulat.zigans...@gmail.com:
  Hello Tom,
 
  snip
 
  i don't understand what you mean. are you support all 2^20 codepoints
  in Data.Text package?
 
  Bulat,
 
  Yes, its internal representation is UTF-16, which is capable of
  encoding *any* valid Unicode codepoint.

 Just like Char is capable of encoding any valid Unicode codepoint.


 Unless a Char in Haskell is 32 bits (or at least more than 16 bits) it con
 NOT encode all Unicode points.

http://www.haskell.org/onlinereport/lexemes.html

-- 
Ivan Lazar Miljenovic
ivan.miljeno...@gmail.com
IvanMiljenovic.wordpress.com
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString

Michael Snoyman mich...@snoyman.com writes:

 I don't think *anyone* is asserting that UTF-16 is a common encoding
 for files anywhere,

*ahem* 
http://en.wikipedia.org/wiki/UTF-16/UCS-2#Use_in_major_operating_systems_and_environments

-- 
Ivan Lazar Miljenovic
ivan.miljeno...@gmail.com
IvanMiljenovic.wordpress.com
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString

On Tue, Aug 17, 2010 at 13:00, Michael Snoyman mich...@snoyman.com wrote:



 On Tue, Aug 17, 2010 at 1:50 PM, Yitzchak Gale g...@sefer.org wrote:

 Ketil Malde wrote:
  I haven't benchmarked it, but I'm fairly sure that, if you try to fit a
  3Gbyte file (the Human genome, say¹), into a computer with 4Gbytes of
  RAM, UTF-16 will be slower than UTF-8...

 I don't think the genome is typical text. And
 I doubt that is true if that text is in a CJK language.



  As far as space usage, you are correct that CJK data will take up more
 memory in UTF-8 than UTF-16. The question still remains whether the overall
 document size will be larger: I'd be interested in taking a random sampling
 of CJK-encoded pages and comparing their UTF-8 and UTF-16 file sizes. I
 think simply talking about this in the vacuum of data is pointless. If
 anyone can recommend a CJK website which would be considered representative
 (or a few), I'll do the test myself.


Regardless of the outcome of that investigation (which in itself is
interesting) I have to agree with Yitzchak that the human genome (or any
other ASCII based data that is not ncessarily a representation of written
human language) is not a good fir for the Text package.

A package like this should IMHO be good at handling human language, as much
of them as possible, and support the common operations as efficiently as
possible: sorting, upper/lowercase (where those exist), find word
boundaries, whatever.

Parsing some kind of file containing the human genome and the like I
think would be much better served by a package focusing on handling large
streams of bytes. No encodings to worry about, no parsing of the stream
determine code points, no calculations determine string lengths. If you need
to convert things to upper/lower case or do sorting you can just fall back
on simple ASCII processing, no need to depend on a package dedicated to
human text processing.

I do think that in-memory processing of Unicode is better served with UTF16
than UTF8 because except en very rare circumstances you can just treat the
text as an array of Char. You can't do that for UTF8 so the efficiency of
the algorithmes would suffer.

I also think that the memory problem is much easier worked around (for
example by dividing the problem in smaller parts) than sub-optimal string
processing because of increased complexity.

-Tako
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString

On Tue, Aug 17, 2010 at 2:20 PM, Ivan Lazar Miljenovic 
ivan.miljeno...@gmail.com wrote:

 Michael Snoyman mich...@snoyman.com writes:

  I don't think *anyone* is asserting that UTF-16 is a common encoding
  for files anywhere,

 *ahem*
 http://en.wikipedia.org/wiki/UTF-16/UCS-2#Use_in_major_operating_systems_and_environments

 I was talking about the contents of the files, not the file names or how
the system calls work. I know at least on Windows, Linux and FreeBSD, if you
open up the default text editor, type in a few letters and hit save, the
file will not be in UTF-16.

Michael
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString

On Tue, Aug 17, 2010 at 13:29, Ketil Malde ke...@malde.org wrote:

 Tako Schotanus t...@codejive.org writes:

  Just like Char is capable of encoding any valid Unicode codepoint.

  Unless a Char in Haskell is 32 bits (or at least more than 16 bits) it
 con
  NOT encode all Unicode points.

 And since it can encode (or rather, represent) any valid Unicode
 codepoint, it follows that it is 32 bits (and at least more than 16
 bits).

 :-)

 (Char is basically a 32bit value, limited valid Unicode code points, so
 it corresponds to UCS-4/UTF-32.)


Yeah, I tried looking it up but I could find the technical definition for
Char, but in the end I found that maxBound was 0x10 making it
basically 24 bits :)

I know for example that Java uses only 16 bits for its Chars and therefore
can NOT give you all Unicode code points with a single Char, with Strings
you can because of the extension points.

-Tako
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString

Michael Snoyman mich...@snoyman.com writes:

 As far as space usage, you are correct that CJK data will take up more
 memory in UTF-8 than UTF-16. 

With the danger of sounding ... alphabetist? as well as belaboring a
point I agree is irrelevant (the storage format):

I'd point out that it seems at least as unfair to optimize for CJK at
the cost of Western languages.  UTF-16 uses two bytes for (most) CJK
ideograms, and (all, I think) characters in Western and other phonetic
scripts.  UTF-8 uses one to two bytes for a lot of Western alphabets,
but three for CJK ideograms.

Now, CJK has about 20K ideograms, which is almost 15 bits per ideogram,
while an ASCII letter is about six bits.  Thus, the information density
of CJK and ASCII is about equal for UTF-8, 5/8 vs 6/8 - compared to
15/16 vs 6/16 for UTF-16.  In other words a given document translated
between Chinese and English should occupy roughly the same space in
UTF-8, but be 2.5 times longer in English for UTF-16.

-k
-- 
If I haven't seen further, it is by standing in the footprints of giants
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString

2010-08-17 Thread Colin Paul Adams

 Ivan == Ivan Lazar Miljenovic ivan.miljeno...@gmail.com writes:

 Char is not an encoding, right?

Ivan No, but in GHC at least it corresponds to a Unicode codepoint.

I don't think this is right, or shouldn't be right, anyway.. Surely it
stands for a character. Unicode codepoints include non-characters such
as the surrogate codepoints used by UTF-16 to map non-BMP codepoints to
pairs of 16-bit codepoints. 

I don't think you ought to be able to see a surrogate codepoint as a Char.
-- 
Colin Adams
Preston Lancashire
()  ascii ribbon campaign - against html e-mail
/\  www.asciiribbon.org   - against proprietary attachments
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString

Yitzchak Gale g...@sefer.org writes:

 I don't think the genome is typical text.

I think the typical *large* collection of text is text-encoded data, and
not, for lack of a better word, literature.  Genomics data is just an
example.

-k
-- 
If I haven't seen further, it is by standing in the footprints of giants
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString

On Tue, Aug 17, 2010 at 13:40, Ketil Malde ke...@malde.org wrote:

 Michael Snoyman mich...@snoyman.com writes:

  As far as space usage, you are correct that CJK data will take up more
  memory in UTF-8 than UTF-16.

 With the danger of sounding ... alphabetist? as well as belaboring a
 point I agree is irrelevant (the storage format):

 I'd point out that it seems at least as unfair to optimize for CJK at
 the cost of Western languages.


Thing is that here you're only talking about size optimizations, for
somebody having to handle a lot of international texts (and I'm not
necessarily talking about Chinese or Japanese here) it would be important
that this is handled in the most efficient way possible, because in the end
storing and retrieving you only do once each while maybe doing a lot of
processing in between. And the on-disk storage or the over-the-wire format
might very well be different than the in-memory format. Each can be selected
for what it's best at.

I'll repeat here that in my opinion a Text package should be good at
handling text, human text, from whatever country. If I need to handle large
streams of ASCII I'll use something else.

:)

Cheers,
 -Tako
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString

2010-08-17 Thread Yitzchak Gale

Michael Snoyman wrote:
 Regarding the data: you haven't actually quoted any
 statistics about the prevalence of CJK data

True, I haven't seen any - except for Google, which
I don't believe is accurate. I would like to see some
good unbiased data.

Right now we just have our intuitions based on anecdotal
evidence and whatever years of experience we have in IT.

For the anecdotal evidence, I really wish that people from
CJK countries were better represented in this discussion.
Unfortunately, Haskell is less prevalent in CJK countries,
and there is somewhat of a language barrier.

 I'd hate to make up statistics on the spot, especially when
 I don't have any numbers from you to compare them with.

I agree, I wish we had better numbers.

 even if the majority of web pages served are
 in those three languages, a fairly high percentage
 of the content will *still* be ASCII, due simply to the HTML,
 CSS and Javascript overhead...
 As far as space usage, you are correct that CJK data will take up more
 memory in UTF-8 than UTF-16. The question still remains whether the overall
 document size will be larger: I'd be interested in taking a random sampling
 of CJK-encoded pages and comparing their UTF-8 and UTF-16 file sizes. I
 think simply talking about this in the vacuum of data is pointless. If
 anyone can recommend a CJK website which would be considered representative
 (or a few), I'll do the test myself.

Again, I agree that some real data would be great.

The problem is, I'm not sure if there is anyone in this discussion
who is qualified to come up with anything even close to a fair
random sampling or a CJK website that is representative.
As far as I can tell, most of us participating in this discussion
have absolutely zero perspective of what computing is like
in CJK countries.

 As far as the conflation, there are two questions
 with regard to the encoding choice: encoding/decoding time
 and space usage.

No, there is a third: using an API that results in robust, readable
and maintainable code even in the face of changing encoding
requirements. Unless you have proof that the difference in
performance between that API and an API with a hard-wired
encoding is the factor that is causing your particular application
to fail to meet its requirements, the hard-wired approach
is guilty of aggravated premature optimization.

So for example, UTF-8 is an important option
to have in a web toolkit. But if that's the only option, that
web toolkit shouldn't be considered a general-purpose one
in my opinion.

 I don't think *anyone* is asserting that
 UTF-16 is a common encoding for files anywhere,
 so by using UTF-16 we are simply incurring an overhead
 in every case.

Well, to start with, all MS Word documents are in UTF-16.
There are a few of those around I think. Most applications -
in some sense of most - store text in UTF-16

Again, without any data, my intuition tells me that
most of the text data stored in the world's files are in
UTF-16. There is currently not much Haskell code
that reads those formats directly, but I think that will
be changing as usage of Haskell in the real world
picks up.

 We can't consider a CJK encoding for text,

Not as a default, certainly not as the only option. But
nice to have as a choice.

 What *is* relevant is that a very large percentage of web pages
 *are*, in fact, standardizing on UTF-8,

In Western countries.

Regards,
Yitz
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString

Colin Paul Adams co...@colina.demon.co.uk writes:

 Char is not an encoding, right?

 Ivan No, but in GHC at least it corresponds to a Unicode codepoint.

 I don't think this is right, or shouldn't be right, anyway.. Surely it
 stands for a character. Unicode codepoints include non-characters such
 as the surrogate codepoints used by UTF-16 to map non-BMP codepoints to
 pairs of 16-bit codepoints. 

  Prelude (toEnum 0xD800) :: Char
  '\55296'

 I don't think you ought to be able to see a surrogate codepoint as a Char.

This is a bit confusing.  From the Unicode glossary:

- Character. (1) The smallest component of written language that has
semantic value; refers to the abstract meaning and/or shape, rather than
a specific shape (see also glyph), though in code tables some form of
visual representation is essential for the reader’s understanding. (2)
Synonym for abstract character. (3) The basic unit of encoding for the
Unicode character encoding. (4) The English name for the ideographic
written elements of Chinese origin. [See  ideograph (2).] 

- Code Point. (1) Any value in the Unicode codespace; that is, the range
of integers from 0 to 1016. (See definition D10 in Section 3.4,
Characters and Encoding.) (2) A value, or position, for a character, in
any coded character set.

From Wikipedia on UTF-16:

Unicode and ISO/IEC 10646 do not, and will never, assign characters to
any of the code points in the U+D800–U+DFFF range, so an individual code
unit from a surrogate pair does not ever represent a character. 

So:

A Char holds a code point, that is, a value from 0 to 0x1016.  Some
of these values do not correspond to Unicode characters.

As far as I can tell, a surrogate pair in UTF-16 is both two (surrogate)
code points of two bytes each, as well as a single code point encoded as
four bytes.  Implementations seem to differ about what the length of
a string containing surrogate pairs is.

-k
-- 
If I haven't seen further, it is by standing in the footprints of giants
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString

2010-08-17 Thread Yitzchak Gale

Ketil Malde wrote:
 I'd point out that it seems at least as unfair to optimize for CJK at
 the cost of Western languages.

Quite true.

 [...speculative calculation from which we conclude that]
 a given document translated
 between Chinese and English should occupy roughly the same space in
 UTF-8, but be 2.5 times longer in English for UTF-16.

Could be. We really need data on that.

If it's practical to maintain different backends with identical public APIs
and different internal encodings, that would be the best. After a
few years of widespread usage, would know a lot more.

Regards,
Yitz
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString

On Tue, Aug 17, 2010 at 1:36 PM, Tako Schotanus t...@codejive.org wrote:

 Yeah, I tried looking it up but I could find the technical definition for
 Char, but in the end I found that maxBound was 0x10 making it
 basically 24 bits :)


I think that's enough to represent all the assigned Unicode code points. I
also think the Unicode consortium (or whatever it is called) made some
statement about the maximum number of bits they'll ever use.

-- Johan
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString

2010-08-17 Thread Felipe Lessa

Hello, Ketil Malde!

On Tue, Aug 17, 2010 at 8:02 AM, Ketil Malde ke...@malde.org wrote:
Ivan Lazar Miljenovic ivan.miljeno...@gmail.com writes:

Seeing as how the genome just uses 4 base letters,

Yes, the bulk of the data is not really text at all, but each sequence
(it's fragmented due to the molecular division into chromosomes, and
due to incompleteness) also has a textual header. Generally, the Fasta
format looks like this:

sequence-id some arbitrary metadata blah blah
ACGATATACGCGCATGCGAT...
..lines and lines of letters...

(As an aside, although there are only four nucleotides (ACGT), there are
occasional wildcard characters, the most common being N for aNy
nucleotide, but there are defined wildcards for all subsets of the alphabet.)

As someone who knows and uses your bio package, I'm almost
certain that Text really isn't the right data type for
representing everything. Certainly *not* for the genomic data
itself. In fact, a representation using 4 bits per base (4
nucleotides plus 12 other characters, such as gaps as aNy) is
easy to represent using ByteStrings with two bases per byte and
should halve the space requirements.

However, the header of each sequence is text, in the sense of
human language text, and ideally should be represented using
Text. In other words, the sequence data type[1] currently is
defined as:

type SeqData = Data.ByteString.Lazy.ByteString
type QualData = Data.ByteString.Lazy.ByteString
data Sequence t = Seq !SeqData !SeqData !(Maybe QualData)

[1]
http://hackage.haskell.org/packages/archive/bio/0.4.6/doc/html/Bio-Sequence-SeqData.html#t:Sequence

where the meaning is that in 'Seq header seqdata qualdata',
'header' would be something like sequence-id some arbitrary
metadata blah blah and 'seqdata' would be ACGATATACGCGCATGCGAT.

But perhaps we should really have:

type SeqData = Data.ByteString.Lazy.ByteString
type QualData = Data.ByteString.Lazy.ByteString
type HeaderData = Data.Text.Text -- strict is prolly a good choice here
data Sequence t = Seq !HeaderData !SeqData !(Maybe QualData)

Semantically, this is the right choice, putting Text where there
is text. We can read everything with ByteStrings and then use[2]

decodeUtf8 :: ByteString - Text

[2]
http://hackage.haskell.org/packages/archive/text/0.7.2.1/doc/html/Data-Text-Encoding.html#v:decodeUtf8

only for the header bits. There is only one problem in this
approach, UTF-8 for the input FASTA file would be hardcoded.
Considering that probably nobody will be using UTF-16 or UTF-32
for the whole FASTA file, there remains only UTF-8 (from which
ASCII is just a special case) and other 8-bits encondings (such
as ISO8859-1, Shift-JIS, etc.). I haven't seen a FASTA file with
characters outside the ASCII range yet, but I guess the choice of
UTF-8 shouldn't be a big problem.

wouldn't it be better to not treat it as text but use something else?

I generally use ByteStrings, with the .Char8 interface if/when
appropriate. This is actually a pretty good choice; even if people use
Unicode in the headers, I don't particularly want to care - as long as
it is transparent. In some cases, I'd like to, say, search headers for
some specific string - in these cases, a nice, tidy, rich, and optimized
Data.ByteString(.Lazy).UTF8 would be nice. (But obviously not terribly
essential at the moment, since I haven't bothered to test the available
options. I guess for my stuff, the (human consumable) text bits are
neither very performance intensive, nor large, so I could probably and
fairly cheaply wrap relevant operations or fields with Data.Text's
{de,en}codeUtf8. And in practice - partly due to lacking software
support, I'm sure - it's all ASCII anyway. :-)

Oh, so I didn't read this paragraph closely enough :). In this
e-mail I'm basically agreeing with your thoughts here =).

And what do you think about creating a real SeqData data type
with two bases per byte? In terms of processing speed I guess
there will be a small penalty, but if you need to have large
quantities of base pairs in memory this would double your
capacity =).

Cheers,

--
Felipe.
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString

2010-08-17 Thread Colin Paul Adams

 Johan == Johan Tibell johan.tib...@gmail.com writes:

Johan On Tue, Aug 17, 2010 at 1:36 PM, Tako Schotanus t...@codejive.org 
wrote:
Johan Yeah, I tried looking it up but I could find the
Johan technical definition for Char, but in the end I found that
Johan maxBound was 0x10 making it basically 24 bits :)


Johan I think that's enough to represent all the assigned Unicode
Johan code points. I also think the Unicode consortium (or whatever
Johan it is called) made some statement about the maximum number of
Johan bits they'll ever use.

Yes. And UTF-16 is only capable of dealing with codepoints up to this limit.
-- 
Colin Adams
Preston Lancashire
()  ascii ribbon campaign - against html e-mail
/\  www.asciiribbon.org   - against proprietary attachments
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString

On Tue, Aug 17, 2010 at 2:23 PM, Yitzchak Gale g...@sefer.org wrote:

 Michael Snoyman wrote:
  Regarding the data: you haven't actually quoted any
  statistics about the prevalence of CJK data

 True, I haven't seen any - except for Google, which
 I don't believe is accurate. I would like to see some
 good unbiased data.


To my knowledge the data we have about prevalence of encoding on the web is
accurate. We crawl all pages we can get our hands on, by starting at some
set of seeds and then following all the links. You cannot be sure that
you've reached all web sites as there might be cliques in the web graph but
we try our best to get them all. You're unlikely to get a better estimate
anywhere else. I doubt few organizations have the machinery required to
crawl most of the web.

-- Johan
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString

On Tue, Aug 17, 2010 at 3:23 PM, Yitzchak Gale g...@sefer.org wrote:

 Michael Snoyman wrote:
  Regarding the data: you haven't actually quoted any
  statistics about the prevalence of CJK data

 True, I haven't seen any - except for Google, which
 I don't believe is accurate. I would like to see some
 good unbiased data.

 Right now we just have our intuitions based on anecdotal
 evidence and whatever years of experience we have in IT.

 For the anecdotal evidence, I really wish that people from
 CJK countries were better represented in this discussion.
 Unfortunately, Haskell is less prevalent in CJK countries,
 and there is somewhat of a language barrier.

  I'd hate to make up statistics on the spot, especially when
  I don't have any numbers from you to compare them with.

 I agree, I wish we had better numbers.

  even if the majority of web pages served are
  in those three languages, a fairly high percentage
  of the content will *still* be ASCII, due simply to the HTML,
  CSS and Javascript overhead...
  As far as space usage, you are correct that CJK data will take up more
  memory in UTF-8 than UTF-16. The question still remains whether the
 overall
  document size will be larger: I'd be interested in taking a random
 sampling
  of CJK-encoded pages and comparing their UTF-8 and UTF-16 file sizes. I
  think simply talking about this in the vacuum of data is pointless. If
  anyone can recommend a CJK website which would be considered
 representative
  (or a few), I'll do the test myself.

 Again, I agree that some real data would be great.

 The problem is, I'm not sure if there is anyone in this discussion
 who is qualified to come up with anything even close to a fair
 random sampling or a CJK website that is representative.
 As far as I can tell, most of us participating in this discussion
 have absolutely zero perspective of what computing is like
 in CJK countries.

 I won't call this a scientific study by any stretch of the imagination, but
I did a quick test on the www.qq.com homepage. The original file encoding
was GB2312; here are the file sizes:

GB2312: 193014
UTF8: 200044
UTF16: 371938


  As far as the conflation, there are two questions
  with regard to the encoding choice: encoding/decoding time
  and space usage.

 No, there is a third: using an API that results in robust, readable
 and maintainable code even in the face of changing encoding
 requirements. Unless you have proof that the difference in
 performance between that API and an API with a hard-wired
 encoding is the factor that is causing your particular application
 to fail to meet its requirements, the hard-wired approach
 is guilty of aggravated premature optimization.

 So for example, UTF-8 is an important option
 to have in a web toolkit. But if that's the only option, that
 web toolkit shouldn't be considered a general-purpose one
 in my opinion.

 I'm not talking about API changes here; the topic at hand is the internal
representation of the stream of characters used by the text package. That is
currently UTF-16; I would argue switching to UTF8.


  I don't think *anyone* is asserting that
  UTF-16 is a common encoding for files anywhere,
  so by using UTF-16 we are simply incurring an overhead
  in every case.

 Well, to start with, all MS Word documents are in UTF-16.
 There are a few of those around I think. Most applications -
 in some sense of most - store text in UTF-16

 Again, without any data, my intuition tells me that
 most of the text data stored in the world's files are in
 UTF-16. There is currently not much Haskell code
 that reads those formats directly, but I think that will
 be changing as usage of Haskell in the real world
 picks up.

 I was referring to text files, not binary files with text embedded within
them. While we might use the text package to deal with the data from a Word
doc once in memory, we would almost certainly need to use ByteString (or
binary perhaps) to actually parse the file. But at the end of the day,
you're right: there would be an encoding penalty at a certain point, just
not on the entire file.

 We can't consider a CJK encoding for text,

 Not as a default, certainly not as the only option. But
 nice to have as a choice.

 I think you're missing the point at hand: I don't think *any* is opposed to
offering encoders/decoders for all the multitude of encoding types out
there. In fact, I believe the text-icu package already supports every
encoding type under discussion. The question is the internal representation
for text, for which a language-specific encoding is *not* a choice, since it
does not support all unicode code points.

Michael
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString

2010-08-17 Thread Daniel Peebles

Sounds to me like we need a lazy Data.Text variation that allows UTF-8 and
UTF-16 segments in it list of strict text elements :) Then big chunks of
western text will be encoded efficiently, and same with CJK! Not sure what
to do about strict Data.Text though :)

On Tue, Aug 17, 2010 at 1:40 PM, Ketil Malde ke...@malde.org wrote:

 Michael Snoyman mich...@snoyman.com writes:

  As far as space usage, you are correct that CJK data will take up more
  memory in UTF-8 than UTF-16.

 With the danger of sounding ... alphabetist? as well as belaboring a
 point I agree is irrelevant (the storage format):

 I'd point out that it seems at least as unfair to optimize for CJK at
 the cost of Western languages.  UTF-16 uses two bytes for (most) CJK
 ideograms, and (all, I think) characters in Western and other phonetic
 scripts.  UTF-8 uses one to two bytes for a lot of Western alphabets,
 but three for CJK ideograms.

 Now, CJK has about 20K ideograms, which is almost 15 bits per ideogram,
 while an ASCII letter is about six bits.  Thus, the information density
 of CJK and ASCII is about equal for UTF-8, 5/8 vs 6/8 - compared to
 15/16 vs 6/16 for UTF-16.  In other words a given document translated
 between Chinese and English should occupy roughly the same space in
 UTF-8, but be 2.5 times longer in English for UTF-16.

 -k
 --
 If I haven't seen further, it is by standing in the footprints of giants
 ___
 Haskell-Cafe mailing list
 Haskell-Cafe@haskell.org
 http://www.haskell.org/mailman/listinfo/haskell-cafe

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString

Felipe Lessa felipe.le...@gmail.com writes:

[-snip- I've already spent too much time on the other stuff :-]

 And what do you think about creating a real SeqData data type
 with two bases per byte?  In terms of processing speed I guess
 there will be a small penalty, but if you need to have large
 quantities of base pairs in memory this would double your
 capacity =).

Yes, this is interesting in some cases.  Obvious downsides would be a
separate data type for protein sequences (20 characters, plus some
wildcards), and more complicated string comparison (when a match is off
by one).  Oh, and lower case is sometimes used to signify less
important regions, like repeats.

Another choice is the 2bit format (used by BLAT, and supported in Bio
for input/output, but not internally), which stores the alphabet proper
directly in 2bit quantities, and uses a separate lists for gaps, lower
case masking, and Ns (and is obviously extensible to wildcards).  Too
much extending, and you're likely to lose any benefit, though.

Basically, it boils down to a set of different tradeoffs, and I think
ByteString is a fairly good choice in *most* cases, and it deals - if
not particularly elegantly, then at least fairly effectively with
various conventions, like lower-casing or wild cards.

-k
-- 
If I haven't seen further, it is by standing in the footprints of giants
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString

2010-08-17 Thread Gábor Lehel

Someone mentioned earlier that IHHO all of this messing around with
encodings and conversions should be handled transparently, and I guess
you could do something like have the internal representation be along
the lines of Either UTF8 UTF16 (or perhaps even more encodings), and
then implement every function in the API equivalently for each
representation (with only the performance characteristics differing),
with input/output functions being specialized for each encoding, and
then only do a conversion when necessary or explicitly requested. But
I assume that would have other problems (like the implicit conversions
causing hard-to-track-down performance bugs when they're triggered
unintentionally).

On Tue, Aug 17, 2010 at 3:21 PM, Daniel Peebles pumpkin...@gmail.com wrote:
 Sounds to me like we need a lazy Data.Text variation that allows UTF-8 and
 UTF-16 segments in it list of strict text elements :) Then big chunks of
 western text will be encoded efficiently, and same with CJK! Not sure what
 to do about strict Data.Text though :)

 On Tue, Aug 17, 2010 at 1:40 PM, Ketil Malde ke...@malde.org wrote:

 Michael Snoyman mich...@snoyman.com writes:

  As far as space usage, you are correct that CJK data will take up more
  memory in UTF-8 than UTF-16.

 With the danger of sounding ... alphabetist? as well as belaboring a
 point I agree is irrelevant (the storage format):

 I'd point out that it seems at least as unfair to optimize for CJK at
 the cost of Western languages.  UTF-16 uses two bytes for (most) CJK
 ideograms, and (all, I think) characters in Western and other phonetic
 scripts.  UTF-8 uses one to two bytes for a lot of Western alphabets,
 but three for CJK ideograms.

 Now, CJK has about 20K ideograms, which is almost 15 bits per ideogram,
 while an ASCII letter is about six bits.  Thus, the information density
 of CJK and ASCII is about equal for UTF-8, 5/8 vs 6/8 - compared to
 15/16 vs 6/16 for UTF-16.  In other words a given document translated
 between Chinese and English should occupy roughly the same space in
 UTF-8, but be 2.5 times longer in English for UTF-16.

 -k
 --
 If I haven't seen further, it is by standing in the footprints of giants
 ___
 Haskell-Cafe mailing list
 Haskell-Cafe@haskell.org
 http://www.haskell.org/mailman/listinfo/haskell-cafe


 ___
 Haskell-Cafe mailing list
 Haskell-Cafe@haskell.org
 http://www.haskell.org/mailman/listinfo/haskell-cafe





-- 
Work is punishment for failing to procrastinate effectively.
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString

2010-08-17 Thread Gábor Lehel

(Actually, this seems more like a job for a type class.)

2010/8/17 Gábor Lehel illiss...@gmail.com:
 Someone mentioned earlier that IHHO all of this messing around with
 encodings and conversions should be handled transparently, and I guess
 you could do something like have the internal representation be along
 the lines of Either UTF8 UTF16 (or perhaps even more encodings), and
 then implement every function in the API equivalently for each
 representation (with only the performance characteristics differing),
 with input/output functions being specialized for each encoding, and
 then only do a conversion when necessary or explicitly requested. But
 I assume that would have other problems (like the implicit conversions
 causing hard-to-track-down performance bugs when they're triggered
 unintentionally).

 On Tue, Aug 17, 2010 at 3:21 PM, Daniel Peebles pumpkin...@gmail.com wrote:
 Sounds to me like we need a lazy Data.Text variation that allows UTF-8 and
 UTF-16 segments in it list of strict text elements :) Then big chunks of
 western text will be encoded efficiently, and same with CJK! Not sure what
 to do about strict Data.Text though :)

 On Tue, Aug 17, 2010 at 1:40 PM, Ketil Malde ke...@malde.org wrote:

 Michael Snoyman mich...@snoyman.com writes:

  As far as space usage, you are correct that CJK data will take up more
  memory in UTF-8 than UTF-16.

 With the danger of sounding ... alphabetist? as well as belaboring a
 point I agree is irrelevant (the storage format):

 I'd point out that it seems at least as unfair to optimize for CJK at
 the cost of Western languages.  UTF-16 uses two bytes for (most) CJK
 ideograms, and (all, I think) characters in Western and other phonetic
 scripts.  UTF-8 uses one to two bytes for a lot of Western alphabets,
 but three for CJK ideograms.

 Now, CJK has about 20K ideograms, which is almost 15 bits per ideogram,
 while an ASCII letter is about six bits.  Thus, the information density
 of CJK and ASCII is about equal for UTF-8, 5/8 vs 6/8 - compared to
 15/16 vs 6/16 for UTF-16.  In other words a given document translated
 between Chinese and English should occupy roughly the same space in
 UTF-8, but be 2.5 times longer in English for UTF-16.

 -k
 --
 If I haven't seen further, it is by standing in the footprints of giants
 ___
 Haskell-Cafe mailing list
 Haskell-Cafe@haskell.org
 http://www.haskell.org/mailman/listinfo/haskell-cafe


 ___
 Haskell-Cafe mailing list
 Haskell-Cafe@haskell.org
 http://www.haskell.org/mailman/listinfo/haskell-cafe





 --
 Work is punishment for failing to procrastinate effectively.




-- 
Work is punishment for failing to procrastinate effectively.
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString

2010-08-17 Thread John Millikin

On Tue, Aug 17, 2010 at 06:12, Michael Snoyman mich...@snoyman.com wrote:
 I'm not talking about API changes here; the topic at hand is the internal
 representation of the stream of characters used by the text package. That is
 currently UTF-16; I would argue switching to UTF8.

The Data.Text.Foreign module is part of the API, and is currently
hardcoded to use UTF-16. Any change of the internal encoding will
require breaking this module's API.

  We can't consider a CJK encoding for text,

 Not as a default, certainly not as the only option. But
 nice to have as a choice.

 I think you're missing the point at hand: I don't think *any* is opposed to
 offering encoders/decoders for all the multitude of encoding types out
 there. In fact, I believe the text-icu package already supports every
 encoding type under discussion. The question is the internal representation
 for text, for which a language-specific encoding is *not* a choice, since it
 does not support all unicode code points.
 Michael

The reason many Japanese and Chinese users reject UTF-8 isn't due to
space constraints (UTF-8 and UTF-16 are roughly equal), it's because
they reject Unicode itself. Shift-JIS and the various Chinese
encodings both contain Han characters which are missing from Unicode,
either due to the Han unification or simply were not considered
important enough to include (yet there's a codepage for Linear-B...).
Ruby, which has an enormous Japanese userbase, solved the problem by
essentially defining Text = (Encoding, ByteString), and then
re-implementing text logic for each encoding. This allows very
efficient operation with every possible encoding, at the cost of
increased complexity (caching decoded characters, multi-byte handling,
etc).
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString

On Tue, Aug 17, 2010 at 6:19 PM, John Millikin jmilli...@gmail.com wrote:

 Ruby, which has an enormous Japanese userbase, solved the problem by
 essentially defining Text = (Encoding, ByteString), and then
 re-implementing text logic for each encoding. This allows very
 efficient operation with every possible encoding, at the cost of
 increased complexity (caching decoded characters, multi-byte handling,
 etc).


This code introduce overhead as each function call needs to dispatch on the
encoding, which is unlikely to be known statically. I don't know if this
matters or not (yet another thing that needs to be measured).

-- Johan
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString

2010-08-17 Thread Donn Cave

Quoth John Millikin jmilli...@gmail.com,

 Ruby, which has an enormous Japanese userbase, solved the problem by
 essentially defining Text = (Encoding, ByteString), and then
 re-implementing text logic for each encoding. This allows very
 efficient operation with every possible encoding, at the cost of
 increased complexity (caching decoded characters, multi-byte handling,
 etc).

Ruby actually comes from the CJK world in a way, doesn't it?

Even if efficient per-encoding manipulation is a tough nut to crack,
it at least avoids the fixed cost of bulk decoding, so an application
designer doesn't need to  think about the pay-off for a correct text
approach vs. `binary'/ASCII, and the language/library designer doesn't
need to think about whether genome data is a representative case etc.

If Haskell had the development resources to make something like this
work, would it actually take the form of a Haskell-level type like
that - data Text = (Encoding, ByteString)?  I mean, I know that's
just a very clear and convenient way to express it for the purposes
of the present discussion, and actual design is a little premature -
... but, I think you could argue that from the Haskell level,
`Text' should be a single type, if the encoding differences aren't
semantically interesting.

Donn Cave, d...@avvanta.com
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString

On Tue, Aug 17, 2010 at 9:30 PM, Donn Cave d...@avvanta.com wrote:

 Quoth John Millikin jmilli...@gmail.com,

  Ruby, which has an enormous Japanese userbase, solved the problem by
  essentially defining Text = (Encoding, ByteString), and then
  re-implementing text logic for each encoding. This allows very
  efficient operation with every possible encoding, at the cost of
  increased complexity (caching decoded characters, multi-byte handling,
  etc).

 Ruby actually comes from the CJK world in a way, doesn't it?

 Even if efficient per-encoding manipulation is a tough nut to crack,
 it at least avoids the fixed cost of bulk decoding, so an application
 designer doesn't need to  think about the pay-off for a correct text
 approach vs. `binary'/ASCII, and the language/library designer doesn't
 need to think about whether genome data is a representative case etc.


Remember that the cost of decoding is O(n) no matter what encoding is used
internally as you always have to validate when going from  ByteString to
Text. If the external and internal encoding don't match then you also have
to copy the bytes into a new buffer, but that is only one allocation (a
pointer increment with a semi-space collector) and the copy is cheap since
the data is in cache.

-- Johan
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString

2010-08-17 Thread anderson leo

Hi michael, here is a web site http://zh.wikipedia.org/zh-cn/. It is the
wikipedia for Chinese.

-Andrew

On Tue, Aug 17, 2010 at 7:00 PM, Michael Snoyman mich...@snoyman.comwrote:



 On Tue, Aug 17, 2010 at 1:50 PM, Yitzchak Gale g...@sefer.org wrote:

 Ketil Malde wrote:
  I haven't benchmarked it, but I'm fairly sure that, if you try to fit a
  3Gbyte file (the Human genome, say¹), into a computer with 4Gbytes of
  RAM, UTF-16 will be slower than UTF-8...

 I don't think the genome is typical text. And
 I doubt that is true if that text is in a CJK language.

  I think that *IF* we are aiming for a single, grand, unified text
  library to Rule Them All, it needs to use UTF-8.

 Given the growth rate of China's economy, if CJK isn't
 already the majority of text being processed in the world,
 it will be soon. I have seen media reports claiming CJK is
 now a majority of text data going over the wire on the web,
 though I haven't seen anything scientific backing up those claims.
 It certainly seems reasonable. I believe Google's measurements
 based on their own web index showing wide adoption of UTF-8
 are very badly skewed due to a strong Western bias.

 In that case, if we have to pick one encoding for Data.Text,
 UTF-16 is likely to be a better choice than UTF-8, especially
 if the cost is fairly low even for the special case of Western
 languages. Also, UTF-16 has become by far the dominant internal
 text format for most software and for most user platforms.
 Except on desktop Linux - and whether we like it or not, Linux
 desktops will remain a tiny minority for the foreseeable future.

  I think you are conflating two points here, and ignoring some important
 data. Regarding the data: you haven't actually quoted any statistics about
 the prevalence of CJK data, but even if the majority of web pages served are
 in those three languages, a fairly high percentage of the content will
 *still* be ASCII, due simply to the HTML, CSS and Javascript overhead. I'd
 hate to make up statistics on the spot, especially when I don't have any
 numbers from you to compare them with.

 As far as the conflation, there are two questions with regard to the
 encoding choice: encoding/decoding time and space usage. I don't think
 *anyone* is asserting that UTF-16 is a common encoding for files anywhere,
 so by using UTF-16 we are simply incurring an overhead in every case. We
 can't consider a CJK encoding for text, so its prevalence is irrelevant to
 this topic. What *is* relevant is that a very large percentage of web pages
 *are*, in fact, standardizing on UTF-8, and that all 7-bit text files are by
 default UTF-8.

 As far as space usage, you are correct that CJK data will take up more
 memory in UTF-8 than UTF-16. The question still remains whether the overall
 document size will be larger: I'd be interested in taking a random sampling
 of CJK-encoded pages and comparing their UTF-8 and UTF-16 file sizes. I
 think simply talking about this in the vacuum of data is pointless. If
 anyone can recommend a CJK website which would be considered representative
 (or a few), I'll do the test myself.

 Michael

 ___
 Haskell-Cafe mailing list
 Haskell-Cafe@haskell.org
 http://www.haskell.org/mailman/listinfo/haskell-cafe


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString

2010-08-17 Thread John Meacham

On Tue, Aug 17, 2010 at 03:21:32PM +0200, Daniel Peebles wrote:
 Sounds to me like we need a lazy Data.Text variation that allows UTF-8 and
 UTF-16 segments in it list of strict text elements :) Then big chunks of
 western text will be encoded efficiently, and same with CJK! Not sure what
 to do about strict Data.Text though :)

If space is really a concern, there should be a varient that uses LZO or
some other fast compression algorithm that allows concatination as the
back end. 

ranty thing to follow
That said, there is never a reason to use UTF-16, it is a vestigial
remanent from the brief period when it was thought 16 bits would be
enough for the unicode standard, any defense of it nowadays is after the
fact justification for having accidentally standardized on it back in
the day. When people chose to use the 16 bit representation, it was
because they wanted a one-to-one mapping between codepoints and units of
computation, which has many advantages. However, this is no longer true,
if the one-to-one mapping is important then nowadays you use ucs-4,
otherwise, you use utf8. If space is very important then you work with
compressed text. In practice a mix of the two is fairly ideal.

John

-- 
John Meacham - ⑆repetae.net⑆john⑈ - http://notanumber.net/
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString


Bulat Ziganshin wrote:

Johan wrote:

So it's not clear to me that using UTF-16 makes the program
noticeably slower or use more memory on a real program.


it's clear misunderstanding. of course, not every program holds much
text data in memory. but some does, and here you will double memory
usage


I write programs that hold onto quite a good deal of natural language 
text; a few million words at least. Getting efficient Unicode for that 
is a high priority. However, all of that text is in Japanese, Chinese, 
Arabic, Hindi, Urdu,... That's the reason I want Unicode. I'm pretty 
sure UTF-16 isn't going to be causing any special problems here.


For NLP work, any language with a vaguely ASCII format isn't a problem. 
We've been shoving English and western European languages into a subset 
of ASCII for years (heck, we don't even allow real parentheses!).


For the mostly English files on my harddrive, UTF-8 is a clear win. But 
when it comes to programming, I'm not so sure. I'd like to see some good 
benchmarks and a clear explanation of where the costs are. Relying on 
intuitions is notoriously bad for these kinds of encoding issues.


--
Live well,
~wren
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString

2010-08-17 Thread John Millikin

On Tue, Aug 17, 2010 at 12:30, Donn Cave d...@avvanta.com wrote:
 If Haskell had the development resources to make something like this
 work, would it actually take the form of a Haskell-level type like
 that - data Text = (Encoding, ByteString)?  I mean, I know that's
 just a very clear and convenient way to express it for the purposes
 of the present discussion, and actual design is a little premature -
 ... but, I think you could argue that from the Haskell level,
 `Text' should be a single type, if the encoding differences aren't
 semantically interesting.

It should be possible to create a Ruby-style Text in Haskell, using
the existing Text API. The constructor would be something like  data
Text = Text !Encoding !ByteString , but there's no need to export
it. The only significant improvements, performance-wise, would be that
1) encoding text to its internal encoding would be O(1) and 2)
decoding text would only have to perform validation, instead of
validation+copy+stream fusion muck. Downside: lazy decoding makes it
very difficult to reason about failures, since even simple operations
like 'append' might fail if you try to append two texts with
mutually-incompatible characters.

In any case, I suspect getting Haskell itself to support non-Unicode
characters is much more difficult than writing an appropriate Text
type.
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString


Michael Snoyman wrote:

On Tue, Aug 17, 2010 at 2:20 PM, Ivan Lazar Miljenovic 
ivan.miljeno...@gmail.com wrote:


Michael Snoyman mich...@snoyman.com writes:


I don't think *anyone* is asserting that UTF-16 is a common encoding
for files anywhere,

*ahem*
http://en.wikipedia.org/wiki/UTF-16/UCS-2#Use_in_major_operating_systems_and_environments

I was talking about the contents of the files, not the file names or how

the system calls work. I know at least on Windows, Linux and FreeBSD, if you
open up the default text editor, type in a few letters and hit save, the
file will not be in UTF-16.


OSX, TextEdit, plain text mode is UTF-16 and cannot be altered. Also, if 
you load a UTF-8 plain text file in TextEdit it will be garbled because 
it assumes UTF-16. For html files you can choose the encoding, which 
defaults to UTF-8. But for plain text, it's always UTF-16. OSX is also 
fond of UTF-16 in Cocoa...


--
Live well,
~wren
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString


John Millikin wrote:

The reason many Japanese and Chinese users reject UTF-8 isn't due to
space constraints (UTF-8 and UTF-16 are roughly equal), it's because
they reject Unicode itself.


+1.

This is the thing Unicode advocates don't want to admit. Until Unicode 
has code points for _all_ Chinese and Japanese characters, there will be 
active resistance to adoption.


--
Live well,
~wren
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString


Johan Tibell wrote:

To my knowledge the data we have about prevalence of encoding on the web is
accurate. We crawl all pages we can get our hands on, by starting at some
set of seeds and then following all the links. You cannot be sure that
you've reached all web sites as there might be cliques in the web graph but
we try our best to get them all. You're unlikely to get a better estimate
anywhere else. I doubt few organizations have the machinery required to
crawl most of the web.


There was a study recently on this. They found that there are four main 
parts of the Internet:


* a densely connected core, where from any site you can get to any other
* an in cone, from which you can reach the core (but not other in-cone 
members, since then you'd both be in the core)
* an out cone, which can be reached from the core (but which cannot 
reach each other)

* and, unconnected islands

The surprising part is they found that all four parts are approximately 
the same size. I forget the exact numbers, but they're all 25+/-5%.


This implies that an exhaustive crawl of the web would require having 
about 50% of all websites as seeds (the in-cone plus the islands). If 
we're only interested in a representative sample, then we could get by 
with fewer. However, that depends a lot on the definition of 
representative. And we can't have an accurate definition of 
representative without doing the entire crawl at some point in order to 
discover the appropriate distributions. Then again, distributions change 
over time...


Thus, I would guess that Google only has 50~75% of the net: the core, 
the out-cone, and a fraction of the islands and in-cone.


--
Live well,
~wren
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString

On 18 August 2010 12:12, wren ng thornton w...@freegeek.org wrote:
 Johan Tibell wrote:

 To my knowledge the data we have about prevalence of encoding on the web
 is
 accurate. We crawl all pages we can get our hands on, by starting at some
 set of seeds and then following all the links. You cannot be sure that
 you've reached all web sites as there might be cliques in the web graph
 but
 we try our best to get them all. You're unlikely to get a better estimate
 anywhere else. I doubt few organizations have the machinery required to
 crawl most of the web.

 There was a study recently on this. They found that there are four main
 parts of the Internet:

 * a densely connected core, where from any site you can get to any other
 * an in cone, from which you can reach the core (but not other in-cone
 members, since then you'd both be in the core)
 * an out cone, which can be reached from the core (but which cannot reach
 each other)
 * and, unconnected islands

I'm guessing here that you're referring to what I've heard called the
hidden web: databases, etc. that require sign-ins, etc. (as stuff
that isn't in the core, to differing degrees: some of these databases
are indexed by google but you can't actually read them without an
account, etc.) ?

-- 
Ivan Lazar Miljenovic
ivan.miljeno...@gmail.com
IvanMiljenovic.wordpress.com
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString

2010-08-17 Thread Richard O'Keefe


On Aug 17, 2010, at 11:51 PM, Ketil Malde wrote:

 Yitzchak Gale g...@sefer.org writes:
 
 I don't think the genome is typical text.
 
 I think the typical *large* collection of text is text-encoded data, and
 not, for lack of a better word, literature.  Genomics data is just an
 example.

I have a collection of 100,000 patents I'm working with.
5.5GB of XML, most of it (US-)English text.
After stripping out the XML markup, it's 4GB of text.
It's a random sample from some 14 million patents I could
have access to, but 100,000 was more than enough.



___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString


Ivan Lazar Miljenovic wrote:

On 18 August 2010 12:12, wren ng thornton w...@freegeek.org wrote:

Johan Tibell wrote:

To my knowledge the data we have about prevalence of encoding on the web
is
accurate. We crawl all pages we can get our hands on, by starting at some
set of seeds and then following all the links. You cannot be sure that
you've reached all web sites as there might be cliques in the web graph
but
we try our best to get them all. You're unlikely to get a better estimate
anywhere else. I doubt few organizations have the machinery required to
crawl most of the web.

There was a study recently on this. They found that there are four main
parts of the Internet:

* a densely connected core, where from any site you can get to any other
* an in cone, from which you can reach the core (but not other in-cone
members, since then you'd both be in the core)
* an out cone, which can be reached from the core (but which cannot reach
each other)
* and, unconnected islands


I'm guessing here that you're referring to what I've heard called the
hidden web: databases, etc. that require sign-ins, etc. (as stuff
that isn't in the core, to differing degrees: some of these databases
are indexed by google but you can't actually read them without an
account, etc.) ?


Not so far as I recall. I'd have to find a copy of the paper to be sure 
though. Because the metric used was graph connectivity, if those hidden 
pages have links out into non-hidden pages (e.g., the login page), then 
they'd be counted in the same way as the non-hidden pages reachable from 
them.


--
Live well,
~wren
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString

Well, I'm not certain if it counts as a typical Chinese website, but here
are the stats;

UTF8: 64,198
UTF16: 113,160

And just for fun, after gziping:

UTF8: 17,708
UTF16: 19,367

On Wed, Aug 18, 2010 at 2:59 AM, anderson leo fireman...@gmail.com wrote:

 Hi michael, here is a web site http://zh.wikipedia.org/zh-cn/. It is the
 wikipedia for Chinese.

 -Andrew

 On Tue, Aug 17, 2010 at 7:00 PM, Michael Snoyman mich...@snoyman.comwrote:



 On Tue, Aug 17, 2010 at 1:50 PM, Yitzchak Gale g...@sefer.org wrote:

 Ketil Malde wrote:
  I haven't benchmarked it, but I'm fairly sure that, if you try to fit a
  3Gbyte file (the Human genome, say¹), into a computer with 4Gbytes of
  RAM, UTF-16 will be slower than UTF-8...

 I don't think the genome is typical text. And
 I doubt that is true if that text is in a CJK language.

  I think that *IF* we are aiming for a single, grand, unified text
  library to Rule Them All, it needs to use UTF-8.

 Given the growth rate of China's economy, if CJK isn't
 already the majority of text being processed in the world,
 it will be soon. I have seen media reports claiming CJK is
 now a majority of text data going over the wire on the web,
 though I haven't seen anything scientific backing up those claims.
 It certainly seems reasonable. I believe Google's measurements
 based on their own web index showing wide adoption of UTF-8
 are very badly skewed due to a strong Western bias.

 In that case, if we have to pick one encoding for Data.Text,
 UTF-16 is likely to be a better choice than UTF-8, especially
 if the cost is fairly low even for the special case of Western
 languages. Also, UTF-16 has become by far the dominant internal
 text format for most software and for most user platforms.
 Except on desktop Linux - and whether we like it or not, Linux
 desktops will remain a tiny minority for the foreseeable future.

  I think you are conflating two points here, and ignoring some important
 data. Regarding the data: you haven't actually quoted any statistics about
 the prevalence of CJK data, but even if the majority of web pages served are
 in those three languages, a fairly high percentage of the content will
 *still* be ASCII, due simply to the HTML, CSS and Javascript overhead. I'd
 hate to make up statistics on the spot, especially when I don't have any
 numbers from you to compare them with.

 As far as the conflation, there are two questions with regard to the
 encoding choice: encoding/decoding time and space usage. I don't think
 *anyone* is asserting that UTF-16 is a common encoding for files anywhere,
 so by using UTF-16 we are simply incurring an overhead in every case. We
 can't consider a CJK encoding for text, so its prevalence is irrelevant to
 this topic. What *is* relevant is that a very large percentage of web pages
 *are*, in fact, standardizing on UTF-8, and that all 7-bit text files are by
 default UTF-8.

 As far as space usage, you are correct that CJK data will take up more
 memory in UTF-8 than UTF-16. The question still remains whether the overall
 document size will be larger: I'd be interested in taking a random sampling
 of CJK-encoded pages and comparing their UTF-8 and UTF-16 file sizes. I
 think simply talking about this in the vacuum of data is pointless. If
 anyone can recommend a CJK website which would be considered representative
 (or a few), I'll do the test myself.

 Michael

 ___
 Haskell-Cafe mailing list
 Haskell-Cafe@haskell.org
 http://www.haskell.org/mailman/listinfo/haskell-cafe



___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString

2010-08-17 Thread Jinjing Wang

 John Millikin wrote:

 The reason many Japanese and Chinese users reject UTF-8 isn't due to
 space constraints (UTF-8 and UTF-16 are roughly equal), it's because
 they reject Unicode itself.

 +1.

 This is the thing Unicode advocates don't want to admit. Until Unicode has
 code points for _all_ Chinese and Japanese characters, there will be active
 resistance to adoption.

 --
 Live well,
 ~wren

For mainland chinese websites:

Most that became popular during web 1.0 (5-10 years ago) are using
utf-8 incompatible format, e.g. gb2312.

for example:

* www.sina.com.cn
* www.sohu.com

They didn't switch to utf-8 probably just because they never have to.

However, many of the popular websites started during web 2.0 are adopting utf-8

for example:

* renren.com (chinese largest facebook clone)
* www.kaixin001.com (chinese second largest facebook clone)
* t.sina.com.cn (an example of twitter clone)

These websites adopted utf-8 because (I think) most web development
tools have already standardized on utf-8, and there's little reason
change it.

I'm not aware of any (at least common) chinese characters that can be
represented by gb2312 but not in unicode. Since the range of gb2312 is
a subset of the range of gbk, which is a subset of the range of
gb18030. And gb18030 is just another encoding of unicode.

ref:

* http://en.wikipedia.org/wiki/GB_18030

-- 
jinjing
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString

2010-08-16 Thread Daniel Fischer

Hi Bulat,

On Monday 16 August 2010 07:35:44, Bulat Ziganshin wrote:
 Hello Daniel,

 Sunday, August 15, 2010, 10:39:24 PM, you wrote:
  That's great. If that performance difference is a show stopper, one
  shouldn't go higher-level than C anyway :)

 *all* speed measurements that find Haskell is as fast as C, was
 broken.

That's a pretty bold claim, considering that you probably don't know all 
such measurements ;)

But let's get serious. Bryan posted measurements showing the text (HEAD) 
package's performance within a reasonable factor of wc's. (Okay, he didn't 
give a complete description of his test, so we can only assume that all 
participants did the same job. I'm bold enough to assume that.)
Lazy text being 7% slower than wc, strict 30%.

If you are claiming that his test was flawed (and since the numbers clearly 
showed Haskell slower thanC, just not much, I suspect you do, otherwise I 
don't see the point of your post), could you please elaborate why you think 
it's flawed?

 Let's see:

 D:\testingread MsOffice.arc
 MsOffice.arc 317mb -- Done
 Time 0.407021 seconds (timer accuracy 0.00 seconds)
 Speed 779.505632 mbytes/sec

I see nothing here, not knowing what `read' is. None of read (n), read (2), 
read (1p), read(3p) makes sense here, so it must be something else.
Since it outputs a size in bytes, I doubt that it actually counts 
characters, like wc -m and, presumably, the text programmes Bryan 
benchmarked.
Just counting bytes, wc and Data.ByteString[.Lazy] can do much faster than 
counting characters too.

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString

On Sat, Aug 14, 2010 at 10:46 PM, Michael Snoyman mich...@snoyman.comwrote:


 When I'm writing a web app, my code is sitting on a Linux system where the
 default encoding is UTF-8, communicating with a database speaking UTF-8,
 receiving request bodies in UTF-8 and sending response bodies in UTF-8. So
 converting all of that data to UTF-16, just to be converted right back to
 UTF-8, does seem strange for that purpose.


Bear in mind that much of the data you're working with can't be readily
trusted. UTF-8 coming from the filesystem, the network, and often the
database may not be valid. The cost of validating it isn't all that
different from the cost of converting it to UTF-16.

And of course the internals of Data.Text are all fusion-based, so much of
the time you're not going to be allocating UTF-16 arrays at all, but instead
creating a pipeline of characters that are manipulated in a tight loop. This
eliminates a lot of the additional copying that bytestring has to do, for
instance.

To give you an idea of how competitive Data.Text can be compared to C code,
this is the system's wc command counting UTF-8 characters in a modestly
large file:

$ time wc -m huge.txt
32443330
real 0.728s


This is Data.Text performing the same task:

$ time ./FileRead text huge.txt
32443330
real 0.697s
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString

Quoth John Millikin jmilli...@gmail.com,

 I don't see why [Char] is obvious -- you'd never use [Word8] for
 storing binary data, right? [Char] is popular because it's the default
 type for string literals, and due to simple inertia, but when there's
 a type based on packed arrays there's no reason to use the list
 representation.

Well, yes, string literals - and pattern matching support, maybe
that's the same thing.  And I think it's fair to say that [Char]
is a natural, elegant match for the language, I mean it leverages
your basic Haskell skills if for example you want to parse something
fairly simple.  So even if ByteString weren't the monumental hassle
it is today for simple stuff, String would have at least a little appeal.
And if packed arrays really always mattered, [Char] would be long gone.
They don't, you can do a lot of stuff with [Char] before it turns into
a problem.

 Also, despite the name, ByteString and Text are for separate purposes.
 ByteString is an efficient [Word8], Text is an efficient [Char] -- use
 ByteString for binary data, and Text for...text. Most mature languages
 have both types, though the choice of UTF-16 for Text is unusual.

Maybe most mature languages have one or more extra string types
hacked on to support wide characters.  I don't think it's necessarily
a virtue.  ByteString vs. ByteString.Char8, where you can choose
more or less indiscriminately to treat the data as Char or Word8,
seems to me like a more useful way to approach the problem.  (Of
course, ByteString.Char8 isn't a good way to deal with wide characters
correctly, I'm just saying that's where I'd like to find the answer,
not in some internal character encoding into which all text data
must be converted.)

Donn Cave, d...@avvanta.com
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString

On Sat, Aug 14, 2010 at 10:07 PM, Donn Cave d...@avvanta.com wrote:


 Am I confused about this?  It's why I can't see Text ever being

simply the obvious choice.  [Char] will continue to be the obvious
 choice if you want a functional data type that supports pattern
 matching etc.


Actually, with view patterns, Text is pretty nice to pattern match against:

foo (uncons - Just (c,cs)) = whee

despam (prefixed spam - Just suffix) = whee `mappend` suffix

ByteString will continue to be the obvious choice
 for big data loads.


Don't confuse I have big data with I need bytes. If you are working with
bytes, use bytestring. If you are working with text, outside of a few narrow
domains you should use text.

 We'll have a three way choice between programming
 elegance, correctness and efficiency.  If Haskell were more than
 just a research language, this might be its most prominent open
 sore, don't you think?


No, that's just FUD.
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString

2010-08-15 Thread Colin Paul Adams

 Bryan == Bryan O'Sullivan b...@serpentine.com writes:

Bryan On Sat, Aug 14, 2010 at 10:46 PM, Michael Snoyman 
mich...@snoyman.com wrote:
Bryan When I'm writing a web app, my code is sitting on a Linux
Bryan system where the default encoding is UTF-8, communicating
Bryan with a database speaking UTF-8, receiving request bodies in
Bryan UTF-8 and sending response bodies in UTF-8. So converting all
Bryan of that data to UTF-16, just to be converted right back to
Bryan UTF-8, does seem strange for that purpose.


Bryan Bear in mind that much of the data you're working with can't
Bryan be readily trusted. UTF-8 coming from the filesystem, the
Bryan network, and often the database may not be valid. The cost of
Bryan validating it isn't all that different from the cost of
Bryan converting it to UTF-16.

But UTF-16 (apart from being an abomination for creating a hole in the
codepoint space and making it impossible to ever etxend it) is slow to
process compared with UTF-32 - you can't get the nth character in
constant time, so it seems an odd choice to me.
-- 
Colin Adams
Preston Lancashire
()  ascii ribbon campaign - against html e-mail
/\  www.asciiribbon.org   - against proprietary attachments
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString

2010-08-15 Thread Johan Tibell

Hi Colin,

On Sun, Aug 15, 2010 at 9:34 AM, Colin Paul Adams
co...@colina.demon.co.ukwrote:

 But UTF-16 (apart from being an abomination for creating a hole in the
 codepoint space and making it impossible to ever etxend it) is slow to
 process compared with UTF-32 - you can't get the nth character in
 constant time, so it seems an odd choice to me.


Aside: Getting the nth character isn't very useful when working with Unicode
text:

* Most text processing is linear.
* What we consider a character and what Unicode considers a character
differs a bit e.g. since Unicode uses combining characters.

Cheers,
Johan
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString

2010-08-15 Thread Ivan Lazar Miljenovic

Don Stewart d...@galois.com writes:

 * Pay attention to Haskell Cafe announcements
 * Follow the Reddit Haskell news.
 * Read the quarterly reports on Hackage
 * Follow Planet Haskell

And yet there are still many packages that fall under the radar with no
announcements of any kind on initial release or even new versions :(

-- 
Ivan Lazar Miljenovic
ivan.miljeno...@gmail.com
IvanMiljenovic.wordpress.com
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString

2010-08-15 Thread Vo Minh Thu

2010/8/15 Ivan Lazar Miljenovic ivan.miljeno...@gmail.com:
 Don Stewart d...@galois.com writes:

     * Pay attention to Haskell Cafe announcements
     * Follow the Reddit Haskell news.
     * Read the quarterly reports on Hackage
     * Follow Planet Haskell

 And yet there are still many packages that fall under the radar with no
 announcements of any kind on initial release or even new versions :(

If you're interested in a comprehensive update list, you can follow
Hackage on Twitter, or the news feed.

Cheers,
Thu
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString


Don Stewart wrote:

So, to stay up to date, but without drowning in data. Do one of:

* Pay attention to Haskell Cafe announcements
* Follow the Reddit Haskell news.
* Read the quarterly reports on Hackage
* Follow Planet Haskell
  


Interesting. Obviously I look at Haskell Cafe from time to time 
(although there's usually far too much traffic to follow it all). I 
wasn't aware of *any* of the other resources listed.


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString

2010-08-15 Thread Ivan Lazar Miljenovic

Vo Minh Thu not...@gmail.com writes:

 2010/8/15 Ivan Lazar Miljenovic ivan.miljeno...@gmail.com:
 Don Stewart d...@galois.com writes:

     * Pay attention to Haskell Cafe announcements
     * Follow the Reddit Haskell news.
     * Read the quarterly reports on Hackage
     * Follow Planet Haskell

 And yet there are still many packages that fall under the radar with no
 announcements of any kind on initial release or even new versions :(

 If you're interested in a comprehensive update list, you can follow
 Hackage on Twitter, or the news feed.

Except that that doesn't tell you:

* The purpose of the library
* How a release differs from a previous one
* Why you should use it, etc.

Furthermore, several interesting discussions have arisen out of
announcement emails.

-- 
Ivan Lazar Miljenovic
ivan.miljeno...@gmail.com
IvanMiljenovic.wordpress.com
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString

2010-08-15 Thread Vo Minh Thu

2010/8/15 Ivan Lazar Miljenovic ivan.miljeno...@gmail.com:
 Vo Minh Thu not...@gmail.com writes:

 2010/8/15 Ivan Lazar Miljenovic ivan.miljeno...@gmail.com:
 Don Stewart d...@galois.com writes:

     * Pay attention to Haskell Cafe announcements
     * Follow the Reddit Haskell news.
     * Read the quarterly reports on Hackage
     * Follow Planet Haskell

 And yet there are still many packages that fall under the radar with no
 announcements of any kind on initial release or even new versions :(

 If you're interested in a comprehensive update list, you can follow
 Hackage on Twitter, or the news feed.

 Except that that doesn't tell you:

 * The purpose of the library
 * How a release differs from a previous one
 * Why you should use it, etc.

 Furthermore, several interesting discussions have arisen out of
 announcement emails.

Sure, nor does it write a book chapter about some practical usage. I
mean (tongue in cheek) that the other ressource, nor even some proper
annoucement, provide all that.

I still remember the UHC annoucement (a (nearly) complete Haskell 98
compiler) thread where most of it was about lack of support for n+k
pattern.

But the bullet list above was to point Andrew a few places where he
could have learn about Text.

Cheers,
Thu
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 8/15/10 03:01 , Bryan O'Sullivan wrote:
 On Sat, Aug 14, 2010 at 10:07 PM, Donn Cave d...@avvanta.com
 mailto:d...@avvanta.com wrote:
  We'll have a three way choice between programming
 elegance, correctness and efficiency.  If Haskell were more than
 just a research language, this might be its most prominent open
 sore, don't you think?
 
 No, that's just FUD. 

More to the point, there's nothing elegant about [Char] --- its sole
advantage is requiring no thought.

- -- 
brandon s. allbery [linux,solaris,freebsd,perl]  allb...@kf8nh.com
system administrator  [openafs,heimdal,too many hats]  allb...@ece.cmu.edu
electrical and computer engineering, carnegie mellon university  KF8NH
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.10 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkxoBPgACgkQIn7hlCsL25WbWACgz+MXfwL6ly1Euv1X1HD7Gmg8
fO0Anj1LY6CqDyLjr0s5L2M5Okx8ie+/
=eIIs
-END PGP SIGNATURE-
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString

2010-08-15 Thread Bill Atkins

No, not really.  Linked lists are very easy to deal with recursively and
Strings automatically work with any already-defined list functions.

On Sun, Aug 15, 2010 at 11:17 AM, Brandon S Allbery KF8NH 
allb...@ece.cmu.edu wrote:

 More to the point, there's nothing elegant about [Char] --- its sole
 advantage is requiring no thought.

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString

Quoth Bryan O'Sullivan b...@serpentine.com,
 On Sat, Aug 14, 2010 at 10:07 PM, Donn Cave d...@avvanta.com wrote:
...
 ByteString will continue to be the obvious choice
 for big data loads.

 Don't confuse I have big data with I need bytes. If you are working with
 bytes, use bytestring. If you are working with text, outside of a few narrow
 domains you should use text.

I wonder how many ByteString users are `working with bytes', in the
sense you apparently mean where the bytes are not text characters.
My impression is that in practice, there is a sizeable contingent
out here using ByteString.Char8 and relatively few applications for
the Word8 type.  Some of it should no doubt move to Text, but the
ability to work with native packed data - minimal processing and
space requirements, interoperability with foreign code, mmap, etc. -
is attractive enough that the choice can be less than obvious.

Donn Cave, d...@avvanta.com
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString

Quoth Bill Atkins watk...@alum.rpi.edu,

 No, not really.  Linked lists are very easy to deal with recursively and
 Strings automatically work with any already-defined list functions.

Yes, they're great - a terrible mistake, for a practical programming
language, but if you fail to recognize the attraction, you miss some of
the historical lesson on emphasizing elegance and correctness over
practical performance.

Donn Cave, d...@avvanta.com
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString

2010-08-15 Thread Felipe Lessa

On Sun, Aug 15, 2010 at 12:50 PM, Donn Cave d...@avvanta.com wrote:
 I wonder how many ByteString users are `working with bytes', in the
 sense you apparently mean where the bytes are not text characters.
 My impression is that in practice, there is a sizeable contingent
 out here using ByteString.Char8 and relatively few applications for
 the Word8 type.  Some of it should no doubt move to Text, but the
 ability to work with native packed data - minimal processing and
 space requirements, interoperability with foreign code, mmap, etc. -
 is attractive enough that the choice can be less than obvious.

Using ByteString.Char8 doesn't mean your data isn't a stream of bytes,
it means that it is a stream of bytes but for convenience you prefer
using Char8 functions.  For example, a DNA sequence (AATCGATACATG...)
is a stream of bytes, but it is better to write 'A' than 65.

But yes, many users of ByteStrings should be using Text. =)

Cheers!

-- 
Felipe.
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 8/15/10 11:25 , Bill Atkins wrote:
 No, not really.  Linked lists are very easy to deal with recursively and
 Strings automatically work with any already-defined list functions.
 
 On Sun, Aug 15, 2010 at 11:17 AM, Brandon S Allbery KF8NH
 allb...@ece.cmu.edu mailto:allb...@ece.cmu.edu wrote:
 
 More to the point, there's nothing elegant about [Char] --- its sole
 advantage is requiring no thought.

Except that it seems to me that a number of functions in Data.List are
really functions on Strings and not especially useful on generic lists.
There is overlap but it's not as large as might be thought.

- -- 
brandon s. allbery [linux,solaris,freebsd,perl]  allb...@kf8nh.com
system administrator  [openafs,heimdal,too many hats]  allb...@ece.cmu.edu
electrical and computer engineering, carnegie mellon university  KF8NH
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.10 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkxoFt4ACgkQIn7hlCsL25V+OACfXngN6ZX5L7AL153AkRYDFnqZ
jqsAnA3Lem5LioDVS5bc0ADGzHwWsKFE
=ehkx
-END PGP SIGNATURE-
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString


Donn Cave wrote:

I wonder how many ByteString users are `working with bytes', in the
sense you apparently mean where the bytes are not text characters.
My impression is that in practice, there is a sizeable contingent
out here using ByteString.Char8 and relatively few applications for
the Word8 type.  Some of it should no doubt move to Text, but the
ability to work with native packed data - minimal processing and
space requirements, interoperability with foreign code, mmap, etc. -
is attractive enough that the choice can be less than obvious.
  


I use ByteString for various binary-processing stuff. I also use it for 
string-processing, but that's mainly because I didn't know anything else 
existed. I'm sure lots of other people are using stuff like Data.Binary 
to serialise raw binary data using ByteString too.


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString


Donn Cave wrote:

Quoth Bill Atkins watk...@alum.rpi.edu,

  

No, not really.  Linked lists are very easy to deal with recursively and
Strings automatically work with any already-defined list functions.



Yes, they're great - a terrible mistake, for a practical programming
language, but if you fail to recognize the attraction, you miss some of
the historical lesson on emphasizing elegance and correctness over
practical performance.
  


And if you fail to recognise what a grave mistake placing performance 
before correctness is, you end up with things like buffer overflow 
exploits, SQL injection attacks, the Y2K bug, programs that can't handle 
files larger than 2GB or that don't understand Unicode, and so forth. 
All things that could have been almost trivially avoided if everybody 
wasn't so hung up on absolute performance at any cost.


Sure, performance is a priority. But it should never be the top 
priority. ;-)


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 8/15/10 13:53 , Andrew Coppin wrote:
 injection attacks, the Y2K bug, programs that can't handle files larger than
 2GB or that don't understand Unicode, and so forth. All things that could
 have been almost trivially avoided if everybody wasn't so hung up on
 absolute performance at any cost.

Now that's a bit unfair; nobody imagined back when lseek() was enshrined in
the Unix API that it would still be in use when a (long) wasn't big enough
:)  (Remember that Unix is itself a practical example of a research platform
avoiding success at any cost gone horribly wrong.)

- -- 
brandon s. allbery [linux,solaris,freebsd,perl]  allb...@kf8nh.com
system administrator  [openafs,heimdal,too many hats]  allb...@ece.cmu.edu
electrical and computer engineering, carnegie mellon university  KF8NH
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.10 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkxoK2gACgkQIn7hlCsL25VaHgCcCj8T8Qqfx4Co1lXZCH7BApkW
iI8AoNcSabjLso9nXBfujeI+diC8rM78
=FwBb
-END PGP SIGNATURE-
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString

On Sat, Aug 14, 2010 at 6:05 PM, Bryan O'Sullivan b...@serpentine.comwrote:


- If it's not good enough, and the fault lies in a library you chose,
report a bug and provide a test case.

 As a case in point, I took the string search benchmark that Daniel shared
on Friday, and boiled it down to a simple test case: how long does it take
to read a 31MB file?

GNU wc -m:

   - en_US.UTF-8: 0.701s

text 0.7.1.0:

   - lazy text: 1.959s
   - strict text: 3.527s

darcs HEAD:

   - lazy text: 0.749s
   - strict text: 0.927s
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString


Brandon S Allbery KF8NH wrote:

(Remember that Unix is itself a practical example of a research platform
avoiding success at any cost gone horribly wrong.)
  


I haven't used Erlang myself, but I've heard it described in a similar 
way. (I don't know how true that actually is...)


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString

2010-08-15 Thread Daniel Fischer

On Sunday 15 August 2010 20:04:01, Bryan O'Sullivan wrote:
 On Sat, Aug 14, 2010 at 6:05 PM, Bryan O'Sullivan 
b...@serpentine.comwrote:
 - If it's not good enough, and the fault lies in a library you
  chose, report a bug and provide a test case.
 
 As a case in point, I took the string search benchmark that Daniel shared
 on Friday, and boiled it down to a simple test case: how long does it
 take to read a 31MB file?

 GNU wc -m:

- en_US.UTF-8: 0.701s

 text 0.7.1.0:

- lazy text: 1.959s
- strict text: 3.527s

 darcs HEAD:

- lazy text: 0.749s
- strict text: 0.927s

That's great. If that performance difference is a show stopper, one 
shouldn't go higher-level than C anyway :)
(doesn't mean one should stop thinking about further speed-up, though)

Out of curiosity, what kind of speed-up did your Friday fix bring to the 
searching/replacing functions?

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 8/15/10 14:34 , Andrew Coppin wrote:
 Brandon S Allbery KF8NH wrote:
 (Remember that Unix is itself a practical example of a research platform
 avoiding success at any cost gone horribly wrong.)
 
 I haven't used Erlang myself, but I've heard it described in a similar way.
 (I don't know how true that actually is...)

Similar case, actually:  internal research project with internal practical
uses, then got discovered and productized by a different internal group.

- -- 
brandon s. allbery [linux,solaris,freebsd,perl]  allb...@kf8nh.com
system administrator  [openafs,heimdal,too many hats]  allb...@ece.cmu.edu
electrical and computer engineering, carnegie mellon university  KF8NH
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.10 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkxoNoAACgkQIn7hlCsL25XSAgCgtLKTtT8YN99KsArnhW2kMDvh
oHcAnR1QrfIaq3hmzqU7yF31NZubEMsR
=zpv1
-END PGP SIGNATURE-
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString

On Sun, Aug 15, 2010 at 11:39 AM, Daniel Fischer
daniel.is.fisc...@web.dewrote:

 Out of curiosity, what kind of speed-up did your Friday fix bring to the
 searching/replacing functions?


Quite a bit!

text 0.7.1.0 and 0.7.2.1:

   - 1.056s

darcs HEAD:

   - 0.158s
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString