[Haskell-cafe] How would you hack it?

Andrew Coppin Wed, 04 Jun 2008 13:12:38 -0700

So anyway, today I found myself wanting to build a program thatgenerates some test data. I managed to make it *work* without too muchdifficulty, but it wasn't exactly "elegant". I'm curios to know howhigher-order minds would approach this problem. It's not an especially"hard" problem, and I'm sure there are several good solutions possible.

I have a file that contains several thousand words, seperated by whitespace. [I gather that on Unix there's a standard location for thisfile?] I want to end up with a file that contains a randomly-chosenselection of words. Actually, I'd like the end result to be a LaTeXfile, containing sections, subsections, paragraphs and sentences.(Although obviously the actual sentences will be gibberish.) I'd like tobe able to select how big the file should be, to within a few dozencharacters anyway. Exact precision is not required.


How would you do this?

The approach I came up with is to slurp up the words like so:

 raw <- readFile "words.txt"
 let ws = words raw
 let n = length ws
 let wa = listArray (1,n) ws

(I actually used lazy ByteStrings of characters.) So now I have an arrayof packed ByteStrings, and I can pick array indexes at random and use"unwords" to build my gibberish "sentences".

The fun part is making a sentence come out the right size. There are twoobvious possibilities:- Assume that all words are approximately N characters long, andestimate how many words a sentence therefore needs to contain to havethe required length.- Start with an empty list and *actually count* the characters as youadd each word. (You can prepend them rather than append them for extraefficiency.)

I ended up taking the latter approach - at least at the sentence level.

What I actually did was to write a function that builds lists of words.There is then another function that builds several sublists ofapproximately the prescribed lengths, inserts some commas, capitalisesthe first letter and appends a fullstop. This therefore generates a"sentence". After that, there's a function that builds several sentencesof random size with random numbers of commas and makes a "paragraph" outof them. Next a function gathers several paragraphs and inserts arandomly-generated subsection heading. A similar function takes severalsubsections and adds a random section heading.

In my current implementation, all of this is in the IO monad (so I canpick things randomly). Also, the whole thing looks very repetative andreally if I thought about it harder, it ought to be possible to factorit out into something smaller and more ellegant. The clause functionbuilds clauses of "approximately" the right size, and each functionabove that (sentence, paragraph, subsection, section) becomesprogressively less accurate in its sizing. On the other hand, the finalgenerated text always for example has 4 subsections to each section, andthey always look visually the same size. I'd like to make it morerandom, but all the code works by estimating "roughly" how big aparticular construct is, and therefore how many of then are required tofull N characters. For larger and larger N, actually counting this stuffwould seem somewhat inefficient, so I'm estimating. But that makes ithard to add more randomness without losing overall size control.

The final file can end up being a fair way different in size to what Irequested, and has an annoyingly regular internal structure. Butessentially, it works.

It would be nice to modify the code to generate HTML - but since it'scoded so simple-mindedly, that would be a copy & paste job.

Clearly, what I *should* have done is think more about a goodabstraction before writing miles of code. ;-) So how would you guys do this?


_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

[Haskell-cafe] How would you hack it?

Reply via email to