So anyway, today I found myself wanting to build a program that generates some test data. I managed to make it *work* without too much difficulty, but it wasn't exactly "elegant". I'm curios to know how higher-order minds would approach this problem. It's not an especially "hard" problem, and I'm sure there are several good solutions possible.

I have a file that contains several thousand words, seperated by white space. [I gather that on Unix there's a standard location for this file?] I want to end up with a file that contains a randomly-chosen selection of words. Actually, I'd like the end result to be a LaTeX file, containing sections, subsections, paragraphs and sentences. (Although obviously the actual sentences will be gibberish.) I'd like to be able to select how big the file should be, to within a few dozen characters anyway. Exact precision is not required.

How would you do this?

The approach I came up with is to slurp up the words like so:

 raw <- readFile "words.txt"
 let ws = words raw
 let n = length ws
 let wa = listArray (1,n) ws

(I actually used lazy ByteStrings of characters.) So now I have an array of packed ByteStrings, and I can pick array indexes at random and use "unwords" to build my gibberish "sentences".

The fun part is making a sentence come out the right size. There are two obvious possibilities: - Assume that all words are approximately N characters long, and estimate how many words a sentence therefore needs to contain to have the required length. - Start with an empty list and *actually count* the characters as you add each word. (You can prepend them rather than append them for extra efficiency.)
I ended up taking the latter approach - at least at the sentence level.

What I actually did was to write a function that builds lists of words. There is then another function that builds several sublists of approximately the prescribed lengths, inserts some commas, capitalises the first letter and appends a fullstop. This therefore generates a "sentence". After that, there's a function that builds several sentences of random size with random numbers of commas and makes a "paragraph" out of them. Next a function gathers several paragraphs and inserts a randomly-generated subsection heading. A similar function takes several subsections and adds a random section heading.

In my current implementation, all of this is in the IO monad (so I can pick things randomly). Also, the whole thing looks very repetative and really if I thought about it harder, it ought to be possible to factor it out into something smaller and more ellegant. The clause function builds clauses of "approximately" the right size, and each function above that (sentence, paragraph, subsection, section) becomes progressively less accurate in its sizing. On the other hand, the final generated text always for example has 4 subsections to each section, and they always look visually the same size. I'd like to make it more random, but all the code works by estimating "roughly" how big a particular construct is, and therefore how many of then are required to full N characters. For larger and larger N, actually counting this stuff would seem somewhat inefficient, so I'm estimating. But that makes it hard to add more randomness without losing overall size control.

The final file can end up being a fair way different in size to what I requested, and has an annoyingly regular internal structure. But essentially, it works.

It would be nice to modify the code to generate HTML - but since it's coded so simple-mindedly, that would be a copy & paste job.

Clearly, what I *should* have done is think more about a good abstraction before writing miles of code. ;-) So how would you guys do this?

_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Reply via email to