On Thu, 2011-09-15 at 22:54 +0200, Pulkit Singhal wrote: > Has anyone ever had to create large mock/dummy datasets for test > environments or for POCs/Demos to convince folks that Solr was the > wave of the future?
Yes, but I did it badly. The problem is that real data are not random so any simple random String generator is likely to produce data where the distribution of words does not have much in common with real world data. Zipf's law seems like the way to go: https://secure.wikimedia.org/wikipedia/en/wiki/Zipf%27s_law A little searching reveals things like https://wiki.apache.org/pig/DataGeneratorHadoop http://diveintodata.org/2009/09/zipf-distribution-generator-in-java/ Unfortunately most non-techies will be confused by seeing computer generated words so a combination of Zipf to calculate word distribution and a dictionary to provide the words themselves might be best. That still leaves confusing computer generated sentences if one wants to have larger text fields in the index, but opting for something that generates text that looks like real sentences collides with proper distribution of the words.