Re: Generating large datasets for Solr proof-of-concept

Toke Eskildsen Fri, 16 Sep 2011 00:54:17 -0700

On Thu, 2011-09-15 at 22:54 +0200, Pulkit Singhal wrote:
> Has anyone ever had to create large mock/dummy datasets for test
> environments or for POCs/Demos to convince folks that Solr was the
> wave of the future?


Yes, but I did it badly. The problem is that real data are not random so
any simple random String generator is likely to produce data where the
distribution of words does not have much in common with real world data.


Zipf's law seems like the way to go:
https://secure.wikimedia.org/wikipedia/en/wiki/Zipf%27s_law

A little searching reveals things like
https://wiki.apache.org/pig/DataGeneratorHadoop
http://diveintodata.org/2009/09/zipf-distribution-generator-in-java/


Unfortunately most non-techies will be confused by seeing computer
generated words so a combination of Zipf to calculate word distribution
and a dictionary to provide the words themselves might be best.

That still leaves confusing computer generated sentences if one wants to
have larger text fields in the index, but opting for something that
generates text that looks like real sentences collides with proper
distribution of the words.

Re: Generating large datasets for Solr proof-of-concept

Reply via email to