If we want to test with huge amounts of data we feed portions of the internet. The problem is it takes a lot of bandwith and lots of computing power to get to a `reasonable` size. On the positive side, you deal with real text so it's easier to tune for relevance.
I think it's easier to create a simple XML generator with mock data, prices, popularity rates etc. It's fast to generate millions of mock products and once you have a large quantity of XML files, you can easily index, test, change config or schema and reindex. On the other hand, the sample data that comes with the Solr example is a good set as well as it proves the concepts well, especially with the stock Velocity templates. We know Solr will handle enormous sets but quantity is not always a part of a PoC. > Hello Everyone, > > I have a goal of populating Solr with a million unique products in > order to create a test environment for a proof of concept. I started > out by using DIH with Amazon RSS feeds but I've quickly realized that > there's no way I can glean a million products from one RSS feed. And > I'd go mad if I just sat at my computer all day looking for feeds and > punching them into DIH config for Solr. > > Has anyone ever had to create large mock/dummy datasets for test > environments or for POCs/Demos to convince folks that Solr was the > wave of the future? Any tips would be greatly appreciated. I suppose > it sounds a lot like crawling even though it started out as innocent > DIH usage. > > - Pulkit