Re: Generating large datasets for Solr proof-of-concept
Thanks Hoss. I agree that the way you restated the question is better for getting results. BTW I think you've tipped me off to exactly what I needed with this URL: http://bbyopen.com/ Thanks! - Pulkit On Fri, Sep 16, 2011 at 4:35 PM, Chris Hostetter wrote: > > : Has anyone ever had to create large mock/dummy datasets for test > : environments or for POCs/Demos to convince folks that Solr was the > : wave of the future? Any tips would be greatly appreciated. I suppose > : it sounds a lot like crawling even though it started out as innocent > : DIH usage. > > the better question to ask is where you can find good sample data sets for > building proof of concept implementations. > > If you want an example of product data, the best buy product catalog is > available for developers using either an API or a bulk download of xml > files... > > http://bbyopen.com/ > > ...last time i looked (~1 year ago) there were about 1 million products in > the data dump. > > > -Hoss >
Re: Generating large datasets for Solr proof-of-concept
: Has anyone ever had to create large mock/dummy datasets for test : environments or for POCs/Demos to convince folks that Solr was the : wave of the future? Any tips would be greatly appreciated. I suppose : it sounds a lot like crawling even though it started out as innocent : DIH usage. the better question to ask is where you can find good sample data sets for building proof of concept implementations. If you want an example of product data, the best buy product catalog is available for developers using either an API or a bulk download of xml files... http://bbyopen.com/ ...last time i looked (~1 year ago) there were about 1 million products in the data dump. -Hoss
Re: Generating large datasets for Solr proof-of-concept
On Thu, 2011-09-15 at 22:54 +0200, Pulkit Singhal wrote: > Has anyone ever had to create large mock/dummy datasets for test > environments or for POCs/Demos to convince folks that Solr was the > wave of the future? Yes, but I did it badly. The problem is that real data are not random so any simple random String generator is likely to produce data where the distribution of words does not have much in common with real world data. Zipf's law seems like the way to go: https://secure.wikimedia.org/wikipedia/en/wiki/Zipf%27s_law A little searching reveals things like https://wiki.apache.org/pig/DataGeneratorHadoop http://diveintodata.org/2009/09/zipf-distribution-generator-in-java/ Unfortunately most non-techies will be confused by seeing computer generated words so a combination of Zipf to calculate word distribution and a dictionary to provide the words themselves might be best. That still leaves confusing computer generated sentences if one wants to have larger text fields in the index, but opting for something that generates text that looks like real sentences collides with proper distribution of the words.
Re: Generating large datasets for Solr proof-of-concept
http://aws.amazon.com/datasets DBPedia might be the easiest to work with: http://aws.amazon.com/datasets/2319 Amazon has a lot of these things. Infochimps.com is a marketplace for free & pay versions. Lance On Thu, Sep 15, 2011 at 6:55 PM, Pulkit Singhal wrote: > Ah missing } doh! > > BTW I still welcome any ideas on how to build an e-commerce test base. > It doesn't have to be amazon that was jsut my approach, any one? > > - Pulkit > > On Thu, Sep 15, 2011 at 8:52 PM, Pulkit Singhal > wrote: > > Thanks for all the feedback thus far. Now to get little technical about > it :) > > > > I was thinking of feeding a file with all the tags of amazon that > > yield close to roughly 5 results each into a file and then running > > my rss DIH off of that, I came up with the following config but > > something is amiss, can someone please point out what is off about > > this? > > > > > > >processor="LineEntityProcessor" > >url="file:///xxx/yyy/zzz/amazonfeeds.txt" > >rootEntity="false" > >dataSource="myURIreader1" > >transformer="RegexTransformer,DateFormatTransformer" > >> > > >pk="link" > >url="${amazonFeeds.rawLine" > >processor="XPathEntityProcessor" > >forEach="/rss/channel | /rss/channel/item" > > > > > transformer="RegexTransformer,HTMLStripTransformer,DateFormatTransformer,script:skipRow"> > > ... > > > > The rawline should feed into the url key but instead i get: > > > > Caused by: java.net.MalformedURLException: no protocol: > > null${amazonFeeds.rawLine > >at > org.apache.solr.handler.dataimport.URLDataSource.getData(URLDataSource.java:90) > > > > Sep 15, 2011 8:48:01 PM org.apache.solr.update.DirectUpdateHandler2 > rollback > > INFO: start rollback > > > > Sep 15, 2011 8:48:01 PM org.apache.solr.handler.dataimport.SolrWriter > rollback > > SEVERE: Exception while solr rollback. > > > > Thanks in advance! > > > > On Thu, Sep 15, 2011 at 4:12 PM, Markus Jelsma > > wrote: > >> If we want to test with huge amounts of data we feed portions of the > internet. > >> The problem is it takes a lot of bandwith and lots of computing power to > get > >> to a `reasonable` size. On the positive side, you deal with real text so > it's > >> easier to tune for relevance. > >> > >> I think it's easier to create a simple XML generator with mock data, > prices, > >> popularity rates etc. It's fast to generate millions of mock products > and once > >> you have a large quantity of XML files, you can easily index, test, > change > >> config or schema and reindex. > >> > >> On the other hand, the sample data that comes with the Solr example is a > good > >> set as well as it proves the concepts well, especially with the stock > Velocity > >> templates. > >> > >> We know Solr will handle enormous sets but quantity is not always a part > of a > >> PoC. > >> > >>> Hello Everyone, > >>> > >>> I have a goal of populating Solr with a million unique products in > >>> order to create a test environment for a proof of concept. I started > >>> out by using DIH with Amazon RSS feeds but I've quickly realized that > >>> there's no way I can glean a million products from one RSS feed. And > >>> I'd go mad if I just sat at my computer all day looking for feeds and > >>> punching them into DIH config for Solr. > >>> > >>> Has anyone ever had to create large mock/dummy datasets for test > >>> environments or for POCs/Demos to convince folks that Solr was the > >>> wave of the future? Any tips would be greatly appreciated. I suppose > >>> it sounds a lot like crawling even though it started out as innocent > >>> DIH usage. > >>> > >>> - Pulkit > >> > > > -- Lance Norskog goks...@gmail.com
Re: Generating large datasets for Solr proof-of-concept
Thanks for all the feedback thus far. Now to get little technical about it :) I was thinking of feeding a file with all the tags of amazon that yield close to roughly 5 results each into a file and then running my rss DIH off of that, I came up with the following config but something is amiss, can someone please point out what is off about this? ... The rawline should feed into the url key but instead i get: Caused by: java.net.MalformedURLException: no protocol: null${amazonFeeds.rawLine at org.apache.solr.handler.dataimport.URLDataSource.getData(URLDataSource.java:90) Sep 15, 2011 8:48:01 PM org.apache.solr.update.DirectUpdateHandler2 rollback INFO: start rollback Sep 15, 2011 8:48:01 PM org.apache.solr.handler.dataimport.SolrWriter rollback SEVERE: Exception while solr rollback. Thanks in advance! On Thu, Sep 15, 2011 at 4:12 PM, Markus Jelsma wrote: > If we want to test with huge amounts of data we feed portions of the internet. > The problem is it takes a lot of bandwith and lots of computing power to get > to a `reasonable` size. On the positive side, you deal with real text so it's > easier to tune for relevance. > > I think it's easier to create a simple XML generator with mock data, prices, > popularity rates etc. It's fast to generate millions of mock products and once > you have a large quantity of XML files, you can easily index, test, change > config or schema and reindex. > > On the other hand, the sample data that comes with the Solr example is a good > set as well as it proves the concepts well, especially with the stock Velocity > templates. > > We know Solr will handle enormous sets but quantity is not always a part of a > PoC. > >> Hello Everyone, >> >> I have a goal of populating Solr with a million unique products in >> order to create a test environment for a proof of concept. I started >> out by using DIH with Amazon RSS feeds but I've quickly realized that >> there's no way I can glean a million products from one RSS feed. And >> I'd go mad if I just sat at my computer all day looking for feeds and >> punching them into DIH config for Solr. >> >> Has anyone ever had to create large mock/dummy datasets for test >> environments or for POCs/Demos to convince folks that Solr was the >> wave of the future? Any tips would be greatly appreciated. I suppose >> it sounds a lot like crawling even though it started out as innocent >> DIH usage. >> >> - Pulkit >
Re: Generating large datasets for Solr proof-of-concept
Ah missing } doh! BTW I still welcome any ideas on how to build an e-commerce test base. It doesn't have to be amazon that was jsut my approach, any one? - Pulkit On Thu, Sep 15, 2011 at 8:52 PM, Pulkit Singhal wrote: > Thanks for all the feedback thus far. Now to get little technical about it :) > > I was thinking of feeding a file with all the tags of amazon that > yield close to roughly 5 results each into a file and then running > my rss DIH off of that, I came up with the following config but > something is amiss, can someone please point out what is off about > this? > > > processor="LineEntityProcessor" > url="file:///xxx/yyy/zzz/amazonfeeds.txt" > rootEntity="false" > dataSource="myURIreader1" > transformer="RegexTransformer,DateFormatTransformer" > > > pk="link" > url="${amazonFeeds.rawLine" > processor="XPathEntityProcessor" > forEach="/rss/channel | /rss/channel/item" > > transformer="RegexTransformer,HTMLStripTransformer,DateFormatTransformer,script:skipRow"> > ... > > The rawline should feed into the url key but instead i get: > > Caused by: java.net.MalformedURLException: no protocol: > null${amazonFeeds.rawLine > at > org.apache.solr.handler.dataimport.URLDataSource.getData(URLDataSource.java:90) > > Sep 15, 2011 8:48:01 PM org.apache.solr.update.DirectUpdateHandler2 rollback > INFO: start rollback > > Sep 15, 2011 8:48:01 PM org.apache.solr.handler.dataimport.SolrWriter rollback > SEVERE: Exception while solr rollback. > > Thanks in advance! > > On Thu, Sep 15, 2011 at 4:12 PM, Markus Jelsma > wrote: >> If we want to test with huge amounts of data we feed portions of the >> internet. >> The problem is it takes a lot of bandwith and lots of computing power to get >> to a `reasonable` size. On the positive side, you deal with real text so it's >> easier to tune for relevance. >> >> I think it's easier to create a simple XML generator with mock data, prices, >> popularity rates etc. It's fast to generate millions of mock products and >> once >> you have a large quantity of XML files, you can easily index, test, change >> config or schema and reindex. >> >> On the other hand, the sample data that comes with the Solr example is a good >> set as well as it proves the concepts well, especially with the stock >> Velocity >> templates. >> >> We know Solr will handle enormous sets but quantity is not always a part of a >> PoC. >> >>> Hello Everyone, >>> >>> I have a goal of populating Solr with a million unique products in >>> order to create a test environment for a proof of concept. I started >>> out by using DIH with Amazon RSS feeds but I've quickly realized that >>> there's no way I can glean a million products from one RSS feed. And >>> I'd go mad if I just sat at my computer all day looking for feeds and >>> punching them into DIH config for Solr. >>> >>> Has anyone ever had to create large mock/dummy datasets for test >>> environments or for POCs/Demos to convince folks that Solr was the >>> wave of the future? Any tips would be greatly appreciated. I suppose >>> it sounds a lot like crawling even though it started out as innocent >>> DIH usage. >>> >>> - Pulkit >> >
Re: Generating large datasets for Solr proof-of-concept
If we want to test with huge amounts of data we feed portions of the internet. The problem is it takes a lot of bandwith and lots of computing power to get to a `reasonable` size. On the positive side, you deal with real text so it's easier to tune for relevance. I think it's easier to create a simple XML generator with mock data, prices, popularity rates etc. It's fast to generate millions of mock products and once you have a large quantity of XML files, you can easily index, test, change config or schema and reindex. On the other hand, the sample data that comes with the Solr example is a good set as well as it proves the concepts well, especially with the stock Velocity templates. We know Solr will handle enormous sets but quantity is not always a part of a PoC. > Hello Everyone, > > I have a goal of populating Solr with a million unique products in > order to create a test environment for a proof of concept. I started > out by using DIH with Amazon RSS feeds but I've quickly realized that > there's no way I can glean a million products from one RSS feed. And > I'd go mad if I just sat at my computer all day looking for feeds and > punching them into DIH config for Solr. > > Has anyone ever had to create large mock/dummy datasets for test > environments or for POCs/Demos to convince folks that Solr was the > wave of the future? Any tips would be greatly appreciated. I suppose > it sounds a lot like crawling even though it started out as innocent > DIH usage. > > - Pulkit
Re: Generating large datasets for Solr proof-of-concept
I've done it using SolrJ and a *lot *of of parallel processes feeding dummy data into the server. On Thu, Sep 15, 2011 at 4:54 PM, Pulkit Singhal wrote: > Hello Everyone, > > I have a goal of populating Solr with a million unique products in > order to create a test environment for a proof of concept. I started > out by using DIH with Amazon RSS feeds but I've quickly realized that > there's no way I can glean a million products from one RSS feed. And > I'd go mad if I just sat at my computer all day looking for feeds and > punching them into DIH config for Solr. > > Has anyone ever had to create large mock/dummy datasets for test > environments or for POCs/Demos to convince folks that Solr was the > wave of the future? Any tips would be greatly appreciated. I suppose > it sounds a lot like crawling even though it started out as innocent > DIH usage. > > - Pulkit >