Re: Generating large datasets for Solr proof-of-concept

2011-09-17 Thread Pulkit Singhal
Thanks Hoss. I agree that the way you restated the question is better
for getting results. BTW I think you've tipped me off to exactly what
I needed with this URL: http://bbyopen.com/

Thanks!
- Pulkit

On Fri, Sep 16, 2011 at 4:35 PM, Chris Hostetter
hossman_luc...@fucit.org wrote:

 : Has anyone ever had to create large mock/dummy datasets for test
 : environments or for POCs/Demos to convince folks that Solr was the
 : wave of the future? Any tips would be greatly appreciated. I suppose
 : it sounds a lot like crawling even though it started out as innocent
 : DIH usage.

 the better question to ask is where you can find good sample data sets for
 building proof of concept implementations.

 If you want an example of product data, the best buy product catalog is
 available for developers using either an API or a bulk download of xml
 files...

        http://bbyopen.com/

 ...last time i looked (~1 year ago) there were about 1 million products in
 the data dump.


 -Hoss



Re: Generating large datasets for Solr proof-of-concept

2011-09-16 Thread Toke Eskildsen
On Thu, 2011-09-15 at 22:54 +0200, Pulkit Singhal wrote:
 Has anyone ever had to create large mock/dummy datasets for test
 environments or for POCs/Demos to convince folks that Solr was the
 wave of the future?

Yes, but I did it badly. The problem is that real data are not random so
any simple random String generator is likely to produce data where the
distribution of words does not have much in common with real world data.


Zipf's law seems like the way to go:
https://secure.wikimedia.org/wikipedia/en/wiki/Zipf%27s_law

A little searching reveals things like
https://wiki.apache.org/pig/DataGeneratorHadoop
http://diveintodata.org/2009/09/zipf-distribution-generator-in-java/


Unfortunately most non-techies will be confused by seeing computer
generated words so a combination of Zipf to calculate word distribution
and a dictionary to provide the words themselves might be best.

That still leaves confusing computer generated sentences if one wants to
have larger text fields in the index, but opting for something that
generates text that looks like real sentences collides with proper
distribution of the words.



Re: Generating large datasets for Solr proof-of-concept

2011-09-16 Thread Chris Hostetter

: Has anyone ever had to create large mock/dummy datasets for test
: environments or for POCs/Demos to convince folks that Solr was the
: wave of the future? Any tips would be greatly appreciated. I suppose
: it sounds a lot like crawling even though it started out as innocent
: DIH usage.

the better question to ask is where you can find good sample data sets for 
building proof of concept implementations.

If you want an example of product data, the best buy product catalog is 
available for developers using either an API or a bulk download of xml 
files...

http://bbyopen.com/

...last time i looked (~1 year ago) there were about 1 million products in 
the data dump.


-Hoss


Generating large datasets for Solr proof-of-concept

2011-09-15 Thread Pulkit Singhal
Hello Everyone,

I have a goal of populating Solr with a million unique products in
order to create a test environment for a proof of concept. I started
out by using DIH with Amazon RSS feeds but I've quickly realized that
there's no way I can glean a million products from one RSS feed. And
I'd go mad if I just sat at my computer all day looking for feeds and
punching them into DIH config for Solr.

Has anyone ever had to create large mock/dummy datasets for test
environments or for POCs/Demos to convince folks that Solr was the
wave of the future? Any tips would be greatly appreciated. I suppose
it sounds a lot like crawling even though it started out as innocent
DIH usage.

- Pulkit


Re: Generating large datasets for Solr proof-of-concept

2011-09-15 Thread Daniel Skiles
I've done it using SolrJ and a *lot *of of parallel processes feeding dummy
data into the server.

On Thu, Sep 15, 2011 at 4:54 PM, Pulkit Singhal pulkitsing...@gmail.comwrote:

 Hello Everyone,

 I have a goal of populating Solr with a million unique products in
 order to create a test environment for a proof of concept. I started
 out by using DIH with Amazon RSS feeds but I've quickly realized that
 there's no way I can glean a million products from one RSS feed. And
 I'd go mad if I just sat at my computer all day looking for feeds and
 punching them into DIH config for Solr.

 Has anyone ever had to create large mock/dummy datasets for test
 environments or for POCs/Demos to convince folks that Solr was the
 wave of the future? Any tips would be greatly appreciated. I suppose
 it sounds a lot like crawling even though it started out as innocent
 DIH usage.

 - Pulkit



Re: Generating large datasets for Solr proof-of-concept

2011-09-15 Thread Markus Jelsma
If we want to test with huge amounts of data we feed portions of the internet. 
The problem is it takes a lot of bandwith and lots of computing power to get 
to a `reasonable` size. On the positive side, you deal with real text so it's 
easier to tune for relevance.

I think it's easier to create a simple XML generator with mock data, prices, 
popularity rates etc. It's fast to generate millions of mock products and once 
you have a large quantity of XML files, you can easily index, test, change 
config or schema and reindex.

On the other hand, the sample data that comes with the Solr example is a good 
set as well as it proves the concepts well, especially with the stock Velocity 
templates.

We know Solr will handle enormous sets but quantity is not always a part of a 
PoC.

 Hello Everyone,
 
 I have a goal of populating Solr with a million unique products in
 order to create a test environment for a proof of concept. I started
 out by using DIH with Amazon RSS feeds but I've quickly realized that
 there's no way I can glean a million products from one RSS feed. And
 I'd go mad if I just sat at my computer all day looking for feeds and
 punching them into DIH config for Solr.
 
 Has anyone ever had to create large mock/dummy datasets for test
 environments or for POCs/Demos to convince folks that Solr was the
 wave of the future? Any tips would be greatly appreciated. I suppose
 it sounds a lot like crawling even though it started out as innocent
 DIH usage.
 
 - Pulkit


Re: Generating large datasets for Solr proof-of-concept

2011-09-15 Thread Pulkit Singhal
Ah missing } doh!

BTW I still welcome any ideas on how to build an e-commerce test base.
It doesn't have to be amazon that was jsut my approach, any one?

- Pulkit

On Thu, Sep 15, 2011 at 8:52 PM, Pulkit Singhal pulkitsing...@gmail.com wrote:
 Thanks for all the feedback thus far. Now to get  little technical about it :)

 I was thinking of feeding a file with all the tags of amazon that
 yield close to roughly 5 results each into a file and then running
 my rss DIH off of that, I came up with the following config but
 something is amiss, can someone please point out what is off about
 this?

    document
        entity name=amazonFeeds
                processor=LineEntityProcessor
                url=file:///xxx/yyy/zzz/amazonfeeds.txt
                rootEntity=false
                dataSource=myURIreader1
                transformer=RegexTransformer,DateFormatTransformer
                
            entity name=feed
                    pk=link
                    url=${amazonFeeds.rawLine
                    processor=XPathEntityProcessor
                    forEach=/rss/channel | /rss/channel/item

 transformer=RegexTransformer,HTMLStripTransformer,DateFormatTransformer,script:skipRow
 ...

 The rawline should feed into the url key but instead i get:

 Caused by: java.net.MalformedURLException: no protocol:
 null${amazonFeeds.rawLine
        at 
 org.apache.solr.handler.dataimport.URLDataSource.getData(URLDataSource.java:90)

 Sep 15, 2011 8:48:01 PM org.apache.solr.update.DirectUpdateHandler2 rollback
 INFO: start rollback

 Sep 15, 2011 8:48:01 PM org.apache.solr.handler.dataimport.SolrWriter rollback
 SEVERE: Exception while solr rollback.

 Thanks in advance!

 On Thu, Sep 15, 2011 at 4:12 PM, Markus Jelsma
 markus.jel...@openindex.io wrote:
 If we want to test with huge amounts of data we feed portions of the 
 internet.
 The problem is it takes a lot of bandwith and lots of computing power to get
 to a `reasonable` size. On the positive side, you deal with real text so it's
 easier to tune for relevance.

 I think it's easier to create a simple XML generator with mock data, prices,
 popularity rates etc. It's fast to generate millions of mock products and 
 once
 you have a large quantity of XML files, you can easily index, test, change
 config or schema and reindex.

 On the other hand, the sample data that comes with the Solr example is a good
 set as well as it proves the concepts well, especially with the stock 
 Velocity
 templates.

 We know Solr will handle enormous sets but quantity is not always a part of a
 PoC.

 Hello Everyone,

 I have a goal of populating Solr with a million unique products in
 order to create a test environment for a proof of concept. I started
 out by using DIH with Amazon RSS feeds but I've quickly realized that
 there's no way I can glean a million products from one RSS feed. And
 I'd go mad if I just sat at my computer all day looking for feeds and
 punching them into DIH config for Solr.

 Has anyone ever had to create large mock/dummy datasets for test
 environments or for POCs/Demos to convince folks that Solr was the
 wave of the future? Any tips would be greatly appreciated. I suppose
 it sounds a lot like crawling even though it started out as innocent
 DIH usage.

 - Pulkit




Re: Generating large datasets for Solr proof-of-concept

2011-09-15 Thread Pulkit Singhal
Thanks for all the feedback thus far. Now to get  little technical about it :)

I was thinking of feeding a file with all the tags of amazon that
yield close to roughly 5 results each into a file and then running
my rss DIH off of that, I came up with the following config but
something is amiss, can someone please point out what is off about
this?

document
entity name=amazonFeeds
processor=LineEntityProcessor
url=file:///xxx/yyy/zzz/amazonfeeds.txt
rootEntity=false
dataSource=myURIreader1
transformer=RegexTransformer,DateFormatTransformer

entity name=feed
pk=link
url=${amazonFeeds.rawLine
processor=XPathEntityProcessor
forEach=/rss/channel | /rss/channel/item

transformer=RegexTransformer,HTMLStripTransformer,DateFormatTransformer,script:skipRow
...

The rawline should feed into the url key but instead i get:

Caused by: java.net.MalformedURLException: no protocol:
null${amazonFeeds.rawLine
at 
org.apache.solr.handler.dataimport.URLDataSource.getData(URLDataSource.java:90)

Sep 15, 2011 8:48:01 PM org.apache.solr.update.DirectUpdateHandler2 rollback
INFO: start rollback

Sep 15, 2011 8:48:01 PM org.apache.solr.handler.dataimport.SolrWriter rollback
SEVERE: Exception while solr rollback.

Thanks in advance!

On Thu, Sep 15, 2011 at 4:12 PM, Markus Jelsma
markus.jel...@openindex.io wrote:
 If we want to test with huge amounts of data we feed portions of the internet.
 The problem is it takes a lot of bandwith and lots of computing power to get
 to a `reasonable` size. On the positive side, you deal with real text so it's
 easier to tune for relevance.

 I think it's easier to create a simple XML generator with mock data, prices,
 popularity rates etc. It's fast to generate millions of mock products and once
 you have a large quantity of XML files, you can easily index, test, change
 config or schema and reindex.

 On the other hand, the sample data that comes with the Solr example is a good
 set as well as it proves the concepts well, especially with the stock Velocity
 templates.

 We know Solr will handle enormous sets but quantity is not always a part of a
 PoC.

 Hello Everyone,

 I have a goal of populating Solr with a million unique products in
 order to create a test environment for a proof of concept. I started
 out by using DIH with Amazon RSS feeds but I've quickly realized that
 there's no way I can glean a million products from one RSS feed. And
 I'd go mad if I just sat at my computer all day looking for feeds and
 punching them into DIH config for Solr.

 Has anyone ever had to create large mock/dummy datasets for test
 environments or for POCs/Demos to convince folks that Solr was the
 wave of the future? Any tips would be greatly appreciated. I suppose
 it sounds a lot like crawling even though it started out as innocent
 DIH usage.

 - Pulkit



Re: Generating large datasets for Solr proof-of-concept

2011-09-15 Thread Lance Norskog
http://aws.amazon.com/datasets

DBPedia might be the easiest to work with:
http://aws.amazon.com/datasets/2319

Amazon has a lot of these things.
Infochimps.com is a marketplace for free  pay versions.


Lance

On Thu, Sep 15, 2011 at 6:55 PM, Pulkit Singhal pulkitsing...@gmail.comwrote:

 Ah missing } doh!

 BTW I still welcome any ideas on how to build an e-commerce test base.
 It doesn't have to be amazon that was jsut my approach, any one?

 - Pulkit

 On Thu, Sep 15, 2011 at 8:52 PM, Pulkit Singhal pulkitsing...@gmail.com
 wrote:
  Thanks for all the feedback thus far. Now to get  little technical about
 it :)
 
  I was thinking of feeding a file with all the tags of amazon that
  yield close to roughly 5 results each into a file and then running
  my rss DIH off of that, I came up with the following config but
  something is amiss, can someone please point out what is off about
  this?
 
 document
 entity name=amazonFeeds
 processor=LineEntityProcessor
 url=file:///xxx/yyy/zzz/amazonfeeds.txt
 rootEntity=false
 dataSource=myURIreader1
 transformer=RegexTransformer,DateFormatTransformer
 
 entity name=feed
 pk=link
 url=${amazonFeeds.rawLine
 processor=XPathEntityProcessor
 forEach=/rss/channel | /rss/channel/item
 
 
 transformer=RegexTransformer,HTMLStripTransformer,DateFormatTransformer,script:skipRow
  ...
 
  The rawline should feed into the url key but instead i get:
 
  Caused by: java.net.MalformedURLException: no protocol:
  null${amazonFeeds.rawLine
 at
 org.apache.solr.handler.dataimport.URLDataSource.getData(URLDataSource.java:90)
 
  Sep 15, 2011 8:48:01 PM org.apache.solr.update.DirectUpdateHandler2
 rollback
  INFO: start rollback
 
  Sep 15, 2011 8:48:01 PM org.apache.solr.handler.dataimport.SolrWriter
 rollback
  SEVERE: Exception while solr rollback.
 
  Thanks in advance!
 
  On Thu, Sep 15, 2011 at 4:12 PM, Markus Jelsma
  markus.jel...@openindex.io wrote:
  If we want to test with huge amounts of data we feed portions of the
 internet.
  The problem is it takes a lot of bandwith and lots of computing power to
 get
  to a `reasonable` size. On the positive side, you deal with real text so
 it's
  easier to tune for relevance.
 
  I think it's easier to create a simple XML generator with mock data,
 prices,
  popularity rates etc. It's fast to generate millions of mock products
 and once
  you have a large quantity of XML files, you can easily index, test,
 change
  config or schema and reindex.
 
  On the other hand, the sample data that comes with the Solr example is a
 good
  set as well as it proves the concepts well, especially with the stock
 Velocity
  templates.
 
  We know Solr will handle enormous sets but quantity is not always a part
 of a
  PoC.
 
  Hello Everyone,
 
  I have a goal of populating Solr with a million unique products in
  order to create a test environment for a proof of concept. I started
  out by using DIH with Amazon RSS feeds but I've quickly realized that
  there's no way I can glean a million products from one RSS feed. And
  I'd go mad if I just sat at my computer all day looking for feeds and
  punching them into DIH config for Solr.
 
  Has anyone ever had to create large mock/dummy datasets for test
  environments or for POCs/Demos to convince folks that Solr was the
  wave of the future? Any tips would be greatly appreciated. I suppose
  it sounds a lot like crawling even though it started out as innocent
  DIH usage.
 
  - Pulkit
 
 




-- 
Lance Norskog
goks...@gmail.com