gerlowskija commented on PR #123:
URL: https://github.com/apache/solr-sandbox/pull/123#issuecomment-3258145316

   Devs who want to test this out can run:
   
   1. Download Raw Wiki Data
       - `(mkdir ~/Downloads/wiki && cd ~/Downloads/wiki && wget 
https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 
&& bunzip2 enwiki-latest-pages-articles.xml.bz2)`
   2. Compile gatling-data-prep
       - `./gradlew :gatling-data-prep:jar`
   3. Convert Raw Wiki Data to Solr Docs
       - `mkdir .gatling/batches && java -cp 
gatling-data-prep/build/libs/gatling-data-prep.jar WikipediaXmlToSolr 
~/Downloads/wiki/enwiki-latest-pages-articles.xml .gatling/batches json 5000 
1000`
   4. Start a local Solr - any Solr can be used: local or remote, Docker or 
baremetal, release or SNAPSHOT, etc.  Benchmarking will assume 
"http://localhost:8983/solr"; unless told otherwise.
   5. Install wiki configset to Solr
       - `./scripts/gatling/setup_wikipedia_tests.sh`
   6. Run benchmark
       - `./gradlew gatlingRun  --simulation 
index.IndexWikipediaBatchesSimulation`
    
   
   Steps (1) - (3) are only needed on initial setup, to prepare wikipedia data 
into a format that's Solr-ready, so they only need to be run once.  Which is 
good, since these steps are pretty time-consuming.  In an ideal world we would 
zip up the converted data produced by (3) and have developers just download 
that.  The Lucene/McCandless benchmarks do something similar - they rely on 
Lucene-ready pre-converted files stored in (I think) s3.
   
   And step (5) can probably be folded into the simulation Java code itself - 
it's largely just installing a configset to use in indexing.
   
   So there's still room for simplification here.  But even in the current 
state, the setup is pretty manageable IMO.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to