If you're moving towards mocking up a production system I'd move away from schemaless mode, as it enables both explosion of the number of fields if you get bad or unexpected data, and is prone to difficult to fix errors where it misidentifies numbers/strings ... particularly if a string field happens to first appear with only numbers, you can be in trouble. "Schemaless" is a misleading name... "Schema Guessing" is more accurate, since there is in fact a schema, and it will subsequently object to strings sent to things it initially recognized as a number.
Also it sounds like you might be using the extracting request handler, so be sure to read this section: https://solr.apache.org/guide/solr/latest/indexing-guide/indexing-with-tika.html#solr-cell-performance-implications Note especially the warning (which should probably be a larger warning box). > > For these reasons, Solr Cell is not recommended for use in a production > system. > > It is a best practice to use Solr Cell as a proof-of-concept tool during > development and then run Tika as an external process that sends the > extracted documents to Solr (via SolrJ > <https://solr.apache.org/guide/solr/latest/deployment-guide/solrj.html>) > for indexing. This way, any extraction failures that occur are isolated > from Solr itself and can be handled gracefully. > Starting with a subset of your data is however an excellent idea. It's normal early on to index things, see if you like the result, tweak the schema or tika settings etc and try again. Keeping the initial set smaller but representative facilitates iteration. Best, Gus On Tue, Aug 23, 2022 at 7:45 AM Mikhail Khludnev <[email protected]> wrote: > Hello, Crag. > It's probably something like > https://solr.apache.org/guide/solr/latest/getting-started/tutorial-diy.html > > On Tue, Aug 23, 2022 at 10:57 AM Markwood, Crag <[email protected]> > wrote: > > > Hello, > > > > I've run a couple of the demos and am ready to try Solr on some of my own > > documents. Is there a 'tips/tricks' document for this, or should I just > use > > schemaless mode and point Solr at a subset (%10?) of my repository (~100k > > Microsoft/text/pdf files)? > > > > Thank you in advance! > > Crag > > > > Crag Markwood > > Senior Director, Research Informatics > > Epizyme, an Ipsen Company > > > > > > -- > Sincerely yours > Mikhail Khludnev > -- http://www.needhamsoftware.com (work) http://www.the111shift.com (play)
