Re: getting started?

Gus Heck Tue, 23 Aug 2022 05:47:15 -0700

If you're moving towards mocking up a production system I'd move away from
schemaless mode, as it enables both explosion of the number of fields if
you get bad or unexpected data, and is prone to difficult to fix errors
where it misidentifies numbers/strings ... particularly if a string field
happens to first appear with only  numbers, you can be in trouble.
"Schemaless" is a misleading name... "Schema Guessing" is more accurate,
since there is in fact a schema, and it will subsequently object to strings
sent to things it initially recognized as a number.


Also it sounds like you might be using the extracting request handler, so
be sure to read this section:
https://solr.apache.org/guide/solr/latest/indexing-guide/indexing-with-tika.html#solr-cell-performance-implications
Note especially the warning (which should probably be a larger warning box).
>
> For these reasons, Solr Cell is not recommended for use in a production
> system.
>
> It is a best practice to use Solr Cell as a proof-of-concept tool during
> development and then run Tika as an external process that sends the
> extracted documents to Solr (via SolrJ
> <https://solr.apache.org/guide/solr/latest/deployment-guide/solrj.html>)
> for indexing. This way, any extraction failures that occur are isolated
> from Solr itself and can be handled gracefully.
>

Starting with a subset of your data is however an excellent idea. It's
normal early on to index things, see if you like the result, tweak the
schema or tika settings etc and try again. Keeping the initial set smaller
but representative facilitates iteration.

Best,
Gus

On Tue, Aug 23, 2022 at 7:45 AM Mikhail Khludnev <[email protected]> wrote:

> Hello, Crag.
> It's probably something like
> https://solr.apache.org/guide/solr/latest/getting-started/tutorial-diy.html
>
> On Tue, Aug 23, 2022 at 10:57 AM Markwood, Crag <[email protected]>
> wrote:
>
> > Hello,
> >
> > I've run a couple of the demos and am ready to try Solr on some of my own
> > documents. Is there a 'tips/tricks' document for this, or should I just
> use
> > schemaless mode and point Solr at a subset (%10?) of my repository (~100k
> > Microsoft/text/pdf files)?
> >
> > Thank you in advance!
> > Crag
> >
> > Crag Markwood
> > Senior Director, Research Informatics
> > Epizyme, an Ipsen Company
> >
> >
>
> --
> Sincerely yours
> Mikhail Khludnev
>


-- 
http://www.needhamsoftware.com (work)
http://www.the111shift.com (play)

Re: getting started?

Reply via email to