Re: Indexing large documents

Alexei Martchenko Wed, 19 Mar 2014 05:29:30 -0700

Even the most non-structured data has to have some breakpoint. I've seen
projects running solr that used to index whole books one document per
chapter plus a synopsis boosted doc. The question here is how you need to
search and match those docs.



alexei martchenko
Facebook <http://www.facebook.com/alexeiramone> |
Linkedin<http://br.linkedin.com/in/alexeimartchenko>|
Steam <http://steamcommunity.com/id/alexeiramone/> |
4sq<https://pt.foursquare.com/alexeiramone>| Skype: alexeiramone |
Github <https://github.com/alexeiramone> | (11) 9 7613.0966 |


2014-03-18 23:52 GMT-03:00 Stephen Kottmann <
stephen_kottm...@h3biomedicine.com>:

> Hi Solr Users,
>
> I'm looking for advice on best practices when indexing large documents
> (100's of MB or even 1 to 2 GB text files). I've been hunting around on
> google and the mailing list, and have found some suggestions of splitting
> the logical document up into multiple solr documents. However, I haven't
> been able to find anything that seems like conclusive advice.
>
> Some background...
>
> We've been using solr with great success for some time on a project that is
> mostly indexing very structured data - ie. mainly based on ingesting
> through DIH.
>
> I've now started a new project and we're trying to make use of solr again -
> however, in this project we are indexing mostly unstructured data - pdfs,
> powerpoint, word, etc. I've not done much configuration - my solr instance
> is very close to the example provided in the distribution aside from some
> minor schema changes. Our index is relatively small at this point ( ~3k
> documents ), and for initial indexing I am pulling documents from a http
> data source, running them through Tika, and then pushing to solr using
> solrj. For the most part this is working great... until I hit one of these
> huge text files and then OOM on indexing.
>
> I've got a modest JVM - 4GB allocated. Obviously I can throw more memory at
> it, but it seems like maybe there's a more robust solution that would scale
> better.
>
> Is splitting the logical document into multiple solr documents best
> practice here? If so, what are the considerations or pitfalls of doing this
> that I should be paying attention to. I guess when querying I always need
> to use a group by field to prevent multiple hits for the same document. Are
> there issues with term frequency, etc that you need to work around?
>
> Really interested to hear how others are dealing with this.
>
> Thanks everyone!
> Stephen
>
> --
> [This e-mail message may contain privileged, confidential and/or
> proprietary information of H3 Biomedicine. If you believe that it has been
> sent to you in error, please contact the sender immediately and delete the
> message including any attachments, without copying, using, or distributing
> any of the information contained therein. This e-mail message should not be
> interpreted to include a digital or electronic signature that can be used
> to authenticate an agreement, contract or other legal document, nor to
> reflect an intention to be bound to any legally-binding agreement or
> contract.]
>

Re: Indexing large documents

Reply via email to