Manisha,
In addition to what Shawn has mentioned above, I would also like you to
reevaluate your use case. Do you *need to* index the whole document ? eg:
If it's an email, the body of the email *might* be more important than any
attachments, in which case you could choose to only index the email body
and ignore (or only partially index) the text from attachments. If you
could afford to index the documents partially, you could consider Solr's
"Limit token count filter": See the link below.

https://lucene.apache.org/solr/guide/7_7/filter-descriptions.html#limit-token-count-filter

You'll need to configure it in the schema for the "index" analyzer for the
data type of the field with large text.
Indexing documents of the order of half a GB will definitely come to hurt
your operations, if not now, later (think OOM, extremely slow atomic
updates, long running merges etc.).

- Rahul



On Thu, Oct 1, 2020 at 7:06 PM Shawn Heisey <apa...@elyograg.org> wrote:

> On 10/1/2020 6:57 AM, Manisha Rahatadkar wrote:
> > We are using Apache Solr 7.7 on Windows platform. The data is synced to
> Solr using Solr.Net commit. The data is being synced to SOLR in batches.
> The document size is very huge (~0.5GB average) and solr indexing is taking
> long time. Total document size is ~200GB. As the solr commit is done as a
> part of API, the API calls are failing as document indexing is not
> completed.
>
> A single document is five hundred megabytes?  What kind of documents do
> you have?  You can't even index something that big without tweaking
> configuration parameters that most people don't even know about.
> Assuming you can even get it working, there's no way that indexing a
> document like that is going to be fast.
>
> >    1.  What is your advise on syncing such a large volume of data to
> Solr KB.
>
> What is "KB"?  I have never heard of this in relation to Solr.
>
> >    2.  Because of the search requirements, almost 8 fields are defined
> as Text fields.
>
> I can't figure out what you are trying to say with this statement.
>
> >    3.  Currently Solr_JAVA_MEM is set to 2gb. Is that enough for such a
> large volume of data?
>
> If just one of the documents you're sending to Solr really is five
> hundred megabytes, then 2 gigabytes would probably be just barely enough
> to index one document into an empty index ... and it would probably be
> doing garbage collection so frequently that it would make things REALLY
> slow.  I have no way to predict how much heap you will need.  That will
> require experimentation.  I can tell you that 2GB is definitely not enough.
>
> >    4.  How to set up Solr in production on Windows? Currently it's set
> up as a standalone engine and client is requested to take the backup of the
> drive. Is there any other better way to do? How to set up for the disaster
> recovery?
>
> I would suggest NOT doing it on Windows.  My reasons for that come down
> to costs -- a Windows Server license isn't cheap.
>
> That said, there's nothing wrong with running on Windows, but you're on
> your own as far as running it as a service.  We only have a service
> installer for UNIX-type systems.  Most of the testing for that is done
> on Linux.
>
> >    5.  How to benchmark the system requirements for such a huge data
>
> I do not know what all your needs are, so I have no way to answer this.
> You're going to know a lot more about it that any of us are.
>
> Thanks,
> Shawn
>

Reply via email to