RE: Solr 7.7 - Few Questions

2020-10-13 Thread Hanjan, Harinderdeep S.
1. What tool they use to run Solr as a service on windows.
We use NSSM. https://nssm.cc/


-Original Message-
From: Manisha Rahatadkar [mailto:manisha.rahatad...@anjusoftware.com]
Sent: Tuesday, October 6, 2020 2:26 PM
To: solr-user@lucene.apache.org; ch...@opensourceconnections.com; Shawn Heisey 

Subject: [EXT] RE: Solr 7.7 - Few Questions

Hi All

First of all thanks to Shawn, Rahul and Charlie for taking time to reply my 
questions and valuable information.

I was very concerned about the size of the each document and on several follow 
ups got more information that the documents which have 0.5GB size are mp4 
documents and these are not synced to Solr.

@Shawn Heisey recommended NOT to use Windows because of windows license cost 
and service installer testing is done on Linux.
I agree with him. We are using NSSM tool to run solr as a service.

Are there any members here using Solr on Windows? I look forward to hear from 
them on:

1. What tool they use to run Solr as a service on windows.
2. How to set up the disaster recovery?
3. How to scale up the servers for the better performance?

Thanks in advance and looking forward to hear back your experiences on Solr 
Scale up.

Regards,
Manisha Rahatadkar

-Original Message-
From: Rahul Goswami 
Sent: Sunday, October 4, 2020 11:49 PM
To: ch...@opensourceconnections.com; solr-user@lucene.apache.org
Subject: Re: Solr 7.7 - Few Questions

Charlie,
Thanks for providing an alternate approach to doing this. It would be 
interesting to know how one  could go about organizing the docs in this case? 
(Nested documents?) How would join queries perform on a large
index(200 million+ docs)?

Thanks,
Rahul



On Fri, Oct 2, 2020 at 5:55 AM Charlie Hull  wrote:

> Hi Rahul,
>
>
>
> In addition to the wise advice below: remember in Solr, a 'document'
> is
>
> just the name for the thing that would appear as one of the results
> when
>
> you search (analagous to a database record). It's not the same
>
> conceptually as a 'Word document' or a 'PDF document'. If your source
>
> documents are so big, consider how they might be broken into parts, or
>
> whether you really need to index all of them for retrieval purposes,
> or
>
> what parts of them need to be extracted as text. Thus, the Solr
>
> documents don't necessarily need to be as large as your source documents.
>
>
>
> Consider an email size 20kb with ten PDF attachments, each 20MB. You
>
> probably shouldn't push all this data into a single Solr document, but
>
> you *could* index them as 11 separate Solr documents, but with
> metadata
>
> to indicate that one is an email and ten are PDFs, and a shared ID of
>
> some kind to indicate they're related. Then at query time there are
>
> various ways for you to group these together, so for example if the
>
> query hit one of the PDFs you could show the user the original email,
>
> plus the 9 other attachments, using the shared ID as a key.
>
>
>
> HTH,
>
>
>
> Charlie
>
>
>
> On 02/10/2020 01:53, Rahul Goswami wrote:
>
> > Manisha,
>
> > In addition to what Shawn has mentioned above, I would also like you
> > to
>
> > reevaluate your use case. Do you *need to* index the whole document ? eg:
>
> > If it's an email, the body of the email *might* be more important
> > than
> any
>
> > attachments, in which case you could choose to only index the email
> > body
>
> > and ignore (or only partially index) the text from attachments. If
> > you
>
> > could afford to index the documents partially, you could consider
> > Solr's
>
> > "Limit token count filter": See the link below.
>
> >
>
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__lucene.apache.org
> _solr_guide_7-5F7_filter-2Ddescriptions.html-23limi&d=DwIGaQ&c=jdm1Hby
> _BzoqwoYzPsUCHSCnNps9LuidNkyKDuvdq3M&r=K2RffikYjYEm9pNz7rnNO_vxytl-lju
> jndRmklLfE1I&m=2ozKDmMVWaDgCqsPaYwwEELoGjA5d6xC9xgH28tiErs&s=ysHd67CYE
> hPBEEWIda8ItM0R5gllayaddUyTwqm0glw&e=
> t-token-count-filter
>
> >
>
> > You'll need to configure it in the schema for the "index" analyzer
> > for
> the
>
> > data type of the field with large text.
>
> > Indexing documents of the order of half a GB will definitely come to
> > hurt
>
> > your operations, if not now, later (think OOM, extremely slow atomic
>
> > updates, long running merges etc.).
>
> >
>
> > - Rahul
>
> >
>
> >
>
> >
>
> > On Thu, Oct 1, 2020 at 7:06 PM Shawn Heisey  wrote:
>
> >
>
> >> On 10/1/2020 6:57 AM, Manisha Rahatadkar wr

Re: Solr 7.7 - Few Questions

2020-10-06 Thread Rahul Goswami
1. What tool they use to run Solr as a service on windows.
>> Look into procrun. Afterall. Solr runs inside Jetty. So you should have
a way to invoke Jetty’s Main class with required parameters and bundle that
as a procrun service

2. How to set up the disaster recovery?
>> You can back up your indexes at regular periods. This can be done by
taking snapshots and backing them up...and then using the appropriate
snapshot names to restore a certain commit point. For more details please
refer to this link:
https://lucene.apache.org/solr/guide/7_7/making-and-restoring-backups.html

3. How to scale up the servers for the better performance?
>> This is too open ended a question and depends on a lot of factors
specific to your environment and use-case :)

- Rahul


On Tue, Oct 6, 2020 at 4:26 PM Manisha Rahatadkar <
manisha.rahatad...@anjusoftware.com> wrote:

> Hi All
>
> First of all thanks to Shawn, Rahul and Charlie for taking time to reply
> my questions and valuable information.
>
> I was very concerned about the size of the each document and on several
> follow ups got more information that the documents which have 0.5GB size
> are mp4 documents and these are not synced to Solr.
>
> @Shawn Heisey recommended NOT to use Windows because of windows license
> cost and service installer testing is done on Linux.
> I agree with him. We are using NSSM tool to run solr as a service.
>
> Are there any members here using Solr on Windows? I look forward to hear
> from them on:
>
> 1. What tool they use to run Solr as a service on windows.
> 2. How to set up the disaster recovery?
> 3. How to scale up the servers for the better performance?
>
> Thanks in advance and looking forward to hear back your experiences on
> Solr Scale up.
>
> Regards,
> Manisha Rahatadkar
>
> -Original Message-
> From: Rahul Goswami 
> Sent: Sunday, October 4, 2020 11:49 PM
> To: ch...@opensourceconnections.com; solr-user@lucene.apache.org
> Subject: Re: Solr 7.7 - Few Questions
>
> Charlie,
> Thanks for providing an alternate approach to doing this. It would be
> interesting to know how one  could go about organizing the docs in this
> case? (Nested documents?) How would join queries perform on a large
> index(200 million+ docs)?
>
> Thanks,
> Rahul
>
>
>
> On Fri, Oct 2, 2020 at 5:55 AM Charlie Hull  wrote:
>
> > Hi Rahul,
> >
> >
> >
> > In addition to the wise advice below: remember in Solr, a 'document'
> > is
> >
> > just the name for the thing that would appear as one of the results
> > when
> >
> > you search (analagous to a database record). It's not the same
> >
> > conceptually as a 'Word document' or a 'PDF document'. If your source
> >
> > documents are so big, consider how they might be broken into parts, or
> >
> > whether you really need to index all of them for retrieval purposes,
> > or
> >
> > what parts of them need to be extracted as text. Thus, the Solr
> >
> > documents don't necessarily need to be as large as your source documents.
> >
> >
> >
> > Consider an email size 20kb with ten PDF attachments, each 20MB. You
> >
> > probably shouldn't push all this data into a single Solr document, but
> >
> > you *could* index them as 11 separate Solr documents, but with
> > metadata
> >
> > to indicate that one is an email and ten are PDFs, and a shared ID of
> >
> > some kind to indicate they're related. Then at query time there are
> >
> > various ways for you to group these together, so for example if the
> >
> > query hit one of the PDFs you could show the user the original email,
> >
> > plus the 9 other attachments, using the shared ID as a key.
> >
> >
> >
> > HTH,
> >
> >
> >
> > Charlie
> >
> >
> >
> > On 02/10/2020 01:53, Rahul Goswami wrote:
> >
> > > Manisha,
> >
> > > In addition to what Shawn has mentioned above, I would also like you
> > > to
> >
> > > reevaluate your use case. Do you *need to* index the whole document ?
> eg:
> >
> > > If it's an email, the body of the email *might* be more important
> > > than
> > any
> >
> > > attachments, in which case you could choose to only index the email
> > > body
> >
> > > and ignore (or only partially index) the text from attachments. If
> > > you
> >
> > > could afford to index the documents partially, you could consider
> > > Solr's
> >
> > > "Limit toke

RE: Solr 7.7 - Few Questions

2020-10-06 Thread Manisha Rahatadkar
Hi All

First of all thanks to Shawn, Rahul and Charlie for taking time to reply my 
questions and valuable information.

I was very concerned about the size of the each document and on several follow 
ups got more information that the documents which have 0.5GB size are mp4 
documents and these are not synced to Solr.

@Shawn Heisey recommended NOT to use Windows because of windows license cost 
and service installer testing is done on Linux.
I agree with him. We are using NSSM tool to run solr as a service.

Are there any members here using Solr on Windows? I look forward to hear from 
them on:

1. What tool they use to run Solr as a service on windows.
2. How to set up the disaster recovery?
3. How to scale up the servers for the better performance?

Thanks in advance and looking forward to hear back your experiences on Solr 
Scale up.

Regards,
Manisha Rahatadkar

-Original Message-
From: Rahul Goswami 
Sent: Sunday, October 4, 2020 11:49 PM
To: ch...@opensourceconnections.com; solr-user@lucene.apache.org
Subject: Re: Solr 7.7 - Few Questions

Charlie,
Thanks for providing an alternate approach to doing this. It would be 
interesting to know how one  could go about organizing the docs in this case? 
(Nested documents?) How would join queries perform on a large
index(200 million+ docs)?

Thanks,
Rahul



On Fri, Oct 2, 2020 at 5:55 AM Charlie Hull  wrote:

> Hi Rahul,
>
>
>
> In addition to the wise advice below: remember in Solr, a 'document'
> is
>
> just the name for the thing that would appear as one of the results
> when
>
> you search (analagous to a database record). It's not the same
>
> conceptually as a 'Word document' or a 'PDF document'. If your source
>
> documents are so big, consider how they might be broken into parts, or
>
> whether you really need to index all of them for retrieval purposes,
> or
>
> what parts of them need to be extracted as text. Thus, the Solr
>
> documents don't necessarily need to be as large as your source documents.
>
>
>
> Consider an email size 20kb with ten PDF attachments, each 20MB. You
>
> probably shouldn't push all this data into a single Solr document, but
>
> you *could* index them as 11 separate Solr documents, but with
> metadata
>
> to indicate that one is an email and ten are PDFs, and a shared ID of
>
> some kind to indicate they're related. Then at query time there are
>
> various ways for you to group these together, so for example if the
>
> query hit one of the PDFs you could show the user the original email,
>
> plus the 9 other attachments, using the shared ID as a key.
>
>
>
> HTH,
>
>
>
> Charlie
>
>
>
> On 02/10/2020 01:53, Rahul Goswami wrote:
>
> > Manisha,
>
> > In addition to what Shawn has mentioned above, I would also like you
> > to
>
> > reevaluate your use case. Do you *need to* index the whole document ? eg:
>
> > If it's an email, the body of the email *might* be more important
> > than
> any
>
> > attachments, in which case you could choose to only index the email
> > body
>
> > and ignore (or only partially index) the text from attachments. If
> > you
>
> > could afford to index the documents partially, you could consider
> > Solr's
>
> > "Limit token count filter": See the link below.
>
> >
>
> >
> https://lucene.apache.org/solr/guide/7_7/filter-descriptions.html#limi
> t-token-count-filter
>
> >
>
> > You'll need to configure it in the schema for the "index" analyzer
> > for
> the
>
> > data type of the field with large text.
>
> > Indexing documents of the order of half a GB will definitely come to
> > hurt
>
> > your operations, if not now, later (think OOM, extremely slow atomic
>
> > updates, long running merges etc.).
>
> >
>
> > - Rahul
>
> >
>
> >
>
> >
>
> > On Thu, Oct 1, 2020 at 7:06 PM Shawn Heisey  wrote:
>
> >
>
> >> On 10/1/2020 6:57 AM, Manisha Rahatadkar wrote:
>
> >>> We are using Apache Solr 7.7 on Windows platform. The data is
> >>> synced to
>
> >> Solr using Solr.Net commit. The data is being synced to SOLR in batches.
>
> >> The document size is very huge (~0.5GB average) and solr indexing
> >> is
> taking
>
> >> long time. Total document size is ~200GB. As the solr commit is
> >> done as
> a
>
> >> part of API, the API calls are failing as document indexing is not
>
> >> completed.
>
> >>
>
> >> A single document is five hundred megabytes?  What kind

Re: Solr 7.7 - Few Questions

2020-10-05 Thread Charlie Hull
Nested docs would be one approach, result grouping might be another. 
Regarding JOINs, the only way you're going to know is by some 
representative testing.


Charlie

On 05/10/2020 05:49, Rahul Goswami wrote:

Charlie,
Thanks for providing an alternate approach to doing this. It would be
interesting to know how one  could go about organizing the docs in this
case? (Nested documents?) How would join queries perform on a large
index(200 million+ docs)?

Thanks,
Rahul



On Fri, Oct 2, 2020 at 5:55 AM Charlie Hull  wrote:


Hi Rahul,



In addition to the wise advice below: remember in Solr, a 'document' is

just the name for the thing that would appear as one of the results when

you search (analagous to a database record). It's not the same

conceptually as a 'Word document' or a 'PDF document'. If your source

documents are so big, consider how they might be broken into parts, or

whether you really need to index all of them for retrieval purposes, or

what parts of them need to be extracted as text. Thus, the Solr

documents don't necessarily need to be as large as your source documents.



Consider an email size 20kb with ten PDF attachments, each 20MB. You

probably shouldn't push all this data into a single Solr document, but

you *could* index them as 11 separate Solr documents, but with metadata

to indicate that one is an email and ten are PDFs, and a shared ID of

some kind to indicate they're related. Then at query time there are

various ways for you to group these together, so for example if the

query hit one of the PDFs you could show the user the original email,

plus the 9 other attachments, using the shared ID as a key.



HTH,



Charlie



On 02/10/2020 01:53, Rahul Goswami wrote:


Manisha,
In addition to what Shawn has mentioned above, I would also like you to
reevaluate your use case. Do you *need to* index the whole document ? eg:
If it's an email, the body of the email *might* be more important than

any


attachments, in which case you could choose to only index the email body
and ignore (or only partially index) the text from attachments. If you
could afford to index the documents partially, you could consider Solr's
"Limit token count filter": See the link below.

https://lucene.apache.org/solr/guide/7_7/filter-descriptions.html#limit-token-count-filter


You'll need to configure it in the schema for the "index" analyzer for

the


data type of the field with large text.
Indexing documents of the order of half a GB will definitely come to hurt
your operations, if not now, later (think OOM, extremely slow atomic
updates, long running merges etc.).
- Rahul
On Thu, Oct 1, 2020 at 7:06 PM Shawn Heisey  wrote:

On 10/1/2020 6:57 AM, Manisha Rahatadkar wrote:

We are using Apache Solr 7.7 on Windows platform. The data is synced to

Solr using Solr.Net commit. The data is being synced to SOLR in batches.
The document size is very huge (~0.5GB average) and solr indexing is

taking


long time. Total document size is ~200GB. As the solr commit is done as

a


part of API, the API calls are failing as document indexing is not
completed.
A single document is five hundred megabytes?  What kind of documents do
you have?  You can't even index something that big without tweaking
configuration parameters that most people don't even know about.
Assuming you can even get it working, there's no way that indexing a
document like that is going to be fast.

 1.  What is your advise on syncing such a large volume of data to

Solr KB.
What is "KB"?  I have never heard of this in relation to Solr.

 2.  Because of the search requirements, almost 8 fields are defined

as Text fields.
I can't figure out what you are trying to say with this statement.

 3.  Currently Solr_JAVA_MEM is set to 2gb. Is that enough for such

a


large volume of data?
If just one of the documents you're sending to Solr really is five
hundred megabytes, then 2 gigabytes would probably be just barely enough
to index one document into an empty index ... and it would probably be
doing garbage collection so frequently that it would make things REALLY
slow.  I have no way to predict how much heap you will need.  That will
require experimentation.  I can tell you that 2GB is definitely not

enough.


 4.  How to set up Solr in production on Windows? Currently it's set

up as a standalone engine and client is requested to take the backup of

the


drive. Is there any other better way to do? How to set up for the

disaster


recovery?
I would suggest NOT doing it on Windows.  My reasons for that come down
to costs -- a Windows Server license isn't cheap.
That said, there's nothing wrong with running on Windows, but you're on
your own as far as running it as a service.  We only have a service
installer for UNIX-type systems.  Most of the testing for that is done
on Linux.

 5.  How to benchmark the system requirements for such a huge data

I do not know what all your needs are, so I have no way to answer this.
You're go

Re: Solr 7.7 - Few Questions

2020-10-04 Thread Rahul Goswami
Charlie,
Thanks for providing an alternate approach to doing this. It would be
interesting to know how one  could go about organizing the docs in this
case? (Nested documents?) How would join queries perform on a large
index(200 million+ docs)?

Thanks,
Rahul



On Fri, Oct 2, 2020 at 5:55 AM Charlie Hull  wrote:

> Hi Rahul,
>
>
>
> In addition to the wise advice below: remember in Solr, a 'document' is
>
> just the name for the thing that would appear as one of the results when
>
> you search (analagous to a database record). It's not the same
>
> conceptually as a 'Word document' or a 'PDF document'. If your source
>
> documents are so big, consider how they might be broken into parts, or
>
> whether you really need to index all of them for retrieval purposes, or
>
> what parts of them need to be extracted as text. Thus, the Solr
>
> documents don't necessarily need to be as large as your source documents.
>
>
>
> Consider an email size 20kb with ten PDF attachments, each 20MB. You
>
> probably shouldn't push all this data into a single Solr document, but
>
> you *could* index them as 11 separate Solr documents, but with metadata
>
> to indicate that one is an email and ten are PDFs, and a shared ID of
>
> some kind to indicate they're related. Then at query time there are
>
> various ways for you to group these together, so for example if the
>
> query hit one of the PDFs you could show the user the original email,
>
> plus the 9 other attachments, using the shared ID as a key.
>
>
>
> HTH,
>
>
>
> Charlie
>
>
>
> On 02/10/2020 01:53, Rahul Goswami wrote:
>
> > Manisha,
>
> > In addition to what Shawn has mentioned above, I would also like you to
>
> > reevaluate your use case. Do you *need to* index the whole document ? eg:
>
> > If it's an email, the body of the email *might* be more important than
> any
>
> > attachments, in which case you could choose to only index the email body
>
> > and ignore (or only partially index) the text from attachments. If you
>
> > could afford to index the documents partially, you could consider Solr's
>
> > "Limit token count filter": See the link below.
>
> >
>
> >
> https://lucene.apache.org/solr/guide/7_7/filter-descriptions.html#limit-token-count-filter
>
> >
>
> > You'll need to configure it in the schema for the "index" analyzer for
> the
>
> > data type of the field with large text.
>
> > Indexing documents of the order of half a GB will definitely come to hurt
>
> > your operations, if not now, later (think OOM, extremely slow atomic
>
> > updates, long running merges etc.).
>
> >
>
> > - Rahul
>
> >
>
> >
>
> >
>
> > On Thu, Oct 1, 2020 at 7:06 PM Shawn Heisey  wrote:
>
> >
>
> >> On 10/1/2020 6:57 AM, Manisha Rahatadkar wrote:
>
> >>> We are using Apache Solr 7.7 on Windows platform. The data is synced to
>
> >> Solr using Solr.Net commit. The data is being synced to SOLR in batches.
>
> >> The document size is very huge (~0.5GB average) and solr indexing is
> taking
>
> >> long time. Total document size is ~200GB. As the solr commit is done as
> a
>
> >> part of API, the API calls are failing as document indexing is not
>
> >> completed.
>
> >>
>
> >> A single document is five hundred megabytes?  What kind of documents do
>
> >> you have?  You can't even index something that big without tweaking
>
> >> configuration parameters that most people don't even know about.
>
> >> Assuming you can even get it working, there's no way that indexing a
>
> >> document like that is going to be fast.
>
> >>
>
> >>> 1.  What is your advise on syncing such a large volume of data to
>
> >> Solr KB.
>
> >>
>
> >> What is "KB"?  I have never heard of this in relation to Solr.
>
> >>
>
> >>> 2.  Because of the search requirements, almost 8 fields are defined
>
> >> as Text fields.
>
> >>
>
> >> I can't figure out what you are trying to say with this statement.
>
> >>
>
> >>> 3.  Currently Solr_JAVA_MEM is set to 2gb. Is that enough for such
> a
>
> >> large volume of data?
>
> >>
>
> >> If just one of the documents you're sending to Solr really is five
>
> >> hundred megabytes, then 2 gigabytes would probably be just barely enough
>
> >> to index one document into an empty index ... and it would probably be
>
> >> doing garbage collection so frequently that it would make things REALLY
>
> >> slow.  I have no way to predict how much heap you will need.  That will
>
> >> require experimentation.  I can tell you that 2GB is definitely not
> enough.
>
> >>
>
> >>> 4.  How to set up Solr in production on Windows? Currently it's set
>
> >> up as a standalone engine and client is requested to take the backup of
> the
>
> >> drive. Is there any other better way to do? How to set up for the
> disaster
>
> >> recovery?
>
> >>
>
> >> I would suggest NOT doing it on Windows.  My reasons for that come down
>
> >> to costs -- a Windows Server license isn't cheap.
>
> >>
>
> >> That said, there's nothing wrong with running on Windows, but you're on
>
> >> your own as far as 

Re: Solr 7.7 - Few Questions

2020-10-02 Thread Charlie Hull

Hi Rahul,

In addition to the wise advice below: remember in Solr, a 'document' is 
just the name for the thing that would appear as one of the results when 
you search (analagous to a database record). It's not the same 
conceptually as a 'Word document' or a 'PDF document'. If your source 
documents are so big, consider how they might be broken into parts, or 
whether you really need to index all of them for retrieval purposes, or 
what parts of them need to be extracted as text. Thus, the Solr 
documents don't necessarily need to be as large as your source documents.


Consider an email size 20kb with ten PDF attachments, each 20MB. You 
probably shouldn't push all this data into a single Solr document, but 
you *could* index them as 11 separate Solr documents, but with metadata 
to indicate that one is an email and ten are PDFs, and a shared ID of 
some kind to indicate they're related. Then at query time there are 
various ways for you to group these together, so for example if the 
query hit one of the PDFs you could show the user the original email, 
plus the 9 other attachments, using the shared ID as a key.


HTH,

Charlie

On 02/10/2020 01:53, Rahul Goswami wrote:

Manisha,
In addition to what Shawn has mentioned above, I would also like you to
reevaluate your use case. Do you *need to* index the whole document ? eg:
If it's an email, the body of the email *might* be more important than any
attachments, in which case you could choose to only index the email body
and ignore (or only partially index) the text from attachments. If you
could afford to index the documents partially, you could consider Solr's
"Limit token count filter": See the link below.

https://lucene.apache.org/solr/guide/7_7/filter-descriptions.html#limit-token-count-filter

You'll need to configure it in the schema for the "index" analyzer for the
data type of the field with large text.
Indexing documents of the order of half a GB will definitely come to hurt
your operations, if not now, later (think OOM, extremely slow atomic
updates, long running merges etc.).

- Rahul



On Thu, Oct 1, 2020 at 7:06 PM Shawn Heisey  wrote:


On 10/1/2020 6:57 AM, Manisha Rahatadkar wrote:

We are using Apache Solr 7.7 on Windows platform. The data is synced to

Solr using Solr.Net commit. The data is being synced to SOLR in batches.
The document size is very huge (~0.5GB average) and solr indexing is taking
long time. Total document size is ~200GB. As the solr commit is done as a
part of API, the API calls are failing as document indexing is not
completed.

A single document is five hundred megabytes?  What kind of documents do
you have?  You can't even index something that big without tweaking
configuration parameters that most people don't even know about.
Assuming you can even get it working, there's no way that indexing a
document like that is going to be fast.


1.  What is your advise on syncing such a large volume of data to

Solr KB.

What is "KB"?  I have never heard of this in relation to Solr.


2.  Because of the search requirements, almost 8 fields are defined

as Text fields.

I can't figure out what you are trying to say with this statement.


3.  Currently Solr_JAVA_MEM is set to 2gb. Is that enough for such a

large volume of data?

If just one of the documents you're sending to Solr really is five
hundred megabytes, then 2 gigabytes would probably be just barely enough
to index one document into an empty index ... and it would probably be
doing garbage collection so frequently that it would make things REALLY
slow.  I have no way to predict how much heap you will need.  That will
require experimentation.  I can tell you that 2GB is definitely not enough.


4.  How to set up Solr in production on Windows? Currently it's set

up as a standalone engine and client is requested to take the backup of the
drive. Is there any other better way to do? How to set up for the disaster
recovery?

I would suggest NOT doing it on Windows.  My reasons for that come down
to costs -- a Windows Server license isn't cheap.

That said, there's nothing wrong with running on Windows, but you're on
your own as far as running it as a service.  We only have a service
installer for UNIX-type systems.  Most of the testing for that is done
on Linux.


5.  How to benchmark the system requirements for such a huge data

I do not know what all your needs are, so I have no way to answer this.
You're going to know a lot more about it that any of us are.

Thanks,
Shawn



--
Charlie Hull
OpenSource Connections, previously Flax

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.o19s.com



Re: Solr 7.7 - Few Questions

2020-10-01 Thread Rahul Goswami
Manisha,
In addition to what Shawn has mentioned above, I would also like you to
reevaluate your use case. Do you *need to* index the whole document ? eg:
If it's an email, the body of the email *might* be more important than any
attachments, in which case you could choose to only index the email body
and ignore (or only partially index) the text from attachments. If you
could afford to index the documents partially, you could consider Solr's
"Limit token count filter": See the link below.

https://lucene.apache.org/solr/guide/7_7/filter-descriptions.html#limit-token-count-filter

You'll need to configure it in the schema for the "index" analyzer for the
data type of the field with large text.
Indexing documents of the order of half a GB will definitely come to hurt
your operations, if not now, later (think OOM, extremely slow atomic
updates, long running merges etc.).

- Rahul



On Thu, Oct 1, 2020 at 7:06 PM Shawn Heisey  wrote:

> On 10/1/2020 6:57 AM, Manisha Rahatadkar wrote:
> > We are using Apache Solr 7.7 on Windows platform. The data is synced to
> Solr using Solr.Net commit. The data is being synced to SOLR in batches.
> The document size is very huge (~0.5GB average) and solr indexing is taking
> long time. Total document size is ~200GB. As the solr commit is done as a
> part of API, the API calls are failing as document indexing is not
> completed.
>
> A single document is five hundred megabytes?  What kind of documents do
> you have?  You can't even index something that big without tweaking
> configuration parameters that most people don't even know about.
> Assuming you can even get it working, there's no way that indexing a
> document like that is going to be fast.
>
> >1.  What is your advise on syncing such a large volume of data to
> Solr KB.
>
> What is "KB"?  I have never heard of this in relation to Solr.
>
> >2.  Because of the search requirements, almost 8 fields are defined
> as Text fields.
>
> I can't figure out what you are trying to say with this statement.
>
> >3.  Currently Solr_JAVA_MEM is set to 2gb. Is that enough for such a
> large volume of data?
>
> If just one of the documents you're sending to Solr really is five
> hundred megabytes, then 2 gigabytes would probably be just barely enough
> to index one document into an empty index ... and it would probably be
> doing garbage collection so frequently that it would make things REALLY
> slow.  I have no way to predict how much heap you will need.  That will
> require experimentation.  I can tell you that 2GB is definitely not enough.
>
> >4.  How to set up Solr in production on Windows? Currently it's set
> up as a standalone engine and client is requested to take the backup of the
> drive. Is there any other better way to do? How to set up for the disaster
> recovery?
>
> I would suggest NOT doing it on Windows.  My reasons for that come down
> to costs -- a Windows Server license isn't cheap.
>
> That said, there's nothing wrong with running on Windows, but you're on
> your own as far as running it as a service.  We only have a service
> installer for UNIX-type systems.  Most of the testing for that is done
> on Linux.
>
> >5.  How to benchmark the system requirements for such a huge data
>
> I do not know what all your needs are, so I have no way to answer this.
> You're going to know a lot more about it that any of us are.
>
> Thanks,
> Shawn
>


Re: Solr 7.7 - Few Questions

2020-10-01 Thread Shawn Heisey

On 10/1/2020 6:57 AM, Manisha Rahatadkar wrote:

We are using Apache Solr 7.7 on Windows platform. The data is synced to Solr 
using Solr.Net commit. The data is being synced to SOLR in batches. The 
document size is very huge (~0.5GB average) and solr indexing is taking long 
time. Total document size is ~200GB. As the solr commit is done as a part of 
API, the API calls are failing as document indexing is not completed.


A single document is five hundred megabytes?  What kind of documents do 
you have?  You can't even index something that big without tweaking 
configuration parameters that most people don't even know about. 
Assuming you can even get it working, there's no way that indexing a 
document like that is going to be fast.



   1.  What is your advise on syncing such a large volume of data to Solr KB.


What is "KB"?  I have never heard of this in relation to Solr.


   2.  Because of the search requirements, almost 8 fields are defined as Text 
fields.


I can't figure out what you are trying to say with this statement.


   3.  Currently Solr_JAVA_MEM is set to 2gb. Is that enough for such a large 
volume of data?


If just one of the documents you're sending to Solr really is five 
hundred megabytes, then 2 gigabytes would probably be just barely enough 
to index one document into an empty index ... and it would probably be 
doing garbage collection so frequently that it would make things REALLY 
slow.  I have no way to predict how much heap you will need.  That will 
require experimentation.  I can tell you that 2GB is definitely not enough.



   4.  How to set up Solr in production on Windows? Currently it's set up as a 
standalone engine and client is requested to take the backup of the drive. Is 
there any other better way to do? How to set up for the disaster recovery?


I would suggest NOT doing it on Windows.  My reasons for that come down 
to costs -- a Windows Server license isn't cheap.


That said, there's nothing wrong with running on Windows, but you're on 
your own as far as running it as a service.  We only have a service 
installer for UNIX-type systems.  Most of the testing for that is done 
on Linux.



   5.  How to benchmark the system requirements for such a huge data


I do not know what all your needs are, so I have no way to answer this. 
You're going to know a lot more about it that any of us are.


Thanks,
Shawn


RE: Solr 7.7 - Few Questions

2020-10-01 Thread Manisha Rahatadkar
I apologize for sending this email again, I don't mean to spam the mailbox but 
looking out for the urgent help.

We are using Apache Solr 7.7 on Windows platform. The data is synced to Solr 
using Solr.Net commit. The data is being synced to SOLR in batches. The 
document size is very huge (~0.5GB average) and solr indexing is taking long 
time. Total document size is ~200GB. As the solr commit is done as a part of 
API, the API calls are failing as document indexing is not completed.


  1.  What is your advise on syncing such a large volume of data to Solr KB.
  2.  Because of the search requirements, almost 8 fields are defined as Text 
fields.
  3.  Currently Solr_JAVA_MEM is set to 2gb. Is that enough for such a large 
volume of data?
  4.  How to set up Solr in production on Windows? Currently it's set up as a 
standalone engine and client is requested to take the backup of the drive. Is 
there any other better way to do? How to set up for the disaster recovery?
  5.  How to benchmark the system requirements for such a huge data

Thanks in advance.

Regards
Manisha Rahatadkar


Confidentiality Notice

This email message, including any attachments, is for the sole use of the 
intended recipient and may contain confidential and privileged information. Any 
unauthorized view, use, disclosure or distribution is prohibited. If you are 
not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message. Anju Software, Inc. 4500 S. 
Lakeshore Drive, Suite 620, Tempe, AZ USA 85282.