Re: how to index 20 MB plain-text xml

2014-03-30 Thread primoz . skale
Hi!

I had the same issue with XML files. Even small XML files produced OOM 
exception. I read that the way XMLs are parsed can sometimes blow up 
memory requirements to such values that java runs out of heap. My solution 
was:

1. Don't parse XML files
2. Parse only small XML files and hope for the best
3. Give Solr the largest possible amount of java heap size (and hope for 
the best)

But then again, one time I also got OOM exception with Word documents - it 
turned out that some user had pasted 400 MB worth of photos into a Word 
file.

Regards,

Primoz




From:   Floyd Wu 
To: solr-user@lucene.apache.org
Date:   31.03.2014 08:18
Subject:Re: how to index 20 MB plain-text xml



Hi Alex,

Thanks for your responding. Personally I don't want to feed these big xml
to solr. But users wants.
I'll try your suggestions later.

Many thanks.

Floyd



2014-03-31 13:44 GMT+08:00 Alexandre Rafalovitch :

> Without digging too deep into why exactly this is happening, here are
> the general options:
>
> 0. Are you actually committing? Check the messages in the logs and see
> if the records show up when you expect them too.
> 1. Are you actually trying to feed 20Mb file to Solr? Maybe it's HTTP
> buffer that's blowing up? Try using stream.file instead (notice
> security warning though): http://wiki.apache.org/solr/ContentStream
> 2. Split file into smaller ones and and commit each separately
> 3. Set hard auto-commit in solrconfig.xml based on number of documents
> to flush in-memory structures to disk
> 4. Switch to using DataImportHandler to pull from XML instead of pushing
> 5. Increase amount of memory to Solr (-X command line flags)
>
> Regards,
>Alex.
>
> Personal website: http://www.outerthoughts.com/
> Current project: http://www.solr-start.com/ - Accelerating your Solr
> proficiency
>
> On Mon, Mar 31, 2014 at 12:00 PM, Floyd Wu  wrote:
> > I have many plain text xml that I transfer to form of solr xml format.
> > But every time I send them to solr, I hit OOM exception.
> > How to configure solr to "eat" these big xml?
> > Please guide me a way. Thanks
> >
> > floyd
>



Re: how to index 20 MB plain-text xml

2014-03-30 Thread Floyd Wu
Hi Alex,

Thanks for your responding. Personally I don't want to feed these big xml
to solr. But users wants.
I'll try your suggestions later.

Many thanks.

Floyd



2014-03-31 13:44 GMT+08:00 Alexandre Rafalovitch :

> Without digging too deep into why exactly this is happening, here are
> the general options:
>
> 0. Are you actually committing? Check the messages in the logs and see
> if the records show up when you expect them too.
> 1. Are you actually trying to feed 20Mb file to Solr? Maybe it's HTTP
> buffer that's blowing up? Try using stream.file instead (notice
> security warning though): http://wiki.apache.org/solr/ContentStream
> 2. Split file into smaller ones and and commit each separately
> 3. Set hard auto-commit in solrconfig.xml based on number of documents
> to flush in-memory structures to disk
> 4. Switch to using DataImportHandler to pull from XML instead of pushing
> 5. Increase amount of memory to Solr (-X command line flags)
>
> Regards,
>Alex.
>
> Personal website: http://www.outerthoughts.com/
> Current project: http://www.solr-start.com/ - Accelerating your Solr
> proficiency
>
> On Mon, Mar 31, 2014 at 12:00 PM, Floyd Wu  wrote:
> > I have many plain text xml that I transfer to form of solr xml format.
> > But every time I send them to solr, I hit OOM exception.
> > How to configure solr to "eat" these big xml?
> > Please guide me a way. Thanks
> >
> > floyd
>


Re: how to index 20 MB plain-text xml

2014-03-30 Thread Alexandre Rafalovitch
Without digging too deep into why exactly this is happening, here are
the general options:

0. Are you actually committing? Check the messages in the logs and see
if the records show up when you expect them too.
1. Are you actually trying to feed 20Mb file to Solr? Maybe it's HTTP
buffer that's blowing up? Try using stream.file instead (notice
security warning though): http://wiki.apache.org/solr/ContentStream
2. Split file into smaller ones and and commit each separately
3. Set hard auto-commit in solrconfig.xml based on number of documents
to flush in-memory structures to disk
4. Switch to using DataImportHandler to pull from XML instead of pushing
5. Increase amount of memory to Solr (-X command line flags)

Regards,
   Alex.

Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency

On Mon, Mar 31, 2014 at 12:00 PM, Floyd Wu  wrote:
> I have many plain text xml that I transfer to form of solr xml format.
> But every time I send them to solr, I hit OOM exception.
> How to configure solr to "eat" these big xml?
> Please guide me a way. Thanks
>
> floyd


how to index 20 MB plain-text xml

2014-03-30 Thread Floyd Wu
I have many plain text xml that I transfer to form of solr xml format.
But every time I send them to solr, I hit OOM exception.
How to configure solr to "eat" these big xml?
Please guide me a way. Thanks

floyd


Re: eDismax parser and the mm parameter

2014-03-30 Thread S.L
Thanks Jack , my use cases are as follows.


   1. Search for "Ginseng" everything related to ginseng should show up.
   2. Search For "White Siberian Ginseng" results with the whole phrase
   show up first followed by 2 words from the phrase followed by a single word
   in the phrase
   3. Fuzzy Search "Whte Sberia Ginsng" (please note the typos here)
   documents with White Siberian Ginseng Should show up , this looks like the
   most complicated of all as Solr does not support fuzzy phrase searches . (I
   have no solution for this yet).

Thanks again!


On Sun, Mar 30, 2014 at 11:21 PM, Jack Krupansky wrote:

> The mm parameter is really only relevant when the default operator is OR
> or explicit OR operators are used.
>
> Again: Please provide your use case examples and your expectations for
> each use case. It really doesn't make a lot of sense to prematurely focus
> on a solution when you haven't clearly defined your use cases.
>
> -- Jack Krupansky
>
> -Original Message- From: S.L
> Sent: Sunday, March 30, 2014 9:13 PM
> To: solr-user@lucene.apache.org
> Subject: Re: eDismax parser and the mm parameter
>
> Jack,
>
> I mis-stated the problem , I am not using the OR operator as default
> now(now that I think about it it does not make sense to use the default
> operator OR along with the mm parameter) , the reason I want to use pf and
> mm in conjunction is because of my understanding of the edismax parser and
> I have not looked into pf2 and pf3 parameters yet.
>
> I will state my understanding here below.
>
> Pf -  Is used to boost the result score if the complete phrase matches.
> mm <(less than) search term length would help limit the query results  to a
> certain number of better matches.
>
> With that being said would it make sense to have dynamic mm (set to the
> length of search term - 1)?
>
> I also have a question around using a fuzzy search along with eDismax
> parser , but I will ask that in a seperate post once I go thru that aspect
> of eDismax parser.
>
> Thanks again !
>
>
>
>
>
> On Sun, Mar 30, 2014 at 6:44 PM, Jack Krupansky 
> wrote:
>
>  If you use pf, pf2, and pf3 and boost appropriately, the effects of mm
>> will be dwarfed.
>>
>> The general goal is to assure that the top documents really are the best,
>> not to necessarily limit the total document count. Focusing on the latter
>> could be a real waste of time.
>>
>> It's still not clear why or how you need or want to use OR as the default
>> operator - you still haven't given us a use case for that.
>>
>> To repeat: Give us a full set of use cases before taking this XY Problem
>> approach of pursuing a solution before the problem is understood.
>>
>> -- Jack Krupansky
>>
>> -Original Message- From: S.L
>> Sent: Sunday, March 30, 2014 6:14 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: eDismax parser and the mm parameter
>>
>> Jacks Thanks Again,
>>
>> I am searching  Chinese medicine  documents , as the example I gave
>> earlier
>> a user can search for "Ginseng" or Siberian Ginseng or Red Siberian
>> Ginseng
>> , I certainly want to use pf parameter (which is not driven by mm
>> parameter) , however for giving higher score to documents that have more
>> of
>> the terms I want to use edismax now if I give a mm of 3 and the search
>> term
>> is of only length 1 (like "Ginseng") what does edisMax do ?
>>
>>
>> On Sun, Mar 30, 2014 at 1:21 PM, Jack Krupansky 
>> wrote:
>>
>>  It still depends on your objective - which you haven't told us yet. Show
>>
>>> us some use cases and detail what your expectations are for each use
>>> case.
>>>
>>> The edismax phrase boosting is probably a lot more useful than messing
>>> around with mm. Take a look at pf, pf2, and pf3.
>>>
>>> See:
>>> http://wiki.apache.org/solr/ExtendedDisMax
>>> https://cwiki.apache.org/confluence/display/solr/The+
>>> Extended+DisMax+Query+Parser
>>>
>>> The focus on mm may indeed be a classic "XY Problem" - a premature focus
>>> on a solution without detailing the problem.
>>>
>>> -- Jack Krupansky
>>>
>>> -Original Message- From: S.L
>>> Sent: Sunday, March 30, 2014 11:18 AM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: eDismax parser and the mm parameter
>>>
>>> Thanks Jack! I understand the intent of mm parameter, my question is that
>>> since the query terms being provided are not of fixed length I do not
>>> know
>>> what the mm should like for example "Ginseng","Siberian Ginseng" are my
>>> search terms. The first one can have an mm upto 1 and the second one can
>>> have an mm of upto 2 .
>>>
>>> Should I dynamically set the mm based on the number of search terms in my
>>> query ?
>>>
>>> Thanks again.
>>>
>>>
>>> On Sun, Mar 30, 2014 at 5:20 AM, Jack Krupansky >> >
>>> wrote:
>>>
>>>  1. Yes, the default for mm is 1.
>>>
>>>
 2. It depends on what you are really trying to do - you haven't told us.

 Generally, mm=1 is equivalent to q.op=OR, and mm=100% is equivalent to
 q.op=AND.

 Generally, use q

Re: eDismax parser and the mm parameter

2014-03-30 Thread Jack Krupansky
The mm parameter is really only relevant when the default operator is OR or 
explicit OR operators are used.


Again: Please provide your use case examples and your expectations for each 
use case. It really doesn't make a lot of sense to prematurely focus on a 
solution when you haven't clearly defined your use cases.


-- Jack Krupansky

-Original Message- 
From: S.L

Sent: Sunday, March 30, 2014 9:13 PM
To: solr-user@lucene.apache.org
Subject: Re: eDismax parser and the mm parameter

Jack,

I mis-stated the problem , I am not using the OR operator as default
now(now that I think about it it does not make sense to use the default
operator OR along with the mm parameter) , the reason I want to use pf and
mm in conjunction is because of my understanding of the edismax parser and
I have not looked into pf2 and pf3 parameters yet.

I will state my understanding here below.

Pf -  Is used to boost the result score if the complete phrase matches.
mm <(less than) search term length would help limit the query results  to a
certain number of better matches.

With that being said would it make sense to have dynamic mm (set to the
length of search term - 1)?

I also have a question around using a fuzzy search along with eDismax
parser , but I will ask that in a seperate post once I go thru that aspect
of eDismax parser.

Thanks again !





On Sun, Mar 30, 2014 at 6:44 PM, Jack Krupansky 
wrote:



If you use pf, pf2, and pf3 and boost appropriately, the effects of mm
will be dwarfed.

The general goal is to assure that the top documents really are the best,
not to necessarily limit the total document count. Focusing on the latter
could be a real waste of time.

It's still not clear why or how you need or want to use OR as the default
operator - you still haven't given us a use case for that.

To repeat: Give us a full set of use cases before taking this XY Problem
approach of pursuing a solution before the problem is understood.

-- Jack Krupansky

-Original Message- From: S.L
Sent: Sunday, March 30, 2014 6:14 PM
To: solr-user@lucene.apache.org
Subject: Re: eDismax parser and the mm parameter

Jacks Thanks Again,

I am searching  Chinese medicine  documents , as the example I gave 
earlier
a user can search for "Ginseng" or Siberian Ginseng or Red Siberian 
Ginseng

, I certainly want to use pf parameter (which is not driven by mm
parameter) , however for giving higher score to documents that have more 
of
the terms I want to use edismax now if I give a mm of 3 and the search 
term

is of only length 1 (like "Ginseng") what does edisMax do ?


On Sun, Mar 30, 2014 at 1:21 PM, Jack Krupansky 
wrote:

 It still depends on your objective - which you haven't told us yet. Show
us some use cases and detail what your expectations are for each use 
case.


The edismax phrase boosting is probably a lot more useful than messing
around with mm. Take a look at pf, pf2, and pf3.

See:
http://wiki.apache.org/solr/ExtendedDisMax
https://cwiki.apache.org/confluence/display/solr/The+
Extended+DisMax+Query+Parser

The focus on mm may indeed be a classic "XY Problem" - a premature focus
on a solution without detailing the problem.

-- Jack Krupansky

-Original Message- From: S.L
Sent: Sunday, March 30, 2014 11:18 AM
To: solr-user@lucene.apache.org
Subject: Re: eDismax parser and the mm parameter

Thanks Jack! I understand the intent of mm parameter, my question is that
since the query terms being provided are not of fixed length I do not 
know

what the mm should like for example "Ginseng","Siberian Ginseng" are my
search terms. The first one can have an mm upto 1 and the second one can
have an mm of upto 2 .

Should I dynamically set the mm based on the number of search terms in my
query ?

Thanks again.


On Sun, Mar 30, 2014 at 5:20 AM, Jack Krupansky 
wrote:

 1. Yes, the default for mm is 1.



2. It depends on what you are really trying to do - you haven't told us.

Generally, mm=1 is equivalent to q.op=OR, and mm=100% is equivalent to
q.op=AND.

Generally, use q.op unless you really know what you are doing.

Generally, the intent of mm is to set the minimum number of OR/SHOULD
clauses that must match on the top level of a query.

-- Jack Krupansky

-Original Message- From: S.L
Sent: Sunday, March 30, 2014 2:25 AM
To: solr-user@lucene.apache.org
Subject: eDismax parser and the mm parameter

Hi All,

I am planning to use the eDismax query parser in SOLR to give boost to
documents that have a phrase in their fields present. Now there is a mm
parameter in the edismax parser query , since the query typed by the 
user
could be of any length (i.e. >=1) I would like to set the mm value to 1 
.

I
have the following questions regarding this parameter.

  1. Is it set to 1 by default ?
  2. In my schema.xml the defaultOperator is set to "AND" should I set 
it

  to "OR" inorder for the edismax parser to be effective with a mm of 1?


Thanks in advance!











Re: eDismax parser and the mm parameter

2014-03-30 Thread S.L
Jack,

 I mis-stated the problem , I am not using the OR operator as default
now(now that I think about it it does not make sense to use the default
operator OR along with the mm parameter) , the reason I want to use pf and
mm in conjunction is because of my understanding of the edismax parser and
I have not looked into pf2 and pf3 parameters yet.

I will state my understanding here below.

Pf -  Is used to boost the result score if the complete phrase matches.
mm <(less than) search term length would help limit the query results  to a
certain number of better matches.

With that being said would it make sense to have dynamic mm (set to the
length of search term - 1)?

I also have a question around using a fuzzy search along with eDismax
parser , but I will ask that in a seperate post once I go thru that aspect
of eDismax parser.

Thanks again !





On Sun, Mar 30, 2014 at 6:44 PM, Jack Krupansky wrote:

> If you use pf, pf2, and pf3 and boost appropriately, the effects of mm
> will be dwarfed.
>
> The general goal is to assure that the top documents really are the best,
> not to necessarily limit the total document count. Focusing on the latter
> could be a real waste of time.
>
> It's still not clear why or how you need or want to use OR as the default
> operator - you still haven't given us a use case for that.
>
> To repeat: Give us a full set of use cases before taking this XY Problem
> approach of pursuing a solution before the problem is understood.
>
> -- Jack Krupansky
>
> -Original Message- From: S.L
> Sent: Sunday, March 30, 2014 6:14 PM
> To: solr-user@lucene.apache.org
> Subject: Re: eDismax parser and the mm parameter
>
> Jacks Thanks Again,
>
> I am searching  Chinese medicine  documents , as the example I gave earlier
> a user can search for "Ginseng" or Siberian Ginseng or Red Siberian Ginseng
> , I certainly want to use pf parameter (which is not driven by mm
> parameter) , however for giving higher score to documents that have more of
> the terms I want to use edismax now if I give a mm of 3 and the search term
> is of only length 1 (like "Ginseng") what does edisMax do ?
>
>
> On Sun, Mar 30, 2014 at 1:21 PM, Jack Krupansky 
> wrote:
>
>  It still depends on your objective - which you haven't told us yet. Show
>> us some use cases and detail what your expectations are for each use case.
>>
>> The edismax phrase boosting is probably a lot more useful than messing
>> around with mm. Take a look at pf, pf2, and pf3.
>>
>> See:
>> http://wiki.apache.org/solr/ExtendedDisMax
>> https://cwiki.apache.org/confluence/display/solr/The+
>> Extended+DisMax+Query+Parser
>>
>> The focus on mm may indeed be a classic "XY Problem" - a premature focus
>> on a solution without detailing the problem.
>>
>> -- Jack Krupansky
>>
>> -Original Message- From: S.L
>> Sent: Sunday, March 30, 2014 11:18 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: eDismax parser and the mm parameter
>>
>> Thanks Jack! I understand the intent of mm parameter, my question is that
>> since the query terms being provided are not of fixed length I do not know
>> what the mm should like for example "Ginseng","Siberian Ginseng" are my
>> search terms. The first one can have an mm upto 1 and the second one can
>> have an mm of upto 2 .
>>
>> Should I dynamically set the mm based on the number of search terms in my
>> query ?
>>
>> Thanks again.
>>
>>
>> On Sun, Mar 30, 2014 at 5:20 AM, Jack Krupansky 
>> wrote:
>>
>>  1. Yes, the default for mm is 1.
>>
>>>
>>> 2. It depends on what you are really trying to do - you haven't told us.
>>>
>>> Generally, mm=1 is equivalent to q.op=OR, and mm=100% is equivalent to
>>> q.op=AND.
>>>
>>> Generally, use q.op unless you really know what you are doing.
>>>
>>> Generally, the intent of mm is to set the minimum number of OR/SHOULD
>>> clauses that must match on the top level of a query.
>>>
>>> -- Jack Krupansky
>>>
>>> -Original Message- From: S.L
>>> Sent: Sunday, March 30, 2014 2:25 AM
>>> To: solr-user@lucene.apache.org
>>> Subject: eDismax parser and the mm parameter
>>>
>>> Hi All,
>>>
>>> I am planning to use the eDismax query parser in SOLR to give boost to
>>> documents that have a phrase in their fields present. Now there is a mm
>>> parameter in the edismax parser query , since the query typed by the user
>>> could be of any length (i.e. >=1) I would like to set the mm value to 1 .
>>> I
>>> have the following questions regarding this parameter.
>>>
>>>   1. Is it set to 1 by default ?
>>>   2. In my schema.xml the defaultOperator is set to "AND" should I set it
>>>   to "OR" inorder for the edismax parser to be effective with a mm of 1?
>>>
>>>
>>> Thanks in advance!
>>>
>>>
>>>
>>
>


Re: eDismax parser and the mm parameter

2014-03-30 Thread Jack Krupansky
If you use pf, pf2, and pf3 and boost appropriately, the effects of mm will 
be dwarfed.


The general goal is to assure that the top documents really are the best, 
not to necessarily limit the total document count. Focusing on the latter 
could be a real waste of time.


It's still not clear why or how you need or want to use OR as the default 
operator - you still haven't given us a use case for that.


To repeat: Give us a full set of use cases before taking this XY Problem 
approach of pursuing a solution before the problem is understood.


-- Jack Krupansky

-Original Message- 
From: S.L

Sent: Sunday, March 30, 2014 6:14 PM
To: solr-user@lucene.apache.org
Subject: Re: eDismax parser and the mm parameter

Jacks Thanks Again,

I am searching  Chinese medicine  documents , as the example I gave earlier
a user can search for "Ginseng" or Siberian Ginseng or Red Siberian Ginseng
, I certainly want to use pf parameter (which is not driven by mm
parameter) , however for giving higher score to documents that have more of
the terms I want to use edismax now if I give a mm of 3 and the search term
is of only length 1 (like "Ginseng") what does edisMax do ?


On Sun, Mar 30, 2014 at 1:21 PM, Jack Krupansky 
wrote:



It still depends on your objective - which you haven't told us yet. Show
us some use cases and detail what your expectations are for each use case.

The edismax phrase boosting is probably a lot more useful than messing
around with mm. Take a look at pf, pf2, and pf3.

See:
http://wiki.apache.org/solr/ExtendedDisMax
https://cwiki.apache.org/confluence/display/solr/The+
Extended+DisMax+Query+Parser

The focus on mm may indeed be a classic "XY Problem" - a premature focus
on a solution without detailing the problem.

-- Jack Krupansky

-Original Message- From: S.L
Sent: Sunday, March 30, 2014 11:18 AM
To: solr-user@lucene.apache.org
Subject: Re: eDismax parser and the mm parameter

Thanks Jack! I understand the intent of mm parameter, my question is that
since the query terms being provided are not of fixed length I do not know
what the mm should like for example "Ginseng","Siberian Ginseng" are my
search terms. The first one can have an mm upto 1 and the second one can
have an mm of upto 2 .

Should I dynamically set the mm based on the number of search terms in my
query ?

Thanks again.


On Sun, Mar 30, 2014 at 5:20 AM, Jack Krupansky 
wrote:

 1. Yes, the default for mm is 1.


2. It depends on what you are really trying to do - you haven't told us.

Generally, mm=1 is equivalent to q.op=OR, and mm=100% is equivalent to
q.op=AND.

Generally, use q.op unless you really know what you are doing.

Generally, the intent of mm is to set the minimum number of OR/SHOULD
clauses that must match on the top level of a query.

-- Jack Krupansky

-Original Message- From: S.L
Sent: Sunday, March 30, 2014 2:25 AM
To: solr-user@lucene.apache.org
Subject: eDismax parser and the mm parameter

Hi All,

I am planning to use the eDismax query parser in SOLR to give boost to
documents that have a phrase in their fields present. Now there is a mm
parameter in the edismax parser query , since the query typed by the user
could be of any length (i.e. >=1) I would like to set the mm value to 1 .
I
have the following questions regarding this parameter.

  1. Is it set to 1 by default ?
  2. In my schema.xml the defaultOperator is set to "AND" should I set it
  to "OR" inorder for the edismax parser to be effective with a mm of 1?


Thanks in advance!








Re: eDismax parser and the mm parameter

2014-03-30 Thread S.L
Jacks Thanks Again,

I am searching  Chinese medicine  documents , as the example I gave earlier
a user can search for "Ginseng" or Siberian Ginseng or Red Siberian Ginseng
, I certainly want to use pf parameter (which is not driven by mm
parameter) , however for giving higher score to documents that have more of
the terms I want to use edismax now if I give a mm of 3 and the search term
is of only length 1 (like "Ginseng") what does edisMax do ?


On Sun, Mar 30, 2014 at 1:21 PM, Jack Krupansky wrote:

> It still depends on your objective - which you haven't told us yet. Show
> us some use cases and detail what your expectations are for each use case.
>
> The edismax phrase boosting is probably a lot more useful than messing
> around with mm. Take a look at pf, pf2, and pf3.
>
> See:
> http://wiki.apache.org/solr/ExtendedDisMax
> https://cwiki.apache.org/confluence/display/solr/The+
> Extended+DisMax+Query+Parser
>
> The focus on mm may indeed be a classic "XY Problem" - a premature focus
> on a solution without detailing the problem.
>
> -- Jack Krupansky
>
> -Original Message- From: S.L
> Sent: Sunday, March 30, 2014 11:18 AM
> To: solr-user@lucene.apache.org
> Subject: Re: eDismax parser and the mm parameter
>
> Thanks Jack! I understand the intent of mm parameter, my question is that
> since the query terms being provided are not of fixed length I do not know
> what the mm should like for example "Ginseng","Siberian Ginseng" are my
> search terms. The first one can have an mm upto 1 and the second one can
> have an mm of upto 2 .
>
> Should I dynamically set the mm based on the number of search terms in my
> query ?
>
> Thanks again.
>
>
> On Sun, Mar 30, 2014 at 5:20 AM, Jack Krupansky 
> wrote:
>
>  1. Yes, the default for mm is 1.
>>
>> 2. It depends on what you are really trying to do - you haven't told us.
>>
>> Generally, mm=1 is equivalent to q.op=OR, and mm=100% is equivalent to
>> q.op=AND.
>>
>> Generally, use q.op unless you really know what you are doing.
>>
>> Generally, the intent of mm is to set the minimum number of OR/SHOULD
>> clauses that must match on the top level of a query.
>>
>> -- Jack Krupansky
>>
>> -Original Message- From: S.L
>> Sent: Sunday, March 30, 2014 2:25 AM
>> To: solr-user@lucene.apache.org
>> Subject: eDismax parser and the mm parameter
>>
>> Hi All,
>>
>> I am planning to use the eDismax query parser in SOLR to give boost to
>> documents that have a phrase in their fields present. Now there is a mm
>> parameter in the edismax parser query , since the query typed by the user
>> could be of any length (i.e. >=1) I would like to set the mm value to 1 .
>> I
>> have the following questions regarding this parameter.
>>
>>   1. Is it set to 1 by default ?
>>   2. In my schema.xml the defaultOperator is set to "AND" should I set it
>>   to "OR" inorder for the edismax parser to be effective with a mm of 1?
>>
>>
>> Thanks in advance!
>>
>>
>


Re: SOLR Cloud 4.6 - PERFORMANCE WARNING: Overlapping onDeckSearchers=2

2014-03-30 Thread Shawn Heisey
On 3/30/2014 2:59 PM, Rishi Easwaran wrote:
> RAM shouldn't be a problem. 
> I have a box with 144GB RAM, running 12 instances with 4GB Java heap each.
> There are 9 instances wrting to 1TB of SSD disk space. 
>  Other 3 are writing to SATA drives, and have autosoftcommit disabled.

This brought up more questions than it answered.  I was assuming that
you only had a total of 4GB of index data, but after reading this, I
think my assumption may be incorrect.  If you add up all the Solr index
data on the SSD, how much disk space does it take?

You should not be running more than one instance of Solr per machine.
One instance of Solr can run multiple indexes.  Running more than one
results in quite a lot of overhead, and it seems unlikely that you would
need to dedicate 48GB of total RAM to the Java heap.

Thanks,
Shawn



Re: SOLR Cloud 4.6 - PERFORMANCE WARNING: Overlapping onDeckSearchers=2

2014-03-30 Thread Rishi Easwaran
RAM shouldn't be a problem. 
I have a box with 144GB RAM, running 12 instances with 4GB Java heap each.
There are 9 instances wrting to 1TB of SSD disk space. 
 Other 3 are writing to SATA drives, and have autosoftcommit disabled.

 

 

-Original Message-
From: Shawn Heisey 
To: solr-user 
Sent: Fri, Mar 28, 2014 8:35 pm
Subject: Re: SOLR Cloud 4.6 - PERFORMANCE WARNING: Overlapping onDeckSearchers=2


On 3/28/2014 4:07 PM, Rishi Easwaran wrote:
> 
>  Shawn,
> 
> I changed the autoSoftCommit value to 15000 (15 sec). 
> My index size is pretty small ~4GB and its running on a SSD drive with ~100 
> GB 
space on it. 
> Now I see the warn message every 15 seconds.
> 
> The caches I think are minimal
> 
> 
> 
> 
initialSize="512" autowarmCount="0"/>
>  
initialSize="512"   autowarmCount="0"/>
> 
> 200
> 
> I think still something is going on. I mean 15s on SSD drives is a long time 
to handle a 4GB index.

How much RAM do you have and what size is your max java heap?

https://wiki.apache.org/solr/SolrPerformanceProblems#RAM

Thanks,
Shawn


 


Re: zookeeper reconnect failure

2014-03-30 Thread Mark Miller
We don’t currently retry, but I don’t think it would hurt much if we did - at 
least briefly.

If you want to file a JIRA issue, that would be the best way to get it in a 
future release.

-- 
Mark Miller
about.me/markrmiller

On March 28, 2014 at 5:40:47 PM, Michael Della Bitta 
(michael.della.bi...@appinions.com) wrote:

Hi, Jessica,  

We've had a similar problem when DNS resolution of our Hadoop task nodes  
has failed. They tend to take a dirt nap until you fix the problem  
manually. Are you experiencing this in AWS as well?  

I'd say the two things to do are to poll the node state via HTTP using a  
monitoring tool so you get an immediate notification of the problem, and to  
install some sort of caching server like nscd if you expect to have DNS  
resolution failures regularly.  



Michael Della Bitta  

Applications Developer  

o: +1 646 532 3062  

appinions inc.  

"The Science of Influence Marketing"  

18 East 41st Street  

New York, NY 10017  

t: @appinions  | g+:  
plus.google.com/appinions
  
w: appinions.com   


On Fri, Mar 28, 2014 at 4:27 PM, Jessica Mallet wrote:  

> Hi,  
>  
> First off, I'd like to give a disclaimer that this probably is a very edge  
> case issue. However, since it happened to us, I would like to get some  
> advice on how to best handle this failure scenario.  
>  
> Basically, we had some network issue where we temporarily lost connection  
> and DNS. The zookeeper client properly triggered the watcher. However, when  
> trying to reconnect, this following Exception is thrown:  
>  
> 2014-03-27 17:24:46,882 ERROR [main-EventThread] SolrException.java (line  
> 121) :java.net.UnknownHostException: : Name or  
> service not known  
> at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)  
> at java.net.InetAddress$1.lookupAllHostAddr(InetAddress.java:866)  
> at  
> java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1258)  
> at java.net.InetAddress.getAllByName0(InetAddress.java:1211)  
> at java.net.InetAddress.getAllByName(InetAddress.java:1127)  
> at java.net.InetAddress.getAllByName(InetAddress.java:1063)  
> at  
>  
> org.apache.zookeeper.client.StaticHostProvider.(StaticHostProvider.java:60)
>   
> at org.apache.zookeeper.ZooKeeper.(ZooKeeper.java:445)  
> at org.apache.zookeeper.ZooKeeper.(ZooKeeper.java:380)  
> at  
> org.apache.solr.common.cloud.SolrZooKeeper.(SolrZooKeeper.java:41)  
> at  
>  
> org.apache.solr.common.cloud.DefaultConnectionStrategy.reconnect(DefaultConnectionStrategy.java:53)
>   
> at  
>  
> org.apache.solr.common.cloud.ConnectionManager.process(ConnectionManager.java:147)
>   
> at  
>  
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:519) 
>  
> at  
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)  
>  
> I tried to look at the code and it seems that there'd be no further retries  
> to connect to Zookeeper, and the node is basically left in a bad state and  
> will not recover on its own. (Please correct me if I'm reading this wrong.)  
> Thinking about it, this is probably fair, since normally you wouldn't  
> expect retries to fix an "unknown host" issue--even though in our case it  
> would have--but I'm wondering what we should do to handle this situation if  
> it happens again in the future.  
>  
> Any advice is appreciated.  
>  
> Thanks,  
> Jessica  
>  


Re: SolrCloud OR distributed Solr

2014-03-30 Thread Erick Erickson
Distributed solr is simply the ability for Solr to take the incoming
query and send it to multiple shards, then aggregate the response.
Here a "shard" is a physical partition of a single logical index. The
assumption is that you can't fit the entire index on a single machine
and still get the performance you need, so you use N smaller "parts".

So, there has to be some mechanism to send the request to each
sub-index and assemble the response and give it back to the client.
That's "distrubuted solr".

Before 4.0, splitting the index up was entirely manual, _you_ decided
what document went to what shard. _you_ configured Solr to "know"
about where the other shards were. _you_ handled the situation where a
node went down and you had to "heal" the network. But it was still
using "distributed search"


As of 4.0, SolrCloud happens. The differences are
1> you can have Solr automatically distribute the docs to the right shard.
2> when a node goes down, Solr can automatically compensate (assuming
more than one replica/shard)
3> when the node comes back up, Solr will automatically re-synchronize
the node before (automatically) bringing it back into service

NOTE: you can still use old-style manual sharding if you choose, it's
available in 4.x

But be careful here and draw a distinction between "distributed
search" and "federated search".
Distributed search - what we've been talking about, the underlying
assumption is that the sub-indexes are all substantially similar.

Federated search - the sub-indexes (or, indeed, complete
self-contained indexes) may have no relation to each other and you're
somehow expected to search them all and return the results. In this
case you'll probably be firing off N separate queries (one to each of
N indexes) and assembling them at the app layer.

Best,
Erick

On Sun, Mar 30, 2014 at 1:42 PM, Priti Solanki  wrote:
> Hello Member,
>
> Is there any difference between distributed solr & solrCloud ?
>
> Consider I have three countries' product. I have indexed one country data
> and it's index size is 160 gb+
>
> Now we have other two countries and now I am confused !
>
> My client ask me what is the difference if we procure another Solr server
> and indexed separatelyI was thinking for solrcloud.Can someone explain
> how we can explain these two approaches in simple words and if there are
> any reading links please share.
>
> Thanks


Re: SolrCloud OR distributed Solr

2014-03-30 Thread Gora Mohanty
On 30 March 2014 23:12, Priti Solanki  wrote:
>
> Hello Member,
>
> Is there any difference between distributed solr & solrCloud ?

You might be confusing the older Solr distributed search with the new SolrCloud:
* Older distributed search: https://wiki.apache.org/solr/DistributedSearch
* SolrCloud: https://cwiki.apache.org/confluence/display/solr/SolrCloud

> Consider I have three countries' product. I have indexed one country data
> and it's index size is 160 gb+
>
> Now we have other two countries and now I am confused !
>
> My client ask me what is the difference if we procure another Solr server
> and indexed separatelyI was thinking for solrcloud.Can someone explain
> how we can explain these two approaches in simple words and if there are
> any reading links please share.

With 4.0+ versions of Solr, you probably want to go for SolrCloud.

Regards,
Gora


SolrCloud OR distributed Solr

2014-03-30 Thread Priti Solanki
Hello Member,

Is there any difference between distributed solr & solrCloud ?

Consider I have three countries' product. I have indexed one country data
and it's index size is 160 gb+

Now we have other two countries and now I am confused !

My client ask me what is the difference if we procure another Solr server
and indexed separatelyI was thinking for solrcloud.Can someone explain
how we can explain these two approaches in simple words and if there are
any reading links please share.

Thanks


Re: Context-aware suggesters in Solr

2014-03-30 Thread Alan Woodward
Thanks Areek.  So looking at the code in trunk, exposing it to Solr looks to be 
pretty straightforward - just extending DocumentDictionaryFactory to take a 
'contextField' parameter as well, and passing that on to the DocumentDictionary 
constructor.  I'll give it a go!

Thanks again.

Alan Woodward
www.flax.co.uk


On 29 Mar 2014, at 22:29, Areek Zillur wrote:

> The context field can only be set at configuration-time for the
> AnalyzingInfixSuggester (FYI: CONTEXTS_FIELD_NAME refers to the field in
> Lucene index that is internally maintained by the suggester and does not
> reflect any field in user's index). The context field can be specified and
> fed into the suggester using DocumentDictionary,
> DocumentValueSourceDictionary etc, (the support for contexts in
> FileDictionary is not there yet).
> 
> The context-aware functionality is not yet exposed to Solr.
> 
> There were attempts made to make Analyzing/FuzzySuggester to be
> context-aware (LUCENE-5350; patch might be outdated), but its still not in
> trunk (see jira discussion).
> 
> Hope that helps,
> 
> Areek
> 
> 
> On Fri, Mar 28, 2014 at 3:47 AM, Alan Woodward  wrote:
> 
>> Hi all,
>> 
>> I have a few of questions about the context-aware AnalyzingInfixSuggester:
>> - is it possible to choose a specific field for the context at runtime
>> (say, I want to limit suggestions by a field that I've already faceted on),
>> or is it limited to the hardcoded CONTEXTS_FIELD_NAME?
>> - is the context-aware functionality exposed to Solr yet?
>> - how difficult would it be to add similar functionality to the other
>> suggesters, if say I only wanted to do prefix matching?
>> 
>> Thanks,
>> 
>> Alan Woodward
>> www.flax.co.uk
>> 
>> 
>> 



Re: eDismax parser and the mm parameter

2014-03-30 Thread Jack Krupansky
It still depends on your objective - which you haven't told us yet. Show us 
some use cases and detail what your expectations are for each use case.


The edismax phrase boosting is probably a lot more useful than messing 
around with mm. Take a look at pf, pf2, and pf3.


See:
http://wiki.apache.org/solr/ExtendedDisMax
https://cwiki.apache.org/confluence/display/solr/The+Extended+DisMax+Query+Parser

The focus on mm may indeed be a classic "XY Problem" - a premature focus on 
a solution without detailing the problem.


-- Jack Krupansky

-Original Message- 
From: S.L

Sent: Sunday, March 30, 2014 11:18 AM
To: solr-user@lucene.apache.org
Subject: Re: eDismax parser and the mm parameter

Thanks Jack! I understand the intent of mm parameter, my question is that
since the query terms being provided are not of fixed length I do not know
what the mm should like for example "Ginseng","Siberian Ginseng" are my
search terms. The first one can have an mm upto 1 and the second one can
have an mm of upto 2 .

Should I dynamically set the mm based on the number of search terms in my
query ?

Thanks again.


On Sun, Mar 30, 2014 at 5:20 AM, Jack Krupansky 
wrote:



1. Yes, the default for mm is 1.

2. It depends on what you are really trying to do - you haven't told us.

Generally, mm=1 is equivalent to q.op=OR, and mm=100% is equivalent to
q.op=AND.

Generally, use q.op unless you really know what you are doing.

Generally, the intent of mm is to set the minimum number of OR/SHOULD
clauses that must match on the top level of a query.

-- Jack Krupansky

-Original Message- From: S.L
Sent: Sunday, March 30, 2014 2:25 AM
To: solr-user@lucene.apache.org
Subject: eDismax parser and the mm parameter

Hi All,

I am planning to use the eDismax query parser in SOLR to give boost to
documents that have a phrase in their fields present. Now there is a mm
parameter in the edismax parser query , since the query typed by the user
could be of any length (i.e. >=1) I would like to set the mm value to 1 . 
I

have the following questions regarding this parameter.

  1. Is it set to 1 by default ?
  2. In my schema.xml the defaultOperator is set to "AND" should I set it
  to "OR" inorder for the edismax parser to be effective with a mm of 1?


Thanks in advance!





Re: eDismax parser and the mm parameter

2014-03-30 Thread S.L
Thanks Jack! I understand the intent of mm parameter, my question is that
since the query terms being provided are not of fixed length I do not know
what the mm should like for example "Ginseng","Siberian Ginseng" are my
search terms. The first one can have an mm upto 1 and the second one can
have an mm of upto 2 .

Should I dynamically set the mm based on the number of search terms in my
query ?

Thanks again.


On Sun, Mar 30, 2014 at 5:20 AM, Jack Krupansky wrote:

> 1. Yes, the default for mm is 1.
>
> 2. It depends on what you are really trying to do - you haven't told us.
>
> Generally, mm=1 is equivalent to q.op=OR, and mm=100% is equivalent to
> q.op=AND.
>
> Generally, use q.op unless you really know what you are doing.
>
> Generally, the intent of mm is to set the minimum number of OR/SHOULD
> clauses that must match on the top level of a query.
>
> -- Jack Krupansky
>
> -Original Message- From: S.L
> Sent: Sunday, March 30, 2014 2:25 AM
> To: solr-user@lucene.apache.org
> Subject: eDismax parser and the mm parameter
>
> Hi All,
>
> I am planning to use the eDismax query parser in SOLR to give boost to
> documents that have a phrase in their fields present. Now there is a mm
> parameter in the edismax parser query , since the query typed by the user
> could be of any length (i.e. >=1) I would like to set the mm value to 1 . I
> have the following questions regarding this parameter.
>
>   1. Is it set to 1 by default ?
>   2. In my schema.xml the defaultOperator is set to "AND" should I set it
>   to "OR" inorder for the edismax parser to be effective with a mm of 1?
>
>
> Thanks in advance!
>


Re: eDismax parser and the mm parameter

2014-03-30 Thread simpleliving...@gmail.com
Thanks Ahmet.

So if its single term query like 'Ginseng' what does a mm=3 do to the query .I 
am guessing it would be reduced to 1 automatically in this case.

Sent from my HTC

- Reply message -
From: "Ahmet Arslan" 
To: "solr-user@lucene.apache.org" 
Subject: eDismax parser and the mm parameter
Date: Sun, Mar 30, 2014 7:52 AM

Hi,

Using mm=1 with (e)dismax is not a good idea. Your user will be unhappy. 
Because there in no coord factor with this parser.
coord is about : "Typically, a document that contains more of the query's terms 
will receive a higher score than another document with fewer query terms."

I suggest you to use something more restrictive  : "3<-1 6<80%"  


I think there is a new feature autoRelax in some ticket. Even better start with 
mm=100% and relax mm value until you retrieve *enough* documents. 

It is OK to use default operator of OR with default operator because coord 
factor kicks in.

http://lucene.apache.org/core/3_0_3/api/all/org/apache/lucene/search/Similarity.html#formula_coord

https://wiki.apache.org/solr/DisMaxQParserPlugin#mm_.28Minimum_.27Should.27_Match.29


Ahmet


On Sunday, March 30, 2014 12:21 PM, Jack Krupansky  
wrote:
1. Yes, the default for mm is 1.

2. It depends on what you are really trying to do - you haven't told us.

Generally, mm=1 is equivalent to q.op=OR, and mm=100% is equivalent to 
q.op=AND.

Generally, use q.op unless you really know what you are doing.

Generally, the intent of mm is to set the minimum number of OR/SHOULD 
clauses that must match on the top level of a query.

-- Jack Krupansky


-Original Message- 
From: S.L
Sent: Sunday, March 30, 2014 2:25 AM
To: solr-user@lucene.apache.org
Subject: eDismax parser and the mm parameter

Hi All,

I am planning to use the eDismax query parser in SOLR to give boost to
documents that have a phrase in their fields present. Now there is a mm
parameter in the edismax parser query , since the query typed by the user
could be of any length (i.e. >=1) I would like to set the mm value to 1 . I
have the following questions regarding this parameter.

1. Is it set to 1 by default ?
2. In my schema.xml the defaultOperator is set to "AND" should I set it
to "OR" inorder for the edismax parser to be effective with a mm of 1?


Thanks in advance!

Re: eDismax parser and the mm parameter

2014-03-30 Thread Ahmet Arslan
Hi,

Using mm=1 with (e)dismax is not a good idea. Your user will be unhappy. 
Because there in no coord factor with this parser.
coord is about : "Typically, a document that contains more of the query's terms 
will receive a higher score than another document with fewer query terms."

I suggest you to use something more restrictive  : "3<-1 6<80%"  


I think there is a new feature autoRelax in some ticket. Even better start with 
mm=100% and relax mm value until you retrieve *enough* documents. 

It is OK to use default operator of OR with default operator because coord 
factor kicks in.

http://lucene.apache.org/core/3_0_3/api/all/org/apache/lucene/search/Similarity.html#formula_coord

https://wiki.apache.org/solr/DisMaxQParserPlugin#mm_.28Minimum_.27Should.27_Match.29


Ahmet


On Sunday, March 30, 2014 12:21 PM, Jack Krupansky  
wrote:
1. Yes, the default for mm is 1.

2. It depends on what you are really trying to do - you haven't told us.

Generally, mm=1 is equivalent to q.op=OR, and mm=100% is equivalent to 
q.op=AND.

Generally, use q.op unless you really know what you are doing.

Generally, the intent of mm is to set the minimum number of OR/SHOULD 
clauses that must match on the top level of a query.

-- Jack Krupansky


-Original Message- 
From: S.L
Sent: Sunday, March 30, 2014 2:25 AM
To: solr-user@lucene.apache.org
Subject: eDismax parser and the mm parameter

Hi All,

I am planning to use the eDismax query parser in SOLR to give boost to
documents that have a phrase in their fields present. Now there is a mm
parameter in the edismax parser query , since the query typed by the user
could be of any length (i.e. >=1) I would like to set the mm value to 1 . I
have the following questions regarding this parameter.

   1. Is it set to 1 by default ?
   2. In my schema.xml the defaultOperator is set to "AND" should I set it
   to "OR" inorder for the edismax parser to be effective with a mm of 1?


Thanks in advance!


Re: eDismax parser and the mm parameter

2014-03-30 Thread Jack Krupansky

1. Yes, the default for mm is 1.

2. It depends on what you are really trying to do - you haven't told us.

Generally, mm=1 is equivalent to q.op=OR, and mm=100% is equivalent to 
q.op=AND.


Generally, use q.op unless you really know what you are doing.

Generally, the intent of mm is to set the minimum number of OR/SHOULD 
clauses that must match on the top level of a query.


-- Jack Krupansky

-Original Message- 
From: S.L

Sent: Sunday, March 30, 2014 2:25 AM
To: solr-user@lucene.apache.org
Subject: eDismax parser and the mm parameter

Hi All,

I am planning to use the eDismax query parser in SOLR to give boost to
documents that have a phrase in their fields present. Now there is a mm
parameter in the edismax parser query , since the query typed by the user
could be of any length (i.e. >=1) I would like to set the mm value to 1 . I
have the following questions regarding this parameter.

  1. Is it set to 1 by default ?
  2. In my schema.xml the defaultOperator is set to "AND" should I set it
  to "OR" inorder for the edismax parser to be effective with a mm of 1?


Thanks in advance!