Re: Parsing dating during indexing - Year Only

2015-06-19 Thread Chris Hostetter

I'm not sure i understand your question ...

if you know that you are only ever going to have the 'year' then why not 
just index the year as an int?

a TrieDateField isn't really of any use to you, because normal date type 
usage (date math, date ranges) are useless because you don't have any real 
date values (ie: it's ambiguous wether 2007 should match 
just_the_year:[2006-06-01T00:00:00Z TO 2007-06-01T00:00:00Z])


If you really need a true date field because *most* of your documents have 
real dates, but only sometimes do you injest documents with only the 
year, and when you injest documents like this you wnat to assume some 
fixed month/day/hour/etc... then you can easily do this with update 
processors ... consider a chain of...

  RegexReplaceProcessorFactory: 
just_the_year: ^(\d+)$ - $1-01-01T00:00:00Z
  CloneFieldUpdateProcessor: 
just_the_year - real_date_field
  FirstFieldValueUpdateProcessorFactory:
real_date_field 

(if a doc already had a value in the real field, ignore the new year only value)

https://lucene.apache.org/solr/5_2_0/solr-core/org/apache/solr/update/processor/CloneFieldUpdateProcessorFactory.html
https://lucene.apache.org/solr/5_2_0/solr-core/org/apache/solr/update/processor/RegexReplaceProcessorFactory.html
https://lucene.apache.org/solr/5_2_0/solr-core/org/apache/solr/update/processor/FirstFieldValueUpdateProcessorFactory.html


: Date: Fri, 19 Jun 2015 13:57:04 -0700 (MST)
: From: levanDev levandev9...@gmail.com
: Reply-To: solr-user@lucene.apache.org
: To: solr-user@lucene.apache.org
: Subject: Parsing dating during indexing - Year Only
: 
: Hello,
: 
: Example csv doc has column 'just_the_year' and value '2010':  
: 
: With the Schema API I can tell the indexing process to treat 'just_the_year'
: as a date field. 
: 
: I know that I can update the solrconfig.xml to correctly parse formats such
: as MM/dd/ (which is awesome) but has anyone tried to covert just the
: year value to a full date (2010-01-01T00:00:00Z) by updating the
: solrconfig.xml?
: 
: I know it's possible to import csv, do the date transformation, export again
: and have everything work nicely but it would be cool to reduce the number of
: steps involved and use the powerful date processor. 
: 
: Thank you, 
: Levan
: 
: 
: 
: --
: View this message in context: 
http://lucene.472066.n3.nabble.com/Parsing-dating-during-indexing-Year-Only-tp4213045.html
: Sent from the Solr - User mailing list archive at Nabble.com.
: 

-Hoss
http://www.lucidworks.com/


Parsing dating during indexing - Year Only

2015-06-19 Thread levanDev
Hello,

Example csv doc has column 'just_the_year' and value '2010':  

With the Schema API I can tell the indexing process to treat 'just_the_year'
as a date field. 

I know that I can update the solrconfig.xml to correctly parse formats such
as MM/dd/ (which is awesome) but has anyone tried to covert just the
year value to a full date (2010-01-01T00:00:00Z) by updating the
solrconfig.xml?

I know it's possible to import csv, do the date transformation, export again
and have everything work nicely but it would be cool to reduce the number of
steps involved and use the powerful date processor. 

Thank you, 
Levan



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Parsing-dating-during-indexing-Year-Only-tp4213045.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: CollapseQParserPluging Incorrect Facet Counts

2015-06-19 Thread Joel Bernstein
The CollapsingQParserPlugin does not provide facet counts that are them
same as the group.facet feature in Grouping. It provides facet counts that
behave like group.truncate.

The CollapsingQParserPlugin only collapses the result set. The facets
counts are then generated for the collapsed result set by the
FacetComponent.

This has been a hot topic of late.

Joel Bernstein
http://joelsolr.blogspot.com/

On Fri, Jun 19, 2015 at 3:54 PM, Carlos Maroto charlie.mar...@gmail.com
wrote:

 Hi,

 We are comparing results between Field Collapsing (group* parameters) and
 CollapseQParserPlugin.  We noticed that some facets are returning incorrect
 counts.

 Here are the relevant parameters of one of our test queries:

 Field Collapsing:
 ---

 q=red%20dressfacet=truefacet.mincount=1facet.limit=-1facet.field=searchcolorfacetgroup=truegroup.field=groupidgroup.facet=true
 group.ngroups=true

 ngroups = 5964

 lst name=searchcolorfacet
 ...
 int name=red11/int
 ...
 /lst

 CollapseQParserPlugin:

 --q=red%20dressfacet=truefacet.mincount=1facet.limit=-1facet.field=searchcolorfacetfq=%7B!collapse%20field=groupid%7D

 numFound = 5964 (same)

 lst name=searchcolorfacet
 ...
 int name=red8/int
 ...
 /lst

 When we change the CollapseQParserPlugin query by adding
 fq=searchcolorfacet:red, the numFound value is 11, effectively showing
 all 11 hits with that color.  The facet count for red now shows the correct
 value of 11 as well.

 Has anyone seeing something similar?

 Thanks,
 Carlos



Re: CollapseQParserPluging Incorrect Facet Counts

2015-06-19 Thread Joel Bernstein
If you see the last comment on:

https://issues.apache.org/jira/browse/SOLR-6143

You'll see there is a discussion starting about adding this feature.

Joel Bernstein
http://joelsolr.blogspot.com/

On Fri, Jun 19, 2015 at 4:14 PM, Joel Bernstein joels...@gmail.com wrote:

 The CollapsingQParserPlugin does not provide facet counts that are them
 same as the group.facet feature in Grouping. It provides facet counts that
 behave like group.truncate.

 The CollapsingQParserPlugin only collapses the result set. The facets
 counts are then generated for the collapsed result set by the
 FacetComponent.

 This has been a hot topic of late.

 Joel Bernstein
 http://joelsolr.blogspot.com/

 On Fri, Jun 19, 2015 at 3:54 PM, Carlos Maroto charlie.mar...@gmail.com
 wrote:

 Hi,

 We are comparing results between Field Collapsing (group* parameters) and
 CollapseQParserPlugin.  We noticed that some facets are returning
 incorrect
 counts.

 Here are the relevant parameters of one of our test queries:

 Field Collapsing:
 ---

 q=red%20dressfacet=truefacet.mincount=1facet.limit=-1facet.field=searchcolorfacetgroup=truegroup.field=groupidgroup.facet=true
 group.ngroups=true

 ngroups = 5964

 lst name=searchcolorfacet
 ...
 int name=red11/int
 ...
 /lst

 CollapseQParserPlugin:

 --q=red%20dressfacet=truefacet.mincount=1facet.limit=-1facet.field=searchcolorfacetfq=%7B!collapse%20field=groupid%7D

 numFound = 5964 (same)

 lst name=searchcolorfacet
 ...
 int name=red8/int
 ...
 /lst

 When we change the CollapseQParserPlugin query by adding
 fq=searchcolorfacet:red, the numFound value is 11, effectively showing
 all 11 hits with that color.  The facet count for red now shows the
 correct
 value of 11 as well.

 Has anyone seeing something similar?

 Thanks,
 Carlos





RE: CollapseQParserPluging Incorrect Facet Counts

2015-06-19 Thread Carlos Maroto
Thanks Joel,

I don't know why I was unable to find the understanding collapsing email 
thread via the search I did on the site but I found it in my own email search 
now.

We'll look into our specific scenario and see if we can find a workaround.  
Thanks!

CARLOS MAROTO   
M +1 626 354 7750

-Original Message-
From: Joel Bernstein [mailto:joels...@gmail.com] 
Sent: Friday, June 19, 2015 1:18 PM
To: solr-user@lucene.apache.org
Subject: Re: CollapseQParserPluging Incorrect Facet Counts

If you see the last comment on:

https://issues.apache.org/jira/browse/SOLR-6143

You'll see there is a discussion starting about adding this feature.

Joel Bernstein
http://joelsolr.blogspot.com/

On Fri, Jun 19, 2015 at 4:14 PM, Joel Bernstein joels...@gmail.com wrote:

 The CollapsingQParserPlugin does not provide facet counts that are 
 them same as the group.facet feature in Grouping. It provides facet 
 counts that behave like group.truncate.

 The CollapsingQParserPlugin only collapses the result set. The facets 
 counts are then generated for the collapsed result set by the 
 FacetComponent.

 This has been a hot topic of late.

 Joel Bernstein
 http://joelsolr.blogspot.com/

 On Fri, Jun 19, 2015 at 3:54 PM, Carlos Maroto 
 charlie.mar...@gmail.com
 wrote:

 Hi,

 We are comparing results between Field Collapsing (group* 
 parameters) and CollapseQParserPlugin.  We noticed that some facets 
 are returning incorrect counts.

 Here are the relevant parameters of one of our test queries:

 Field Collapsing:
 ---

 q=red%20dressfacet=truefacet.mincount=1facet.limit=-1facet.field=
 searchcolorfacetgroup=truegroup.field=groupidgroup.facet=true
 group.ngroups=true

 ngroups = 5964

 lst name=searchcolorfacet
 ...
 int name=red11/int
 ...
 /lst

 CollapseQParserPlugin:

 --q=red%20dressfacet=truefacet.minc
 ount=1facet.limit=-1facet.field=searchcolorfacetfq=%7B!collapse%20
 field=groupid%7D

 numFound = 5964 (same)

 lst name=searchcolorfacet
 ...
 int name=red8/int
 ...
 /lst

 When we change the CollapseQParserPlugin query by adding 
 fq=searchcolorfacet:red, the numFound value is 11, effectively 
 showing all 11 hits with that color.  The facet count for red now 
 shows the correct value of 11 as well.

 Has anyone seeing something similar?

 Thanks,
 Carlos





Re: Parsing dating during indexing - Year Only

2015-06-19 Thread levanDev
Hi Chris, 

Thank you for taking the time to write the detailed response. Very helpful.
Dealing with interesting formats in the source data and trying to evaluate
various options for our business needs. The second scenario you described
(where some values in the date field are just the year) will either come up
pretty soon for me or will certainly help someone else dealing with that
issue currently. 

Thank you,  
Levan



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Parsing-date-during-indexing-Year-Only-tp4213045p4213065.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Parsing dating during indexing - Year Only

2015-06-19 Thread Erick Erickson
Hmm, I can see some things you couldn't do with just using
a tint field for the year. Or rather, some things that wouldn't
be as convenient

But this might help:
http://lucene.apache.org/solr/5_2_0/solr-core/org/apache/solr/update/processor/ParseDateFieldUpdateProcessorFactory.html

or you can also consider a http://wiki.apache.org/solr/ScriptUpdateProcessor

Best,
Erick

On Fri, Jun 19, 2015 at 1:57 PM, levanDev levandev9...@gmail.com wrote:
 Hello,

 Example csv doc has column 'just_the_year' and value '2010':

 With the Schema API I can tell the indexing process to treat 'just_the_year'
 as a date field.

 I know that I can update the solrconfig.xml to correctly parse formats such
 as MM/dd/ (which is awesome) but has anyone tried to covert just the
 year value to a full date (2010-01-01T00:00:00Z) by updating the
 solrconfig.xml?

 I know it's possible to import csv, do the date transformation, export again
 and have everything work nicely but it would be cool to reduce the number of
 steps involved and use the powerful date processor.

 Thank you,
 Levan



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Parsing-dating-during-indexing-Year-Only-tp4213045.html
 Sent from the Solr - User mailing list archive at Nabble.com.


RE: How to do a Data sharding for data in a database table

2015-06-19 Thread Carlos Maroto
As stated previously, using Field Collapsing (group parameters) tends to 
significantly slow down queries.  In my experience, search response gets even 
worst when:
- Requesting facets, which more often than not I do in my query formulation
- Asking for the facet counts to be on the groups via the group.facet=true 
parameter (way worst in some of my use cases that had a lot of distinct values 
for at least one of the facets)
- Queries are matching many hits, i.e. individual counts (hundreds of thousands 
or more in our case) and total groups counts (in the few thousands)

Also stated by someone, switching to CollapseQParserPlugin will likely reduce 
significantly the response time given its different implementation.  Using 
CollapseQParserPlugin means that you:

1- Have to change how the query gets created
2- May need to change how you consume the Solr response (depending on what you 
are using today)
3- Will not have the total number of individual hits (before collapsing count) 
because the numFound returned by the CollapseQParserPlugin represents the total 
number of groups (like groups.ngroups does)
4- You may have an issue with facet value counts not being exact in the 
CollapseQParserPlugin response

With respect to sharding, there are multiple considerations.  The most relevant 
given your need for grouping is to implement custom routing of documents to 
shards so that all members of a group are indexed in the same shard, if you 
can.  Otherwise your grouping across shards will have some issues (particularly 
with counts, I believe.)

CARLOS MAROTO   
http://www.searchtechnologies.com/
M +1 626 354 7750

-Original Message-
From: Reitzel, Charles [mailto:charles.reit...@tiaa-cref.org] 
Sent: Friday, June 19, 2015 12:08 PM
To: solr-user@lucene.apache.org
Subject: RE: How to do a Data sharding for data in a database table

Also, since you are tuning for relative times, you can tune on the smaller 
index.   Surely, you will want to test at scale.   But tuning query, analyzer 
or schema options is usually easier to do on a smaller index.   If you get a 3x 
improvement at small scale, it may only be 2.5x at full scale.

E.g. storing the group field as doc values is one option that can help grouping 
performance in some cases (at least according to this list, I haven't tried it 
yet).

The number of distinct values of the grouping field is important as well.  If 
there are very many, you may want to try CollapsingQParserPlugin. 

The point being, some of these options may require reindexing!   So, again, it 
is a much easier and faster process to tune on a smaller index.

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com]
Sent: Friday, June 19, 2015 2:33 PM
To: solr-user@lucene.apache.org
Subject: Re: How to do a Data sharding for data in a database table

Do be aware that turning on debug=query adds a load. I've seen the debug 
component take 90% of the query time. (to be fair it usually takes a much 
smaller percentage).

But you'll see a section at the end of the response if you set debug=all with 
the time each component took so you'll have a sense of the relative time used 
by each component.

Best,
Erick

On Fri, Jun 19, 2015 at 11:06 AM, Wenbin Wang wwang...@gmail.com wrote:
 As for now, the index size is 6.5 M records, and the performance is 
 good enough. I will re-build the index for all the records (14 M) and 
 test it again with debug turned on.

 Thanks


 On Fri, Jun 19, 2015 at 12:10 PM, Erick Erickson 
 erickerick...@gmail.com
 wrote:

 First and most obvious thing to try:

 bq: the Solr was started with maximal 4G for JVM, and index size is  
 2G

 Bump your JVM to 8G, perhaps 12G. The size of the index on disk is 
 very loosely coupled to JVM requirements. It's quite possible that 
 you're spending all your time in GC cycles. Consider gathering GC 
 characteristics, see:
 http://lucidworks.com/blog/garbage-collection-bootcamp-1-0/

 As Charles says, on the face of it the system you describe should 
 handle quite a load, so it feels like things can be tuned and you 
 won't have to resort to sharding.
 Sharding inevitably imposes some overhead so it's best to go there last.

 From my perspective, this is, indeed, an XY problem. You're assuming 
 that sharding is your solution. But you really haven't identified the 
 _problem_ other than queries are too slow. Let's nail down the 
 reason queries are taking a second before jumping into sharding. I've 
 just spent too much of my life fixing the wrong thing ;)

 It would be useful to see a couple of sample queries so we can get a 
 feel for how complex they are. Especially if you append, as Charles 
 mentions, debug=true

 Best,
 Erick

 On Fri, Jun 19, 2015 at 7:02 AM, Reitzel, Charles 
 charles.reit...@tiaa-cref.org wrote:
  Grouping does tend to be expensive.   Our regular queries typically
 return in 10-15ms while the grouping queries take 60-80ms in a test 
 environment ( 1M docs).
 
  This is 

Re: understanding collapsingQParser with facet vs group.facet

2015-06-19 Thread Derek Poh

Hi Upayavira

Thank you for your explanation onthe difference between traditional 
grouping and collapsingQParser. I understand more now.


On 6/19/2015 7:11 PM, Upayavira wrote:

On Fri, Jun 19, 2015, at 06:20 AM, Derek Poh wrote:

Hi

I read about collapsingQParser returns the facet count the same as
group.truncate=true and has this issue with the facet count and the
after filter facet count notthe same.
Using group.facetdoes not has this issue but it's performance is very
badcompared to collapsingQParser.

I trying to understand why collapsingQParser behave this way and will
need to explain to management.

Can someone explain how collapsingQParser calculatethefacet
countscompated to group.facet?

I'm not familiar with group.facet. But to compare traditional grouping
to the collapsingQParser - in traditional grouping, all matching
documents remain in the result set, but they are grouped for output
purposes. However, the collapsingQParser is actually a query filter. It
will reduce the number of matching results. Any faceting that happens
will happen on the filtered results.

I wonder if you can use this syntax to achieve faceting alongside
collapsing:

q=whatever
fq={!collapse tag=collapse}blah
facet.field={!ex=collapse}my_facet_field

This way, you get the benefits of the CollapsingQParserPlugin, with full
faceting on the uncollapsed resultset.

I've no idea how this would perform, but I'd expect it to be better than
the grouping option.

Upayavira






Re: Auto-suggest in Solr

2015-06-19 Thread Zheng Lin Edwin Yeo
Ok sure.

  ngrams: The max number of tokens out of which singles will be make the
 dictionary. The default value is 2. Increasing this would mean you want
 more than the previous 2 tokens to be taken into consideration when making
 the suggestions. 

I got confused by this, as I could not get the behavior when I use the
suggester. Since the default value is 2, it means the search for mp3 p
should include only suggestions that contains mp3 ... and not just from
the letter p. But I have only been getting suggestions that starts with
p only.
Even when I try with a bigger ngrams value for longer search, I'm getting
the same results as well, that the suggester only consider the last token
when giving the suggestions.

I still could not achieve anything that consider 2 or more tokens when
returning the suggestions.

So am I actually following the right direction with this?

Regards,
Edwin



On 19 June 2015 at 18:53, Alessandro Benedetti benedetti.ale...@gmail.com
wrote:

 Actually the documentation is not clear enough.
 Let's try to understand this suggester.

 *Building*
 This suggester build a FST that it will use to provide the autocomplete
 feature running prefix searches on it .
 The terms it uses to generate the FST are the tokens produced by the
  suggestFreeTextAnalyzerFieldType .

 And this should be correct.
 So if we have a shingle token filter[1-3] ( we produce unigrams as well) in
 our analysis to keep it simple , from these original field values :
 mp3 ipod
 mp3 player
 mp3 player ipod
 player of Real

 - we produce these list of possible suggestions in our FST :

 mp3
 player
 ipod
 real
 of

 mp3 ipod
 mp3 player
 player ipod
 player of
 of real

 mp3 player ipod
 player of real

 From the documentation I read :

   ngrams: The max number of tokens out of which singles will be make the
  dictionary. The default value is 2. Increasing this would mean you want
  more than the previous 2 tokens to be taken into consideration when
 making
  the suggestions. 


 This makes me confused, as I was not expecting this param to affect the
 suggestion dictionary.
 So I would like a clarification here from our masters :)
 At this point let's see what happens at query time .

 *Query Time *
 As my understanding the ngrams params will consider  the last N-1 tokens
 the user put separated by the space separator.

 Builds an ngram model from the text sent to {@link
  * #build} and predicts based on the last grams-1 tokens in
  * the request sent to {@link #lookup}. This tries to
  * handle the long tail of suggestions for when the
  * incoming query is a never before seen query string.


 Example , grams=3 should consider only the last 2 tokens

 special mp3 p - mp3 p

 Then this query is analysed using the suggestFreeTextAnalyzerFieldType .
 We produce 3 tokens :
 mp3
 p
 mp3 p

 And we run the prefix matching on the FST .

 *Conclusion*
 My understanding is wrong for sure at some point, as the behaviour I get is
 different.
 Can we discuss this , clarify this and eventually put it in the official
 documentation ?

 Cheers

 2015-06-19 6:40 GMT+01:00 Zheng Lin Edwin Yeo edwinye...@gmail.com:

  I'm implementing an auto-suggest feature in Solr, and I'll like to
 achieve
  the follwing:
 
  For example, if the user enters mp3, Solr might suggest mp3 player,
  mp3 nano and mp3 music.
  When the user enters mp3 p, the suggestion should narrow down to mp3
  player.
 
  Currently, when I type mp3 p, the suggester is returning words that
  starts with the letter p only, and I'm getting results like plan,
  production, etc, and it does not take the mp3 token into
 consideration.
 
  I'm using Solr 5.1 and below is my configuration:
 
  In solrconfig.xml:
 
  searchComponent name=suggest class=solr.SuggestComponent
lst name=suggester
 
   str name=lookupImplFreeTextLookupFactory/str
   str name=indexPathsuggester_freetext_dir/str
 
  str name=dictionaryImplDocumentDictionaryFactory/str
  str name=fieldSuggestion/str
  str name=weightFieldProject/str
  str name=suggestFreeTextAnalyzerFieldTypesuggestType/str
  int name=ngrams5/int
  str name=buildOnStartupfalse/str
  str name=buildOnCommitfalse/str
/lst
  /searchComponent
 
 
  In schema.xml
 
  fieldType name=suggestType class=solr.TextField
  positionIncrementGap=100
  analyzer type=index
  charFilter class=solr.PatternReplaceCharFilterFactory
  pattern=[^a-zA-Z0-9] replacement=  /
  tokenizer class=solr.WhitespaceTokenizerFactory/
  filter class=solr.ShingleFilterFactory minShingleSize=2
  maxShingleSize=6 outputUnigrams=false/
  /analyzer
  analyzer type=query
  charFilter class=solr.PatternReplaceCharFilterFactory
  pattern=[^a-zA-Z0-9] replacement=  /
  tokenizer class=solr.WhitespaceTokenizerFactory/
  filter class=solr.ShingleFilterFactory minShingleSize=2
  maxShingleSize=6 outputUnigrams=true/
  /analyzer
  /fieldType
 
 
  Is there anything that I configured wrongly?
 
 
  Regards,
  Edwin
 



 --
 

Same query, inconsistent result in SolrCloud

2015-06-19 Thread Jerome Yang
Hi!

I'm facing a problem.
I'm using SolrCloud 4.10.3, with 2 shards, each shard have 2 replicas.

After index data to the collection, and run the same query,

http://localhost:8983/solr/catalog/select?q=awt=jsonindent=true

Sometimes, it return the right,

{
  responseHeader:{
status:0,
QTime:19,
params:{
  indent:true,
  q:a,
  wt:json}},
  response:{numFound:5,start:0,maxScore:0.43969032,docs:[
  {},{},...

]

  }

}

But, when I re-run the same query, it return :

{
  responseHeader:{
status:0,
QTime:14,
params:{
  indent:true,
  q:a,
  wt:json}},
  response:{numFound:0,start:0,maxScore:0.0,docs:[]
  },
  highlighting:{}}


Just some short word will show this kind of problem.

Do anyone know what's going on?

Thanks

Regards,

Jerome


Re: understanding collapsingQParser with facet vs group.facet

2015-06-19 Thread Derek Poh

Hi Joel

By group heads, is it referring to the document thatis use to represent 
each group in the main result section?


Eg. Using the below 3 documentsandwe collapse on field supplier_id

supplier_id:S1
product_id:P1

supplier_id:S2
product_id:P2

supplier_id:S2
product_id:P3

With collapse on supplier_id, the result in the main sectionis as follows,

supplier_id:S1
product_id:P1

supplier_id:S2
product_id:P3

The group head of supplier_id:S1 is P1and supplier_id:S2 will be P3?

Facets (and even sort) are calculated on P1 and P3?

-Derek

On 6/19/2015 7:05 PM, Joel Bernstein wrote:

The CollapsingQParserPlugin currently doesn't calculate facets at all. It
simply collapses the document set. The facets are then calculated only on
the group heads.

Grouping has special faceting code built into it that supports the
group.facet functionality.

Joel Bernstein
http://joelsolr.blogspot.com/

On Fri, Jun 19, 2015 at 6:20 AM, Derek Poh d...@globalsources.com wrote:


Hi

I read about collapsingQParser returns the facet count the same as
group.truncate=true and has this issue with the facet count and the after
filter facet count notthe same.
Using group.facetdoes not has this issue but it's performance is very
badcompared to collapsingQParser.

I trying to understand why collapsingQParser behave this way and will need
to explain to management.

Can someone explain how collapsingQParser calculatethefacet countscompated
to group.facet?

Thank you,
Derek







Re: Help: Problem in customized token filter

2015-06-19 Thread Aman Tandon
Steve,

Thank you thank you so much. You guys are awesome.

Steve how can i learn more about the lucene indexing process in more
detail. e.g. after we send documents for indexing which function calls till
the doc actually store in index files.

I will be thankful to you. If you guide me here.

With Regards
Aman Tandon

On Fri, Jun 19, 2015 at 10:48 AM, Steve Rowe sar...@gmail.com wrote:

 Aman,

 Solr uses the same Token filter instances over and over, calling reset()
 before sending each document through.  Your code sets “exhausted to true
 and then never sets it back to false, so the next time the token filter
 instance is used, its “exhausted value is still true, so no input stream
 tokens are concatenated ever again.

 Does that make sense?

 Steve
 www.lucidworks.com

  On Jun 19, 2015, at 1:10 AM, Aman Tandon amantandon...@gmail.com
 wrote:
 
  Hi Steve,
 
 
  you never set exhausted to false, and when the filter got reused, *it
  incorrectly carried state from the previous document.*
 
 
  Thanks for replying, but I am not able to understand this.
 
  With Regards
  Aman Tandon
 
  On Fri, Jun 19, 2015 at 10:25 AM, Steve Rowe sar...@gmail.com wrote:
 
  Hi Aman,
 
  The admin UI screenshot you linked to is from an older version of Solr -
  what version are you using?
 
  Lots of extraneous angle brackets and asterisks got into your email and
  made for a bunch of cleanup work before I could read or edit it.  In the
  future, please put your code somewhere people can easily read it and
  copy/paste it into an editor: into a github gist or on a paste service,
 etc.
 
  Looks to me like your use of “exhausted” is unnecessary, and is likely
 the
  cause of the problem you saw (only one document getting processed): you
  never set exhausted to false, and when the filter got reused, it
  incorrectly carried state from the previous document.
 
  Here’s a simpler version that’s hopefully more correct and more
 efficient
  (2 fewer copies from the StringBuilder to the final token).  Note: I
 didn’t
  test it:
 
 https://gist.github.com/sarowe/9b9a52b683869ced3a17
 
  Steve
  www.lucidworks.com
 
  On Jun 18, 2015, at 11:33 AM, Aman Tandon amantandon...@gmail.com
  wrote:
 
  Please help, what wrong I am doing here. please guide me.
 
  With Regards
  Aman Tandon
 
  On Thu, Jun 18, 2015 at 4:51 PM, Aman Tandon amantandon...@gmail.com
  wrote:
 
  Hi,
 
  I created a *token concat filter* to concat all the tokens from token
  stream. It creates the concatenated token as expected.
 
  But when I am posting the xml containing more than 30,000 documents,
  then
  only first document is having the data of that field.
 
  *Schema:*
 
  *field name=titlex type=text indexed=true stored=false
  required=false omitNorms=false multiValued=false /*
 
 
 
 
 
 
  *fieldType name=text class=solr.TextField
  positionIncrementGap=100*
  *  analyzer type=index*
  *charFilter class=solr.HTMLStripCharFilterFactory/*
  *tokenizer class=solr.StandardTokenizerFactory/*
  *filter class=solr.WordDelimiterFilterFactory
  generateWordParts=1 generateNumberParts=1 catenateWords=0
  catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/*
  *filter class=solr.LowerCaseFilterFactory/*
  *filter class=solr.ShingleFilterFactory maxShingleSize=3
  outputUnigrams=true tokenSeparator=/*
  *filter class=solr.SnowballPorterFilterFactory
  language=English protected=protwords.txt/*
  *filter
  class=com.xyz.analysis.concat.ConcatenateWordsFilterFactory/*
  *filter class=solr.SynonymFilterFactory
  synonyms=stemmed_synonyms_text_prime_ex_index.txt ignoreCase=true
  expand=true/*
  *  /analyzer*
  *  analyzer type=query*
  *tokenizer class=solr.StandardTokenizerFactory/*
  *filter class=solr.SynonymFilterFactory
  synonyms=synonyms.txt ignoreCase=true expand=true/*
  *filter class=solr.StopFilterFactory ignoreCase=true
  words=stopwords_text_prime_search.txt
  enablePositionIncrements=true /*
  *filter class=solr.WordDelimiterFilterFactory
  generateWordParts=1 generateNumberParts=1 catenateWords=0
  catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/*
  *filter class=solr.LowerCaseFilterFactory/*
  *filter class=solr.SnowballPorterFilterFactory
  language=English protected=protwords.txt/*
  *filter
  class=com.xyz.analysis.concat.ConcatenateWordsFilterFactory/*
  *  /analyzer**/fieldType*
 
 
  Please help me, The code for the filter is as follows, please take a
  look.
 
  Here is the picture of what filter is doing
  http://i.imgur.com/THCsYtG.png?1
 
  The code of concat filter is :
 
  *package com.xyz.analysis.concat;*
 
  *import java.io.IOException;*
 
 
  *import org.apache.lucene.analysis.TokenFilter;*
 
  *import org.apache.lucene.analysis.TokenStream;*
 
  *import
 org.apache.lucene.analysis.tokenattributes.CharTermAttribute;*
 
  *import 

understanding collapsingQParser with facet vs group.facet

2015-06-19 Thread Derek Poh

Hi

I read about collapsingQParser returns the facet count the same as 
group.truncate=true and has this issue with the facet count and the 
after filter facet count notthe same.
Using group.facetdoes not has this issue but it's performance is very 
badcompared to collapsingQParser.


I trying to understand why collapsingQParser behave this way and will 
need to explain to management.


Can someone explain how collapsingQParser calculatethefacet 
countscompated to group.facet?


Thank you,
Derek




Limit indexed documents.

2015-06-19 Thread tomas.kalas
Hello i have a few questions for indexing data.
Existing some hardware or software limits for indexing data?
And is some maximum of indexed documents?
Thanks for your answers.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Limit-indexed-documents-tp4212913.html
Sent from the Solr - User mailing list archive at Nabble.com.


SolrJ: getBeans with multiple document types in response

2015-06-19 Thread Catala, Francois
Hello,

I'm trying to parse Solr Responses with SolrJ,  but the responses contain mixed 
types : for example 'song' documents and 'movie' documents with different 
fields.
The getBeans method takes 1 class type as input parameter, this does not allow 
for mixed document types responses.
What would be the best approach to parse the response and to get a list of 
'entity' (the super class).

I'm about to write another implementation of the DocumentObjectBinder class but 
I'd like to avoid that.

Thanks!!

François Catala
Software Developer
NUANCE COMMUNICATIONS, INC.
1500 University, Suite 557
Montréal QC  H3A 3S7
514 904 7800   Officejust say my name or ext. 2345



Re: Error when submitting PDF to Solr w/text fields using SolrJ

2015-06-19 Thread Paden
Yeah I'm just gonna say hands down this was a totally bad question. My fault,
mea culpa. I'm pretty new to working in an IDE environment and using a stack
trace (I just finished my first year of CS at University and now I'm
interning). I'm actually kind of embarrassed by how long it took me to
realize I wasn't looking at the entire stack trace. Idiot moment of the week
for sure. Thanks for the patience guys but when I looked at the entire stack
trace it gave me this. 

Caused by: java.lang.IllegalArgumentException: Document contains at least
one immense term in field=text (whose UTF8 encoding is longer than the max
length 32766), all of which were skipped.  Please correct the analyzer to
not produce such terms.  The prefix of the first immense term is: '[84, 104,
101, 32, 73, 78, 76, 32, 105, 115, 32, 97, 32, 85, 46, 83, 46, 32, 68, 101,
112, 97, 114, 116, 109, 101, 110, 116, 32, 111]...', original message: bytes
can be at most 32766 in length; got 44360
at
org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:667)
at
org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:344)
at
org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:300)
at
org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:232)
at
org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:458)
at
org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1350)
at
org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:239)
at
org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:163)
... 40 more
Caused by:
org.apache.lucene.util.BytesRefHash$MaxBytesLengthExceededException: bytes
can be at most 32766 in length; got 44360
at org.apache.lucene.util.BytesRefHash.add(BytesRefHash.java:284)
at
org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:154)
at
org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:657)
... 47 more


And it took me all of two seconds to realize what had gone wrong. Now I'm
just trying to figure out how to index the text content without truncating
all the info or filtering it out entirely, thereby messing up my searching
capabilities. 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Error-when-submitting-PDF-to-Solr-w-text-fields-using-SolrJ-tp4212704p4212919.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Limit indexed documents.

2015-06-19 Thread Toke Eskildsen
tomas.kalas kala...@email.cz wrote:
 Existing some hardware or software limits for indexing data?

The only really hard Solr limit is 2 billion X per shard, where X is document 
count, unique values in a DocValues String field and other things like that. 
There are some softer limits, after which performance degrades markedly: Number 
of fields (hundreds are fine, millions are unrealistic), number of shards 
(avoid going into the thousands). Having a Java heap of hundreds of gigabytes 
is possible, but requires tweaking to avoid very long garbage collection 
pauses. I do not know of a byte size limit for shards: Shards of 1-2 TB works 
without problems on fitting hardware.

 And is some maximum of indexed documents?

While the limit is 2 billion per single shard, SolrCloud does not have this 
limitation. A soft limit before doing some custom multi-level setup would thus 
be around 2000 billion documents, divided across 1000 shards.

- Toke Eskildsen


RE: How to do a Data sharding for data in a database table

2015-06-19 Thread Reitzel, Charles
Hi Wenbin,

To me, your instance appears well provisioned.  Likewise, your analysis of test 
vs. production performance makes a lot of sense.  Perhaps your time would be 
well spent tuning the query performance for your app before resorting to 
sharding?   

To that end, what do you see when you set debugQuery=true?   Where does solr 
spend the time?   My guess would be in the grouping and sorting steps, but 
which?   Sometime the schema details matter for performance.   Folks on this 
list can help with that.

-Charlie

-Original Message-
From: Wenbin Wang [mailto:wwang...@gmail.com] 
Sent: Friday, June 19, 2015 7:55 AM
To: solr-user@lucene.apache.org
Subject: Re: How to do a Data sharding for data in a database table

I have enough RAM (30G) and Hard disk (1000G). It is not I/O bound or computer 
disk bound. In addition, the Solr was started with maximal 4G for JVM, and 
index size is  2G. In a typical test, I made sure enough free RAM of 10G was 
available. I have not tuned any parameter in the configuration, it is default 
configuration.

The number of fields for each record is around 10, and the number of results to 
be returned per page is 30. So the response time should not be affected by 
network traffic, and it is tested in the same machine. The query has a list of 
4 search parameters, and each parameter takes a list of values or date range. 
The results will also be grouped and sorted. The response time of a typical 
single request is around 1 second. It can be  1 second with more demanding 
requests.

In our production environment, we have 64 cores, and we need to support 
300 concurrent users, that is about 300 concurrent request per second. Each 
core will have to process about 5 request per second. The response time under 
this load will not be 1 second any more. My estimate is that an average of 200 
ms response time of a single request would be able to handle
300 concurrent users in production. There is no plan to increase the total 
number of cores 5 times.

In a previous test, a search index around 6M data size was able to handle 
5 request per second in each core of my 8-core machine.

By doing data sharding from one single index of 13M to 2 indexes of 6 or 7 
M/each, I am expecting much faster response time that can meet the demand of 
production environment. That is the motivation of doing data sharding.
However, I am also open to solution that can improve the performance of the  
index of 13M to 14M size so that I do not need to do a data sharding.





On Fri, Jun 19, 2015 at 12:39 AM, Erick Erickson erickerick...@gmail.com
wrote:

 You've repeated your original statement. Shawn's observation is that 
 10M docs is a very small corpus by Solr standards. You either have 
 very demanding document/search combinations or you have a poorly tuned 
 Solr installation.

 On reasonable hardware I expect 25-50M documents to have sub-second 
 response time.

 So what we're trying to do is be sure this isn't an XY problem, from 
 Hossman's apache page:

 Your question appears to be an XY Problem ... that is: you are 
 dealing with X, you are assuming Y will help you, and you are asking 
 about Y
 without giving more details about the X so that we can understand 
 the full issue.  Perhaps the best solution doesn't involve Y at all?
 See Also: http://www.perlmonks.org/index.pl?node_id=542341

 So again, how would you characterize your documents? How many fields? 
 What do queries look like? How much physical memory on the machine? 
 How much memory have you allocated to the JVM?

 You might review:
 http://wiki.apache.org/solr/UsingMailingLists


 Best,
 Erick

 On Thu, Jun 18, 2015 at 3:23 PM, wwang525 wwang...@gmail.com wrote:
  The query without load is still under 1 second. But under load, 
  response
 time
  can be much longer due to the queued up query.
 
  We would like to shard the data to something like 6 M / shard, which 
  will still give a under 1 second response time under load.
 
  What are some best practice to shard the data? for example, we could
 shard
  the data by date range, but that is pretty dynamic, and we could 
  shard
 data
  by some other properties, but if the data is not evenly distributed, 
  you
 may
  not be able shard it anymore.
 
 
 
  --
  View this message in context:
 http://lucene.472066.n3.nabble.com/How-to-do-a-Data-sharding-for-data-
 in-a-database-table-tp4212765p4212803.html
  Sent from the Solr - User mailing list archive at Nabble.com.


*
This e-mail may contain confidential or privileged information.
If you are not the intended recipient, please notify the sender immediately and 
then delete it.

TIAA-CREF
*


Re: Error when submitting PDF to Solr w/text fields using SolrJ

2015-06-19 Thread Alessandro Benedetti
Silly thing … Maybe the immense token was generating because trying to set
string as field type for your text ?
Can be ?
Can you wipe out the index, set a proper type for your text, and index
again ?
No worries about the not full stack trace,
We learn and do wrong things everyday :)
Errare humanum est

Cheers

2015-06-19 14:31 GMT+01:00 Paden rumsey...@gmail.com:

 Yeah I'm just gonna say hands down this was a totally bad question. My
 fault,
 mea culpa. I'm pretty new to working in an IDE environment and using a
 stack
 trace (I just finished my first year of CS at University and now I'm
 interning). I'm actually kind of embarrassed by how long it took me to
 realize I wasn't looking at the entire stack trace. Idiot moment of the
 week
 for sure. Thanks for the patience guys but when I looked at the entire
 stack
 trace it gave me this.

 Caused by: java.lang.IllegalArgumentException: Document contains at least
 one immense term in field=text (whose UTF8 encoding is longer than the
 max
 length 32766), all of which were skipped.  Please correct the analyzer to
 not produce such terms.  The prefix of the first immense term is: '[84,
 104,
 101, 32, 73, 78, 76, 32, 105, 115, 32, 97, 32, 85, 46, 83, 46, 32, 68, 101,
 112, 97, 114, 116, 109, 101, 110, 116, 32, 111]...', original message:
 bytes
 can be at most 32766 in length; got 44360
 at

 org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:667)
 at

 org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:344)
 at

 org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:300)
 at

 org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:232)
 at

 org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:458)
 at
 org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1350)
 at

 org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:239)
 at

 org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:163)
 ... 40 more
 Caused by:
 org.apache.lucene.util.BytesRefHash$MaxBytesLengthExceededException: bytes
 can be at most 32766 in length; got 44360
 at org.apache.lucene.util.BytesRefHash.add(BytesRefHash.java:284)
 at
 org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:154)
 at

 org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:657)
 ... 47 more


 And it took me all of two seconds to realize what had gone wrong. Now I'm
 just trying to figure out how to index the text content without truncating
 all the info or filtering it out entirely, thereby messing up my searching
 capabilities.



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Error-when-submitting-PDF-to-Solr-w-text-fields-using-SolrJ-tp4212704p4212919.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
--

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?

William Blake - Songs of Experience -1794 England


RE: How to do a Data sharding for data in a database table

2015-06-19 Thread Reitzel, Charles
Grouping does tend to be expensive.   Our regular queries typically return in 
10-15ms while the grouping queries take 60-80ms in a test environment ( 1M 
docs).

This is ok for us, since we wrote our app to take the grouping queries out of 
the critical path (async query in parallel with two primary queries and some 
work in middle tier).   But this approach is unlikely to work for most cases.

-Original Message-
From: Reitzel, Charles [mailto:charles.reit...@tiaa-cref.org] 
Sent: Friday, June 19, 2015 9:52 AM
To: solr-user@lucene.apache.org
Subject: RE: How to do a Data sharding for data in a database table

Hi Wenbin,

To me, your instance appears well provisioned.  Likewise, your analysis of test 
vs. production performance makes a lot of sense.  Perhaps your time would be 
well spent tuning the query performance for your app before resorting to 
sharding?   

To that end, what do you see when you set debugQuery=true?   Where does solr 
spend the time?   My guess would be in the grouping and sorting steps, but 
which?   Sometime the schema details matter for performance.   Folks on this 
list can help with that.

-Charlie

-Original Message-
From: Wenbin Wang [mailto:wwang...@gmail.com]
Sent: Friday, June 19, 2015 7:55 AM
To: solr-user@lucene.apache.org
Subject: Re: How to do a Data sharding for data in a database table

I have enough RAM (30G) and Hard disk (1000G). It is not I/O bound or computer 
disk bound. In addition, the Solr was started with maximal 4G for JVM, and 
index size is  2G. In a typical test, I made sure enough free RAM of 10G was 
available. I have not tuned any parameter in the configuration, it is default 
configuration.

The number of fields for each record is around 10, and the number of results to 
be returned per page is 30. So the response time should not be affected by 
network traffic, and it is tested in the same machine. The query has a list of 
4 search parameters, and each parameter takes a list of values or date range. 
The results will also be grouped and sorted. The response time of a typical 
single request is around 1 second. It can be  1 second with more demanding 
requests.

In our production environment, we have 64 cores, and we need to support 
300 concurrent users, that is about 300 concurrent request per second. Each 
core will have to process about 5 request per second. The response time under 
this load will not be 1 second any more. My estimate is that an average of 200 
ms response time of a single request would be able to handle
300 concurrent users in production. There is no plan to increase the total 
number of cores 5 times.

In a previous test, a search index around 6M data size was able to handle 
5 request per second in each core of my 8-core machine.

By doing data sharding from one single index of 13M to 2 indexes of 6 or 7 
M/each, I am expecting much faster response time that can meet the demand of 
production environment. That is the motivation of doing data sharding.
However, I am also open to solution that can improve the performance of the  
index of 13M to 14M size so that I do not need to do a data sharding.





On Fri, Jun 19, 2015 at 12:39 AM, Erick Erickson erickerick...@gmail.com
wrote:

 You've repeated your original statement. Shawn's observation is that 
 10M docs is a very small corpus by Solr standards. You either have 
 very demanding document/search combinations or you have a poorly tuned 
 Solr installation.

 On reasonable hardware I expect 25-50M documents to have sub-second 
 response time.

 So what we're trying to do is be sure this isn't an XY problem, from 
 Hossman's apache page:

 Your question appears to be an XY Problem ... that is: you are 
 dealing with X, you are assuming Y will help you, and you are asking 
 about Y
 without giving more details about the X so that we can understand 
 the full issue.  Perhaps the best solution doesn't involve Y at all?
 See Also: http://www.perlmonks.org/index.pl?node_id=542341

 So again, how would you characterize your documents? How many fields? 
 What do queries look like? How much physical memory on the machine? 
 How much memory have you allocated to the JVM?

 You might review:
 http://wiki.apache.org/solr/UsingMailingLists


 Best,
 Erick

 On Thu, Jun 18, 2015 at 3:23 PM, wwang525 wwang...@gmail.com wrote:
  The query without load is still under 1 second. But under load, 
  response
 time
  can be much longer due to the queued up query.
 
  We would like to shard the data to something like 6 M / shard, which 
  will still give a under 1 second response time under load.
 
  What are some best practice to shard the data? for example, we could
 shard
  the data by date range, but that is pretty dynamic, and we could 
  shard
 data
  by some other properties, but if the data is not evenly distributed, 
  you
 may
  not be able shard it anymore.
 
 
 
  --
  View this message in context:
 http://lucene.472066.n3.nabble.com/How-to-do-a-Data-sharding-for-data-

Re: understanding collapsingQParser with facet vs group.facet

2015-06-19 Thread Upayavira
On Fri, Jun 19, 2015, at 06:20 AM, Derek Poh wrote:
 Hi
 
 I read about collapsingQParser returns the facet count the same as 
 group.truncate=true and has this issue with the facet count and the 
 after filter facet count notthe same.
 Using group.facetdoes not has this issue but it's performance is very 
 badcompared to collapsingQParser.
 
 I trying to understand why collapsingQParser behave this way and will 
 need to explain to management.
 
 Can someone explain how collapsingQParser calculatethefacet 
 countscompated to group.facet?

I'm not familiar with group.facet. But to compare traditional grouping
to the collapsingQParser - in traditional grouping, all matching
documents remain in the result set, but they are grouped for output
purposes. However, the collapsingQParser is actually a query filter. It
will reduce the number of matching results. Any faceting that happens
will happen on the filtered results.

I wonder if you can use this syntax to achieve faceting alongside
collapsing:

q=whatever
fq={!collapse tag=collapse}blah
facet.field={!ex=collapse}my_facet_field

This way, you get the benefits of the CollapsingQParserPlugin, with full
faceting on the uncollapsed resultset.

I've no idea how this would perform, but I'd expect it to be better than
the grouping option.

Upayavira


Re: Auto-suggest in Solr

2015-06-19 Thread Alessandro Benedetti
Actually the documentation is not clear enough.
Let's try to understand this suggester.

*Building*
This suggester build a FST that it will use to provide the autocomplete
feature running prefix searches on it .
The terms it uses to generate the FST are the tokens produced by the
 suggestFreeTextAnalyzerFieldType .

And this should be correct.
So if we have a shingle token filter[1-3] ( we produce unigrams as well) in
our analysis to keep it simple , from these original field values :
mp3 ipod
mp3 player
mp3 player ipod
player of Real

- we produce these list of possible suggestions in our FST :

mp3
player
ipod
real
of

mp3 ipod
mp3 player
player ipod
player of
of real

mp3 player ipod
player of real

From the documentation I read :

  ngrams: The max number of tokens out of which singles will be make the
 dictionary. The default value is 2. Increasing this would mean you want
 more than the previous 2 tokens to be taken into consideration when making
 the suggestions. 


This makes me confused, as I was not expecting this param to affect the
suggestion dictionary.
So I would like a clarification here from our masters :)
At this point let's see what happens at query time .

*Query Time *
As my understanding the ngrams params will consider  the last N-1 tokens
the user put separated by the space separator.

Builds an ngram model from the text sent to {@link
 * #build} and predicts based on the last grams-1 tokens in
 * the request sent to {@link #lookup}. This tries to
 * handle the long tail of suggestions for when the
 * incoming query is a never before seen query string.


Example , grams=3 should consider only the last 2 tokens

special mp3 p - mp3 p

Then this query is analysed using the suggestFreeTextAnalyzerFieldType .
We produce 3 tokens :
mp3
p
mp3 p

And we run the prefix matching on the FST .

*Conclusion*
My understanding is wrong for sure at some point, as the behaviour I get is
different.
Can we discuss this , clarify this and eventually put it in the official
documentation ?

Cheers

2015-06-19 6:40 GMT+01:00 Zheng Lin Edwin Yeo edwinye...@gmail.com:

 I'm implementing an auto-suggest feature in Solr, and I'll like to achieve
 the follwing:

 For example, if the user enters mp3, Solr might suggest mp3 player,
 mp3 nano and mp3 music.
 When the user enters mp3 p, the suggestion should narrow down to mp3
 player.

 Currently, when I type mp3 p, the suggester is returning words that
 starts with the letter p only, and I'm getting results like plan,
 production, etc, and it does not take the mp3 token into consideration.

 I'm using Solr 5.1 and below is my configuration:

 In solrconfig.xml:

 searchComponent name=suggest class=solr.SuggestComponent
   lst name=suggester

  str name=lookupImplFreeTextLookupFactory/str
  str name=indexPathsuggester_freetext_dir/str

 str name=dictionaryImplDocumentDictionaryFactory/str
 str name=fieldSuggestion/str
 str name=weightFieldProject/str
 str name=suggestFreeTextAnalyzerFieldTypesuggestType/str
 int name=ngrams5/int
 str name=buildOnStartupfalse/str
 str name=buildOnCommitfalse/str
   /lst
 /searchComponent


 In schema.xml

 fieldType name=suggestType class=solr.TextField
 positionIncrementGap=100
 analyzer type=index
 charFilter class=solr.PatternReplaceCharFilterFactory
 pattern=[^a-zA-Z0-9] replacement=  /
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.ShingleFilterFactory minShingleSize=2
 maxShingleSize=6 outputUnigrams=false/
 /analyzer
 analyzer type=query
 charFilter class=solr.PatternReplaceCharFilterFactory
 pattern=[^a-zA-Z0-9] replacement=  /
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.ShingleFilterFactory minShingleSize=2
 maxShingleSize=6 outputUnigrams=true/
 /analyzer
 /fieldType


 Is there anything that I configured wrongly?


 Regards,
 Edwin




-- 
--

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?

William Blake - Songs of Experience -1794 England


Re: understanding collapsingQParser with facet vs group.facet

2015-06-19 Thread Joel Bernstein
The CollapsingQParserPlugin currently doesn't calculate facets at all. It
simply collapses the document set. The facets are then calculated only on
the group heads.

Grouping has special faceting code built into it that supports the
group.facet functionality.

Joel Bernstein
http://joelsolr.blogspot.com/

On Fri, Jun 19, 2015 at 6:20 AM, Derek Poh d...@globalsources.com wrote:

 Hi

 I read about collapsingQParser returns the facet count the same as
 group.truncate=true and has this issue with the facet count and the after
 filter facet count notthe same.
 Using group.facetdoes not has this issue but it's performance is very
 badcompared to collapsingQParser.

 I trying to understand why collapsingQParser behave this way and will need
 to explain to management.

 Can someone explain how collapsingQParser calculatethefacet countscompated
 to group.facet?

 Thank you,
 Derek





Re: understanding collapsingQParser with facet vs group.facet

2015-06-19 Thread Joel Bernstein
Unfortunately this won't give you group.facet results:

q=whatever
fq={!collapse tag=collapse}blah
facet.field={!ex=collapse}my_facet_field

This will give you the expanded facet counts as it removes the collapse
filter.

A good explanation of group.facets is here:

http://blog.trifork.com/2012/04/10/faceting-result-grouping/










Joel Bernstein
http://joelsolr.blogspot.com/

On Fri, Jun 19, 2015 at 7:11 AM, Upayavira u...@odoko.co.uk wrote:

 On Fri, Jun 19, 2015, at 06:20 AM, Derek Poh wrote:
  Hi
 
  I read about collapsingQParser returns the facet count the same as
  group.truncate=true and has this issue with the facet count and the
  after filter facet count notthe same.
  Using group.facetdoes not has this issue but it's performance is very
  badcompared to collapsingQParser.
 
  I trying to understand why collapsingQParser behave this way and will
  need to explain to management.
 
  Can someone explain how collapsingQParser calculatethefacet
  countscompated to group.facet?

 I'm not familiar with group.facet. But to compare traditional grouping
 to the collapsingQParser - in traditional grouping, all matching
 documents remain in the result set, but they are grouped for output
 purposes. However, the collapsingQParser is actually a query filter. It
 will reduce the number of matching results. Any faceting that happens
 will happen on the filtered results.

 I wonder if you can use this syntax to achieve faceting alongside
 collapsing:

 q=whatever
 fq={!collapse tag=collapse}blah
 facet.field={!ex=collapse}my_facet_field

 This way, you get the benefits of the CollapsingQParserPlugin, with full
 faceting on the uncollapsed resultset.

 I've no idea how this would perform, but I'd expect it to be better than
 the grouping option.

 Upayavira



Error: Could not create instance of 'SolrInputDocument'

2015-06-19 Thread Paul Revere
We are running PaperThin's CommonSpot CMS in a Cold Fusion 10 and MS SQL Server 
2008 R2 environment. We're using Apache Solr 4.10.4 vice Cold Fusion's Solr. We 
can create (and delete) collections through the CS CMS; they appear in (and 
disappear from) both the physical file structure as well as the Apache Solr 
dashboard. When we try indexing a collection through our CS CMS, it appears 
that each member is being indexed, however, each member errors out 
[Error.see logs] and indexing continues to the next member, only to error 
out again, etc., etc., etc. Eventually the entire collection is indexed in 
this fashion, and we received a message that the collection has been indexed 
and optimized. Our keyword search fails, returning 0 results.

Our log files show entries for each member indexed:

Error: Could not create instance of 'SolrInputDocument'.
~~
Exception: org.apache.solr.common.SolrInputDocument

We're obiously missing something, but this is our first time using Apache Solr 
and aren't sure where things may be broken.
Many thanks for any/all recommendations/guidance.

Thanks!

Paul R.



Re: Error when submitting PDF to Solr w/text fields using SolrJ

2015-06-19 Thread Alessandro Benedetti
I definitely agree with Erick, the stack trace you posted is not complete
again.
This is an example of the same problem you got with a complete, meaningful
stack trace :

Stacktrace you provided :

org.apache.solr.common.SolrException: Exception writing document id 12345
 to the index; possible analysis error.
 at
 org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:168)
 at
 org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69)
 at
 org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51)
 at
 org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:870)
 at
 org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1024)
 at
 org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:693)
 …
 -- Important stack trace follows !!
 Caused by: java.lang.IllegalArgumentException: input AttributeSource must
 not be null
 at org.apache.lucene.util.AttributeSource.init(AttributeSource.java:94)
 at org.apache.lucene.analysis.TokenStream.init(TokenStream.java:106)
 at org.apache.lucene.analysis.TokenFilter.init(TokenFilter.java:33)
 at
 org.apache.lucene.analysis.util.FilteringTokenFilter.init(FilteringTokenFilter.java:70)
 at org.apache.lucene.analysis.core.StopFilter.init(StopFilter.java:60)
 at
 org.apache.lucene.analysis.core.StopFilterFactory.create(StopFilterFactory.java:127)
 at
 org.apache.solr.analysis.TokenizerChain.createComponents(TokenizerChain.java:67)
 at
 org.apache.lucene.analysis.AnalyzerWrapper.createComponents(AnalyzerWrapper.java:102)
 at org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:180)
 at org.apache.lucene.document.Field.tokenStream(Field.java:554)
 at
 org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:597)
 at
 org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:342)
 at
 org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:301)
 at
 org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:222)
 at
 org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:450)
 at
 org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1507)
 at
 org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:240)
 at
 org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:164)
 ... 35 more
 ,


If you give us all the stack trace, I am pretty sure we can help .


Cheers


2015-06-19 5:31 GMT+01:00 Erick Erickson erickerick...@gmail.com:

 The stack trace is what gets returned to the client, right? It's often
 much more informative to see the Solr log output, the error message
 is often much more helpful there. By the time the exception bubbles
 up through the various layers vital information is sometimes not returned
 to the client in the error message.

 One precaution I would take since you've changed the schema is to
 _completely_ remove the index.
 1 shut down Solr
 2 rm -rf coreX/data
 3 restart Solr.
 4 try it again.

 Lucene doesn't really care at all whether a field gets indexed one way in
 one document and another way in the next document and occasionally
 having fields indexed different ways (string and text) in different
 documents
 at the same time confuses things.

 Best,
 Erick

 On Thu, Jun 18, 2015 at 10:31 AM, Paden rumsey...@gmail.com wrote:
  Just rolling out a little bit more information as it is coming. I
 changed the
  field type in the schema to text_general and that didn't change a thing.
 
  Another thing is that it's consistently submitting/not submitting the
 same
  documents. I will run over it one time and it won't index a set of
  documents. When I clear the index and run the program again it
  submits/doesn't submit the same documents.
 
  And it will index certain PDF's it just won't index others. Which is
 weird
  because I printed the strings that are submitted to Solr and the ones
 that
  get submitted are really similar to the ones that aren't submitted.
 
  I can't post the actual strings for sensitivity reasons.
 
 
 
  --
  View this message in context:
 http://lucene.472066.n3.nabble.com/Error-when-submitting-PDF-to-Solr-w-text-fields-using-SolrJ-tp4212704p4212757.html
  Sent from the Solr - User mailing list archive at Nabble.com.




-- 
--

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?

William Blake - Songs of Experience -1794 England


Re: understanding collapsingQParser with facet vs group.facet

2015-06-19 Thread Joel Bernstein
The AnalyticsQuery can be used to implement custom faceting modules. This
would allow you to calculate facets counts in an algorithm similar to
group.facets before the result set is collapsed. If you are in distributed
mode you will also need to implement a merge strategy:

http://heliosearch.org/solrs-new-analyticsquery-api/
http://heliosearch.org/solrs-mergestrategy/

Joel Bernstein
http://joelsolr.blogspot.com/

On Fri, Jun 19, 2015 at 7:28 AM, Joel Bernstein joels...@gmail.com wrote:

 Unfortunately this won't give you group.facet results:

 q=whatever
 fq={!collapse tag=collapse}blah
 facet.field={!ex=collapse}my_facet_field

 This will give you the expanded facet counts as it removes the collapse
 filter.

 A good explanation of group.facets is here:

 http://blog.trifork.com/2012/04/10/faceting-result-grouping/










 Joel Bernstein
 http://joelsolr.blogspot.com/

 On Fri, Jun 19, 2015 at 7:11 AM, Upayavira u...@odoko.co.uk wrote:

 On Fri, Jun 19, 2015, at 06:20 AM, Derek Poh wrote:
  Hi
 
  I read about collapsingQParser returns the facet count the same as
  group.truncate=true and has this issue with the facet count and the
  after filter facet count notthe same.
  Using group.facetdoes not has this issue but it's performance is very
  badcompared to collapsingQParser.
 
  I trying to understand why collapsingQParser behave this way and will
  need to explain to management.
 
  Can someone explain how collapsingQParser calculatethefacet
  countscompated to group.facet?

 I'm not familiar with group.facet. But to compare traditional grouping
 to the collapsingQParser - in traditional grouping, all matching
 documents remain in the result set, but they are grouped for output
 purposes. However, the collapsingQParser is actually a query filter. It
 will reduce the number of matching results. Any faceting that happens
 will happen on the filtered results.

 I wonder if you can use this syntax to achieve faceting alongside
 collapsing:

 q=whatever
 fq={!collapse tag=collapse}blah
 facet.field={!ex=collapse}my_facet_field

 This way, you get the benefits of the CollapsingQParserPlugin, with full
 faceting on the uncollapsed resultset.

 I've no idea how this would perform, but I'd expect it to be better than
 the grouping option.

 Upayavira





Re: How to do a Data sharding for data in a database table

2015-06-19 Thread Wenbin Wang
I have enough RAM (30G) and Hard disk (1000G). It is not I/O bound or
computer disk bound. In addition, the Solr was started with maximal 4G for
JVM, and index size is  2G. In a typical test, I made sure enough free RAM
of 10G was available. I have not tuned any parameter in the configuration,
it is default configuration.

The number of fields for each record is around 10, and the number of
results to be returned per page is 30. So the response time should not be
affected by network traffic, and it is tested in the same machine. The
query has a list of 4 search parameters, and each parameter takes a list of
values or date range. The results will also be grouped and sorted. The
response time of a typical single request is around 1 second. It can be  1
second with more demanding requests.

In our production environment, we have 64 cores, and we need to support 
300 concurrent users, that is about 300 concurrent request per second. Each
core will have to process about 5 request per second. The response time
under this load will not be 1 second any more. My estimate is that an
average of 200 ms response time of a single request would be able to handle
300 concurrent users in production. There is no plan to increase the total
number of cores 5 times.

In a previous test, a search index around 6M data size was able to handle 
5 request per second in each core of my 8-core machine.

By doing data sharding from one single index of 13M to 2 indexes of 6 or 7
M/each, I am expecting much faster response time that can meet the demand
of production environment. That is the motivation of doing data sharding.
However, I am also open to solution that can improve the performance of the
 index of 13M to 14M size so that I do not need to do a data sharding.





On Fri, Jun 19, 2015 at 12:39 AM, Erick Erickson erickerick...@gmail.com
wrote:

 You've repeated your original statement. Shawn's
 observation is that 10M docs is a very small corpus
 by Solr standards. You either have very demanding
 document/search combinations or you have a poorly
 tuned Solr installation.

 On reasonable hardware I expect 25-50M documents to have
 sub-second response time.

 So what we're trying to do is be sure this isn't
 an XY problem, from Hossman's apache page:

 Your question appears to be an XY Problem ... that is: you are dealing
 with X, you are assuming Y will help you, and you are asking about Y
 without giving more details about the X so that we can understand the
 full issue.  Perhaps the best solution doesn't involve Y at all?
 See Also: http://www.perlmonks.org/index.pl?node_id=542341

 So again, how would you characterize your documents? How many
 fields? What do queries look like? How much physical memory on the
 machine? How much memory have you allocated to the JVM?

 You might review:
 http://wiki.apache.org/solr/UsingMailingLists


 Best,
 Erick

 On Thu, Jun 18, 2015 at 3:23 PM, wwang525 wwang...@gmail.com wrote:
  The query without load is still under 1 second. But under load, response
 time
  can be much longer due to the queued up query.
 
  We would like to shard the data to something like 6 M / shard, which will
  still give a under 1 second response time under load.
 
  What are some best practice to shard the data? for example, we could
 shard
  the data by date range, but that is pretty dynamic, and we could shard
 data
  by some other properties, but if the data is not evenly distributed, you
 may
  not be able shard it anymore.
 
 
 
  --
  View this message in context:
 http://lucene.472066.n3.nabble.com/How-to-do-a-Data-sharding-for-data-in-a-database-table-tp4212765p4212803.html
  Sent from the Solr - User mailing list archive at Nabble.com.



Migration from Solr 4.7.1 to SolrCloud 5.1

2015-06-19 Thread shacky
Hi.
I have an old index running on a standalone Solr 4.7.1 and I have to
migrate its index to my new SolrCloud 5.1 installation.
I'm looking for some way to do this but I'm a little confused.
Could you help me please?
Thank you very much!
Bye


Re: ZooKeeper connection refused

2015-06-19 Thread shacky
2015-06-17 16:11 GMT+02:00 Shalin Shekhar Mangar shalinman...@gmail.com:
 Is ZK healthy? Can you try the following from the server on which Solr
 is running:

 echo ruok | nc zk1 2181

Thank you very much Shalin for your answer!
My ZK cluster was not ready because two nodes was dead and only one
node was running.
I fixed the two nodes and now all works good.
Thank you very much!


RE: Solr Logging

2015-06-19 Thread Garth Grimm
Framework way?

Maybe try delving into the log4j framework and modify the log4j.properties 
file.  You can generate different log files based upon what class generated the 
message.  Here's an example that I experimented with previously, it generates 
an update log, and 2 different query logs with slightly different information 
about each query.

Adding a component to each requestHandler dedicated to logging might be the 
best way, but that might not qualify as a framework way, and I've never tried 
anything like that, so don't know how easy it might be.

Just sending the relevant lines from log4j.properties, excluding the lines  
that are there by default.

# Logger for updates
log4j.logger.org.apache.solr.update.processor.LogUpdateProcessor=INFO, Updates

#- size rotation with log cleanup.
log4j.appender.Updates=org.apache.log4j.RollingFileAppender
log4j.appender.Updates.MaxFileSize=4MB
log4j.appender.Updates.MaxBackupIndex=9

#- File to log to and log format
log4j.appender.Updates.File=${solr.log}/solr_Updates.log
log4j.appender.Updates.layout=org.apache.log4j.PatternLayout
log4j.appender.Updates.layout.ConversionPattern=%-5p - %d{-MM-dd 
HH:mm:ss.SSS}; %C; %m\n

# Logger for queries, using SolrDispatchFilter
log4j.logger.org.apache.solr.servlet.SolrDispatchFilter=DEBUG, queryLog1

#- size rotation with log cleanup.
log4j.appender.queryLog1=org.apache.log4j.RollingFileAppender
log4j.appender.queryLog1.MaxFileSize=4MB
log4j.appender.queryLog1.MaxBackupIndex=9

#- File to log to and log format
log4j.appender.queryLog1.File=${solr.log}/solr_queryLog1.log
log4j.appender.queryLog1.layout=org.apache.log4j.PatternLayout
log4j.appender.queryLog1.layout.ConversionPattern=%-5p - %d{-MM-dd 
HH:mm:ss.SSS}; %C; %m\n

# Logger for queries, using SolrCore
log4j.logger.org.apache.solr.core.SolrCore=INFO, queryLog2

#- size rotation with log cleanup.
log4j.appender.queryLog2=org.apache.log4j.RollingFileAppender
log4j.appender.queryLog2.MaxFileSize=4MB
log4j.appender.queryLog2.MaxBackupIndex=9

#- File to log to and log format
log4j.appender.queryLog2.File=${solr.log}/solr_queryLog2.log
log4j.appender.queryLog2.layout=org.apache.log4j.PatternLayout
log4j.appender.queryLog2.layout.ConversionPattern=%-5p - %d{-MM-dd 
HH:mm:ss.SSS}; %C; %m\n


-Original Message-
From: rbkumar88 [mailto:rbkuma...@gmail.com] 
Sent: Thursday, June 18, 2015 10:41 AM
To: solr-user@lucene.apache.org
Subject: Solr Logging

Hi,

I want to log Solr search queries/response time and Solr indexing log 
separately in different set of log files.
Is there any convenient framework/way to do it.

Thanks
Bharath



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Logging-tp4212730.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: How to append new data to index i solr?

2015-06-19 Thread Mikhail Khludnev
It does. Absolutely. But it depends on what you in it. Start from
http://wiki.apache.org/solr/UpdateXmlMessages#add.2Freplace_documents

On Fri, Jun 19, 2015 at 7:54 AM, 步青云 mailliup...@qq.com wrote:

 Hello,
  I'm a solr user with some question. I want to append new data to the
 existing index. Does Solr support to append new data to index?
  Thanks for any reply.
 Best wishes.
 Jason




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
mkhlud...@griddynamics.com


Re: Solr 5.2.1 on Solaris

2015-06-19 Thread Ramkumar R. Aiyengar
Please open a JIRA with details of what the issues are, we should try to
support this..
On 18 Jun 2015 15:07, Bence Vass bence.v...@inso.tuwien.ac.at wrote:

 Hello,

 Is there any documentation on how to start Solr 5.2.1 on Solaris (Solaris
 10)? The script (solr start) doesn't work out of the box, is anyone running
 Solaris 5.x on Solaris?

 - Thanks



Distributed Search component question

2015-06-19 Thread Mihran Shahinian
Hi all,
I have the following search components that I don't have a solution at the
moment to get them working in distributed mode on solr 4.10.4.

[standard query component]
[search component-1] (StageID - 2500):
 handleResponses: get few values from docs and populate parameters for
stats component and set some metadata in the ResponseBuilder
  rb.rsp.add(metadata, NamedList...)

distributedProcess:
   rb.doFacets=false;
   if (rb.stage  StageID)
  if( null == rb.rsp[metadata] ) {
   return StageID;
   }
return component-2.StageID

[search component-2] (StageID - 2800):
distributedProcess:
   rb.doFacets=true;
   formatAndSet some facetParams based on rb.rsp[metadata]
   return ResponseBuilder.STAGE_GET_FIELDS

[standard facet component]:


Things seem to work fine between component-1 and component-2, I just can't
prevent facets from running
until component-2 sets proper facet params. And than facet component sets
the rb._facetInfo to null. Should I move my logic in component-2 from
distributeProcess to handleResponses and modify ShardRequest and set
rb.addRequest?

Any hints are much appreciated.
Mihran


Re: Error when submitting PDF to Solr w/text fields using SolrJ

2015-06-19 Thread Alessandro Benedetti
So, the first I can say is if that is true : it almost killed Solr with
280 files you are doing something wrong for sure.
At least if you are not trying to index 4k full movies xD

Joking apart :
1) You should carefully design your analyser.
2) You should store your fields initially to verify you index what you were
supposed to ( in number and in content)
Assuming you are a beginner storing the fields will make easier for you to
check, as they will pop out of the results.

is at least the number of docs indexed correct ?


2015-06-19 15:34 GMT+01:00 Paden rumsey...@gmail.com:

 Yeah, actually changing the field to text_en or text_en_splitting
 actually made it so my indexer indexed all my files. The only problem is, I
 don't think it's doing it well.

 I have two Cores that I'm working with. Both of them have indexed the same
 set of files. The first core, which I will refer to as Testcore, I used a
 DIH configuration that indexed the files with their metadata. (It indexed
 everything fine but it almost killed Solr with 280 files I would hate to
 see
 what would happen with say, 10,000 files.). When I query Testcore on some
 random common word like a it returns like 279 files. A good margin I can
 accept that.

 The second core, which I will refer to as Testcore2, I used my own indexer
 that I created and use SolrJ as the client. It indexes everything. However,
 when I query on the same word a it only returns 208 of the 281 files.
 Which is weird cause I'm using the exact same Querying handler for both. So
 I don't think a comprehensive indexed text is being sent to Solr.





 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Error-when-submitting-PDF-to-Solr-w-text-fields-using-SolrJ-tp4212704p4212933.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
--

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?

William Blake - Songs of Experience -1794 England


Re: Error when submitting PDF to Solr w/text fields using SolrJ

2015-06-19 Thread Paden
Yeah, actually changing the field to text_en or text_en_splitting
actually made it so my indexer indexed all my files. The only problem is, I
don't think it's doing it well. 

I have two Cores that I'm working with. Both of them have indexed the same
set of files. The first core, which I will refer to as Testcore, I used a
DIH configuration that indexed the files with their metadata. (It indexed
everything fine but it almost killed Solr with 280 files I would hate to see
what would happen with say, 10,000 files.). When I query Testcore on some
random common word like a it returns like 279 files. A good margin I can
accept that. 

The second core, which I will refer to as Testcore2, I used my own indexer
that I created and use SolrJ as the client. It indexes everything. However,
when I query on the same word a it only returns 208 of the 281 files.
Which is weird cause I'm using the exact same Querying handler for both. So
I don't think a comprehensive indexed text is being sent to Solr. 





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Error-when-submitting-PDF-to-Solr-w-text-fields-using-SolrJ-tp4212704p4212933.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Error: Could not create instance of 'SolrInputDocument'

2015-06-19 Thread Shawn Heisey
On 6/19/2015 5:40 AM, Paul Revere wrote:
 Our log files show entries for each member indexed:
 
 Error: Could not create instance of 'SolrInputDocument'.
 ~~
 Exception: org.apache.solr.common.SolrInputDocument

There will be a *lot* more detail available on this exception.  We will
need all of it, including all caused by information.  It can be dozens
of lines and include multiple caused by clauses, each of which will
have a stacktrace.  Your message indicates that it is Solr 4.10.4 ...
hopefully it is unmodified.  That information is critical in comparing
the exception stacktrace to the source code.

There might also be additional information in the logs that is
immediately before or after this message.  You might need to go to the
Solr logfile instead of your application's logfile for more information.

http://wiki.apache.org/solr/UsingMailingLists

Thanks,
Shawn



Re: Migration from Solr 4.7.1 to SolrCloud 5.1

2015-06-19 Thread Erick Erickson
You really have to ask more specific questions here. What
are you confused _about_? Have
you gone through the tutorial? Read the Solr In Action book?
Tried _anything_?


Best,
Erick

On Fri, Jun 19, 2015 at 5:02 AM, shacky shack...@gmail.com wrote:
 Hi.
 I have an old index running on a standalone Solr 4.7.1 and I have to
 migrate its index to my new SolrCloud 5.1 installation.
 I'm looking for some way to do this but I'm a little confused.
 Could you help me please?
 Thank you very much!
 Bye


Re: How to do a Data sharding for data in a database table

2015-06-19 Thread Erick Erickson
First and most obvious thing to try:

bq: the Solr was started with maximal 4G for JVM, and index size is  2G

Bump your JVM to 8G, perhaps 12G. The size of the index on disk is very
loosely coupled to JVM requirements. It's quite possible that you're spending
all your time in GC cycles. Consider gathering GC characteristics, see:
http://lucidworks.com/blog/garbage-collection-bootcamp-1-0/

As Charles says, on the face of it the system you describe should handle quite
a load, so it feels like things can be tuned and you won't have to
resort to sharding.
Sharding inevitably imposes some overhead so it's best to go there last.

From my perspective, this is, indeed, an XY problem. You're assuming
that sharding
is your solution. But you really haven't identified the _problem_ other than
queries are too slow. Let's nail down the reason queries are taking
a second before
jumping into sharding. I've just spent too much of my life fixing the
wrong thing ;)

It would be useful to see a couple of sample queries so we can get a
feel for how complex they
are. Especially if you append, as Charles mentions, debug=true

Best,
Erick

On Fri, Jun 19, 2015 at 7:02 AM, Reitzel, Charles
charles.reit...@tiaa-cref.org wrote:
 Grouping does tend to be expensive.   Our regular queries typically return in 
 10-15ms while the grouping queries take 60-80ms in a test environment ( 1M 
 docs).

 This is ok for us, since we wrote our app to take the grouping queries out of 
 the critical path (async query in parallel with two primary queries and some 
 work in middle tier).   But this approach is unlikely to work for most cases.

 -Original Message-
 From: Reitzel, Charles [mailto:charles.reit...@tiaa-cref.org]
 Sent: Friday, June 19, 2015 9:52 AM
 To: solr-user@lucene.apache.org
 Subject: RE: How to do a Data sharding for data in a database table

 Hi Wenbin,

 To me, your instance appears well provisioned.  Likewise, your analysis of 
 test vs. production performance makes a lot of sense.  Perhaps your time 
 would be well spent tuning the query performance for your app before 
 resorting to sharding?

 To that end, what do you see when you set debugQuery=true?   Where does solr 
 spend the time?   My guess would be in the grouping and sorting steps, but 
 which?   Sometime the schema details matter for performance.   Folks on this 
 list can help with that.

 -Charlie

 -Original Message-
 From: Wenbin Wang [mailto:wwang...@gmail.com]
 Sent: Friday, June 19, 2015 7:55 AM
 To: solr-user@lucene.apache.org
 Subject: Re: How to do a Data sharding for data in a database table

 I have enough RAM (30G) and Hard disk (1000G). It is not I/O bound or 
 computer disk bound. In addition, the Solr was started with maximal 4G for 
 JVM, and index size is  2G. In a typical test, I made sure enough free RAM 
 of 10G was available. I have not tuned any parameter in the configuration, it 
 is default configuration.

 The number of fields for each record is around 10, and the number of results 
 to be returned per page is 30. So the response time should not be affected by 
 network traffic, and it is tested in the same machine. The query has a list 
 of 4 search parameters, and each parameter takes a list of values or date 
 range. The results will also be grouped and sorted. The response time of a 
 typical single request is around 1 second. It can be  1 second with more 
 demanding requests.

 In our production environment, we have 64 cores, and we need to support 
 300 concurrent users, that is about 300 concurrent request per second. Each 
 core will have to process about 5 request per second. The response time under 
 this load will not be 1 second any more. My estimate is that an average of 
 200 ms response time of a single request would be able to handle
 300 concurrent users in production. There is no plan to increase the total 
 number of cores 5 times.

 In a previous test, a search index around 6M data size was able to handle 
 5 request per second in each core of my 8-core machine.

 By doing data sharding from one single index of 13M to 2 indexes of 6 or 7 
 M/each, I am expecting much faster response time that can meet the demand of 
 production environment. That is the motivation of doing data sharding.
 However, I am also open to solution that can improve the performance of the  
 index of 13M to 14M size so that I do not need to do a data sharding.





 On Fri, Jun 19, 2015 at 12:39 AM, Erick Erickson erickerick...@gmail.com
 wrote:

 You've repeated your original statement. Shawn's observation is that
 10M docs is a very small corpus by Solr standards. You either have
 very demanding document/search combinations or you have a poorly tuned
 Solr installation.

 On reasonable hardware I expect 25-50M documents to have sub-second
 response time.

 So what we're trying to do is be sure this isn't an XY problem, from
 Hossman's apache page:

 Your question appears to be an XY Problem ... that is: you are
 dealing with 

Re: Error when submitting PDF to Solr w/text fields using SolrJ

2015-06-19 Thread Erick Erickson
You really, really, really want to get friendly with the
admin/analysis page for questions like:

bq: You're probably right though. I probably have to create a better analyzer

really ;).

It shows you exactly what each link in your analysis chain does to the
input. Perhaps 75% or
the questions about why am I getting the results I'm seeing are
answered there IMO.

Best,
Erick

On Fri, Jun 19, 2015 at 9:38 AM, Paden rumsey...@gmail.com wrote:
 Yes the number of indexed documents is correct. But the queries I perform
 fall short of what they should be. You're probably right though. I probably
 have to create a better analyzer.

 And I'm not really worried about the other fields. I've already check to see
 if it's storing them correctly and it is. I'm mostly worried about the text
 fields and how they're being indexed by Solr when submitted.

 BTW: Because of your comment, I went back and checked my core that used the
 DIH configuration. I increased the RAM on the Linux virtual machine I'm
 using and it worked like a dream. Thanks! You might have just helped me
 finish this project.



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Error-when-submitting-PDF-to-Solr-w-text-fields-using-SolrJ-tp4212704p4212967.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: Migration from Solr 4.7.1 to SolrCloud 5.1

2015-06-19 Thread shacky
2015-06-19 18:00 GMT+02:00 Erick Erickson erickerick...@gmail.com:
 You really have to ask more specific questions here. What
 are you confused _about_? Have

I read that I could migrate using the backup script, so I looked for
the backup script in the Solr 4.7.1 source code but I haven't find
anything...


Re: Error when submitting PDF to Solr w/text fields using SolrJ

2015-06-19 Thread Paden
Yes the number of indexed documents is correct. But the queries I perform
fall short of what they should be. You're probably right though. I probably
have to create a better analyzer. 

And I'm not really worried about the other fields. I've already check to see
if it's storing them correctly and it is. I'm mostly worried about the text
fields and how they're being indexed by Solr when submitted. 

BTW: Because of your comment, I went back and checked my core that used the
DIH configuration. I increased the RAM on the Linux virtual machine I'm
using and it worked like a dream. Thanks! You might have just helped me
finish this project.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Error-when-submitting-PDF-to-Solr-w-text-fields-using-SolrJ-tp4212704p4212967.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Error when submitting PDF to Solr w/text fields using SolrJ

2015-06-19 Thread Erick Erickson
This may be another forehead-slapper (man, you don't know how often
I've injured myself that way).

Did you commit at the end of the SolrJ indexing to Testcore2? DIH automatically
commits at the end of the run, and depending on how your SolrJ program
is written
it may not have. Or just set autoCommit (with openSearcher=true) in
your solrconfig
file. Or set autoSoftCommit there. In either case, wait until the
interval has expired
after your indexing has run.

Or, for that matter, you can insure you've committed by using curl or
just entering
something like
/Testcore2/update?commit=true
in a url.

And another one that'll make you cringe is if your SolrJ program looks like:

while (more docs) {
   create a solr doc and add it to my list
   if (list  100) {
  send list to Solr
  clear list
  }
}
end of program.

As the program exits, there'll still be docs in the list that haven'
been sent to Solr.

Alessandro's question hints at things like this, the question is
whether the doc is
all the docs got sent to Solr or not. Second question is whether
they're analyzed
differently in the two cores. Third question

Best,
Erick



On Fri, Jun 19, 2015 at 8:32 AM, Alessandro Benedetti
benedetti.ale...@gmail.com wrote:
 So, the first I can say is if that is true : it almost killed Solr with
 280 files you are doing something wrong for sure.
 At least if you are not trying to index 4k full movies xD

 Joking apart :
 1) You should carefully design your analyser.
 2) You should store your fields initially to verify you index what you were
 supposed to ( in number and in content)
 Assuming you are a beginner storing the fields will make easier for you to
 check, as they will pop out of the results.

 is at least the number of docs indexed correct ?


 2015-06-19 15:34 GMT+01:00 Paden rumsey...@gmail.com:

 Yeah, actually changing the field to text_en or text_en_splitting
 actually made it so my indexer indexed all my files. The only problem is, I
 don't think it's doing it well.

 I have two Cores that I'm working with. Both of them have indexed the same
 set of files. The first core, which I will refer to as Testcore, I used a
 DIH configuration that indexed the files with their metadata. (It indexed
 everything fine but it almost killed Solr with 280 files I would hate to
 see
 what would happen with say, 10,000 files.). When I query Testcore on some
 random common word like a it returns like 279 files. A good margin I can
 accept that.

 The second core, which I will refer to as Testcore2, I used my own indexer
 that I created and use SolrJ as the client. It indexes everything. However,
 when I query on the same word a it only returns 208 of the 281 files.
 Which is weird cause I'm using the exact same Querying handler for both. So
 I don't think a comprehensive indexed text is being sent to Solr.





 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Error-when-submitting-PDF-to-Solr-w-text-fields-using-SolrJ-tp4212704p4212933.html
 Sent from the Solr - User mailing list archive at Nabble.com.




 --
 --

 Benedetti Alessandro
 Visiting card : http://about.me/alessandro_benedetti

 Tyger, tyger burning bright
 In the forests of the night,
 What immortal hand or eye
 Could frame thy fearful symmetry?

 William Blake - Songs of Experience -1794 England


Re: CREATE collection bug or feature?

2015-06-19 Thread Shawn Heisey
On 6/19/2015 11:15 AM, Jim.Musil wrote:
 I noticed that when I issue the CREATE collection command to the api, it does 
 not automatically put a replica on every live node connected to zookeeper.

 So, for example, if I have 3 solr nodes connected to a zookeeper ensemble and 
 create a collection like this:

 /admin/collections?action=CREATEname=my_collectionnumShards=1replicationFactor=1maxShardsPerNode=1collection.configName=my_config

 It will only create a core on one of the three nodes. I can make it work if I 
 change replicationFactor to 3. When standing up an entire stack using chef, 
 this all gets a bit clunky. I don't see any option such as ALL that would 
 just create a replica on all nodes regardless of size.

 I'm guessing this is intentional, but curious about the reasoning.

If you tell it replicationFactor=1, then you get exactly that -- one
copy of your index.  I personally think that it would be a violation of
something known as the principle of least surprise for Solr to
automatically create replicas without being asked to.

I would assume that if you are writing automated tools to build indexes
and the servers hosting those indexes that your automation will be able
to calculate a reasonable replicationFactor, or calculate the number of
hosts to create based on a provided replicationFactor.

A feature to have Solr itself automatically calculate a
replicationFactor based on the number of available hosts and the
numShards value provided is not a bad idea.  Please create a feature
request issue in Jira.  One way that this might be done is by setting
replicationFactor to auto or maybe a special number, perhaps 0 or -1.

https://issues.apache.org/jira/browse/SOLR

Thanks,
Shawn



Re: CREATE collection bug or feature?

2015-06-19 Thread Erick Erickson
Jim:

This is by design. There's no way to tell Solr to find all the cores
available and put one replica on each. In fact, you're explicitly
telling it to create one and only one replica, one and only one shard.
That is, your collection will have exactly one low-level core. But you
realized that...

As to the reasoning. Consider hetergeneous collections all hosted on
the same Solr cluster. I have big collections, little collections,
some with high QPS rates, some not. etc. Having Solr do things like
this automatically would make managing this difficult.

Probably the real reason is nobody thought it would be useful in
the general case. And I probably concur. Adding a new node to an
existing cluster would result in unbalanced clusters etc.

I suppose a stop-gap would be to query the live_nodes in the cluster
and add that to the URL, don't know how much of a pain that would be
though.

Best,
Erick

On Fri, Jun 19, 2015 at 10:15 AM, Jim.Musil jim.mu...@target.com wrote:
 I noticed that when I issue the CREATE collection command to the api, it does 
 not automatically put a replica on every live node connected to zookeeper.

 So, for example, if I have 3 solr nodes connected to a zookeeper ensemble and 
 create a collection like this:

 /admin/collections?action=CREATEname=my_collectionnumShards=1replicationFactor=1maxShardsPerNode=1collection.configName=my_config

 It will only create a core on one of the three nodes. I can make it work if I 
 change replicationFactor to 3. When standing up an entire stack using chef, 
 this all gets a bit clunky. I don't see any option such as ALL that would 
 just create a replica on all nodes regardless of size.

 I'm guessing this is intentional, but curious about the reasoning.

 Thanks!
 Jim


Re: How to do a Data sharding for data in a database table

2015-06-19 Thread Wenbin Wang
As for now, the index size is 6.5 M records, and the performance is good
enough. I will re-build the index for all the records (14 M) and test it
again with debug turned on.

Thanks


On Fri, Jun 19, 2015 at 12:10 PM, Erick Erickson erickerick...@gmail.com
wrote:

 First and most obvious thing to try:

 bq: the Solr was started with maximal 4G for JVM, and index size is  2G

 Bump your JVM to 8G, perhaps 12G. The size of the index on disk is very
 loosely coupled to JVM requirements. It's quite possible that you're
 spending
 all your time in GC cycles. Consider gathering GC characteristics, see:
 http://lucidworks.com/blog/garbage-collection-bootcamp-1-0/

 As Charles says, on the face of it the system you describe should handle
 quite
 a load, so it feels like things can be tuned and you won't have to
 resort to sharding.
 Sharding inevitably imposes some overhead so it's best to go there last.

 From my perspective, this is, indeed, an XY problem. You're assuming
 that sharding
 is your solution. But you really haven't identified the _problem_ other
 than
 queries are too slow. Let's nail down the reason queries are taking
 a second before
 jumping into sharding. I've just spent too much of my life fixing the
 wrong thing ;)

 It would be useful to see a couple of sample queries so we can get a
 feel for how complex they
 are. Especially if you append, as Charles mentions, debug=true

 Best,
 Erick

 On Fri, Jun 19, 2015 at 7:02 AM, Reitzel, Charles
 charles.reit...@tiaa-cref.org wrote:
  Grouping does tend to be expensive.   Our regular queries typically
 return in 10-15ms while the grouping queries take 60-80ms in a test
 environment ( 1M docs).
 
  This is ok for us, since we wrote our app to take the grouping queries
 out of the critical path (async query in parallel with two primary queries
 and some work in middle tier).   But this approach is unlikely to work for
 most cases.
 
  -Original Message-
  From: Reitzel, Charles [mailto:charles.reit...@tiaa-cref.org]
  Sent: Friday, June 19, 2015 9:52 AM
  To: solr-user@lucene.apache.org
  Subject: RE: How to do a Data sharding for data in a database table
 
  Hi Wenbin,
 
  To me, your instance appears well provisioned.  Likewise, your analysis
 of test vs. production performance makes a lot of sense.  Perhaps your time
 would be well spent tuning the query performance for your app before
 resorting to sharding?
 
  To that end, what do you see when you set debugQuery=true?   Where does
 solr spend the time?   My guess would be in the grouping and sorting steps,
 but which?   Sometime the schema details matter for performance.   Folks on
 this list can help with that.
 
  -Charlie
 
  -Original Message-
  From: Wenbin Wang [mailto:wwang...@gmail.com]
  Sent: Friday, June 19, 2015 7:55 AM
  To: solr-user@lucene.apache.org
  Subject: Re: How to do a Data sharding for data in a database table
 
  I have enough RAM (30G) and Hard disk (1000G). It is not I/O bound or
 computer disk bound. In addition, the Solr was started with maximal 4G for
 JVM, and index size is  2G. In a typical test, I made sure enough free RAM
 of 10G was available. I have not tuned any parameter in the configuration,
 it is default configuration.
 
  The number of fields for each record is around 10, and the number of
 results to be returned per page is 30. So the response time should not be
 affected by network traffic, and it is tested in the same machine. The
 query has a list of 4 search parameters, and each parameter takes a list of
 values or date range. The results will also be grouped and sorted. The
 response time of a typical single request is around 1 second. It can be  1
 second with more demanding requests.
 
  In our production environment, we have 64 cores, and we need to support 
  300 concurrent users, that is about 300 concurrent request per second.
 Each core will have to process about 5 request per second. The response
 time under this load will not be 1 second any more. My estimate is that an
 average of 200 ms response time of a single request would be able to handle
  300 concurrent users in production. There is no plan to increase the
 total number of cores 5 times.
 
  In a previous test, a search index around 6M data size was able to
 handle 
  5 request per second in each core of my 8-core machine.
 
  By doing data sharding from one single index of 13M to 2 indexes of 6 or
 7 M/each, I am expecting much faster response time that can meet the demand
 of production environment. That is the motivation of doing data sharding.
  However, I am also open to solution that can improve the performance of
 the  index of 13M to 14M size so that I do not need to do a data sharding.
 
 
 
 
 
  On Fri, Jun 19, 2015 at 12:39 AM, Erick Erickson 
 erickerick...@gmail.com
  wrote:
 
  You've repeated your original statement. Shawn's observation is that
  10M docs is a very small corpus by Solr standards. You either have
  very demanding document/search 

CREATE collection bug or feature?

2015-06-19 Thread Jim . Musil
I noticed that when I issue the CREATE collection command to the api, it does 
not automatically put a replica on every live node connected to zookeeper.

So, for example, if I have 3 solr nodes connected to a zookeeper ensemble and 
create a collection like this:

/admin/collections?action=CREATEname=my_collectionnumShards=1replicationFactor=1maxShardsPerNode=1collection.configName=my_config

It will only create a core on one of the three nodes. I can make it work if I 
change replicationFactor to 3. When standing up an entire stack using chef, 
this all gets a bit clunky. I don't see any option such as ALL that would 
just create a replica on all nodes regardless of size.

I'm guessing this is intentional, but curious about the reasoning.

Thanks!
Jim


RE: Extended Dismax Query Parser with AND as default operator

2015-06-19 Thread Cario, Elaine
Dirk,

There are 3 open JIRAs related to this behavior:

https://issues.apache.org/jira/browse/SOLR-3739
https://issues.apache.org/jira/browse/SOLR-3740 
https://issues.apache.org/jira/browse/SOLR-3741

We worked around it by adding the explicit + signs if the query matched the 
problematic patterns.  A pain, I know.

-Original Message-
From: Dirk Buchhorn [mailto:dirk.buchh...@finkundpartner.de] 
Sent: Thursday, June 18, 2015 3:31 AM
To: solr-user@lucene.apache.org
Subject: Extended Dismax Query Parser with AND as default operator

Hello,

I have a question to the extended dismax query parser. If the default operator 
is changed to AND (q.op=AND) then the search results seems to be incorrect. I 
will explain it on some examples. For this test I use solr v5.1 and the tika 
core from the example directory.
== Preparation ==
Add the following lines to the schema.xml file
  field name=id type=string indexed=true stored=true required=true/
  uniqueKeyid/uniqueKey
Change the field text to stored=true
Remove the multiValued attribute from the title and text field (we don't need 
multivaled fields in our test)

Add test data (use curl or fiddler)
Url:http://localhost:8983/solr/tika/update/json?commit=true
Header: Content-type: application/json
[
  {id:1, title:green, author:Jon, text:blue},
  {id:2, title:green, author:Jon Jessie, text:red},
  {id:3, title:yellow, author:Jessie, text:blue},
  {id:4, title:green, author:Jessie, text:blue},
  {id:5, title:blue, author:Jon, text:yellow},
  {id:6, title:red, author:Jon, text:green} ]

== Test ==
The following parameter are always set.
default operator is AND: q.op=AND
use the extended dismax query parser: defType=edismax set the default query 
fields to title and text: qf=title text
sort: id asc

=== #1 test ===
q=red green
response:
{ numFound:2,start:0,
  docs:[
{id:2,title:green,author:Jon Jessie,text:red},
{id:6,title:red,author:Jon,text:green}]
}
parsedquery_toString: +(((text:green | title:green) (text:red | title:red))~2)

This test works as expected.

=== #2 test ===
We use a group
q=(red green)
Same response as test one.
parsedquery_toString: +(((text:green | title:green) (text:red | title:red))~2)

This test works as expected.

=== #3 test ===
q=green red author:Jessie
response:
{ numFound:1,start:0,
  docs:[{id:2,title:green,author:Jon Jessie,text:red}] }
parsedquery_toString: +(((text:green | title:green) (text:red | title:red) 
author:jessie)~3)

This test works as expected.

=== #4 test ===
q=(green red) author:Jessie
response:
{ numFound:2,start:0,
  docs:[
{id:2,title:green,author:Jon Jessie,text:red},
{id:4,title:green,author:Jessie,text:blue}]
}
parsedquery_toString: +text:green | title:green) (text:red | title:red)) 
author:jessie)~2)

The same result as the 3th test was expected. Why no AND is used for the query 
group?

=== #5 test ===
q=(+green +red) author:Jessie
response:
{ numFound:4,start:0,
  docs:[
{id:2,title:green,author:Jon Jessie,text:red},
{id:3,title:yellow,author:Jessie,text:blue},
{id:4,title:green,author:Jessie,text:blue},
{id:6,title:red,author:Jon,text:green}]
}
parsedquery_toString: +((+(text:green | title:green) +(text:red | title:red)) 
author:jessie)

Now AND is used for the group but the author is concatenated with OR. Why?

=== #6 test ===
q=(+green +red) +author:Jessie
response:
{ numFound:3,start:0,
  docs:[
{id:2,title:green,author:Jon Jessie,text:red},
{id:3,title:yellow,author:Jessie,text:blue},
{id:4,title:green,author:Jessie,text:blue}]
}
parsedquery_toString: +((+(text:green | title:green) +(text:red | title:red)) 
+author:jessie)

Still not the expected result.

=== #7 test ===
q=+(+green +red) +author:Jessie
response:
{ numFound:1,start:0,
  docs:[{id:2,title:green,author:Jon Jessie,text:red}] }
parsedquery_toString: +(+(+(text:green | title:green) +(text:red | title:red)) 
+author:jessie)

Now the result is ok. But if all operators must be given then q.op=AND is 
useless.

=== #8 test ===
q=green author:(Jon Jessie)
Found four results, expected are one. The query must changed to '+green 
+author:(+Jon +Jessie)' to get the expected result.

Is this a bug in the extended dismax parser or what is the reason for not 
consequently applying q.op=AND to the query expression?

Kind regards

Dirk Buchhorn


Re: CREATE collection bug or feature?

2015-06-19 Thread Jim . Musil
Thanks as always for the great answers!

Jim


On 6/19/15, 11:57 AM, Erick Erickson erickerick...@gmail.com wrote:

Jim:

This is by design. There's no way to tell Solr to find all the cores
available and put one replica on each. In fact, you're explicitly
telling it to create one and only one replica, one and only one shard.
That is, your collection will have exactly one low-level core. But you
realized that...

As to the reasoning. Consider hetergeneous collections all hosted on
the same Solr cluster. I have big collections, little collections,
some with high QPS rates, some not. etc. Having Solr do things like
this automatically would make managing this difficult.

Probably the real reason is nobody thought it would be useful in
the general case. And I probably concur. Adding a new node to an
existing cluster would result in unbalanced clusters etc.

I suppose a stop-gap would be to query the live_nodes in the cluster
and add that to the URL, don't know how much of a pain that would be
though.

Best,
Erick

On Fri, Jun 19, 2015 at 10:15 AM, Jim.Musil jim.mu...@target.com wrote:
 I noticed that when I issue the CREATE collection command to the api,
it does not automatically put a replica on every live node connected to
zookeeper.

 So, for example, if I have 3 solr nodes connected to a zookeeper
ensemble and create a collection like this:

 
/admin/collections?action=CREATEname=my_collectionnumShards=1replicati
onFactor=1maxShardsPerNode=1collection.configName=my_config

 It will only create a core on one of the three nodes. I can make it
work if I change replicationFactor to 3. When standing up an entire
stack using chef, this all gets a bit clunky. I don't see any option
such as ALL that would just create a replica on all nodes regardless
of size.

 I'm guessing this is intentional, but curious about the reasoning.

 Thanks!
 Jim



CollapseQParserPluging Incorrect Facet Counts

2015-06-19 Thread Carlos Maroto
Hi,

We are comparing results between Field Collapsing (group* parameters) and
CollapseQParserPlugin.  We noticed that some facets are returning incorrect
counts.

Here are the relevant parameters of one of our test queries:

Field Collapsing:
---
q=red%20dressfacet=truefacet.mincount=1facet.limit=-1facet.field=searchcolorfacetgroup=truegroup.field=groupidgroup.facet=true
group.ngroups=true

ngroups = 5964

lst name=searchcolorfacet
...
int name=red11/int
...
/lst

CollapseQParserPlugin:
--q=red%20dressfacet=truefacet.mincount=1facet.limit=-1facet.field=searchcolorfacetfq=%7B!collapse%20field=groupid%7D

numFound = 5964 (same)

lst name=searchcolorfacet
...
int name=red8/int
...
/lst

When we change the CollapseQParserPlugin query by adding
fq=searchcolorfacet:red, the numFound value is 11, effectively showing
all 11 hits with that color.  The facet count for red now shows the correct
value of 11 as well.

Has anyone seeing something similar?

Thanks,
Carlos


RE: How to do a Data sharding for data in a database table

2015-06-19 Thread Reitzel, Charles
Also, since you are tuning for relative times, you can tune on the smaller 
index.   Surely, you will want to test at scale.   But tuning query, analyzer 
or schema options is usually easier to do on a smaller index.   If you get a 3x 
improvement at small scale, it may only be 2.5x at full scale.

E.g. storing the group field as doc values is one option that can help grouping 
performance in some cases (at least according to this list, I haven't tried it 
yet).

The number of distinct values of the grouping field is important as well.  If 
there are very many, you may want to try CollapsingQParserPlugin. 

The point being, some of these options may require reindexing!   So, again, it 
is a much easier and faster process to tune on a smaller index.

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Friday, June 19, 2015 2:33 PM
To: solr-user@lucene.apache.org
Subject: Re: How to do a Data sharding for data in a database table

Do be aware that turning on debug=query adds a load. I've seen the debug 
component take 90% of the query time. (to be fair it usually takes a much 
smaller percentage).

But you'll see a section at the end of the response if you set debug=all with 
the time each component took so you'll have a sense of the relative time used 
by each component.

Best,
Erick

On Fri, Jun 19, 2015 at 11:06 AM, Wenbin Wang wwang...@gmail.com wrote:
 As for now, the index size is 6.5 M records, and the performance is 
 good enough. I will re-build the index for all the records (14 M) and 
 test it again with debug turned on.

 Thanks


 On Fri, Jun 19, 2015 at 12:10 PM, Erick Erickson 
 erickerick...@gmail.com
 wrote:

 First and most obvious thing to try:

 bq: the Solr was started with maximal 4G for JVM, and index size is  
 2G

 Bump your JVM to 8G, perhaps 12G. The size of the index on disk is 
 very loosely coupled to JVM requirements. It's quite possible that 
 you're spending all your time in GC cycles. Consider gathering GC 
 characteristics, see:
 http://lucidworks.com/blog/garbage-collection-bootcamp-1-0/

 As Charles says, on the face of it the system you describe should 
 handle quite a load, so it feels like things can be tuned and you 
 won't have to resort to sharding.
 Sharding inevitably imposes some overhead so it's best to go there last.

 From my perspective, this is, indeed, an XY problem. You're assuming 
 that sharding is your solution. But you really haven't identified the 
 _problem_ other than queries are too slow. Let's nail down the 
 reason queries are taking a second before jumping into sharding. I've 
 just spent too much of my life fixing the wrong thing ;)

 It would be useful to see a couple of sample queries so we can get a 
 feel for how complex they are. Especially if you append, as Charles 
 mentions, debug=true

 Best,
 Erick

 On Fri, Jun 19, 2015 at 7:02 AM, Reitzel, Charles 
 charles.reit...@tiaa-cref.org wrote:
  Grouping does tend to be expensive.   Our regular queries typically
 return in 10-15ms while the grouping queries take 60-80ms in a test 
 environment ( 1M docs).
 
  This is ok for us, since we wrote our app to take the grouping 
  queries
 out of the critical path (async query in parallel with two primary queries
 and some work in middle tier).   But this approach is unlikely to work for
 most cases.
 
  -Original Message-
  From: Reitzel, Charles [mailto:charles.reit...@tiaa-cref.org]
  Sent: Friday, June 19, 2015 9:52 AM
  To: solr-user@lucene.apache.org
  Subject: RE: How to do a Data sharding for data in a database table
 
  Hi Wenbin,
 
  To me, your instance appears well provisioned.  Likewise, your 
  analysis
 of test vs. production performance makes a lot of sense.  Perhaps 
 your time would be well spent tuning the query performance for your 
 app before resorting to sharding?
 
  To that end, what do you see when you set debugQuery=true?   Where does
 solr spend the time?   My guess would be in the grouping and sorting steps,
 but which?   Sometime the schema details matter for performance.   Folks on
 this list can help with that.
 
  -Charlie
 
  -Original Message-
  From: Wenbin Wang [mailto:wwang...@gmail.com]
  Sent: Friday, June 19, 2015 7:55 AM
  To: solr-user@lucene.apache.org
  Subject: Re: How to do a Data sharding for data in a database table
 
  I have enough RAM (30G) and Hard disk (1000G). It is not I/O bound 
  or
 computer disk bound. In addition, the Solr was started with maximal 
 4G for JVM, and index size is  2G. In a typical test, I made sure 
 enough free RAM of 10G was available. I have not tuned any parameter 
 in the configuration, it is default configuration.
 
  The number of fields for each record is around 10, and the number 
  of
 results to be returned per page is 30. So the response time should 
 not be affected by network traffic, and it is tested in the same 
 machine. The query has a list of 4 search parameters, and each 
 parameter takes a list 

Re: How to do a Data sharding for data in a database table

2015-06-19 Thread Erick Erickson
Do be aware that turning on debug=query adds a load. I've seen the
debug component
take 90% of the query time. (to be fair it usually takes a much
smaller percentage).

But you'll see a section at the end of the response if you set
debug=all with the time each
component took so you'll have a sense of the relative time used by
each component.

Best,
Erick

On Fri, Jun 19, 2015 at 11:06 AM, Wenbin Wang wwang...@gmail.com wrote:
 As for now, the index size is 6.5 M records, and the performance is good
 enough. I will re-build the index for all the records (14 M) and test it
 again with debug turned on.

 Thanks


 On Fri, Jun 19, 2015 at 12:10 PM, Erick Erickson erickerick...@gmail.com
 wrote:

 First and most obvious thing to try:

 bq: the Solr was started with maximal 4G for JVM, and index size is  2G

 Bump your JVM to 8G, perhaps 12G. The size of the index on disk is very
 loosely coupled to JVM requirements. It's quite possible that you're
 spending
 all your time in GC cycles. Consider gathering GC characteristics, see:
 http://lucidworks.com/blog/garbage-collection-bootcamp-1-0/

 As Charles says, on the face of it the system you describe should handle
 quite
 a load, so it feels like things can be tuned and you won't have to
 resort to sharding.
 Sharding inevitably imposes some overhead so it's best to go there last.

 From my perspective, this is, indeed, an XY problem. You're assuming
 that sharding
 is your solution. But you really haven't identified the _problem_ other
 than
 queries are too slow. Let's nail down the reason queries are taking
 a second before
 jumping into sharding. I've just spent too much of my life fixing the
 wrong thing ;)

 It would be useful to see a couple of sample queries so we can get a
 feel for how complex they
 are. Especially if you append, as Charles mentions, debug=true

 Best,
 Erick

 On Fri, Jun 19, 2015 at 7:02 AM, Reitzel, Charles
 charles.reit...@tiaa-cref.org wrote:
  Grouping does tend to be expensive.   Our regular queries typically
 return in 10-15ms while the grouping queries take 60-80ms in a test
 environment ( 1M docs).
 
  This is ok for us, since we wrote our app to take the grouping queries
 out of the critical path (async query in parallel with two primary queries
 and some work in middle tier).   But this approach is unlikely to work for
 most cases.
 
  -Original Message-
  From: Reitzel, Charles [mailto:charles.reit...@tiaa-cref.org]
  Sent: Friday, June 19, 2015 9:52 AM
  To: solr-user@lucene.apache.org
  Subject: RE: How to do a Data sharding for data in a database table
 
  Hi Wenbin,
 
  To me, your instance appears well provisioned.  Likewise, your analysis
 of test vs. production performance makes a lot of sense.  Perhaps your time
 would be well spent tuning the query performance for your app before
 resorting to sharding?
 
  To that end, what do you see when you set debugQuery=true?   Where does
 solr spend the time?   My guess would be in the grouping and sorting steps,
 but which?   Sometime the schema details matter for performance.   Folks on
 this list can help with that.
 
  -Charlie
 
  -Original Message-
  From: Wenbin Wang [mailto:wwang...@gmail.com]
  Sent: Friday, June 19, 2015 7:55 AM
  To: solr-user@lucene.apache.org
  Subject: Re: How to do a Data sharding for data in a database table
 
  I have enough RAM (30G) and Hard disk (1000G). It is not I/O bound or
 computer disk bound. In addition, the Solr was started with maximal 4G for
 JVM, and index size is  2G. In a typical test, I made sure enough free RAM
 of 10G was available. I have not tuned any parameter in the configuration,
 it is default configuration.
 
  The number of fields for each record is around 10, and the number of
 results to be returned per page is 30. So the response time should not be
 affected by network traffic, and it is tested in the same machine. The
 query has a list of 4 search parameters, and each parameter takes a list of
 values or date range. The results will also be grouped and sorted. The
 response time of a typical single request is around 1 second. It can be  1
 second with more demanding requests.
 
  In our production environment, we have 64 cores, and we need to support 
  300 concurrent users, that is about 300 concurrent request per second.
 Each core will have to process about 5 request per second. The response
 time under this load will not be 1 second any more. My estimate is that an
 average of 200 ms response time of a single request would be able to handle
  300 concurrent users in production. There is no plan to increase the
 total number of cores 5 times.
 
  In a previous test, a search index around 6M data size was able to
 handle 
  5 request per second in each core of my 8-core machine.
 
  By doing data sharding from one single index of 13M to 2 indexes of 6 or
 7 M/each, I am expecting much faster response time that can meet the demand
 of production environment. That is the motivation of doing