Re: inconsistent result count when doing paging

2017-02-09 Thread cmti95035
Thanks Shawn! I will double check to make sure the uniqueKey are really
unique across all shards.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/inconsistent-result-count-when-doing-paging-tp4319427p4319633.html
Sent from the Solr - User mailing list archive at Nabble.com.


Problem with cyrillics letters through Tika OCR indexing

2017-02-09 Thread Абрашин , Игорь Олегович
Hello, everyone I'm encountered the error mentioned at the title?
The original image attached and recognized text below:
3ApaBCTyI7ITe 9| )KVIBy xopomo

Does anyone faced the similar?
Need to mentioned that tesseract recognize it more correctly with -l rus option.

Thanks in advance!


С уважением,
Игорь Абрашин
ООО <НОВАТЭК НТЦ>
тел. раб.: +7 (3452) 680-386
тел. внутр. корпор.: 22-586
[121]



how to get modified field data if it doesn't exist in meta

2017-02-09 Thread Gytis Mikuciunas
Hi,

We have started to use solr for our documents indexing (vsd, vsdx,
xls,xlsx, doc, docx, pdf, txt).

Modified date values is needed for each file. MS Office's files, pdfs have
this value.
Problem is with txt files as they don't have this value in their meta.

Is there any possibility to get it somehow from os level and force adding
it to solr when we do indexing.

p.s.

Windows 2012 server, single instance

typical command we use: java -Dauto -Dc=index_sandbox -Dport=80
-Dfiletypes=vsd,vsdx,xls,xlsx,doc,docx,pdf,txt -Dbasicauth=admin: -jar
example/exampledocs/post.jar "M:\DNS_dump"


Regards,

Gytis


Problem with collection operations in 6.4.1?

2017-02-09 Thread Walter Underwood
After three hours, I’m still getting this from an async collection delete 
request.

{
responseHeader: {
status: 0,
QTime: 12
},
status: {
state: "submitted",
msg: "found [wunder0] in submitted tasks"
}
}


16 node cluster, 4 shards, 4 replicas, 14.7 million documents. Also, shutting 
down a node times out after three minutes and needs a kill. And collection 
reload times out after three minutes.

Did not have this problem with 6.2.1.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)




Migrate Documents to Another Collection

2017-02-09 Thread alias
hello
   please help me look this question ,Solr6.3 is the issue of the index 
migration (/admin/collections?Action=MIGRATE), I really do not know how to 
solve the hope that someone will help answer, very grateful


http://lucene.472066.n3.nabble.com/Migrate-Documents-to-Another-Collection-td4318900.html#a4318907

Re: Removing duplicate terms from query

2017-02-09 Thread Erick Erickson
This is a common misunderstanding of RemoveDuplicatesTokenFilter. It
removes tokens _introduced_ by certain other filters, not duplicates
that were part of the original. This is the relevant part of the docs:
"if they have the same text and position values". An input of "hey hey hey"
has a different position for each "hey"...

Best,
Erick

On Thu, Feb 9, 2017 at 10:52 AM, Markus Jelsma
 wrote:
> Yeah, what does that do anyway, omit both, but not one in particular, and 
> where was omitTermFreq all this time, does it make sense?
>
> Not to me at least, so i never tried it and just overridden the similarity in 
> place.
>
> M.
>
> -Original message-
>> From:Alexandre Rafalovitch 
>> Sent: Thursday 9th February 2017 18:00
>> To: solr-user 
>> Subject: Re: Removing duplicate terms from query
>>
>> Would omitTermFreqAndPositions help here? Though that's probably an
>> overkill as that disables phrase searches too. I am not sure if it is
>> possible to do omitTermFreqAndPositions=true omitPositions=false to
>> just skip frequencies.
>>
>> Regards,
>>Alex.
>> 
>> http://www.solr-start.com/ - Resources for Solr users, new and experienced
>>
>>
>> On 9 February 2017 at 11:37, Walter Underwood  wrote:
>> > 1. I don’t think this is a good idea. It means that a search for “hey hey 
>> > hey” won’t score that document higher.
>> >
>> > 2. Maybe you want to change how tf is calculated. Ignore multiple 
>> > occurrences of a word.
>> >
>> > I ran into this with the movie title “New York, New York” at Netflix. It 
>> > isn’t twice as much about New York, but it needs to be the best match for 
>> > the query “new york new york”.
>> >
>> > wunder
>> > Walter Underwood
>> > wun...@wunderwood.org
>> > http://observer.wunderwood.org/  (my blog)
>> >
>> >
>> >> On Feb 9, 2017, at 5:18 AM, Ere Maijala  wrote:
>> >>
>> >> Thanks Emir.
>> >>
>> >> I was thinking of something very simple like doing what 
>> >> RemoveDuplicatesTokenFilter does but ignoring positions. It would of 
>> >> course still be possible to have the same term multiple times, but at 
>> >> least the adjacent ones could be deduplicated. The reason I'm not too 
>> >> eager to do it in a query preprocessor is that I'd have to essentially 
>> >> duplicate functionality of the query analysis chain that contains 
>> >> ICUTokenizerFactory, WordDelimiterFilterFactory and whatnot.
>> >>
>> >> Regards,
>> >> Ere
>> >>
>> >> 9.2.2017, 14.52, Emir Arnautovic kirjoitti:
>> >>> Hi Ere,
>> >>>
>> >>> I don't think that there is such filter. Implementing such filter would
>> >>> require looking backward which violates streaming approach of token
>> >>> filters and unpredictable memory usage.
>> >>>
>> >>> I would do it as part of query preprocessor and not necessarily as part
>> >>> of Solr.
>> >>>
>> >>> HTH,
>> >>> Emir
>> >>>
>> >>>
>> >>> On 09.02.2017 12:24, Ere Maijala wrote:
>>  Hi,
>> 
>>  I just noticed that while we use RemoveDuplicatesTokenFilter during
>>  query time, it will consider term positions and not really do anything
>>  e.g. if query is 'term term term'. As far as I can see the term
>>  positions make no difference in a simple non-phrase search. Is there a
>>  built-in way to deal with this? I know I can write a filter to do
>>  this, but I feel like this would be something quite basic to do for
>>  the query. And I don't think it's even anything too weird for normal
>>  users to do. Just consider e.g. searching for music by title:
>> 
>>  Hey, hey, hey ; Shivers of pleasure
>> 
>>  I also verified that at least according to debugQuery=true and
>>  anecdotal evicende the search really slows down if you repeat the same
>>  term enough.
>> 
>>  --Ere
>> >>>
>> >>
>> >> --
>> >> Ere Maijala
>> >> Kansalliskirjasto / The National Library of Finland
>> >
>>


Re: Huh? What does this even mean? Not enough time left to update replicas. However, the schema is updated already.

2017-02-09 Thread Erick Erickson
Well, managed schema in SolrCloud is a bit heavy-weight.
When you change the schema, two things need to happen:
1> the change has to be pushed to ZooKeeper
2> the replicas in the collection need to be reloaded to make
the changes available to all replicas for the _next_ doc that
comes in.

So what I think that message means is that the changes have
been propagated to ZooKeeper but the replicas haven't all
reloaded.

Best,
Erick

On Thu, Feb 9, 2017 at 11:29 AM, Shawn Heisey  wrote:
> On 2/9/2017 10:29 AM, Michael Joyner wrote:
>>
>> Huh? What does this even mean? If the schema is updated already how
>> can we be out of time to update it?
>>
>> Not enough time left to update replicas. However, the schema is
>> updated already.
>
> The code where the waitForOtherReplicasToUpdate method (that throws the
> exception) is called has this comment:
>
>   // Don't block further schema updates while waiting for a pending
> update to propagate to other replicas.
>   // This reduces the likelihood of a (time-limited) distributed
> deadlock during concurrent schema updates.
>
> Looks like it's part of the code that lets you manage your schema with
> HTTP calls.
>
> I don't really know what kinds of potential problems this code is
> designed to deal with, but it sounds like your Solr servers are having
> performance issues causing operations that are normally very fast to
> take a very long time instead.  One common cause for this is setting the
> heap too small, so that Java is forced to engage in nearly constant full
> garbage collections.
>
> The default timeout in this situation is ten minutes, unless you
> explicitly configure it to be different, which is probably done in the
> config for the managed schema.  In order for a ten minute timeout to be
> exceeded, the performance problem would be VERY severe ... to the point
> where I would be surprised if you can even get query results from Solr
> at all.
>
> Thanks,
> Shawn
>


Re: Huh? What does this even mean? Not enough time left to update replicas. However, the schema is updated already.

2017-02-09 Thread Shawn Heisey
On 2/9/2017 10:29 AM, Michael Joyner wrote:
>
> Huh? What does this even mean? If the schema is updated already how
> can we be out of time to update it?
>
> Not enough time left to update replicas. However, the schema is
> updated already.

The code where the waitForOtherReplicasToUpdate method (that throws the
exception) is called has this comment:

  // Don't block further schema updates while waiting for a pending
update to propagate to other replicas.
  // This reduces the likelihood of a (time-limited) distributed
deadlock during concurrent schema updates.

Looks like it's part of the code that lets you manage your schema with
HTTP calls.

I don't really know what kinds of potential problems this code is
designed to deal with, but it sounds like your Solr servers are having
performance issues causing operations that are normally very fast to
take a very long time instead.  One common cause for this is setting the
heap too small, so that Java is forced to engage in nearly constant full
garbage collections.

The default timeout in this situation is ten minutes, unless you
explicitly configure it to be different, which is probably done in the
config for the managed schema.  In order for a ten minute timeout to be
exceeded, the performance problem would be VERY severe ... to the point
where I would be surprised if you can even get query results from Solr
at all.

Thanks,
Shawn



RE: Removing duplicate terms from query

2017-02-09 Thread Markus Jelsma
Yeah, what does that do anyway, omit both, but not one in particular, and where 
was omitTermFreq all this time, does it make sense?

Not to me at least, so i never tried it and just overridden the similarity in 
place.

M. 
 
-Original message-
> From:Alexandre Rafalovitch 
> Sent: Thursday 9th February 2017 18:00
> To: solr-user 
> Subject: Re: Removing duplicate terms from query
> 
> Would omitTermFreqAndPositions help here? Though that's probably an
> overkill as that disables phrase searches too. I am not sure if it is
> possible to do omitTermFreqAndPositions=true omitPositions=false to
> just skip frequencies.
> 
> Regards,
>    Alex.
> 
> http://www.solr-start.com/ - Resources for Solr users, new and experienced
> 
> 
> On 9 February 2017 at 11:37, Walter Underwood  wrote:
> > 1. I don’t think this is a good idea. It means that a search for “hey hey 
> > hey” won’t score that document higher.
> >
> > 2. Maybe you want to change how tf is calculated. Ignore multiple 
> > occurrences of a word.
> >
> > I ran into this with the movie title “New York, New York” at Netflix. It 
> > isn’t twice as much about New York, but it needs to be the best match for 
> > the query “new york new york”.
> >
> > wunder
> > Walter Underwood
> > wun...@wunderwood.org
> > http://observer.wunderwood.org/  (my blog)
> >
> >
> >> On Feb 9, 2017, at 5:18 AM, Ere Maijala  wrote:
> >>
> >> Thanks Emir.
> >>
> >> I was thinking of something very simple like doing what 
> >> RemoveDuplicatesTokenFilter does but ignoring positions. It would of 
> >> course still be possible to have the same term multiple times, but at 
> >> least the adjacent ones could be deduplicated. The reason I'm not too 
> >> eager to do it in a query preprocessor is that I'd have to essentially 
> >> duplicate functionality of the query analysis chain that contains 
> >> ICUTokenizerFactory, WordDelimiterFilterFactory and whatnot.
> >>
> >> Regards,
> >> Ere
> >>
> >> 9.2.2017, 14.52, Emir Arnautovic kirjoitti:
> >>> Hi Ere,
> >>>
> >>> I don't think that there is such filter. Implementing such filter would
> >>> require looking backward which violates streaming approach of token
> >>> filters and unpredictable memory usage.
> >>>
> >>> I would do it as part of query preprocessor and not necessarily as part
> >>> of Solr.
> >>>
> >>> HTH,
> >>> Emir
> >>>
> >>>
> >>> On 09.02.2017 12:24, Ere Maijala wrote:
>  Hi,
> 
>  I just noticed that while we use RemoveDuplicatesTokenFilter during
>  query time, it will consider term positions and not really do anything
>  e.g. if query is 'term term term'. As far as I can see the term
>  positions make no difference in a simple non-phrase search. Is there a
>  built-in way to deal with this? I know I can write a filter to do
>  this, but I feel like this would be something quite basic to do for
>  the query. And I don't think it's even anything too weird for normal
>  users to do. Just consider e.g. searching for music by title:
> 
>  Hey, hey, hey ; Shivers of pleasure
> 
>  I also verified that at least according to debugQuery=true and
>  anecdotal evicende the search really slows down if you repeat the same
>  term enough.
> 
>  --Ere
> >>>
> >>
> >> --
> >> Ere Maijala
> >> Kansalliskirjasto / The National Library of Finland
> >
> 


Re: procedure to restart solrcloud, and config/collection consistency

2017-02-09 Thread xavier jmlucjav
hi Shawn,

as I replied to Markus, of course I know (and use) the collections api to
reload the config. I am asking what would happen in that scenario:
 - config updated (but collection not reloaded)
 - i restart one node
now one node has the new config and the rest the old one??

To which he already replied:
>The restared/reloaded node has the new config, the others have the old
config until reloaded/restarted.

I was not asking about making solr restart itself, my English must be worst
than I thought. By the way, stuff like that can be achieved with
http://yajsw.sourceforge.net/ a very powerful java wrapper, I used to use
it when Solr did not have a built in daemon setup. It was built by someone
how was using JSW, and got pissed when that one went commercial. It is very
configurable, but of course more complex. I wrote something about it some
time ago
https://medium.com/@jmlucjav/how-to-install-solr-as-a-service-in-any-platform-including-solr-5-8e4a93cc3909

thanks

On Thu, Feb 9, 2017 at 4:53 PM, Shawn Heisey  wrote:

> On 2/9/2017 5:24 AM, xavier jmlucjav wrote:
> > I always wondered, if this was not really needed, and I could just call
> > 'restart' in every node, in a quick loop, and forget about it. Does
> anyone
> > know if this is the case?
> >
> > My doubt is in regards to changing some config, and then doing the above
> > (just restart nodes in a loop). For example, what if I change a config G
> > used in collection C, and I restart just one of the nodes (N1), leaving
> the rest alone. If all the nodes contain a shard for C, what happens, N1 is
> using the new config and the rest are not? how is this handled?
>
> If you want to change the config or schema for a collection and make it
> active across all nodes, just use the collections API to RELOAD the
> collection.  The change will be picked up everywhere.
>
> https://cwiki.apache.org/confluence/display/solr/Collections+API
>
> To answer your question: No.  Solr does not have the ability to restart
> itself.  It would require significant development effort and a
> fundamental change in how Solr is started to make it possible.  It is
> something that has been discussed, but at this time it is not possible.
>
> One idea that would make this possible is mentioned on the following
> wiki page.  It talks about turning Solr into two applications instead of
> one:
>
> https://wiki.apache.org/solr/WhyNoWar#Information_that.27s_
> not_version_specific
>
> Again -- it would not be easy, which is why it hasn't been done yet.
>
> Thanks,
> Shawn
>
>


Re: values for fairnessPolicy?

2017-02-09 Thread Walter Underwood
The code needs a boolean. In HttpShardHandlerFactory.java:


BlockingQueue blockingQueue = (this.queueSize == -1) ?
new SynchronousQueue(this.accessPolicy) :
new ArrayBlockingQueue(this.queueSize, this.accessPolicy);

Also, what is a “reasonable size of queue” for sizeOfQueue?

https://cwiki.apache.org/confluence/display/solr/Distributed+Requests 


wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Feb 9, 2017, at 10:13 AM, Walter Underwood  wrote:
> 
> The default is “false”. I tried “true” and it fails because it can’t parse 
> that as an int.
> 
> The docs need to describe legal values for this.
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 
> 



values for fairnessPolicy?

2017-02-09 Thread Walter Underwood
The default is “false”. I tried “true” and it fails because it can’t parse that 
as an int.

The docs need to describe legal values for this.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)




Re: creating collection using collection API with SSL enabled SolrCloud

2017-02-09 Thread Bryan Bende
You should be able to start your Solr instances with "-h ".

On Thu, Feb 9, 2017 at 12:09 PM, Xie, Sean  wrote:
> Thank you Hrishikesh,
>
> The cluster property solved the issue.
>
> Now we need to figure out a way to give the instance a host name to solve the 
> SSL error that IP not matching the SSL name.
>
> Sean
>
>
>
> On 2/9/17, 11:35 AM, "Hrishikesh Gadre"  wrote:
>
> Hi Sean,
>
> Have you configured the "urlScheme" cluster property (i.e. 
> urlScheme=https)
> ?
> 
> https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-CLUSTERPROP:ClusterProperties
>
> Thanks
> Hrishikesh
>
>
>
> On Thu, Feb 9, 2017 at 8:23 AM, Xie, Sean  wrote:
>
> > Hi All,
> >
> > When trying to create the collection using the API when there are a few
> > replicas, I’m getting error because the call seems to trying to use HTTP
> > for the replicas.
> >
> > https://IP_1:8983/solr/admin/collections?action=CREATE&;
> > name=My_COLLECTION&numShards=1&replicationFactor=1&
> > collection.configName=my_collection_conf
> >
> > Here is the error:
> >
> > org.apache.solr.client.solrj.SolrServerException:IOException occured 
> when
> > talking to server at: http://IP_2:8983/solr
> >
> >
> > Is there something need to be configured for that?
> >
> > Thanks
> > Sean
> >
> > Confidentiality Notice::  This email, including attachments, may include
> > non-public, proprietary, confidential or legally privileged information.
> > If you are not an intended recipient or an authorized agent of an 
> intended
> > recipient, you are hereby notified that any dissemination, distribution 
> or
> > copying of the information contained in or transmitted with this e-mail 
> is
> > unauthorized and strictly prohibited.  If you have received this email 
> in
> > error, please notify the sender by replying to this message and 
> permanently
> > delete this e-mail, its attachments, and any copies of it immediately.  
> You
> > should not retain, copy or use this e-mail or any attachment for any
> > purpose, nor disclose all or any part of the contents to any other 
> person.
> > Thank you.
> >
>
>


Huh? What does this even mean? Not enough time left to update replicas. However, the schema is updated already.

2017-02-09 Thread Michael Joyner


Huh? What does this even mean? If the schema is updated already how can 
we be out of time to update it?


Not enough time left to update replicas. However, the schema is updated 
already.




Re: creating collection using collection API with SSL enabled SolrCloud

2017-02-09 Thread Xie, Sean
Thank you Hrishikesh,

The cluster property solved the issue.

Now we need to figure out a way to give the instance a host name to solve the 
SSL error that IP not matching the SSL name.

Sean



On 2/9/17, 11:35 AM, "Hrishikesh Gadre"  wrote:

Hi Sean,

Have you configured the "urlScheme" cluster property (i.e. urlScheme=https)
?

https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-CLUSTERPROP:ClusterProperties

Thanks
Hrishikesh



On Thu, Feb 9, 2017 at 8:23 AM, Xie, Sean  wrote:

> Hi All,
>
> When trying to create the collection using the API when there are a few
> replicas, I’m getting error because the call seems to trying to use HTTP
> for the replicas.
>
> https://IP_1:8983/solr/admin/collections?action=CREATE&;
> name=My_COLLECTION&numShards=1&replicationFactor=1&
> collection.configName=my_collection_conf
>
> Here is the error:
>
> org.apache.solr.client.solrj.SolrServerException:IOException occured when
> talking to server at: http://IP_2:8983/solr
>
>
> Is there something need to be configured for that?
>
> Thanks
> Sean
>
> Confidentiality Notice::  This email, including attachments, may include
> non-public, proprietary, confidential or legally privileged information.
> If you are not an intended recipient or an authorized agent of an intended
> recipient, you are hereby notified that any dissemination, distribution or
> copying of the information contained in or transmitted with this e-mail is
> unauthorized and strictly prohibited.  If you have received this email in
> error, please notify the sender by replying to this message and 
permanently
> delete this e-mail, its attachments, and any copies of it immediately.  
You
> should not retain, copy or use this e-mail or any attachment for any
> purpose, nor disclose all or any part of the contents to any other person.
> Thank you.
>




Re: Removing duplicate terms from query

2017-02-09 Thread Alexandre Rafalovitch
Would omitTermFreqAndPositions help here? Though that's probably an
overkill as that disables phrase searches too. I am not sure if it is
possible to do omitTermFreqAndPositions=true omitPositions=false to
just skip frequencies.

Regards,
   Alex.

http://www.solr-start.com/ - Resources for Solr users, new and experienced


On 9 February 2017 at 11:37, Walter Underwood  wrote:
> 1. I don’t think this is a good idea. It means that a search for “hey hey 
> hey” won’t score that document higher.
>
> 2. Maybe you want to change how tf is calculated. Ignore multiple occurrences 
> of a word.
>
> I ran into this with the movie title “New York, New York” at Netflix. It 
> isn’t twice as much about New York, but it needs to be the best match for the 
> query “new york new york”.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
>> On Feb 9, 2017, at 5:18 AM, Ere Maijala  wrote:
>>
>> Thanks Emir.
>>
>> I was thinking of something very simple like doing what 
>> RemoveDuplicatesTokenFilter does but ignoring positions. It would of course 
>> still be possible to have the same term multiple times, but at least the 
>> adjacent ones could be deduplicated. The reason I'm not too eager to do it 
>> in a query preprocessor is that I'd have to essentially duplicate 
>> functionality of the query analysis chain that contains ICUTokenizerFactory, 
>> WordDelimiterFilterFactory and whatnot.
>>
>> Regards,
>> Ere
>>
>> 9.2.2017, 14.52, Emir Arnautovic kirjoitti:
>>> Hi Ere,
>>>
>>> I don't think that there is such filter. Implementing such filter would
>>> require looking backward which violates streaming approach of token
>>> filters and unpredictable memory usage.
>>>
>>> I would do it as part of query preprocessor and not necessarily as part
>>> of Solr.
>>>
>>> HTH,
>>> Emir
>>>
>>>
>>> On 09.02.2017 12:24, Ere Maijala wrote:
 Hi,

 I just noticed that while we use RemoveDuplicatesTokenFilter during
 query time, it will consider term positions and not really do anything
 e.g. if query is 'term term term'. As far as I can see the term
 positions make no difference in a simple non-phrase search. Is there a
 built-in way to deal with this? I know I can write a filter to do
 this, but I feel like this would be something quite basic to do for
 the query. And I don't think it's even anything too weird for normal
 users to do. Just consider e.g. searching for music by title:

 Hey, hey, hey ; Shivers of pleasure

 I also verified that at least according to debugQuery=true and
 anecdotal evicende the search really slows down if you repeat the same
 term enough.

 --Ere
>>>
>>
>> --
>> Ere Maijala
>> Kansalliskirjasto / The National Library of Finland
>


Re: difference in json update handler update/json and update/json/docs

2017-02-09 Thread Florian Meier
this was the right lead, thanks Alex

> Am 08.02.2017 um 22:20 schrieb Alexandre Rafalovitch :
> 
> /update/json expects Solr JSON update format.
> /update is an auto-route that should be equivalent to /update/json
> with the right content type/extension.
> 
> /update/json/docs expects random JSON and tries to extract fields for
> indexing from it.
> https://cwiki.apache.org/confluence/display/solr/Transforming+and+Indexing+Custom+JSON
> 
> Regards,
>   Alex.
> 
> 
> http://www.solr-start.com/ - Resources for Solr users, new and experienced
> 
> 
> On 8 February 2017 at 15:54, Florian Meier
>  wrote:
>> dear solr users,
>> can somebody explain the exact difference between the to update handlers? 
>> I’m asking cause with some curl commands solr fails to identify the fields 
>> of the json doc and indexes everything in _str_:
>> 
>> Those work perfectly:
>> curl 'http://localhost:8983/solr/testcore2/update/json?commit=true' 
>> --data-binary @example/exampledocs/cacmDocs.json
>> 
>> 
>> curl 'http://localhost:8983/solr/testcore2/update?commit=true' --data-binary 
>> @example/exampledocs/cacmDocs.json -H 'Content-type:application/json'
>> 
>> But those two (both with update/json/docs) don't
>> 
>> curl 'http://localhost:8983/solr/testcore2/update/json/docs?commit=true' 
>> --data-binary @example/exampledocs/cacmDocs.json -H 
>> 'Content-type:application/json‘
>> 
>> curl 'http://localhost:8983/solr/testcore2/update/json/docs?commit=true' 
>> --data-binary @example/exampledocs/cacmDocs.json
>> 
>> Cheers,
>> Florian
>> 
>> 
>> 
>> 
>> 



Re: Removing duplicate terms from query

2017-02-09 Thread Walter Underwood
1. I don’t think this is a good idea. It means that a search for “hey hey hey” 
won’t score that document higher.

2. Maybe you want to change how tf is calculated. Ignore multiple occurrences 
of a word.

I ran into this with the movie title “New York, New York” at Netflix. It isn’t 
twice as much about New York, but it needs to be the best match for the query 
“new york new york”.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Feb 9, 2017, at 5:18 AM, Ere Maijala  wrote:
> 
> Thanks Emir.
> 
> I was thinking of something very simple like doing what 
> RemoveDuplicatesTokenFilter does but ignoring positions. It would of course 
> still be possible to have the same term multiple times, but at least the 
> adjacent ones could be deduplicated. The reason I'm not too eager to do it in 
> a query preprocessor is that I'd have to essentially duplicate functionality 
> of the query analysis chain that contains ICUTokenizerFactory, 
> WordDelimiterFilterFactory and whatnot.
> 
> Regards,
> Ere
> 
> 9.2.2017, 14.52, Emir Arnautovic kirjoitti:
>> Hi Ere,
>> 
>> I don't think that there is such filter. Implementing such filter would
>> require looking backward which violates streaming approach of token
>> filters and unpredictable memory usage.
>> 
>> I would do it as part of query preprocessor and not necessarily as part
>> of Solr.
>> 
>> HTH,
>> Emir
>> 
>> 
>> On 09.02.2017 12:24, Ere Maijala wrote:
>>> Hi,
>>> 
>>> I just noticed that while we use RemoveDuplicatesTokenFilter during
>>> query time, it will consider term positions and not really do anything
>>> e.g. if query is 'term term term'. As far as I can see the term
>>> positions make no difference in a simple non-phrase search. Is there a
>>> built-in way to deal with this? I know I can write a filter to do
>>> this, but I feel like this would be something quite basic to do for
>>> the query. And I don't think it's even anything too weird for normal
>>> users to do. Just consider e.g. searching for music by title:
>>> 
>>> Hey, hey, hey ; Shivers of pleasure
>>> 
>>> I also verified that at least according to debugQuery=true and
>>> anecdotal evicende the search really slows down if you repeat the same
>>> term enough.
>>> 
>>> --Ere
>> 
> 
> -- 
> Ere Maijala
> Kansalliskirjasto / The National Library of Finland



Re: creating collection using collection API with SSL enabled SolrCloud

2017-02-09 Thread Hrishikesh Gadre
Hi Sean,

Have you configured the "urlScheme" cluster property (i.e. urlScheme=https)
?
https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-CLUSTERPROP:ClusterProperties

Thanks
Hrishikesh



On Thu, Feb 9, 2017 at 8:23 AM, Xie, Sean  wrote:

> Hi All,
>
> When trying to create the collection using the API when there are a few
> replicas, I’m getting error because the call seems to trying to use HTTP
> for the replicas.
>
> https://IP_1:8983/solr/admin/collections?action=CREATE&;
> name=My_COLLECTION&numShards=1&replicationFactor=1&
> collection.configName=my_collection_conf
>
> Here is the error:
>
> org.apache.solr.client.solrj.SolrServerException:IOException occured when
> talking to server at: http://IP_2:8983/solr
>
>
> Is there something need to be configured for that?
>
> Thanks
> Sean
>
> Confidentiality Notice::  This email, including attachments, may include
> non-public, proprietary, confidential or legally privileged information.
> If you are not an intended recipient or an authorized agent of an intended
> recipient, you are hereby notified that any dissemination, distribution or
> copying of the information contained in or transmitted with this e-mail is
> unauthorized and strictly prohibited.  If you have received this email in
> error, please notify the sender by replying to this message and permanently
> delete this e-mail, its attachments, and any copies of it immediately.  You
> should not retain, copy or use this e-mail or any attachment for any
> purpose, nor disclose all or any part of the contents to any other person.
> Thank you.
>


creating collection using collection API with SSL enabled SolrCloud

2017-02-09 Thread Xie, Sean
Hi All,

When trying to create the collection using the API when there are a few 
replicas, I’m getting error because the call seems to trying to use HTTP for 
the replicas.

https://IP_1:8983/solr/admin/collections?action=CREATE&name=My_COLLECTION&numShards=1&replicationFactor=1&collection.configName=my_collection_conf

Here is the error:

org.apache.solr.client.solrj.SolrServerException:IOException occured when 
talking to server at: http://IP_2:8983/solr


Is there something need to be configured for that?

Thanks
Sean

Confidentiality Notice::  This email, including attachments, may include 
non-public, proprietary, confidential or legally privileged information.  If 
you are not an intended recipient or an authorized agent of an intended 
recipient, you are hereby notified that any dissemination, distribution or 
copying of the information contained in or transmitted with this e-mail is 
unauthorized and strictly prohibited.  If you have received this email in 
error, please notify the sender by replying to this message and permanently 
delete this e-mail, its attachments, and any copies of it immediately.  You 
should not retain, copy or use this e-mail or any attachment for any purpose, 
nor disclose all or any part of the contents to any other person. Thank you.


Re: alerting system with Solr's Streaming Expressions

2017-02-09 Thread Susheel Kumar
got it, Thanks, Joel.

On Thu, Feb 9, 2017 at 11:17 AM, Susheel Kumar 
wrote:

> I increased from 250 to 2500 and 100 to 1000 when did't get expected
> result.  Let me put more examples.
>
> Thanks,
> Susheel
>
> On Thu, Feb 9, 2017 at 11:03 AM, Joel Bernstein 
> wrote:
>
>> A few things that I see right off:
>>
>> 1) 2500 terms is too many. I was testing with 100-250 terms
>> 2) 1000 iterations is to high. If the model hasn't converged by 100
>> iterations it's likely not going to converge.
>> 3) You're going to need more examples. You may want to run features first
>> and see what it selects. Then you need multiple examples for each feature.
>> I was testing with the enron ham/spam data set. It would be good to
>> download that dataset and see what that looks like.
>>
>> Joel Bernstein
>> http://joelsolr.blogspot.com/
>>
>> On Thu, Feb 9, 2017 at 10:15 AM, Susheel Kumar 
>> wrote:
>>
>> > Hello Joel,
>> >
>> > Here is the final iteration in json format.
>> >
>> >  https://www.dropbox.com/s/g3a3606ms6cu8q4/final_iteration.json?dl=0
>> >
>> > Below is the expression used
>> >
>> > update(models,
>> >  batchSize="50",
>> >  train(trainingSet,
>> >   features(trainingSet,
>> >  q="*:*",
>> >  featureSet="threatFeatures",
>> >  field="body_txt",
>> >  outcome="out_i",
>> >  numTerms=2500),
>> >   q="*:*",
>> >   name="threatModel",
>> >   field="body_txt",
>> >   outcome="out_i",
>> >   maxIterations="1000"))
>> >
>> > I just have 16 documents with 8+ve and 8-ves. The field which contains
>> the
>> > feedback is body_txt (text_general type)
>> >
>> > Thanks for looking.
>> >
>> >
>> >
>> > On Wed, Feb 8, 2017 at 7:52 AM, Joel Bernstein 
>> wrote:
>> >
>> > > Can you post the final iteration of the model?
>> > >
>> > > Also the expression you used to train the model?
>> > >
>> > > How much training data do you have? Ho many positive examples and
>> > negatives
>> > > examples?
>> > >
>> > > Joel Bernstein
>> > > http://joelsolr.blogspot.com/
>> > >
>> > > On Tue, Feb 7, 2017 at 2:14 PM, Susheel Kumar 
>> > > wrote:
>> > >
>> > > > Hello,
>> > > >
>> > > > I am tried to follow http://joelsolr.blogspot.com/ to see if we can
>> > > > classify positive & negative feedbacks using streaming expressions.
>> > All
>> > > > works but end result where probability_d result of classify
>> expression
>> > > > gives similar results for positive / negative feedback. See below
>> > > >
>> > > > What I may be missing here.  Do i need to put more data in training
>> set
>> > > or
>> > > > something else?
>> > > >
>> > > >
>> > > > { "result-set": { "docs": [ { "body_txt": [ "love the company" ],
>> > > > "score_d": 2.1892474120319667, "id": "6", "probability_d":
>> > > > 0.977944433135261 }, { "body_txt": [ "bad experience " ], "score_d":
>> > > > 3.1689453250842914, "id": "5", "probability_d": 0.9888109278133054
>> }, {
>> > > > "body_txt": [ "This company rewards its employees, but you should
>> only
>> > > work
>> > > > here if you truly love sales. The stress of the job can get to you
>> and
>> > > they
>> > > > definitely push you." ], "score_d": 4.621702323888672, "id": "4",
>> > > > "probability_d": 0.99898557 }, { "body_txt": [ "no chance
>> for
>> > > > advancement with that company every year I was there it got worse I
>> > don't
>> > > > know if all branches of adp but Florence organization was turn over
>> > rate
>> > > > would be higher if it was for temp workers" ], "score_d":
>> > > > 5.288898825826228, "id": "3", "probability_d": 0.9956
>> }, {
>> > > > "body_txt": [ "It was a pleasure to work at the Milpitas campus. The
>> > team
>> > > > that works there are professional and dedicated individuals. The
>> level
>> > of
>> > > > loyalty and dedication is impressive" ], "score_d":
>> 2.5303947056922937,
>> > > > "id": "2", "probability_d": 0.990430778418 },
>> > > >
>> > >
>> >
>>
>
>


Re: alerting system with Solr's Streaming Expressions

2017-02-09 Thread Susheel Kumar
I increased from 250 to 2500 and 100 to 1000 when did't get expected
result.  Let me put more examples.

Thanks,
Susheel

On Thu, Feb 9, 2017 at 11:03 AM, Joel Bernstein  wrote:

> A few things that I see right off:
>
> 1) 2500 terms is too many. I was testing with 100-250 terms
> 2) 1000 iterations is to high. If the model hasn't converged by 100
> iterations it's likely not going to converge.
> 3) You're going to need more examples. You may want to run features first
> and see what it selects. Then you need multiple examples for each feature.
> I was testing with the enron ham/spam data set. It would be good to
> download that dataset and see what that looks like.
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Thu, Feb 9, 2017 at 10:15 AM, Susheel Kumar 
> wrote:
>
> > Hello Joel,
> >
> > Here is the final iteration in json format.
> >
> >  https://www.dropbox.com/s/g3a3606ms6cu8q4/final_iteration.json?dl=0
> >
> > Below is the expression used
> >
> > update(models,
> >  batchSize="50",
> >  train(trainingSet,
> >   features(trainingSet,
> >  q="*:*",
> >  featureSet="threatFeatures",
> >  field="body_txt",
> >  outcome="out_i",
> >  numTerms=2500),
> >   q="*:*",
> >   name="threatModel",
> >   field="body_txt",
> >   outcome="out_i",
> >   maxIterations="1000"))
> >
> > I just have 16 documents with 8+ve and 8-ves. The field which contains
> the
> > feedback is body_txt (text_general type)
> >
> > Thanks for looking.
> >
> >
> >
> > On Wed, Feb 8, 2017 at 7:52 AM, Joel Bernstein 
> wrote:
> >
> > > Can you post the final iteration of the model?
> > >
> > > Also the expression you used to train the model?
> > >
> > > How much training data do you have? Ho many positive examples and
> > negatives
> > > examples?
> > >
> > > Joel Bernstein
> > > http://joelsolr.blogspot.com/
> > >
> > > On Tue, Feb 7, 2017 at 2:14 PM, Susheel Kumar 
> > > wrote:
> > >
> > > > Hello,
> > > >
> > > > I am tried to follow http://joelsolr.blogspot.com/ to see if we can
> > > > classify positive & negative feedbacks using streaming expressions.
> > All
> > > > works but end result where probability_d result of classify
> expression
> > > > gives similar results for positive / negative feedback. See below
> > > >
> > > > What I may be missing here.  Do i need to put more data in training
> set
> > > or
> > > > something else?
> > > >
> > > >
> > > > { "result-set": { "docs": [ { "body_txt": [ "love the company" ],
> > > > "score_d": 2.1892474120319667, "id": "6", "probability_d":
> > > > 0.977944433135261 }, { "body_txt": [ "bad experience " ], "score_d":
> > > > 3.1689453250842914, "id": "5", "probability_d": 0.9888109278133054
> }, {
> > > > "body_txt": [ "This company rewards its employees, but you should
> only
> > > work
> > > > here if you truly love sales. The stress of the job can get to you
> and
> > > they
> > > > definitely push you." ], "score_d": 4.621702323888672, "id": "4",
> > > > "probability_d": 0.99898557 }, { "body_txt": [ "no chance for
> > > > advancement with that company every year I was there it got worse I
> > don't
> > > > know if all branches of adp but Florence organization was turn over
> > rate
> > > > would be higher if it was for temp workers" ], "score_d":
> > > > 5.288898825826228, "id": "3", "probability_d": 0.9956 },
> {
> > > > "body_txt": [ "It was a pleasure to work at the Milpitas campus. The
> > team
> > > > that works there are professional and dedicated individuals. The
> level
> > of
> > > > loyalty and dedication is impressive" ], "score_d":
> 2.5303947056922937,
> > > > "id": "2", "probability_d": 0.990430778418 },
> > > >
> > >
> >
>


Re: alerting system with Solr's Streaming Expressions

2017-02-09 Thread Joel Bernstein
Also you can see in the final iteration of the model that there are 8 true
positives and 8 false positives. So this model classifies everything as
positive. At that you know that it's not a good model.

Joel Bernstein
http://joelsolr.blogspot.com/

On Thu, Feb 9, 2017 at 11:03 AM, Joel Bernstein  wrote:

> A few things that I see right off:
>
> 1) 2500 terms is too many. I was testing with 100-250 terms
> 2) 1000 iterations is to high. If the model hasn't converged by 100
> iterations it's likely not going to converge.
> 3) You're going to need more examples. You may want to run features first
> and see what it selects. Then you need multiple examples for each feature.
> I was testing with the enron ham/spam data set. It would be good to
> download that dataset and see what that looks like.
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Thu, Feb 9, 2017 at 10:15 AM, Susheel Kumar 
> wrote:
>
>> Hello Joel,
>>
>> Here is the final iteration in json format.
>>
>>  https://www.dropbox.com/s/g3a3606ms6cu8q4/final_iteration.json?dl=0
>>
>> Below is the expression used
>>
>> update(models,
>>  batchSize="50",
>>  train(trainingSet,
>>   features(trainingSet,
>>  q="*:*",
>>  featureSet="threatFeatures",
>>  field="body_txt",
>>  outcome="out_i",
>>  numTerms=2500),
>>   q="*:*",
>>   name="threatModel",
>>   field="body_txt",
>>   outcome="out_i",
>>   maxIterations="1000"))
>>
>> I just have 16 documents with 8+ve and 8-ves. The field which contains the
>> feedback is body_txt (text_general type)
>>
>> Thanks for looking.
>>
>>
>>
>> On Wed, Feb 8, 2017 at 7:52 AM, Joel Bernstein 
>> wrote:
>>
>> > Can you post the final iteration of the model?
>> >
>> > Also the expression you used to train the model?
>> >
>> > How much training data do you have? Ho many positive examples and
>> negatives
>> > examples?
>> >
>> > Joel Bernstein
>> > http://joelsolr.blogspot.com/
>> >
>> > On Tue, Feb 7, 2017 at 2:14 PM, Susheel Kumar 
>> > wrote:
>> >
>> > > Hello,
>> > >
>> > > I am tried to follow http://joelsolr.blogspot.com/ to see if we can
>> > > classify positive & negative feedbacks using streaming expressions.
>> All
>> > > works but end result where probability_d result of classify expression
>> > > gives similar results for positive / negative feedback. See below
>> > >
>> > > What I may be missing here.  Do i need to put more data in training
>> set
>> > or
>> > > something else?
>> > >
>> > >
>> > > { "result-set": { "docs": [ { "body_txt": [ "love the company" ],
>> > > "score_d": 2.1892474120319667, "id": "6", "probability_d":
>> > > 0.977944433135261 }, { "body_txt": [ "bad experience " ], "score_d":
>> > > 3.1689453250842914, "id": "5", "probability_d": 0.9888109278133054 },
>> {
>> > > "body_txt": [ "This company rewards its employees, but you should only
>> > work
>> > > here if you truly love sales. The stress of the job can get to you and
>> > they
>> > > definitely push you." ], "score_d": 4.621702323888672, "id": "4",
>> > > "probability_d": 0.99898557 }, { "body_txt": [ "no chance for
>> > > advancement with that company every year I was there it got worse I
>> don't
>> > > know if all branches of adp but Florence organization was turn over
>> rate
>> > > would be higher if it was for temp workers" ], "score_d":
>> > > 5.288898825826228, "id": "3", "probability_d": 0.9956 }, {
>> > > "body_txt": [ "It was a pleasure to work at the Milpitas campus. The
>> team
>> > > that works there are professional and dedicated individuals. The
>> level of
>> > > loyalty and dedication is impressive" ], "score_d":
>> 2.5303947056922937,
>> > > "id": "2", "probability_d": 0.990430778418 },
>> > >
>> >
>>
>
>


Re: alerting system with Solr's Streaming Expressions

2017-02-09 Thread Joel Bernstein
A few things that I see right off:

1) 2500 terms is too many. I was testing with 100-250 terms
2) 1000 iterations is to high. If the model hasn't converged by 100
iterations it's likely not going to converge.
3) You're going to need more examples. You may want to run features first
and see what it selects. Then you need multiple examples for each feature.
I was testing with the enron ham/spam data set. It would be good to
download that dataset and see what that looks like.

Joel Bernstein
http://joelsolr.blogspot.com/

On Thu, Feb 9, 2017 at 10:15 AM, Susheel Kumar 
wrote:

> Hello Joel,
>
> Here is the final iteration in json format.
>
>  https://www.dropbox.com/s/g3a3606ms6cu8q4/final_iteration.json?dl=0
>
> Below is the expression used
>
> update(models,
>  batchSize="50",
>  train(trainingSet,
>   features(trainingSet,
>  q="*:*",
>  featureSet="threatFeatures",
>  field="body_txt",
>  outcome="out_i",
>  numTerms=2500),
>   q="*:*",
>   name="threatModel",
>   field="body_txt",
>   outcome="out_i",
>   maxIterations="1000"))
>
> I just have 16 documents with 8+ve and 8-ves. The field which contains the
> feedback is body_txt (text_general type)
>
> Thanks for looking.
>
>
>
> On Wed, Feb 8, 2017 at 7:52 AM, Joel Bernstein  wrote:
>
> > Can you post the final iteration of the model?
> >
> > Also the expression you used to train the model?
> >
> > How much training data do you have? Ho many positive examples and
> negatives
> > examples?
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
> > On Tue, Feb 7, 2017 at 2:14 PM, Susheel Kumar 
> > wrote:
> >
> > > Hello,
> > >
> > > I am tried to follow http://joelsolr.blogspot.com/ to see if we can
> > > classify positive & negative feedbacks using streaming expressions.
> All
> > > works but end result where probability_d result of classify expression
> > > gives similar results for positive / negative feedback. See below
> > >
> > > What I may be missing here.  Do i need to put more data in training set
> > or
> > > something else?
> > >
> > >
> > > { "result-set": { "docs": [ { "body_txt": [ "love the company" ],
> > > "score_d": 2.1892474120319667, "id": "6", "probability_d":
> > > 0.977944433135261 }, { "body_txt": [ "bad experience " ], "score_d":
> > > 3.1689453250842914, "id": "5", "probability_d": 0.9888109278133054 }, {
> > > "body_txt": [ "This company rewards its employees, but you should only
> > work
> > > here if you truly love sales. The stress of the job can get to you and
> > they
> > > definitely push you." ], "score_d": 4.621702323888672, "id": "4",
> > > "probability_d": 0.99898557 }, { "body_txt": [ "no chance for
> > > advancement with that company every year I was there it got worse I
> don't
> > > know if all branches of adp but Florence organization was turn over
> rate
> > > would be higher if it was for temp workers" ], "score_d":
> > > 5.288898825826228, "id": "3", "probability_d": 0.9956 }, {
> > > "body_txt": [ "It was a pleasure to work at the Milpitas campus. The
> team
> > > that works there are professional and dedicated individuals. The level
> of
> > > loyalty and dedication is impressive" ], "score_d": 2.5303947056922937,
> > > "id": "2", "probability_d": 0.990430778418 },
> > >
> >
>


Re: Solr Heap Dump: Any suggestions on what to look for?

2017-02-09 Thread Shawn Heisey
On 2/9/2017 6:19 AM, Kelly, Frank wrote:
> Got a heap dump on an Out of Memory error.
> Analyzing the dump now in Visual VM
>
> Seeing a lot of byte[] arrays (77% of our 8GB Heap) in
>
>   * TreeMap$Entry
>   * FieldCacheImpl$SortedDocValues
>
> We’re considering switch over to DocValues but would rather be
> definitive about the root cause before we experiment with DocValues
> and require a reindex of our 200M document index 
> In each of our 4 data centers.
>
> Any suggestions on what I should look for in this heap dump to get a
> definitive root cause?
>

Analyzing the cause of large memory allocations when the large
allocations are byte[] arrays might mean that it's a low-level class,
probably in Lucene.  Solr will likely have almost no influence on these
memory allocations, except by changing the schema to enable docValues,
which changes the particular Lucene code that is called.  Note that
wiping the index and rebuilding it from scratch is necessary when you
enable docValues.

Another possible source of problems like this is the filterCache.  A 200
million document index (assuming it's all on the same machine) results
in filterCache entries that are 25 million bytes each.  In Solr
examples, the filterCache defaults to a size of 512.  If a cache that
size on a 200 million document index fills up, it will require nearly 13
gigabytes of heap memory.

Thanks,
Shawn



RE: DistributedUpdateProcessorFactory was explicitly disabled from this updateRequestProcessorChain

2017-02-09 Thread Pratik Thaker
Hi Friends,

Can you please try to give me some details about below issue ?

Regards,
Pratik Thaker

From: Pratik Thaker
Sent: 07 February 2017 17:12
To: 'solr-user@lucene.apache.org'
Subject: DistributedUpdateProcessorFactory was explicitly disabled from this 
updateRequestProcessorChain

Hi All,

I am using SOLR Cloud 6.0

I am receiving below exception very frequently in solr logs,

o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: 
RunUpdateProcessor has received an AddUpdateCommand containing a document that 
appears to still contain Atomic document update operations, most likely because 
DistributedUpdateProcessorFactory was explicitly disabled from this 
updateRequestProcessorChain
at 
org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:63)
at 
org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:48)
at 
org.apache.solr.update.processor.AddSchemaFieldsUpdateProcessorFactory$AddSchemaFieldsUpdateProcessor.processAdd(AddSchemaFieldsUpdateProcessorFactory.java:335)
at 
org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:48)
at 
org.apache.solr.update.processor.FieldMutatingUpdateProcessor.processAdd(FieldMutatingUpdateProcessor.java:117)
at 
org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:48)
at 
org.apache.solr.update.processor.FieldMutatingUpdateProcessor.processAdd(FieldMutatingUpdateProcessor.java:117)
at 
org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:48)
at 
org.apache.solr.update.processor.FieldMutatingUpdateProcessor.processAdd(FieldMutatingUpdateProcessor.java:117)
at 
org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:48)
at 
org.apache.solr.update.processor.FieldMutatingUpdateProcessor.processAdd(FieldMutatingUpdateProcessor.java:117)
at 
org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:48)
at 
org.apache.solr.update.processor.FieldNameMutatingUpdateProcessorFactory$1.processAdd(FieldNameMutatingUpdateProcessorFactory.java:74)
at 
org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:48)
at 
org.apache.solr.update.processor.FieldMutatingUpdateProcessor.processAdd(FieldMutatingUpdateProcessor.java:117)
at 
org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:48)
at 
org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:936)
at 
org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1091)
at 
org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:714)
at 
org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:48)
at 
org.apache.solr.update.processor.AbstractDefaultValueUpdateProcessorFactory$DefaultValueUpdateProcessor.processAdd(AbstractDefaultValueUpdateProcessorFactory.java:93)
at 
org.apache.solr.handler.loader.JavabinLoader$1.update(JavabinLoader.java:97)

Can you please help me with the root cause ? Below is the snapshot of 
solrconfig,






   


  [^\w-\.]
  _





  
-MM-dd'T'HH:mm:ss.SSSZ
-MM-dd'T'HH:mm:ss,SSSZ
-MM-dd'T'HH:mm:ss.SSS
-MM-dd'T'HH:mm:ss,SSS
-MM-dd'T'HH:mm:ssZ
-MM-dd'T'HH:mm:ss
-MM-dd'T'HH:mmZ
-MM-dd'T'HH:mm
-MM-dd HH:mm:ss.SSSZ
-MM-dd HH:mm:ss,SSSZ
-MM-dd HH:mm:ss.SSS
-MM-dd HH:mm:ss,SSS
-MM-dd HH:mm:ssZ
-MM-dd HH:mm:ss
-MM-dd HH:mmZ
-MM-dd HH:mm
-MM-dd
  


  strings
  
java.lang.Boolean
booleans
  
  
java.util.Date
tdates
  
  
java.lang.Long
java.lang.Integer
tlongs
  
  
java.lang.Number
tdoubles
  


  

Regards,
Pratik Thaker


The information in this email is confidential and may be legally privileged. It 
is intended solely for the addressee. Access to this email by anyone else is 
unauthorised. If you are not the intended recipient, any disclosure, copying, 
distribution or any action taken or omitted to be taken in reliance on it, is 
prohibited and may be unlawful.


Re: procedure to restart solrcloud, and config/collection consistency

2017-02-09 Thread Shawn Heisey
On 2/9/2017 5:24 AM, xavier jmlucjav wrote:
> I always wondered, if this was not really needed, and I could just call
> 'restart' in every node, in a quick loop, and forget about it. Does anyone
> know if this is the case?
>
> My doubt is in regards to changing some config, and then doing the above
> (just restart nodes in a loop). For example, what if I change a config G
> used in collection C, and I restart just one of the nodes (N1), leaving the 
> rest alone. If all the nodes contain a shard for C, what happens, N1 is using 
> the new config and the rest are not? how is this handled?

If you want to change the config or schema for a collection and make it
active across all nodes, just use the collections API to RELOAD the
collection.  The change will be picked up everywhere.

https://cwiki.apache.org/confluence/display/solr/Collections+API

To answer your question: No.  Solr does not have the ability to restart
itself.  It would require significant development effort and a
fundamental change in how Solr is started to make it possible.  It is
something that has been discussed, but at this time it is not possible.

One idea that would make this possible is mentioned on the following
wiki page.  It talks about turning Solr into two applications instead of
one:

https://wiki.apache.org/solr/WhyNoWar#Information_that.27s_not_version_specific

Again -- it would not be easy, which is why it hasn't been done yet.

Thanks,
Shawn



Re: Could not find configName for collection

2017-02-09 Thread Shawn Heisey
On 2/9/2017 4:03 AM, Sedat Kestepe wrote:
> When I try to create a collection through Solr or create an index through
> Hue using a csv file, I get the below error:
>
> { "message": 
> "{\"responseHeader\":{\"status\":400,\"QTime\":16025},\"error\":{\"metadata\":[\"error-class\",\"org.apache.solr.common.SolrException\",\"root-error-class\",\"org.apache.solr.common.cloud.ZooKeeperException\"],\"msg\":\"Error
> CREATEing SolrCore 'deneme': Unable to create core [deneme] Caused by:
> Could not find configName for collection deneme found:[twitter_demo,
> testCollMoney, collection1]\",\"code\":400}}\n (error 400)", "traceback": [
> [ "/usr/local/hue/desktop/libs/libsolr/src/libsolr/api.py", 541,
> "create_core", "response = self._root.post('admin/cores', params=params,
> contenttype='application/json')" ], [
> "/usr/local/hue/desktop/core/src/desktop/lib/rest/resource.py", 132,
> "post", "allow_redirects=allow_redirects,
> clear_cookies=clear_cookies)" ], [
> "/usr/local/hue/desktop/core/src/desktop/lib/rest/resource.py", 81, "invoke"
> , "clear_cookies=clear_cookies)" ], [
> "/usr/local/hue/desktop/core/src/desktop/lib/rest/http_client.py", 173,
> "execute", "raise self._exc_class(ex)" ] ], "detail": null, "title": "Error
> while accessing Solr" }

It can't find the config for this collection when trying to create cores
for the collection.  Either you did not tell it which config to use, or
it cannot find the config.
> If I try to install examples on Hue, I get the below error:
>
> "responseHeader":{"status":400,"QTime":7},"error":{"metadata":["error-class"
> ,"org.apache.solr.common.SolrException","root-error-class",
> "java.lang.ClassNotFoundException"],"msg":"Error CREATEing SolrCore
> 'twitter_demo': Unable to create core [twitter_demo] Caused by:
> solr.ThaiWordFilterFactory","code":400}}

The ThaiWordFilterFactory was deprecated sometime in the 4.x series. 
This means that it was removed from 5.0 and later.  The javadocs for
this class on version 4.8 say that you should use ThaiTokenizerFactory
instead.  Your schema needs changes.

> The only was I can create a collection is uploading zookeeper config first
> then using ./solr create -c command (both manually on command line). I want
> to be able to create them over web ui.

The "bin/solr create" command does its work in two steps.  First it
takes the configset (which defaults to basic_configs if you don't
specify it) and copies that configset to zookeeper with the same name as
the collection, then it calls the Collections API to create that
collection, using the config name that it just uploaded.

In order to be able to use the HTTP API to create a collection, the
config that you want to use must already be present in zookeeper.  If
the "collection.configName" parameter is sent with the CREATE request,
then it will use the config with that name, otherwise it will look for a
config with the same name as the collection.  If it can't find a config
to use, it will produce the error you have described.

https://cwiki.apache.org/confluence/display/solr/Using+ZooKeeper+to+Manage+Configuration+Files

> Story: First I installed Solr on Ambari while existing Ambari Infra Solr
> was working. When doing this, Ambari was not starting Solr with /infra-solr
> Zookeeper path.
>
> Today I removed Solr (keeping Infra-Solr but stopping it) on another host.
> Result was the same. One tip: even if this installation is a clean one,
> Solr installation canoeist the configs left from my very first manual Solr
> installation on tree view.
>
> Environment:
>
> Ambari: 2.4.2
>
> HDP: 2.5.3
>
> Solr: 6.4.0 from https://github.com/abajwa-hw/solr-stack (I changed the
> repo URL to 6.4.0)
>
> Hue: 3.11 on Docker (Centos 6.8)

If you did not get Solr from an official Apache mirror, then we cannot
make any guarantees about *what* you are installing -- they may have
modified it.  I am unfamiliar with all of the other software pieces you
have mentioned.  We will not be able to help with those.

Thanks,
Shawn



Re: inconsistent result count when doing paging

2017-02-09 Thread Shawn Heisey
On 2/8/2017 9:35 PM, cmti95035 wrote:
> I noticed in our production environment that the returned result count is
> inconsistent when doing paging.
>
> For example, for a certain query, for the first page (start = 0, rows = 30),
> the corresponding "numFound" is 3402; and then it returned 3378, 3361 for
> the 2nd and 3rd page, respectively (start = 30, 60 respectively). A sample
> query looks like the following:
> q:TMCN:(美丽 OR ?美丽 OR 美丽? OR 丽美)
> raw query parameters:
> fl=*&start=60&rows=30&shards=172.10.10.3:9080/solr/tm01,172.10.10.3:9080

> /solr/tm44,172.10.10.3:9080/solr/tm45&facet=true&facet.missing=false&facet.field=intCls&facet.field=appDate&facet.field=TMStatus
>
> The query was against multiple shards at a time. With limited tries I
> noticed that the return count is consistent if the number of shards are less
> than 5. 

When a distributed search returns different numFound values on different
requests for the same query, it almost always means that your uniqueKey
field is not unique between the different shards -- you have documents
using the same uniqueKey value in more than one shard.

The reason you see different counts has to do with which shards get
their results back to the coordinating node first, so on one query there
may be a different number of duplicate documents than on a subsequent
query, and the fact that Solr will remove duplicates from the combined
results before calculating the total.  Probably when you reduce the
number of shards, you are removing shards from the list that contain the
duplicate documents, so the problem doesn't happen.

It is *critical* that the uniqueKey field remains unique across the
entire distributed index.  Using SolrCloud with *fully* automatic
document routing will typically ensure that everything is unique across
the entire collection, but in other situations, making sure this happens
will be up to you.

Thanks,
Shawn



Re: alerting system with Solr's Streaming Expressions

2017-02-09 Thread Susheel Kumar
Hello Joel,

Here is the final iteration in json format.

 https://www.dropbox.com/s/g3a3606ms6cu8q4/final_iteration.json?dl=0

Below is the expression used

update(models,
 batchSize="50",
 train(trainingSet,
  features(trainingSet,
 q="*:*",
 featureSet="threatFeatures",
 field="body_txt",
 outcome="out_i",
 numTerms=2500),
  q="*:*",
  name="threatModel",
  field="body_txt",
  outcome="out_i",
  maxIterations="1000"))

I just have 16 documents with 8+ve and 8-ves. The field which contains the
feedback is body_txt (text_general type)

Thanks for looking.



On Wed, Feb 8, 2017 at 7:52 AM, Joel Bernstein  wrote:

> Can you post the final iteration of the model?
>
> Also the expression you used to train the model?
>
> How much training data do you have? Ho many positive examples and negatives
> examples?
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Tue, Feb 7, 2017 at 2:14 PM, Susheel Kumar 
> wrote:
>
> > Hello,
> >
> > I am tried to follow http://joelsolr.blogspot.com/ to see if we can
> > classify positive & negative feedbacks using streaming expressions.  All
> > works but end result where probability_d result of classify expression
> > gives similar results for positive / negative feedback. See below
> >
> > What I may be missing here.  Do i need to put more data in training set
> or
> > something else?
> >
> >
> > { "result-set": { "docs": [ { "body_txt": [ "love the company" ],
> > "score_d": 2.1892474120319667, "id": "6", "probability_d":
> > 0.977944433135261 }, { "body_txt": [ "bad experience " ], "score_d":
> > 3.1689453250842914, "id": "5", "probability_d": 0.9888109278133054 }, {
> > "body_txt": [ "This company rewards its employees, but you should only
> work
> > here if you truly love sales. The stress of the job can get to you and
> they
> > definitely push you." ], "score_d": 4.621702323888672, "id": "4",
> > "probability_d": 0.99898557 }, { "body_txt": [ "no chance for
> > advancement with that company every year I was there it got worse I don't
> > know if all branches of adp but Florence organization was turn over rate
> > would be higher if it was for temp workers" ], "score_d":
> > 5.288898825826228, "id": "3", "probability_d": 0.9956 }, {
> > "body_txt": [ "It was a pleasure to work at the Milpitas campus. The team
> > that works there are professional and dedicated individuals. The level of
> > loyalty and dedication is impressive" ], "score_d": 2.5303947056922937,
> > "id": "2", "probability_d": 0.990430778418 },
> >
>


Re: Solr partial update

2017-02-09 Thread Mike Thomsen
Set the fl parameter equal to the fields you want and then query for
id:(SOME_ID OR SOME_ID OR SOME_ID)

On Thu, Feb 9, 2017 at 5:37 AM, Midas A  wrote:

> Hi,
>
> i want solr doc partially if unique id exist else we donot want to do any
> thing .
>
> how can i achieve this .
>
> Regards,
> Midas
>


RE: Solr Heap Dump: Any suggestions on what to look for?

2017-02-09 Thread Markus Jelsma
 
-Original message-
> From:Kelly, Frank 
> Sent: Thursday 9th February 2017 15:42
> To: solr-user@lucene.apache.org
> Subject: Re: Solr Heap Dump: Any suggestions on what to look for?
> 
> Thanks for the fast reply.
> 
> I think we¹re going to focus on using doc values.
> 
> You also said "facet on fewer fields² - how does one do that?

Well, actually, you don't :) Or not meet functional requirements. 

> 
> Thanks!
> 
> -Frank
> 
>  
> Frank Kelly
> Principal Software Engineer
>  
> HERE 
> 5 Wayside Rd, Burlington, MA 01803, USA
> 42° 29' 7" N 71° 11' 32" W
>  
>   
> 
> 
> 
> 
> 
> 
> On 2/9/17, 8:25 AM, "Markus Jelsma"  wrote:
> 
> >Hello - FieldCache is your problem. This can be solved in many ways but
> >only one really beneficial: decrease number of documents, increase heap,
> >facet on fewer fields, don't do function query on many fields. Or, of
> >course, reindex with doc values. And you get a bonus, you can also
> >drastically reduce your heap size.
> >
> >Original message-
> >> From:Kelly, Frank 
> >> Sent: Thursday 9th February 2017 14:20
> >> To: solr-user@lucene.apache.org
> >> Subject: Solr Heap Dump: Any suggestions on what to look for?
> >> 
> >> Got a heap dump on an Out of Memory error.
> >> Analyzing the dump now in Visual VM
> >> 
> > 
> >> Seeing a lot of byte[] arrays (77% of our 8GB Heap) in
> >> TreeMap$Entry FieldCacheImpl$SortedDocValues
> >> We¹re considering switch over to DocValues but would rather be
> >>definitive about the root cause before we experiment with DocValues and
> >>require a reindex of our 200M document index
> >> In each of our 4 data centers.
> >> 
> > 
> >> Any suggestions on what I should look for in this heap dump to get a
> >>definitive root cause?
> >> 
> > 
> >> Cheers! 
> >> 
> > 
> >> -Frank 
> >> 
> > 
> >> 
> > 
> >> Frank Kelly 
> >> Principal Software Engineer
> >> HERE  
> >> 5 Wayside Rd, Burlington, MA 01803, USA
> >> 42° 29' 7" N 71° 11' 32" W
> >>  
> >> >>re.com%2F&data=01%7C01%7C%7Ca5c3d440116b4ff1417508d450ef1798%7C6d4034cd72
> >>254f72b85391feaea64919%7C1&sdata=bLudPbCB2aeqAFdvvZ%2BS5EYg37m%2BKPSTXYNY
> >>Vz%2B1Eug%3D&reserved=0>
> >>  
> >> >>witter.com%2Fhere&data=01%7C01%7C%7Ca5c3d440116b4ff1417508d450ef1798%7C6d
> >>4034cd72254f72b85391feaea64919%7C1&sdata=qxDdM94Mnlw02QHD8oFyzOHO3pNvyHeb
> >>HvSvBz767hE%3D&reserved=0>
> >>  
> >> >>acebook.com%2Fhere&data=01%7C01%7C%7Ca5c3d440116b4ff1417508d450ef1798%7C6
> >>d4034cd72254f72b85391feaea64919%7C1&sdata=uLkky5W5MPUXddFkr5LqDrzn0MRt0vB
> >>kvnxuZhlxpqo%3D&reserved=0>
> >>  
> >> >>inkedin.com%2Fcompany%2Fheremaps&data=01%7C01%7C%7Ca5c3d440116b4ff1417508
> >>d450ef1798%7C6d4034cd72254f72b85391feaea64919%7C1&sdata=fKBiSO7A1YPG4uRzy
> >>WPqyYDpH1B9l%2FyUqCG2TzeLWgM%3D&reserved=0>
> >>  
> >> >>nstagram.com%2Fhere%2F&data=01%7C01%7C%7Ca5c3d440116b4ff1417508d450ef1798
> >>%7C6d4034cd72254f72b85391feaea64919%7C1&sdata=4T41MNcQpb1jH7qzJrySt79MlVj
> >>YwyYLG2lZXVioEkw%3D&reserved=0>
> 
> 


Re: Solr Heap Dump: Any suggestions on what to look for?

2017-02-09 Thread Kelly, Frank
Thanks for the fast reply.

I think we¹re going to focus on using doc values.

You also said "facet on fewer fields² - how does one do that?

Thanks!

-Frank

 
Frank Kelly
Principal Software Engineer
 
HERE 
5 Wayside Rd, Burlington, MA 01803, USA
42° 29' 7" N 71° 11' 32" W
 
  






On 2/9/17, 8:25 AM, "Markus Jelsma"  wrote:

>Hello - FieldCache is your problem. This can be solved in many ways but
>only one really beneficial: decrease number of documents, increase heap,
>facet on fewer fields, don't do function query on many fields. Or, of
>course, reindex with doc values. And you get a bonus, you can also
>drastically reduce your heap size.
>
>Original message-
>> From:Kelly, Frank 
>> Sent: Thursday 9th February 2017 14:20
>> To: solr-user@lucene.apache.org
>> Subject: Solr Heap Dump: Any suggestions on what to look for?
>> 
>> Got a heap dump on an Out of Memory error.
>> Analyzing the dump now in Visual VM
>> 
> 
>> Seeing a lot of byte[] arrays (77% of our 8GB Heap) in
>> TreeMap$Entry FieldCacheImpl$SortedDocValues
>> We¹re considering switch over to DocValues but would rather be
>>definitive about the root cause before we experiment with DocValues and
>>require a reindex of our 200M document index
>> In each of our 4 data centers.
>> 
> 
>> Any suggestions on what I should look for in this heap dump to get a
>>definitive root cause?
>> 
> 
>> Cheers! 
>> 
> 
>> -Frank 
>> 
> 
>> 
> 
>> Frank Kelly 
>> Principal Software Engineer
>> HERE  
>> 5 Wayside Rd, Burlington, MA 01803, USA
>> 42° 29' 7" N 71° 11' 32" W
>>  
>>>re.com%2F&data=01%7C01%7C%7Ca5c3d440116b4ff1417508d450ef1798%7C6d4034cd72
>>254f72b85391feaea64919%7C1&sdata=bLudPbCB2aeqAFdvvZ%2BS5EYg37m%2BKPSTXYNY
>>Vz%2B1Eug%3D&reserved=0>
>>  
>>>witter.com%2Fhere&data=01%7C01%7C%7Ca5c3d440116b4ff1417508d450ef1798%7C6d
>>4034cd72254f72b85391feaea64919%7C1&sdata=qxDdM94Mnlw02QHD8oFyzOHO3pNvyHeb
>>HvSvBz767hE%3D&reserved=0>
>>  
>>>acebook.com%2Fhere&data=01%7C01%7C%7Ca5c3d440116b4ff1417508d450ef1798%7C6
>>d4034cd72254f72b85391feaea64919%7C1&sdata=uLkky5W5MPUXddFkr5LqDrzn0MRt0vB
>>kvnxuZhlxpqo%3D&reserved=0>
>>  
>>>inkedin.com%2Fcompany%2Fheremaps&data=01%7C01%7C%7Ca5c3d440116b4ff1417508
>>d450ef1798%7C6d4034cd72254f72b85391feaea64919%7C1&sdata=fKBiSO7A1YPG4uRzy
>>WPqyYDpH1B9l%2FyUqCG2TzeLWgM%3D&reserved=0>
>>  
>>>nstagram.com%2Fhere%2F&data=01%7C01%7C%7Ca5c3d440116b4ff1417508d450ef1798
>>%7C6d4034cd72254f72b85391feaea64919%7C1&sdata=4T41MNcQpb1jH7qzJrySt79MlVj
>>YwyYLG2lZXVioEkw%3D&reserved=0>



RE: DataImportHandler - Unable to load Tika Config Processing Document # 1

2017-02-09 Thread Anatharaman, Srinatha (Contractor)
Shawn,

Thanks again for your input

As I said in my last email I was successfully completed this in Solr standalone
My requirement is, to index a emails which is already converted to a text 
file(There are no attachments), Once these text files are indexed Solr search 
result should bring me back the entire text file as it is, I am able to achieve 
this in Solr Standalone
For testing my code in SolrCloud I just kept a small file with 3 characters in 
it , Solr does not throw any error but also not indexing the file

I tried below approaches
1. Issue with Dataimporthandler -- Zookeeper is not able to read 
tikaConfig.conf file at run time
2. Issue with Flume SolrSink -- No error shown, it is not indexing but I see 
once in a while it indexes though I did not make any code changes

As you mentioned I never saw Solr crashing or eating up CPU, RAM. The file 
which I am indexing is very small { it has ABC \n DEF}
My worry is Solr is not throwing any error, I kept the Log level to TRACE

Thanks & Regards,
~Sri



-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: Wednesday, February 08, 2017 4:15 PM
To: solr-user@lucene.apache.org
Cc: Shawn Heisey 
Subject: RE: DataImportHandler - Unable to load Tika Config Processing Document 
# 1

> Thank you I will follow Erick's steps
> BTW I am also trying to ingesting using Flume , Flume uses Morphlines 
> along with Tika Even Flume SolrSink will have the same issue?

Yes, when using Tika you run the risk of it choking on a document, eating CPU 
and/or RAM until everything dies. This is also true when you run it standalone. 
The problem is usually caused by PDF and Office documents that are unusual, 
corrupt or incomplete (e.g. truncated in size) or extremely large. But even 
ordinary HTML can get you into trouble due to extreme sizes or very deep nested 
elements.

But, in general, it is not a problem you will experience frequently. We operate 
broad and large scale web crawlers, ingesting all kinds of bad stuff all the 
time. The trick to avoid problems is running each Tika parse in a separate 
thread, have a timer and kill the thread if it reaches a limit. It can still go 
wrong, but trouble is very rare.

Running it standalone and talking to it over network is safest, but not very 
portable/easy distributable on Hadoop or other platforms.



RE: Solr Heap Dump: Any suggestions on what to look for?

2017-02-09 Thread Markus Jelsma
Hello - FieldCache is your problem. This can be solved in many ways but only 
one really beneficial: decrease number of documents, increase heap, facet on 
fewer fields, don't do function query on many fields. Or, of course, reindex 
with doc values. And you get a bonus, you can also drastically reduce your heap 
size.

Original message-
> From:Kelly, Frank 
> Sent: Thursday 9th February 2017 14:20
> To: solr-user@lucene.apache.org
> Subject: Solr Heap Dump: Any suggestions on what to look for?
> 
> Got a heap dump on an Out of Memory error. 
> Analyzing the dump now in Visual VM 
> 
 
> Seeing a lot of byte[] arrays (77% of our 8GB Heap) in 
> TreeMap$Entry FieldCacheImpl$SortedDocValues 
> We’re considering switch over to DocValues but would rather be definitive 
> about the root cause before we experiment with DocValues and require a 
> reindex of our 200M document index  
> In each of our 4 data centers. 
> 
 
> Any suggestions on what I should look for in this heap dump to get a 
> definitive root cause? 
> 
 
> Cheers! 
> 
 
> -Frank 
> 
 
> 
 
> Frank Kelly 
> Principal Software Engineer 
> HERE  
> 5 Wayside Rd, Burlington, MA 01803, USA 
> 42° 29' 7" N 71° 11' 32" W 
>      
>  


Solr Heap Dump: Any suggestions on what to look for?

2017-02-09 Thread Kelly, Frank
Got a heap dump on an Out of Memory error.
Analyzing the dump now in Visual VM

Seeing a lot of byte[] arrays (77% of our 8GB Heap) in

  *   TreeMap$Entry
  *   FieldCacheImpl$SortedDocValues

We’re considering switch over to DocValues but would rather be definitive about 
the root cause before we experiment with DocValues and require a reindex of our 
200M document index
In each of our 4 data centers.

Any suggestions on what I should look for in this heap dump to get a definitive 
root cause?

Cheers!

-Frank


[Description: Macintosh 
HD:Users:jerchow:Downloads:Asset_Package_01_160721:HERE_Logo_2016:sRGB:PDF:HERE_Logo_2016_POS_sRGB.pdf]



Frank Kelly

Principal Software Engineer



HERE

5 Wayside Rd, Burlington, MA 01803, USA

42° 29' 7" N 71° 11' 32" W

[Description: 
/Users/nussbaum/_WORK/PROJECTS/20160726_HERE_EMail_Signature/_Layout/_Images/20160726_HERE_EMail_Signature_360.gif]
[Description: 
/Users/nussbaum/_WORK/PROJECTS/20160726_HERE_EMail_Signature/_Layout/_Images/20160726_HERE_EMail_Signature_Twitter.gif]
 [Description: 
/Users/nussbaum/_WORK/PROJECTS/20160726_HERE_EMail_Signature/_Layout/_Images/20160726_HERE_EMail_Signature_FB.gif]
  [Description: 
/Users/nussbaum/_WORK/PROJECTS/20160726_HERE_EMail_Signature/_Layout/_Images/20160726_HERE_EMail_Signature_IN.gif]
  [Description: 
/Users/nussbaum/_WORK/PROJECTS/20160726_HERE_EMail_Signature/_Layout/_Images/20160726_HERE_EMail_Signature_Insta.gif]
 


Re: Removing duplicate terms from query

2017-02-09 Thread Ere Maijala

Thanks Emir.

I was thinking of something very simple like doing what 
RemoveDuplicatesTokenFilter does but ignoring positions. It would of 
course still be possible to have the same term multiple times, but at 
least the adjacent ones could be deduplicated. The reason I'm not too 
eager to do it in a query preprocessor is that I'd have to essentially 
duplicate functionality of the query analysis chain that contains 
ICUTokenizerFactory, WordDelimiterFilterFactory and whatnot.


Regards,
Ere

9.2.2017, 14.52, Emir Arnautovic kirjoitti:

Hi Ere,

I don't think that there is such filter. Implementing such filter would
require looking backward which violates streaming approach of token
filters and unpredictable memory usage.

I would do it as part of query preprocessor and not necessarily as part
of Solr.

HTH,
Emir


On 09.02.2017 12:24, Ere Maijala wrote:

Hi,

I just noticed that while we use RemoveDuplicatesTokenFilter during
query time, it will consider term positions and not really do anything
e.g. if query is 'term term term'. As far as I can see the term
positions make no difference in a simple non-phrase search. Is there a
built-in way to deal with this? I know I can write a filter to do
this, but I feel like this would be something quite basic to do for
the query. And I don't think it's even anything too weird for normal
users to do. Just consider e.g. searching for music by title:

Hey, hey, hey ; Shivers of pleasure

I also verified that at least according to debugQuery=true and
anecdotal evicende the search really slows down if you repeat the same
term enough.

--Ere




--
Ere Maijala
Kansalliskirjasto / The National Library of Finland


RE: Removing duplicate terms from query

2017-02-09 Thread Markus Jelsma
How about a pattern replace char filter that checks for repeating groups? I'd 
probably not the fastest option but should work right away. 
 
-Original message-
> From:Emir Arnautovic 
> Sent: Thursday 9th February 2017 13:52
> To: solr-user@lucene.apache.org
> Subject: Re: Removing duplicate terms from query
> 
> Hi Ere,
> 
> I don't think that there is such filter. Implementing such filter would 
> require looking backward which violates streaming approach of token 
> filters and unpredictable memory usage.
> 
> I would do it as part of query preprocessor and not necessarily as part 
> of Solr.
> 
> HTH,
> Emir
> 
> 
> On 09.02.2017 12:24, Ere Maijala wrote:
> > Hi,
> >
> > I just noticed that while we use RemoveDuplicatesTokenFilter during 
> > query time, it will consider term positions and not really do anything 
> > e.g. if query is 'term term term'. As far as I can see the term 
> > positions make no difference in a simple non-phrase search. Is there a 
> > built-in way to deal with this? I know I can write a filter to do 
> > this, but I feel like this would be something quite basic to do for 
> > the query. And I don't think it's even anything too weird for normal 
> > users to do. Just consider e.g. searching for music by title:
> >
> > Hey, hey, hey ; Shivers of pleasure
> >
> > I also verified that at least according to debugQuery=true and 
> > anecdotal evicende the search really slows down if you repeat the same 
> > term enough.
> >
> > --Ere
> 
> -- 
> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> Solr & Elasticsearch Support * http://sematext.com/
> 
> 


RE: procedure to restart solrcloud, and config/collection consistency

2017-02-09 Thread Markus Jelsma
Hello - see inline.
 
-Original message-
> From:xavier jmlucjav 
> Sent: Thursday 9th February 2017 13:46
> To: solr-user 
> Subject: Re: procedure to restart solrcloud, and config/collection consistency
> 
> Hi Markus,
> 
> yes, of course I know (and use) the collections api to reload the config. I
> am asking what would happen in that scenario:
> - config updated (but collection not reloaded)
> - i restart one node
> 
> now one node has the new config and the rest the old one??

The restared/reloaded node has the new config, the others have the old config 
until reloaded/restarted.

> 
> Regarding restarting many hosts, my question is if we can just 'restart'
> each solr and that is enough, or it is better to first stop all, and then
> start all.

We prefer a rolling restart, restarting all nodes in sequence with some wait 
time in between to allow the node to come back up properly. I see no reason to 
do a stop all/start all unless you have cleared/will clear the index and want 
to reindex.

> 
> thanks
> 
> 
> On Thu, Feb 9, 2017 at 1:28 PM, Markus Jelsma 
> wrote:
> 
> > Hello - if you just want to use updated configuration, you can use Solr's
> > collection reload API call. For restarting we rely on remote provisioning
> > tools such as Salt, other managing tools can probably execute commands
> > remotely as well.
> >
> > If you operate more than just a very few machines, i'd really recommend
> > using these tools.
> >
> > Markus
> >
> >
> >
> > -Original message-
> > > From:xavier jmlucjav 
> > > Sent: Thursday 9th February 2017 13:24
> > > To: solr-user 
> > > Subject: procedure to restart solrcloud, and config/collection
> > consistency
> > >
> > > Hi,
> > >
> > > When I need to restart a Solrcloud cluster, I always do this:
> > > - log in into host nb1, stop solr
> > > - log in into host nb2, stop solr
> > > -...
> > > - log in into host nbX, stop solr
> > > - verify all hosts did stop
> > > - in host nb1, start solr
> > > - in host nb12, start solr
> > > -...
> > >
> > > I always wondered, if this was not really needed, and I could just call
> > > 'restart' in every node, in a quick loop, and forget about it. Does
> > anyone
> > > know if this is the case?
> > >
> > > My doubt is in regards to changing some config, and then doing the above
> > > (just restart nodes in a loop). For example, what if I change a config G
> > > used in collection C, and I restart just one of the nodes (N1), leaving
> > the
> > > rest alone. If all the nodes contain a shard for C, what happens, N1 is
> > > using the new config and the rest are not? how is this handled?
> > >
> > > thanks
> > > xavier
> > >
> >
> 


Re: Removing duplicate terms from query

2017-02-09 Thread Emir Arnautovic

Hi Ere,

I don't think that there is such filter. Implementing such filter would 
require looking backward which violates streaming approach of token 
filters and unpredictable memory usage.


I would do it as part of query preprocessor and not necessarily as part 
of Solr.


HTH,
Emir


On 09.02.2017 12:24, Ere Maijala wrote:

Hi,

I just noticed that while we use RemoveDuplicatesTokenFilter during 
query time, it will consider term positions and not really do anything 
e.g. if query is 'term term term'. As far as I can see the term 
positions make no difference in a simple non-phrase search. Is there a 
built-in way to deal with this? I know I can write a filter to do 
this, but I feel like this would be something quite basic to do for 
the query. And I don't think it's even anything too weird for normal 
users to do. Just consider e.g. searching for music by title:


Hey, hey, hey ; Shivers of pleasure

I also verified that at least according to debugQuery=true and 
anecdotal evicende the search really slows down if you repeat the same 
term enough.


--Ere


--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Re: procedure to restart solrcloud, and config/collection consistency

2017-02-09 Thread xavier jmlucjav
Hi Markus,

yes, of course I know (and use) the collections api to reload the config. I
am asking what would happen in that scenario:
- config updated (but collection not reloaded)
- i restart one node

now one node has the new config and the rest the old one??

Regarding restarting many hosts, my question is if we can just 'restart'
each solr and that is enough, or it is better to first stop all, and then
start all.

thanks


On Thu, Feb 9, 2017 at 1:28 PM, Markus Jelsma 
wrote:

> Hello - if you just want to use updated configuration, you can use Solr's
> collection reload API call. For restarting we rely on remote provisioning
> tools such as Salt, other managing tools can probably execute commands
> remotely as well.
>
> If you operate more than just a very few machines, i'd really recommend
> using these tools.
>
> Markus
>
>
>
> -Original message-
> > From:xavier jmlucjav 
> > Sent: Thursday 9th February 2017 13:24
> > To: solr-user 
> > Subject: procedure to restart solrcloud, and config/collection
> consistency
> >
> > Hi,
> >
> > When I need to restart a Solrcloud cluster, I always do this:
> > - log in into host nb1, stop solr
> > - log in into host nb2, stop solr
> > -...
> > - log in into host nbX, stop solr
> > - verify all hosts did stop
> > - in host nb1, start solr
> > - in host nb12, start solr
> > -...
> >
> > I always wondered, if this was not really needed, and I could just call
> > 'restart' in every node, in a quick loop, and forget about it. Does
> anyone
> > know if this is the case?
> >
> > My doubt is in regards to changing some config, and then doing the above
> > (just restart nodes in a loop). For example, what if I change a config G
> > used in collection C, and I restart just one of the nodes (N1), leaving
> the
> > rest alone. If all the nodes contain a shard for C, what happens, N1 is
> > using the new config and the rest are not? how is this handled?
> >
> > thanks
> > xavier
> >
>


RE: procedure to restart solrcloud, and config/collection consistency

2017-02-09 Thread Markus Jelsma
Hello - if you just want to use updated configuration, you can use Solr's 
collection reload API call. For restarting we rely on remote provisioning tools 
such as Salt, other managing tools can probably execute commands remotely as 
well.

If you operate more than just a very few machines, i'd really recommend using 
these tools.

Markus

 
 
-Original message-
> From:xavier jmlucjav 
> Sent: Thursday 9th February 2017 13:24
> To: solr-user 
> Subject: procedure to restart solrcloud, and config/collection consistency
> 
> Hi,
> 
> When I need to restart a Solrcloud cluster, I always do this:
> - log in into host nb1, stop solr
> - log in into host nb2, stop solr
> -...
> - log in into host nbX, stop solr
> - verify all hosts did stop
> - in host nb1, start solr
> - in host nb12, start solr
> -...
> 
> I always wondered, if this was not really needed, and I could just call
> 'restart' in every node, in a quick loop, and forget about it. Does anyone
> know if this is the case?
> 
> My doubt is in regards to changing some config, and then doing the above
> (just restart nodes in a loop). For example, what if I change a config G
> used in collection C, and I restart just one of the nodes (N1), leaving the
> rest alone. If all the nodes contain a shard for C, what happens, N1 is
> using the new config and the rest are not? how is this handled?
> 
> thanks
> xavier
> 


procedure to restart solrcloud, and config/collection consistency

2017-02-09 Thread xavier jmlucjav
Hi,

When I need to restart a Solrcloud cluster, I always do this:
- log in into host nb1, stop solr
- log in into host nb2, stop solr
-...
- log in into host nbX, stop solr
- verify all hosts did stop
- in host nb1, start solr
- in host nb12, start solr
-...

I always wondered, if this was not really needed, and I could just call
'restart' in every node, in a quick loop, and forget about it. Does anyone
know if this is the case?

My doubt is in regards to changing some config, and then doing the above
(just restart nodes in a loop). For example, what if I change a config G
used in collection C, and I restart just one of the nodes (N1), leaving the
rest alone. If all the nodes contain a shard for C, what happens, N1 is
using the new config and the rest are not? how is this handled?

thanks
xavier


Simulating group.facet for JSON facets, high mem usage w/ sorting on aggregation...

2017-02-09 Thread Bryant, Michael
Hi all,

I'm converting my legacy facets to JSON facets and am seeing much better 
performance, especially with high cardinality facet fields. However, the one 
issue I can't seem to resolve is excessive memory usage (and OOM errors) when 
trying to simulate the effect of "group.facet" to sort facets according to a 
grouping field.

My situation, slightly simplified is:

Solr 4.6.1

  *   Doc set: ~200,000 docs
  *   Grouping by item_id, an indexed, stored, single value string field with 
~50,000 unique values, ~4 docs per item
  *   Faceting by person_id, an indexed, stored, multi-value string field with 
~50,000 values (w/ a very skewed distribution)
  *   No docValues fields

Each document here is a description of an item, and there are several 
descriptions per item in multiple languages.

With legacy facets I use group.field=item_id and group.facet=true, which gives 
me facet counts with the number of items rather than descriptions, and 
correctly sorted by descending item count.

With JSON facets I'm doing the equivalent like so:

&json.facet={
"people": {
"type": "terms",
"field": "person_id",
"facet": {
"grouped_count": "unique(item_id)"
},
"sort": "grouped_count desc"
}
}

This works, and is somewhat faster than legacy faceting, but it also produces a 
massive spike in memory usage when (and only when) the sort parameter is set to 
the aggregate field. A server that runs happily with a 512MB heap OOMs unless I 
give it a 4GB heap. With sort set to (the default) "count desc" there is no 
memory usage spike.

I would be curious if anyone has experienced this kind of memory usage when 
sorting JSON facets by stats and if there’s anything I can do to mitigate it. 
I’ve tried reindexing with docValues enabled on the relevant fields and it 
seems to make no difference in this respect.

Many thanks,
~Mike


Removing duplicate terms from query

2017-02-09 Thread Ere Maijala

Hi,

I just noticed that while we use RemoveDuplicatesTokenFilter during 
query time, it will consider term positions and not really do anything 
e.g. if query is 'term term term'. As far as I can see the term 
positions make no difference in a simple non-phrase search. Is there a 
built-in way to deal with this? I know I can write a filter to do this, 
but I feel like this would be something quite basic to do for the query. 
And I don't think it's even anything too weird for normal users to do. 
Just consider e.g. searching for music by title:


Hey, hey, hey ; Shivers of pleasure

I also verified that at least according to debugQuery=true and anecdotal 
evicende the search really slows down if you repeat the same term enough.


--Ere


Could not find configName for collection

2017-02-09 Thread Sedat Kestepe
Hi,

I am having a problem with my Solr on Ambari + HDP Stack.

When I try to create a collection through Solr or create an index through
Hue using a csv file, I get the below error:

{ "message": 
"{\"responseHeader\":{\"status\":400,\"QTime\":16025},\"error\":{\"metadata\":[\"error-class\",\"org.apache.solr.common.SolrException\",\"root-error-class\",\"org.apache.solr.common.cloud.ZooKeeperException\"],\"msg\":\"Error
CREATEing SolrCore 'deneme': Unable to create core [deneme] Caused by:
Could not find configName for collection deneme found:[twitter_demo,
testCollMoney, collection1]\",\"code\":400}}\n (error 400)", "traceback": [
[ "/usr/local/hue/desktop/libs/libsolr/src/libsolr/api.py", 541,
"create_core", "response = self._root.post('admin/cores', params=params,
contenttype='application/json')" ], [
"/usr/local/hue/desktop/core/src/desktop/lib/rest/resource.py", 132,
"post", "allow_redirects=allow_redirects,
clear_cookies=clear_cookies)" ], [
"/usr/local/hue/desktop/core/src/desktop/lib/rest/resource.py", 81, "invoke"
, "clear_cookies=clear_cookies)" ], [
"/usr/local/hue/desktop/core/src/desktop/lib/rest/http_client.py", 173,
"execute", "raise self._exc_class(ex)" ] ], "detail": null, "title": "Error
while accessing Solr" }

If I try to install examples on Hue, I get the below error:

"responseHeader":{"status":400,"QTime":7},"error":{"metadata":["error-class"
,"org.apache.solr.common.SolrException","root-error-class",
"java.lang.ClassNotFoundException"],"msg":"Error CREATEing SolrCore
'twitter_demo': Unable to create core [twitter_demo] Caused by:
solr.ThaiWordFilterFactory","code":400}}

The only was I can create a collection is uploading zookeeper config first
then using ./solr create -c command (both manually on command line). I want
to be able to create them over web ui.

Story: First I installed Solr on Ambari while existing Ambari Infra Solr
was working. When doing this, Ambari was not starting Solr with /infra-solr
Zookeeper path.

Today I removed Solr (keeping Infra-Solr but stopping it) on another host.
Result was the same. One tip: even if this installation is a clean one,
Solr installation canoeist the configs left from my very first manual Solr
installation on tree view.

Environment:

Ambari: 2.4.2

HDP: 2.5.3

Solr: 6.4.0 from https://github.com/abajwa-hw/solr-stack (I changed the
repo URL to 6.4.0)

Hue: 3.11 on Docker (Centos 6.8)

I will be glad if you can help me to find a solution to this.


Regards,
Sedat Kestepe
LinkedIn 


Solr partial update

2017-02-09 Thread Midas A
Hi,

i want solr doc partially if unique id exist else we donot want to do any
thing .

how can i achieve this .

Regards,
Midas


inconsistent result count when doing paging

2017-02-09 Thread cmti95035
Hi,

I noticed in our production environment that the returned result count is
inconsistent when doing paging.

For example, for a certain query, for the first page (start = 0, rows = 30),
the corresponding "numFound" is 3402; and then it returned 3378, 3361 for
the 2nd and 3rd page, respectively (start = 30, 60 respectively). A sample
query looks like the following:
q:TMCN:(美丽 OR ?美丽 OR 美丽? OR 丽美)
raw query parameters:
fl=*&start=60&rows=30&shards=172.10.10.3:9080/solr/tm01,172.10.10.3:9080/solr/tm02,172.10.10.3:9080/solr/tm03,172.10.10.3:9080/solr/tm04,172.10.10.3:9080/solr/tm05,172.10.10.3:9080/solr/tm06,172.10.10.3:9080/solr/tm07,172.10.10.3:9080/solr/tm08,172.10.10.3:9080/solr/tm09,172.10.10.3:9080/solr/tm10,172.10.10.3:9080/solr/tm11,172.10.10.3:9080/solr/tm12,172.10.10.3:9080/solr/tm13,172.10.10.3:9080/solr/tm14,172.10.10.3:9080/solr/tm15,172.10.10.3:9080/solr/tm16,172.10.10.3:9080/solr/tm17,172.10.10.3:9080/solr/tm18,172.10.10.3:9080/solr/tm19,172.10.10.3:9080/solr/tm20,172.10.10.3:9080/solr/tm21,172.10.10.3:9080/solr/tm22,172.10.10.3:9080/solr/tm23,172.10.10.3:9080/solr/tm24,172.10.10.3:9080/solr/tm25,172.10.10.3:9080/solr/tm26,172.10.10.3:9080/solr/tm27,172.10.10.3:9080/solr/tm28,172.10.10.3:9080/solr/tm29,172.10.10.3:9080/solr/tm30,172.10.10.3:9080/solr/tm31,172.10.10.3:9080/solr/tm32,172.10.10.3:9080/solr/tm33,172.10.10.3:9080/solr/tm34,172.10.10.3:9080/solr/tm35,172.10.10.3:9080/solr/tm36,172.10.10.3:9080/solr/tm37,172.10.10.3:9080/solr/tm38,172.10.10.3:9080/solr/tm39,172.10.10.3:9080/solr/tm40,172.10.10.3:9080/solr/tm41,172.10.10.3:9080/solr/tm42,172.10.10.3:9080/solr/tm43,172.10.10.3:9080/solr/tm44,172.10.10.3:9080/solr/tm45&facet=true&facet.missing=false&facet.field=intCls&facet.field=appDate&facet.field=TMStatus

The query was against multiple shards at a time. With limited tries I
noticed that the return count is consistent if the number of shards are less
than 5. 

Please help!

Thanks,

James



--
View this message in context: 
http://lucene.472066.n3.nabble.com/inconsistent-result-count-when-doing-paging-tp4319427.html
Sent from the Solr - User mailing list archive at Nabble.com.