[CVE-2020-13957] The checks added to unauthenticated configset uploads in Apache Solr can be circumvented

2020-10-12 Thread Tomas Fernandez Lobbe
Severity: High

Vendor: The Apache Software Foundation

Versions Affected:
6.6.0 to 6.6.5
7.0.0 to 7.7.3
8.0.0 to 8.6.2

Description:
Solr prevents some features considered dangerous (which could be used for
remote code execution) to be configured in a ConfigSet that's uploaded via
API without authentication/authorization. The checks in place to prevent
such features can be circumvented by using a combination of UPLOAD/CREATE
actions.

Mitigation:
Any of the following are enough to prevent this vulnerability:
* Disable UPLOAD command in ConfigSets API if not used by setting the
system property: "configset.upload.enabled" to "false" [1]
* Use Authentication/Authorization and make sure unknown requests aren't
allowed [2]
* Upgrade to Solr 8.6.3 or greater.
* If upgrading is not an option, consider applying the patch in SOLR-14663
([3])
* No Solr API, including the Admin UI, is designed to be exposed to
non-trusted parties. Tune your firewall so that only trusted computers and
people are allowed access

Credit:
Tomás Fernández Löbbe, András Salamon

References:
[1] https://lucene.apache.org/solr/guide/8_6/configsets-api.html
[2]
https://lucene.apache.org/solr/guide/8_6/authentication-and-authorization-plugins.html
[3] https://issues.apache.org/jira/browse/SOLR-14663
[4] https://issues.apache.org/jira/browse/SOLR-14925
[5] https://wiki.apache.org/solr/SolrSecurity


[SECURITY] CVE-2019-12401: XML Bomb in Apache Solr versions prior to 5.0

2019-09-09 Thread Tomas Fernandez Lobbe
Severity: Medium

Vendor: The Apache Software Foundation

Versions Affected:
1.3.0 to 1.4.1
3.1.0 to 3.6.2
4.0.0 to 4.10.4

Description: Solr versions prior to 5.0.0 are vulnerable to an XML resource
consumption attack (a.k.a. Lol Bomb) via it’s update handler. By leveraging
XML DOCTYPE and ENTITY type elements, the attacker can create a pattern
that will expand when the server parses the XML causing OOMs

Mitigation:
* Upgrade to Apache Solr 5.0 or later.
* Ensure your network settings are configured so that only trusted traffic
is allowed to post documents to the running Solr instances.

Credit: Matei "Mal" Badanoiu

References:
[1] https://issues.apache.org/jira/browse/SOLR-13750
[2] https://wiki.apache.org/solr/SolrSecurity


CVE-2019-0192 Deserialization of untrusted data via jmx.serviceUrl in Apache Solr

2019-03-06 Thread Tomas Fernandez Lobbe
Severity: High

Vendor: The Apache Software Foundation

Versions Affected:
5.0.0 to 5.5.5
6.0.0 to 6.6.5

Description:
ConfigAPI allows to configure Solr's JMX server via an HTTP POST request.
By pointing it to a malicious RMI server, an attacker could take advantage
of Solr's unsafe deserialization to trigger remote code execution on the
Solr side.

Mitigation:
Any of the following are enough to prevent this vulnerability:
* Upgrade to Apache Solr 7.0 or later.
* Disable the ConfigAPI if not in use, by running Solr with the system
property “disable.configEdit=true”
* If upgrading or disabling the Config API are not viable options, apply
patch in [1] and re-compile Solr.
* Ensure your network settings are configured so that only trusted traffic
is allowed to ingress/egress your hosts running Solr.

Credit:
Michael Stepankin

References:
[1] https://issues.apache.org/jira/browse/SOLR-13301
[2] https://wiki.apache.org/solr/SolrSecurity


[SECURITY] CVE-2017-3164 SSRF issue in Apache Solr

2019-02-12 Thread Tomas Fernandez Lobbe
CVE-2017-3164 SSRF issue in Apache Solr

Severity: High

Vendor: The Apache Software Foundation

Versions Affected:
Apache Solr versions from 1.3 to 7.6.0

Description:
The "shards" parameter does not have a corresponding whitelist mechanism,
so it can request any URL.

Mitigation:
Upgrade to Apache Solr 7.7.0 or later.
Ensure your network settings are configured so that only trusted traffic is
allowed to ingress/egress your hosts running Solr.

Credit:
dk from Chaitin Tech

References:
https://issues.apache.org/jira/browse/SOLR-12770
https://wiki.apache.org/solr/SolrSecurity


Re: Exception writing document xxxxxx to the index; possible analysis error.

2018-07-11 Thread Tomas Fernandez Lobbe
I Daphne, 
the “possible analysis error” is a misleading error message (to be addressed in 
SOLR-12477). The important piece is the 
“java.lang.ArrayIndexOutOfBoundsException”, it looks like your index may be 
corrupted in some way.

Tomás

> On Jul 11, 2018, at 3:01 PM, Liu, Daphne  wrote:
> 
> Hello Solr Expert,
>   We are using Solr 6.3.0 and lately we are unable to write documents into 
> our index. Please see below error messages. Can anyone help us?
>   Thank you.
> 
> 
> ===
> org.apache.solr.common.SolrException: Exception writing document id 
> 3b8514819e204cc7a110aa5752e29b8e to the index; possible analysis error.
>at 
> org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:178)
>at 
> org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:67)
>at 
> org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:48)
>at 
> org.apache.solr.update.processor.AddSchemaFieldsUpdateProcessorFactory$AddSchemaFieldsUpdateProcessor.processAdd(AddSchemaFieldsUpdateProcessorFactory.java:335)
>at 
> org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:48)
>at 
> org.apache.solr.update.processor.FieldMutatingUpdateProcessor.processAdd(FieldMutatingUpdateProcessor.java:118)
>at 
> org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:48)
>at 
> org.apache.solr.update.processor.FieldMutatingUpdateProcessor.processAdd(FieldMutatingUpdateProcessor.java:118)
>at 
> org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:48)
>at 
> org.apache.solr.update.processor.FieldMutatingUpdateProcessor.processAdd(FieldMutatingUpdateProcessor.java:118)
>at 
> org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:48)
>at 
> org.apache.solr.update.processor.FieldMutatingUpdateProcessor.processAdd(FieldMutatingUpdateProcessor.java:118)
>at 
> org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:48)
>at 
> org.apache.solr.update.processor.FieldNameMutatingUpdateProcessorFactory$1.processAdd(FieldNameMutatingUpdateProcessorFactory.java:74)
>at 
> org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:48)
>at 
> org.apache.solr.update.processor.FieldMutatingUpdateProcessor.processAdd(FieldMutatingUpdateProcessor.java:118)
>at 
> org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:48)
>at 
> org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:957)
>at 
> org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1112)
>at 
> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:738)
>at 
> org.apache.solr.update.processor.LogUpdateProcessorFactory$LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:103)
>at 
> org.apache.solr.handler.loader.JavabinLoader$1.update(JavabinLoader.java:97)
>at 
> org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readOuterMostDocIterator(JavaBinUpdateRequestCodec.java:179)
>at 
> org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readIterator(JavaBinUpdateRequestCodec.java:135)
>at 
> org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:275)
>at 
> org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readNamedList(JavaBinUpdateRequestCodec.java:121)
>at 
> org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:240)
>at 
> org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:158)
>at 
> org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.unmarshal(JavaBinUpdateRequestCodec.java:186)
>at 
> org.apache.solr.handler.loader.JavabinLoader.parseAndLoadDocs(JavabinLoader.java:107)
>at 
> org.apache.solr.handler.loader.JavabinLoader.load(JavabinLoader.java:54)
>at 
> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:97)
>at 
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)
>at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:153)
>at org.apache.solr.core.SolrCore.execute(SolrCore.java:2213)
>at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:654)
>at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:460)
>at 
> 

Re: User queries end up in filterCache if facetting is enabled

2018-05-09 Thread Tomas Fernandez Lobbe
I'd never noticed this before, but I believe it happens because, once you say 
`facet=true`, Solr will need the full docset (the set of all matching docs, not 
just the top matches) and does so by using the filter cache.

> On May 3, 2018, at 7:10 AM, Markus Jelsma  wrote:
> 
> By the way, the queries end up in the filterCache regardless of the value set 
> in useFilterForSortedQuery.
> 
> Thanks,
> Markus
> 
> -Original message-
>> From:Markus Jelsma 
>> Sent: Thursday 3rd May 2018 12:05
>> To: solr-user@lucene.apache.org; solr-user 
>> Subject: RE: User queries end up in filterCache if facetting is enabled
>> 
>> Thanks Mikhail,
>> 
>> But i thought about that setting too, but i do sort by score, as does Solr 
>> /select handler by default. The enum method accounts for all the values for 
>> a facet field, but not the user queries i see ending up in the cache.
>> 
>> Any other suggestions to shed light on this oddity?
>> 
>> Thanks!
>> Markus
>> 
>> 
>> 
>> -Original message-
>>> From:Mikhail Khludnev 
>>> Sent: Thursday 3rd May 2018 9:43
>>> To: solr-user 
>>> Subject: Re: User queries end up in filterCache if facetting is enabled
>>> 
>>> I mean
>>> https://lucene.apache.org/solr/guide/6_6/query-settings-in-solrconfig.html#QuerySettingsinSolrConfig-useFilterForSortedQuery
>>> 
>>> 
>>> On Thu, May 3, 2018 at 10:42 AM, Mikhail Khludnev  wrote:
>>> 
 Enum facets, facet refinements and https://lucene.apache.org/
 solr/guide/6_6/query-settings-in-solrconfig.html comes to my mind.
 
 On Wed, May 2, 2018 at 11:58 PM, Markus Jelsma  wrote:
 
> Hello,
> 
> Anyone here to reproduce this oddity? It shows up in all our collections
> once we enable the stats page to show filterCache entries.
> 
> Is this normal? Am i completely missing something?
> 
> Thanks,
> Markus
> 
> 
> 
> -Original message-
>> From:Markus Jelsma 
>> Sent: Tuesday 1st May 2018 17:32
>> To: Solr-user 
>> Subject: User queries end up in filterCache if facetting is enabled
>> 
>> Hello,
>> 
>> We noticed the number of entries of the filterCache to be higher than
> we expected, using showItems="1024" something unexpected was listed as
> entries of the filterCache, the complete Query.toString() of our user
> queries, massive entries, a lot of them.
>> 
>> We also spotted all entries of fields we facet on, even though we don't
> use them as filtes, but that is caused by facet.field=enum, and should be
> expected, right?
>> 
>> Now, the user query entries are not expected. In the simplest set up,
> searching for something and only enabling the facet engine with facet=true
> causes it to appears in the cache as an entry. The following queries:
>> 
>> http://localhost:8983/solr/search/select?q=content_nl:nog=true
>> http://localhost:8983/solr/search/select?q=*:*=true
>> 
>> become listed as:
>> 
>> CACHE.searcher.filterCache.item_*:*:
>>org.apache.solr.search.BitDocSet@70051ee0
>> 
>> CACHE.searcher.filterCache.item_content_nl:nog:
>>org.apache.solr.search.BitDocSet@13150cf6
>> 
>> This is on 7.3, but 7.2.1 does this as well.
>> 
>> So, should i expect this? Can i disable this? Bug?
>> 
>> 
>> Thanks,
>> Markus
>> 
>> 
>> 
>> 
> 
 
 
 
 --
 Sincerely yours
 Mikhail Khludnev
 
>>> 
>>> 
>>> 
>>> -- 
>>> Sincerely yours
>>> Mikhail Khludnev
>>> 
>> 



Re: Solr 7.2.1 DELETEREPLICA automatically NRT replica appears

2018-03-07 Thread Tomas Fernandez Lobbe
This shouldn’t be happening. Did you see anything related in the logs? Does the 
new NRT replica ever becomes active? Is there a new core created or do you just 
see the replica in the clusterstate?

Tomas 

Sent from my iPhone

> On Mar 7, 2018, at 8:18 PM, Greg Roodt <gro...@gmail.com> wrote:
> 
> Hi
> 
> I am running a cluster of TLOG and PULL replicas. When I call the
> DELETEREPLICA api to remove a replica, the replica is removed, however, a
> new NRT replica pops up in a down state in the cluster.
> 
> Any ideas why?
> 
> Greg


Re: solr cloud unique key query request is sent to all shards!

2018-02-18 Thread Tomas Fernandez Lobbe
In real-time get, the parameter name is “id”, regardless of the name of the 
unique key. 

The request should be in your case: 
http://:8080/api/collections/col1/get?id=69749398

See: https://lucene.apache.org/solr/guide/7_2/realtime-get.html

Sent from my iPhone

> On Feb 18, 2018, at 9:28 PM, Ganesh Sethuraman <ganeshmail...@gmail.com> 
> wrote:
> 
> I tried this real time get on my collection using the both V1 and V2 URL
> for real time get, but did not work!!!
> 
> http://:8080/api/collections/col1/get?myid:69749398
> 
> it returned...
> 
> {
>  "doc":null}
> 
> same issue with V1 URL as well, http://
> :8080/solr/col1/get?myid:69749398
> 
> however if i do q=myid:69749398 with "select" request handler seems to
> fine. I checked my schema again and it is configured correctly.  Like below:
> 
> myid
> 
> Also i see that this implicit request handler is configured correctly Any
> thoughts, what I might be missing?
> 
> 
> 
> On Sun, Feb 18, 2018 at 11:18 PM, Tomas Fernandez Lobbe <tflo...@apple.com>
> wrote:
> 
>> I think real-time get should be directed to the correct shard. Try:
>> [COLLECTION]/get?id=[YOUR_ID]
>> 
>> Sent from my iPhone
>> 
>>> On Feb 18, 2018, at 3:17 PM, Ganesh Sethuraman <ganeshmail...@gmail.com>
>> wrote:
>>> 
>>> Hi
>>> 
>>> I am using Solr 7.2.1. I have 8 shards in two nodes (two different m/c)
>>> using Solr Cloud. The data was indexed with a unique key (default
>> composite
>>> id) using the CSV update handler (batch indexing). Note that I do NOT
>> have
>>>  while indexing.   Then when I try to  query the
>>> collection col1 based on my primary key (as below), I see that in the
>>> 'debug' response that the query was sent to all the shards and when it
>>> finds the document in one the shards it sends a GET FIELD to that shard
>> to
>>> get the data.  The problem is potentially high response time, and more
>>> importantly scalability issue as unnecessarily all shards are being
>> queried
>>> to get one document (by unique key).
>>> 
>>> http://:8080/solr/col1/select?debug=true=id:69749278
>>> 
>>> Is there a way to query to reach the right shard based on the has of the
>>> unique key?
>>> 
>>> Regards
>>> Ganesh
>> 


Re: solr cloud unique key query request is sent to all shards!

2018-02-18 Thread Tomas Fernandez Lobbe
I think real-time get should be directed to the correct shard. Try:  
[COLLECTION]/get?id=[YOUR_ID]

Sent from my iPhone

> On Feb 18, 2018, at 3:17 PM, Ganesh Sethuraman  
> wrote:
> 
> Hi
> 
> I am using Solr 7.2.1. I have 8 shards in two nodes (two different m/c)
> using Solr Cloud. The data was indexed with a unique key (default composite
> id) using the CSV update handler (batch indexing). Note that I do NOT have
>  while indexing.   Then when I try to  query the
> collection col1 based on my primary key (as below), I see that in the
> 'debug' response that the query was sent to all the shards and when it
> finds the document in one the shards it sends a GET FIELD to that shard to
> get the data.  The problem is potentially high response time, and more
> importantly scalability issue as unnecessarily all shards are being queried
> to get one document (by unique key).
> 
> http://:8080/solr/col1/select?debug=true=id:69749278
> 
> Is there a way to query to reach the right shard based on the has of the
> unique key?
> 
> Regards
> Ganesh


Re: Request routing / load-balancing TLOG & PULL replica types

2018-02-12 Thread Tomas Fernandez Lobbe


> On Feb 12, 2018, at 12:06 PM, Greg Roodt <gro...@gmail.com> wrote:
> 
> Thanks Ere. I've taken a look at the discussion here:
> http://lucene.472066.n3.nabble.com/Limit-search-queries-only-to-pull-replicas-td4367323.html
> This is how I was imagining TLOG & PULL replicas would wor, so if this
> functionality does get developed, it would be useful to me.
> 
> I still have 2 questions at the moment:
> 1. I am running the single shard scenario. I'm thinking of using a
> dedicated HTTP load-balancer in front of the PULL replicas only with
> read-only queries directed directly at the load-balancer. In this
> situation, the healthy PULL replicas *should* handle the queries on the
> node itself without a proxy hop (assuming state=active). New PULL replicas
> added to the load-balancer will internally proxy queries to the other PULL
> or TLOG replicas while in state=recovering until the switch to
> state=active. Is my understanding correct?

Yes

> 
> 2. Is it all worth it? Is there any advantage to running a cluster of 3
> TLOGs + 10 PULL replicas vs running 13 TLOG replicas?
> 

I don’t have a definitive answer, this will depend on your specific use case. 
As Erick said, there is very little work that non-leader TLOG replicas do for 
each update, and having all TLOG replicas means that with a single active 
replica you could in theory handle updates. It’s sometimes nice to separate 
query traffic from update traffic, but this can still be done if you have all 
TLOG replicas and you just make sure you don’t query the leader…
One nice characteristic that PULL replicas have is that they can’t go into 
Leader Initiated Recovery (LIR) state, even if there is some sort of network 
partition, they’ll remain in active state even if they can’t talk with the 
leader as long as they can reach ZooKeeper (note that this means they may be 
responding with outdated data for an undetermined amount of time, until 
replicas can replicate from the leader again). Also, since updates are not sent 
to all the replicas (only the TLOG replicas), updates should be faster with 3 
TLOG vs 13 TLOG replicas.


Tomás

> 
> 
> 
> On 12 February 2018 at 19:25, Ere Maijala <ere.maij...@helsinki.fi> wrote:
> 
>> Your question about directing queries to PULL replicas only has been
>> discussed on the list. Look for topic "Limit search queries only to pull
>> replicas". What I'd like to see is something similar to the
>> preferLocalShards parameter. It could be something like
>> "preferReplicaTypes=TLOG,PULL". Tomás mentioned previously that
>> SOLR-10880 could be used as a base for such funtionality, and I'm
>> considering taking a stab at implementing it.
>> 
>> --Ere
>> 
>> 
>> Greg Roodt kirjoitti 12.2.2018 klo 6.55:
>> 
>>> Thank you both for your very detailed answers.
>>> 
>>> This is great to know. I knew that SolrJ had the cluster aware knowledge
>>> (via zookeeper), but I was wondering what something like curl would do.
>>> Great to know that internally the cluster will proxy queries to the
>>> appropriate place regardless.
>>> 
>>> I am running the single shard scenario. I'm thinking of using a dedicated
>>> HTTP load-balancer in front of the PULL replicas only with read-only
>>> queries directed directly at the load-balancer. In this situation, the
>>> healthy PULL replicas *should* handle the queries on the node itself
>>> without a proxy hop (assuming state=active). New PULL replicas added to
>>> the
>>> load-balancer will internally proxy queries to the other PULL or TLOG
>>> replicas while in state=recovering until the switch to state=active.
>>> 
>>> Is my understanding correct?
>>> 
>>> Is this sensible to do, or is it not worth it due to the smart proxying
>>> that SolrCloud can do anyway?
>>> 
>>> If the TLOG and PULL replicas are so similar, is there any real advantage
>>> to having a mixed cluster? I assume a bit less work is required across the
>>> cluster to propagate writes if you only have 3 TLOG nodes vs 10+ PULL
>>> nodes? Or would it be better to just have 13 TLOG nodes?
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On 12 February 2018 at 15:24, Tomas Fernandez Lobbe <tflo...@apple.com>
>>> wrote:
>>> 
>>> On the last question:
>>>> For Writes: Yes. Writes are going to be sent to the shard leader, and
>>>> since PULL replicas can’t  be leaders, it’s going to be a TLOG replica.
>>>> If
>>>> you are using CloudSolrClient, then this routing will be done directly
>>>> from
>>>

Re: Request routing / load-balancing TLOG & PULL replica types

2018-02-11 Thread Tomas Fernandez Lobbe
On the last question:
For Writes: Yes. Writes are going to be sent to the shard leader, and since 
PULL replicas can’t  be leaders, it’s going to be a TLOG replica. If you are 
using CloudSolrClient, then this routing will be done directly from the client 
(since it will send the update to the leader), and if you are using some other 
HTTP client, then yes, the PULL replica will forward the update, the same way 
any non-leader node would.

For reads: this won’t happen today, and any replica can respond to queries. I 
do believe there is value in this kind of routing logic, sometimes you simply 
don’t want the leader to handle any queries, specially when queries can be 
expensive. You could do this today if you want, by putting some load balancer 
in front and just direct your queries to the nodes you know are PULL, but keep 
in mind that this would only work in the single shard scenario, and only if you 
hit an active replica (otherwise, as you said, the query will be routed to any 
other node of the shard, regardless of the type), if you have multiple shards 
then you need to use the “shards” parameter and tell Solr exactly which nodes 
you want to hit for each shard (the “shards” approach can also be done in the 
single shard case, although you would be adding an extra hop I believe)

Tomás 
Sent from my iPhone

> On Feb 11, 2018, at 6:35 PM, Greg Roodt  wrote:
> 
> Hi
> 
> I have a question around how queries are routed and load-balanced in a
> cluster of mixed TLOG and PULL replicas.
> 
> I thought that I might have to put a load-balancer in front of the PULL
> replicas and direct queries at them manually as nodes are added and removed
> as PULL replicas. However, it seems that SolrCloud handles this
> automatically?
> 
> If I add a new PULL replica node, it goes into state="recovering" while it
> pulls the core. As expected. What happens if queries are directed at this
> node while in this state? From what I am observing, the query gets directed
> to another node?
> 
> If SolrCloud is handling the routing of requests to active nodes, will it
> automatically favour PULL replicas for read queries and TLOG replicas for
> writes?
> 
> Thanks
> Greg


Re: 7.2.1 cluster dies within minutes after restart

2018-02-02 Thread Tomas Fernandez Lobbe
Hi Markus, 
If the same code that runs OK in 7.1 breaks 7.2.1, it is clear to me that there 
is some bug in Solr introduced between those releases (maybe an increase in 
memory utilization? or maybe some decrease in query throughput making threads 
to pile up?). I’d hate to have this issue lost in the users list, could you 
create a Jira? Maybe next time you have this issue you can post thread/heap 
dumps, that would be useful.

Tomás

> On Feb 2, 2018, at 9:38 AM, Walter Underwood  wrote:
> 
> Zookeeper 3.4.6 is not good? That was the version recommended by Solr docs 
> when I installed 6.2.0.
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 
>> On Feb 2, 2018, at 9:30 AM, Markus Jelsma  wrote:
>> 
>> Hello S.G.
>> 
>> We have relied in Trie* fields every since they became available, i don't 
>> think reverting to the old fieldType's will do us any good, we have a very 
>> recent problem.
>> 
>> Regarding our heap, the cluster ran fine for years with just 1.5 GB, we only 
>> recently increased it because or data keeps on growing. Heap rarely goes 
>> higher than 50 %, except when this specific problem occurs. The nodes have 
>> no problem processing a few hundred QPS continuously and can go on for days, 
>> sometimes even a few weeks.
>> 
>> I will keep my eye open for other clues when the problem strikes again!
>> 
>> Thanks,
>> Markus
>> 
>> -Original message-
>>> From:S G 
>>> Sent: Friday 2nd February 2018 18:20
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: 7.2.1 cluster dies within minutes after restart
>>> 
>>> Yeah, definitely check the zookeeper version.
>>> 3.4.6 is not a good one I know and you can say the same for all the
>>> versions below it too.
>>> We have used 3.4.9 with no issues.
>>> While Solr 7.x uses 3.4.10
>>> 
>>> Another dimension could be the use or (dis-use) of p-fields like pint,
>>> plong etc.
>>> If you are using them, try to revert back to tint, tlong etc
>>> And if you are not using them, try to use them (Although doing this means a
>>> change from your older config and less likely to help).
>>> 
>>> Lastly, did I read 2 GB for JVM heap?
>>> That seems really too less to me for any version of Solr
>>> We run with 10-16 gb of heap with G1GC collector and new-gen capped at 3-4gb
>>> 
>>> 
>>> On Fri, Feb 2, 2018 at 4:27 AM, Markus Jelsma 
>>> wrote:
>>> 
 Hello Ere,
 
 It appears that my initial e-mail [1] got lost in the thread. We don't
 have GC issues, the cluster that dies occasionally runs, in general, smooth
 and quick with just 2 GB allocated.
 
 Thanks,
 Markus
 
 [1]: http://lucene.472066.n3.nabble.com/7-2-1-cluster-dies-
 within-minutes-after-restart-td4372615.html
 
 -Original message-
> From:Ere Maijala 
> Sent: Friday 2nd February 2018 8:49
> To: solr-user@lucene.apache.org
> Subject: Re: 7.2.1 cluster dies within minutes after restart
> 
> Markus,
> 
> I may be stating the obvious here, but I didn't notice garbage
> collection mentioned in any of the previous messages, so here goes. In
> our experience almost all of the Zookeeper timeouts etc. have been
> caused by too long garbage collection pauses. I've summed up my
> observations here:
>  
> 
> So, in my experience it's relatively easy to cause heavy memory usage
> with SolrCloud with seemingly innocent queries, and GC can become a
> problem really quickly even if everything seems to be running smoothly
> otherwise.
> 
> Regards,
> Ere
> 
> Markus Jelsma kirjoitti 31.1.2018 klo 23.56:
>> Hello S.G.
>> 
>> We do not complain about speed improvements at all, it is clear 7.x is
 faster than its predecessor. The problem is stability and not recovering
 from weird circumstances. In general, it is our high load cluster
 containing user interaction logs that suffers the most. Our main text
 search cluster - receiving much fewer queries - seems mostly unaffected,
 except last Sunday. After very short but high burst of queries it entered
 the same catatonic state the logs cluster usually dies from.
>> 
>> The query burst immediately caused ZK timeouts and high heap
 consumption (not sure which came first of the latter two). The query burst
 lasted for 30 minutes, the excessive heap consumption continued for more
 than 8 hours, before Solr finally realized it could relax. Most remarkable
 was that Solr recovered on its own, ZK timeouts stopped, heap went back to
 normal.
>> 
>> There seems to be a causality between high load and this state.
>> 
>> We really want to get this fixed for ourselves and everyone else that

Re: Master Slave Replication Issue

2018-02-01 Thread Tomas Fernandez Lobbe
This seems pretty serious. Please create a Jira issue

Sent from my iPhone

> On Feb 1, 2018, at 12:15 AM, dennis nalog  
> wrote:
> 
> Hi,
> We are using Solr 7.1 and are solr setup is master-slave replication.
> We encounter this issue that when we disable the replication in master via UI 
> or URL (server/solr/solr_core/replication?command=disablereplication), the 
> data in our slave servers suddenly becomes 0.
> Just wanna know if this is a known issue or this is the expected behavior. 
> Thanks in advance.
> Best regards,Dennis


Re: Mixing simple and nested docs in same update?

2018-01-30 Thread Tomas Fernandez Lobbe
I believe the problem is that:
* BlockJoin queries do not know about your “types”, in the BlockJoin query 
world, everything that’s not a parent (matches the parentFilter) is a child.
* All docs indexed before a parent are considered childs of that doc.
That’s why in your first case it considers “friend” (not a parent, then a 
child) to be a child of the first parent it can find in the segment (mother). 
In the second case, the “friend” doc would have no parent. No parent document 
matches the filter after it, so it’s not considered a match. 
Maybe if you try your query with parentFilter=-type:child, this particular 
example works (I haven’t tried it)?

Note that when you send docs with childs to Solr, Solr will make sure the 
childs are indexed before the parent. Also note that there are some other open 
bugs related to child docs, and in particular, with mixing child docs with 
non-child docs, depending on which features you need this may be a problem.

Tomás

> On Jan 30, 2018, at 5:48 AM, Jan Høydahl  wrote:
> 
> Pasting the GIST link :-) 
> https://gist.github.com/45640fe3bad696d53ef8a0930a35d163 
> 
> Anyone knows if this is expected behavior?
> 
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> 
>> 15. jan. 2018 kl. 14:08 skrev Jan Høydahl :
>> 
>> Radio silence…
>> 
>> Here is a GIST for easy reproduction. Is this by design?
>> 
>> --
>> Jan Høydahl, search solution architect
>> Cominvent AS - www.cominvent.com
>> 
>>> 11. jan. 2018 kl. 00:42 skrev Jan Høydahl :
>>> 
>>> Hi,
>>> 
>>> We index several large nested documents. We found that querying the data 
>>> behaves differently depending on how the documents are indexed.
>>> 
>>> To reproduce:
>>> 
>>> solr start
>>> solr create -c nested
>>> # Index one plain document, “friend" and a nested one, “mother” and 
>>> “daughter”, in same request:
>>> curl localhost:8983/solr/nested/update -d ‘
>>> 
>>> 
>>>   friend
>>>   other
>>> 
>>> 
>>>   mother
>>>   parent
>>>   
>>> daughter
>>> child
>>>   
>>> 
>>> '
>>> 
>>> # Query for mother’s children using either child transformer or child query 
>>> parser
>>> curl 
>>> "localhost:8983/solr/a/query?q=id:mother=%2A%2C%5Bchild%20parentFilter%3Dtype%3Aparent%5D”
>>> {
>>> "responseHeader":{
>>>  "zkConnected":true,
>>>  "status":0,
>>>  "QTime":4,
>>>  "params":{
>>>"q":"id:mother",
>>>"fl":"*,[child parentFilter=type:parent]"}},
>>> "response":{"numFound":1,"start":0,"docs":[
>>>{
>>>  "id":"mother",
>>>  "type":["parent"],
>>>  "_version_":1589249812802306048,
>>>  "type_str":["parent"],
>>>  "_childDocuments_":[
>>>  {
>>>"id":"friend",
>>>"type":["other"],
>>>"_version_":1589249812729954304,
>>>"type_str":["other"]},
>>>  {
>>>"id":"daughter",
>>>"type":["child"],
>>>"_version_":1589249812802306048,
>>>"type_str":["child"]}]}]
>>> }}
>>> 
>>> As you can see, the “friend” got included as a child of “mother”.
>>> If you index the exact same request, putting “friend” after “mother” in the 
>>> xml,
>>> the query works as expected.
>>> 
>>> Inspecting the index, everything looks correct, and only “daughter” and 
>>> “mother” have _root_=mother.
>>> Is there a rule that you should start a new update request for each type of 
>>> parent/child relationship
>>> that you need to index, and not mix them in the same request?
>>> 
>>> --
>>> Jan Høydahl, search solution architect
>>> Cominvent AS - www.cominvent.com
>>> 
>> 
> 



Re: Limit search queries only to pull replicas

2018-01-08 Thread Tomas Fernandez Lobbe
This feature is not currently supported. I was thinking in implementing it by 
extending the work done in SOLR-10880. I still didn’t have time to work on it 
though.  There is a patch for SOLR-10880 that doesn’t implement support for 
replica types, but could be used as base. 

Tomás

> On Jan 8, 2018, at 12:04 AM, Ere Maijala  wrote:
> 
> Server load alone doesn't always indicate the server's ability to serve 
> queries. Memory and cache state are important too, and they're not as easy to 
> monitor. Additionally, server load at any single point in time or a short 
> term average is not indicative of the server's ability to handle search 
> requests if indexing happens in short but intense bursts.
> 
> It can also complicate things if there are more than one Solr instance 
> running on a single server.
> 
> I'm definitely not against intelligent routing. In many cases it makes 
> perfect sense, and I'd still like to use it, just limited to the pull 
> replicas.
> 
> --Ere
> 
> Erick Erickson kirjoitti 5.1.2018 klo 19.03:
>> Actually, I think a much better option is to route queries to server load.
>> The theory of preferring pull replicas to leaders would be that the leader
>> will be doing the indexing work and the pull replicas would be doing less
>> work therefore serving queries faster. But that's a fragile assumption.
>> Let's say indexing stops totally. Now your leader is sitting there idle
>> when it could be serving queries.
>> The autoscaling work will allow for more intelligent routing, you can
>> monitor the CPU load on your servers and if the leader has some spare
>> cycles use them .vs. crudely routing all queries to pull replicas (or tlog
>> replicas for that matter). NOTE: I don't know whether this is being
>> actively worked on or not, but seems a logical extension of the increased
>> monitoring capabilities being put in place for autoscaling, but I'd rather
>> see effort put in there than support routing based solely on a node's type.
>> Best,
>> Erick
>> On Fri, Jan 5, 2018 at 7:51 AM, Emir Arnautović <
>> emir.arnauto...@sematext.com> wrote:
>>> It is interesting that ES had similar feature to prefer primary/replica
>>> but it deprecating that and will remove it - could not find explanation why.
>>> 
>>> Emir
>>> --
>>> Monitoring - Log Management - Alerting - Anomaly Detection
>>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>>> 
>>> 
>>> 
 On 5 Jan 2018, at 15:22, Ere Maijala  wrote:
 
 Hi,
 
 It would be really nice to have a server-side option, though. Not
>>> everyone uses Solrj, and a typical fairly dummy client just queries the
>>> server without any understanding about shards etc. Solr could be clever
>>> enough to not forward the query to NRT shards when configured to prefer
>>> PULL shards and they're available. Maybe it could be something similar to
>>> the preferLocalShards parameter, like "preferShardTypes=TLOG,PULL".
 
 --Ere
 
 Emir Arnautović kirjoitti 14.12.2017 klo 11.41:
> Hi Stanislav,
> I don’t think that there is a built in feature to do this, but that
>>> sounds like nice feature of Solrj - maybe you should check if available.
>>> You can implement it outside of Solrj - check cluster state to see which
>>> shards are available and send queries only to pull replicas.
> HTH,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>> On 14 Dec 2017, at 09:58, Stanislav Sandalnikov <
>>> s.sandalni...@gmail.com> wrote:
>> 
>> Hi,
>> 
>> We have a Solr 7.1 setup with SolrCloud where we have multiple shards
>>> on one server (for indexing) each shard has a pull replica on other servers.
>> 
>> What are the possible ways to limit search request only to pull type
>>> replicase?
>> At the moment the only solution I found is to append shards parameter
>>> to each query, but if new shards added later it requires to change
>>> solrconfig. Is it the only way to do this?
>> 
>> Thank you
>> 
>> Regards
>> Stanislav
>> 
 
 --
 Ere Maijala
 Kansalliskirjasto / The National Library of Finland
>>> 
>>> 
> 
> -- 
> Ere Maijala
> Kansalliskirjasto / The National Library of Finland



Re: Solr cloud optimizer

2017-09-07 Thread Tomas Fernandez Lobbe
By default Solr uses the “TieredMergePolicy”[1], but it can be configured in 
solrconfig, see [2].  Merges can be triggered for different reasons, but most 
commonly by segment flushes (commits) or other merges finishing.

Here is a nice visual demo of segment merging (a bit old but still mostly 
applies AFAIK): [3]

[1] 
https://lucene.apache.org/core/6_6_0/core/org/apache/lucene/index/TieredMergePolicy.html
[2] https://lucene.apache.org/solr/guide/6_6/indexconfig-in-solrconfig.html
[3] 
http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html

Tomas

> On Sep 7, 2017, at 10:00 AM, calamita.agost...@libero.it wrote:
> 
> 
> Hi  all,
> I use SolrCloud with  some collections with 3  shards each. 
> Every day I insert and remove documents from collections. I  know that solr 
> starts optimizer in background to optimize indexes. 
> Which  is the policy that solr applies in order  to start optimizer 
> automatically ? Number of deleted documents? Number of segments? 
> Thanks.



Re: Request to be added to the ContributorsGroup

2017-08-23 Thread Tomas Fernandez Lobbe
I just added you to the wiki. 
Note that the official documentation is now in the "solr-ref-guide" directory 
of the code base, and you can create patches/PRs to it.

Tomás

> On Aug 23, 2017, at 10:58 AM, Kevin Grimes  wrote:
> 
> Hi there,
> 
> I would like to contribute to the Solr wiki. My username is KevinGrimes, and 
> my e-mail is kevingrim...@me.com .
> 
> Thanks,
> Kevin
> 



Re: Query not working with DatePointField

2017-06-15 Thread Tomas Fernandez Lobbe
The query field:* doesn't work with point fields (numerics or dates), only 
exact or range queries are supported, so an equivalent query would be field:[* 
TO *]


Sent from my iPhone

> On Jun 15, 2017, at 5:24 PM, Saurabh Sethi  wrote:
> 
> Hi,
> 
> We have a fieldType specified for date. Earlier it was using TrieDateField
> and we changed it to DatePointField.
> 
>  sortMissingLast="true" precisionStep="6"/>
> 
> 
> 
> Here are the fields used in the query and one of them uses the dateType:
> 
>  stored="false" required="true" multiValued="false"/>
>  stored="false" docValues="false" />
>  stored="false" multiValued="true" />
> 
> The following query was returning correct results when the field type was
> Trie but not with Point:
> 
> field1:value1 AND ((*:* NOT field2:*) AND field3:value3)
> 
> Any idea why field2:* does not return results anymore?
> 
> Thanks,
> Saurabh


Re: Solr 6: how to get SortedSetDocValues from index by field name

2017-06-14 Thread Tomas Fernandez Lobbe
Hi,
To respond your first question: “How do I get SortedSetDocValues from index by 
field name?”, DocValues.getSortedSet(LeafReader reader, String field) (which is 
what you want to use to assert the existence and type of the DV) will give you 
the dv instance for a single leaf reader. In general, a leaf reader is for a 
specific segment, so depending on what you want to do you may need to iterate 
through all the leaves (segments) if you want all values in the index (kind of 
what you’ll see in NumericFacets or IntervalFacets classes). 

SolrIndexSearcher.getSlowAtomicReader() will give you a view of all the 
segments as a single reader, that’s why in that case the code assumes there is 
only one reader that contains all the values. 

Whatever you do, make sure you test your code in cases with multiple segments 
(and with deletes), which is where bugs using this code are most likely to 
occur.

You won’t need the UninvertingReader if you plan to index docValues, that class 
is used to create a docValues-like view of a field that’s indexed=true & 
docValues=false.

Related note, the DocValues API changed from 6.x to 7 (master). See LUCENE-7407.

I hope that helps, 

Tomás

> On Jun 13, 2017, at 10:49 AM, SOLR4189  wrote:
> 
> How do I get SortedSetDocValues from index by field name?
> 
> I try it and it works for me but I didn't understand why to use
> leaves.get(0)? What does it mean? (I saw such using in
> TestUninvertedReader.java of SOLR-6.5.1):
> 
> *Map mapping = new HashMap<>();
> mapping.put(fieldName, UninvertingReader.Type.SORTED);
> 
> SolrIndexSearcher searcher = req.getSearcher();
> 
> DirectoryReader dReader = searcher.getIndexReader();
> LeafReader reader = null;
> 
> if (!dReader.leaves.isEmpty()) {
>  reader = dReader.leaves().get(0).reader;
>  return null;
> }
> 
> SortedSetDocValues sourceIndex = reader.getSortedSetDocValues(fieldName);*
> 
> Maybe do I need to use SlowAtomicReader, like it:
> 
> *
> UninvertingReader reader = new
> UninvertingReader(searcher.getSlowAtomicReader(), mapping)*;
> 
> What is right way to get SortedSetDocValues and why?
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Solr-6-how-to-get-SortedSetDocValues-from-index-by-field-name-tp4340388.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: Got a 404 trying to update a solr. 6.5.1 server. /solr/update not found.

2017-06-05 Thread Tomas Fernandez Lobbe
I think you are missing the collection name in the path.

Tomás

Sent from my iPhone

> On Jun 5, 2017, at 9:08 PM, Phil Scadden  wrote:
> 
> Simple piece of code. Had been working earlier (though against a 6.4.2 
> instance).
> 
>  ConcurrentUpdateSolrClient solr = new 
> ConcurrentUpdateSolrClient("http://myhost:8983/solr",10,2);
>   try {
>solr.deleteByQuery("*:*");
>solr.commit();
>   } catch (SolrServerException | IOException ex) {
>// logger handler stuff omitted.
>   }
> 
> Comes back with:
> 15:53:36,693 DEBUG wire:72 -  << "[\n]"
> 15:53:36,694 DEBUG wire:72 -  << " content="text/html;charset=utf-8"/>[\n]"
> 15:53:36,694 DEBUG wire:72 -  << "Error 404 Not Found[\n]"
> 15:53:36,695 DEBUG wire:72 -  << "[\n]"
> 15:53:36,695 DEBUG wire:72 -  << "HTTP ERROR 404[\n]"
> 15:53:36,696 DEBUG wire:72 -  << "Problem accessing /solr/update. 
> Reason:[\n]"
> 15:53:36,696 DEBUG wire:72 -  << "Not Found[\n]"
> 15:53:36,696 DEBUG wire:72 -  << "[\n]"
> 15:53:36,697 DEBUG wire:72 -  << "[\n]"
> 
> If I access http://myhost:8983/solr/update then I get that html too, but 
> http://myhost:8983/solr comes up with admin page as normal so Solr appears to 
> be running okay.
> Notice: This email and any attachments are confidential and may not be used, 
> published or redistributed without the prior written consent of the Institute 
> of Geological and Nuclear Sciences Limited (GNS Science). If received in 
> error please destroy and immediately notify GNS Science. Do not copy or 
> disclose the contents.


Re: A working example to play with Naive Bayes classifier

2016-07-15 Thread Tomas Ramanauskas
Hi, Allesandro,

sorry for the delay. What do you mean?


As I mentioned earlier, I followed a super simply set of steps.

1. Download Solr
2. Configure classification 
3. Create some documents using curl over HTTP.


Is it difficult to reproduce the steps / problem?


Tomas



> On 23 Jun 2016, at 16:42, Alessandro Benedetti <benedetti.ale...@gmail.com> 
> wrote:
> 
> Can you give an example of your schema, and can you run a simple query for
> you index, curious to see how the input fields are analyzed.
> 
> Cheers
> 
> On Wed, Jun 22, 2016 at 6:05 PM, Alessandro Benedetti <
> benedetti.ale...@gmail.com> wrote:
> 
>> This is better!  At list the classifier is invoked!
>> How many docs in the index have the class assigned?
>> Take a look to the stacktrace and you should find the cause!
>> I am now on mobile, I will check the code tomorrow!
>> Cheers
>> On 22 Jun 2016 5:26 pm, "Tomas Ramanauskas" <
>> tomas.ramanaus...@springer.com> wrote:
>> 
>>> 
>>> I also tried with this config (adding **):
>>> 
>>> 
>>>  
>>>
>>>  classification
>>>
>>>  
>>> 
>>> 
>>> 
>>> 
>>> 
>>> And I get the error:
>>> 
>>> 
>>> 
>>> $ curl http://localhost:8983/solr/demo/update -d '
>>> [
>>> {"id" : "book15",
>>> "title_t":["The Way of Kings"],
>>> "author_s":"Brandon Sanderson",
>>> "cat_s": null,
>>> "pubyear_i":2010,
>>> "ISBN_s":"978-0-7653-2635-5"
>>> }
>>> ]'
>>> {"responseHeader":{"status":500,"QTime":29},"error":{"trace":"java.lang.NullPointerException\n\tat
>>> org.apache.lucene.classification.document.SimpleNaiveBayesDocumentClassifier.getTokenArray(SimpleNaiveBayesDocumentClassifier.java:202)\n\tat
>>> org.apache.lucene.classification.document.SimpleNaiveBayesDocumentClassifier.analyzeSeedDocument(SimpleNaiveBayesDocumentClassifier.java:162)\n\tat
>>> org.apache.lucene.classification.document.SimpleNaiveBayesDocumentClassifier.assignNormClasses(SimpleNaiveBayesDocumentClassifier.java:121)\n\tat
>>> org.apache.lucene.classification.document.SimpleNaiveBayesDocumentClassifier.assignClass(SimpleNaiveBayesDocumentClassifier.java:81)\n\tat
>>> org.apache.solr.update.processor.ClassificationUpdateProcessor.processAdd(ClassificationUpdateProcessor.java:94)\n\tat
>>> org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.handleAdds(JsonLoader.java:474)\n\tat
>>> org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.processUpdate(JsonLoader.java:138)\n\tat
>>> org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.load(JsonLoader.java:114)\n\tat
>>> org.apache.solr.handler.loader.JsonLoader.load(JsonLoader.java:77)\n\tat
>>> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:97)\n\tat
>>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:69)\n\tat
>>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:155)\n\tat
>>> org.apache.solr.core.SolrCore.execute(SolrCore.java:2036)\n\tat
>>> org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:657)\n\tat
>>> org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:464)\n\tat
>>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:257)\n\tat
>>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:208)\n\tat
>>> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1668)\n\tat
>>> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:581)\n\tat
>>> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)\n\tat
>>> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)\n\tat
>>> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)\n\tat
>>> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1160)\n\tat
>>> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:511)\n\tat
>>> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)\n\tat
>>> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1092)\n\tat
>>> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)\n\tat
>>> org.eclip

Re: A working example to play with Naive Bayes classifier

2016-06-22 Thread Tomas Ramanauskas

I also tried with this config (adding **):


  

  classification

  





And I get the error:



$ curl http://localhost:8983/solr/demo/update -d '
[
{"id" : "book15",
"title_t":["The Way of Kings"],
"author_s":"Brandon Sanderson",
"cat_s": null,
"pubyear_i":2010,
"ISBN_s":"978-0-7653-2635-5"
}
]'
{"responseHeader":{"status":500,"QTime":29},"error":{"trace":"java.lang.NullPointerException\n\tat
 
org.apache.lucene.classification.document.SimpleNaiveBayesDocumentClassifier.getTokenArray(SimpleNaiveBayesDocumentClassifier.java:202)\n\tat
 
org.apache.lucene.classification.document.SimpleNaiveBayesDocumentClassifier.analyzeSeedDocument(SimpleNaiveBayesDocumentClassifier.java:162)\n\tat
 
org.apache.lucene.classification.document.SimpleNaiveBayesDocumentClassifier.assignNormClasses(SimpleNaiveBayesDocumentClassifier.java:121)\n\tat
 
org.apache.lucene.classification.document.SimpleNaiveBayesDocumentClassifier.assignClass(SimpleNaiveBayesDocumentClassifier.java:81)\n\tat
 
org.apache.solr.update.processor.ClassificationUpdateProcessor.processAdd(ClassificationUpdateProcessor.java:94)\n\tat
 
org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.handleAdds(JsonLoader.java:474)\n\tat
 
org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.processUpdate(JsonLoader.java:138)\n\tat
 
org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.load(JsonLoader.java:114)\n\tat
 org.apache.solr.handler.loader.JsonLoader.load(JsonLoader.java:77)\n\tat 
org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:97)\n\tat
 
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:69)\n\tat
 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:155)\n\tat
 org.apache.solr.core.SolrCore.execute(SolrCore.java:2036)\n\tat 
org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:657)\n\tat 
org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:464)\n\tat 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:257)\n\tat
 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:208)\n\tat
 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1668)\n\tat
 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:581)\n\tat
 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)\n\tat
 
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)\n\tat
 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)\n\tat
 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1160)\n\tat
 
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:511)\n\tat 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)\n\tat
 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1092)\n\tat
 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)\n\tat
 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)\n\tat
 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)\n\tat
 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)\n\tat
 org.eclipse.jetty.server.Server.handle(Server.java:518)\n\tat 
org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:308)\n\tat 
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:244)\n\tat
 
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)\n\tat
 org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)\n\tat 
org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)\n\tat
 
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceAndRun(ExecuteProduceConsume.java:246)\n\tat
 
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:156)\n\tat
 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:654)\n\tat
 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:572)\n\tat
 java.lang.Thread.run(Thread.java:745)\n","code":500}}


Tomas


On 22 Jun 2016, at 17:22, Tomas Ramanauskas 
<tomas.ramanaus...@springer.com<mailto:tomas.ramanaus...@springer.com>> wrote:

Thanks for the response, Alessandro.

I tried this and it didn’t work either:



$ curl http://localhost:8983/solr/demo/update -d '
[
{"id" : "book14",
"title_t":["The Way of Kings"],
"author_s":"Brandon Sanderson",
"cat_s": null,
"pubyear_i":2010,
"ISBN_s":"978-0-7653-2635-5"
}
]’

{"resp

Re: A working example to play with Naive Bayes classifier

2016-06-22 Thread Tomas Ramanauskas
Thanks for the response, Alessandro.

I tried this and it didn’t work either:



$ curl http://localhost:8983/solr/demo/update -d '
[
{"id" : "book14",
"title_t":["The Way of Kings"],
"author_s":"Brandon Sanderson",
"cat_s": null,
"pubyear_i":2010,
"ISBN_s":"978-0-7653-2635-5"
}
]’

{"responseHeader":{"status":0,"QTime":2}}

$ curl http://localhost:8983/solr/demo/get?id=book14
{
  "doc":
  {
"id":"book14",
"title_t":["The Way of Kings"],
    "author_s":"Brandon Sanderson",
"pubyear_i":2010,
"ISBN_s":"978-0-7653-2635-5",
"_version_":1537854598189940736}}


I don’t see “cat_s” field in the results at all.


Tomas


On 22 Jun 2016, at 16:39, Alessandro Benedetti 
<abenede...@apache.org<mailto:abenede...@apache.org>> wrote:

Hi Tomas,
first consideration :
an empty string is different from a NULL string.
This is controversial, I would suggest you to never use the empty String as
this can cause some others side effect.
Apart from that, the plugin will add the class only if the class field is
without any value

Object documentClass = doc.getFieldValue(classFieldName);
if (documentClass == null) {

Saying that, I would suggest you to build a sample index with some
document and then try to classify.
If this doesn't solve your issue, I can help you further.

Cheers

On Wed, Jun 22, 2016 at 3:45 PM, Tomas Ramanauskas <
tomas.ramanaus...@springer.com<mailto:tomas.ramanaus...@springer.com>> wrote:

I also tried this configuration, but could get the feature to work:



 
   
 classification
   
 


 
   
 title_t,author_s
 cat_s
 bayes
   
 


Tomas

On 22 Jun 2016, at 13:46, Tomas Ramanauskas <
tomas.ramanaus...@springer.com<mailto:tomas.ramanaus...@springer.com><mailto:tomas.ramanaus...@springer.com>>
wrote:

P.S. The version I use:

6.1.0-68

Also, earlier I said “If I modify an existing record, I think the
functionality works:”, but I think it doesn’t work for me at all.

$ curl http://localhost:8983/solr/demo/get?id=book1
{
 "doc":
 {
   "id":"book1",
   "title_t":["The Way of Kings"],
   "author_s":"Brandon Sanderson",
   "cat_s":"fantasy",
   "pubyear_i":2010,
   "ISBN_s":"978-0-7653-2635-5",
   "_version_":1535488016326328320}}

$ curl http://localhost:8983/solr/demo/update -d '
[
{"id" : "book1",
"title_t":["The Way of Kings"],
"author_s":"Brandon Sanderson",
"cat_s":"aaa",
"pubyear_i":2010,
"ISBN_s":"978-0-7653-2635-5"
}
]'
{"responseHeader":{"status":0,"QTime":0}}

$ curl http://localhost:8983/solr/demo/get?id=book1
{
 "doc":
 {
   "id":"book1",
   "title_t":["The Way of Kings"],
   "author_s":"Brandon Sanderson",
   "cat_s":"fantasy",
   "pubyear_i":2010,
   "ISBN_s":"978-0-7653-2635-5",
   "_version_":1535488016326328320}}


Tomas


On 22 Jun 2016, at 12:47, Tomas Ramanauskas <
tomas.ramanaus...@springer.com<mailto:tomas.ramanaus...@springer.com><mailto:tomas.ramanaus...@springer.com>>
wrote:

Hi, everyone,


would someone be able to share a working example (step by step) that
demonstrates the use of Naive Bayes classifier in Solr?


I followed this Blog post:

https://alexbenedetti.blogspot.co.uk/2015/07/solr-document-classification-part-1.html?showComment=1464358093048#c2489902302085000947

And this tutorial:
http://yonik.com/solr-tutorial/

And this JIRA ticket:
https://issues.apache.org/jira/browse/SOLR-7739



So this is my configuration file (only what I added or modified):

 
   
 classification
   
 


 
   
 title_t,author_s
 cat_s
 bayes
   
 



If I modify an existing record, I think the functionality works:


$ curl http://localhost:8983/solr/demo/update -d '
[
{"id" : "book1",
"title_t":["The Way of Kings"],
"author_s":"Brandon Sanderson",
"cat_s":"",
"pubyear_i":2010,
"ISBN_s":"978-0-7653-2635-5"
}
]'
{"responseHeader":{"status":0,"QTime":8}}
$ curl http://localhost:8983/solr/demo/get?id=book1
{
 "doc":
 {
   "id":"book1",
   "title_t":["The Way of Kings"],
   "author_s":"Brandon Sanderson",
   "cat_s":"fantasy",
   "pubyear_i":2010,
   "ISBN_s":"978-0-7653-2635-5",
   "_version_":1535488016326328320}}




If I add a new document, something isn’t quite working:

$ curl http://localhost:8983/solr/demo/update -d '
[
{"id" : "book7",
"title_t":["The Way of Kings"],
"author_s":"Brandon Sanderson",
"cat_s":"",
"pubyear_i":2010,
"ISBN_s":"978-0-7653-2635-5"
}
]'
{"responseHeader":{"status":0,"QTime":0}}
$ curl http://localhost:8983/solr/demo/get?id=book7
{
 "doc":null}









--
--

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England



Re: A working example to play with Naive Bayes classifier

2016-06-22 Thread Tomas Ramanauskas
I also tried this configuration, but could get the feature to work:



  

  classification

  


  

  title_t,author_s
  cat_s
  bayes

  


Tomas

On 22 Jun 2016, at 13:46, Tomas Ramanauskas 
<tomas.ramanaus...@springer.com<mailto:tomas.ramanaus...@springer.com>> wrote:

P.S. The version I use:

6.1.0-68

Also, earlier I said “If I modify an existing record, I think the functionality 
works:”, but I think it doesn’t work for me at all.

$ curl http://localhost:8983/solr/demo/get?id=book1
{
  "doc":
  {
"id":"book1",
"title_t":["The Way of Kings"],
"author_s":"Brandon Sanderson",
"cat_s":"fantasy",
"pubyear_i":2010,
"ISBN_s":"978-0-7653-2635-5",
"_version_":1535488016326328320}}

$ curl http://localhost:8983/solr/demo/update -d '
[
{"id" : "book1",
"title_t":["The Way of Kings"],
"author_s":"Brandon Sanderson",
"cat_s":"aaa",
"pubyear_i":2010,
"ISBN_s":"978-0-7653-2635-5"
}
]'
{"responseHeader":{"status":0,"QTime":0}}

$ curl http://localhost:8983/solr/demo/get?id=book1
{
  "doc":
  {
"id":"book1",
"title_t":["The Way of Kings"],
"author_s":"Brandon Sanderson",
"cat_s":"fantasy",
"pubyear_i":2010,
"ISBN_s":"978-0-7653-2635-5",
"_version_":1535488016326328320}}


Tomas


On 22 Jun 2016, at 12:47, Tomas Ramanauskas 
<tomas.ramanaus...@springer.com<mailto:tomas.ramanaus...@springer.com>> wrote:

Hi, everyone,


would someone be able to share a working example (step by step) that 
demonstrates the use of Naive Bayes classifier in Solr?


I followed this Blog post:
https://alexbenedetti.blogspot.co.uk/2015/07/solr-document-classification-part-1.html?showComment=1464358093048#c2489902302085000947

And this tutorial:
http://yonik.com/solr-tutorial/

And this JIRA ticket:
https://issues.apache.org/jira/browse/SOLR-7739



So this is my configuration file (only what I added or modified):

  

  classification

  


  

  title_t,author_s
  cat_s
  bayes

  



If I modify an existing record, I think the functionality works:


$ curl http://localhost:8983/solr/demo/update -d '
[
{"id" : "book1",
"title_t":["The Way of Kings"],
"author_s":"Brandon Sanderson",
"cat_s":"",
"pubyear_i":2010,
"ISBN_s":"978-0-7653-2635-5"
}
]'
{"responseHeader":{"status":0,"QTime":8}}
$ curl http://localhost:8983/solr/demo/get?id=book1
{
  "doc":
  {
"id":"book1",
"title_t":["The Way of Kings"],
"author_s":"Brandon Sanderson",
"cat_s":"fantasy",
"pubyear_i":2010,
"ISBN_s":"978-0-7653-2635-5",
"_version_":1535488016326328320}}




If I add a new document, something isn’t quite working:

$ curl http://localhost:8983/solr/demo/update -d '
[
{"id" : "book7",
"title_t":["The Way of Kings"],
"author_s":"Brandon Sanderson",
"cat_s":"",
"pubyear_i":2010,
"ISBN_s":"978-0-7653-2635-5"
}
]'
{"responseHeader":{"status":0,"QTime":0}}
$ curl http://localhost:8983/solr/demo/get?id=book7
{
  "doc":null}








Re: A working example to play with Naive Bayes classifier

2016-06-22 Thread Tomas Ramanauskas
P.S. The version I use:

6.1.0-68

Also, earlier I said “If I modify an existing record, I think the functionality 
works:”, but I think it doesn’t work for me at all.

$ curl http://localhost:8983/solr/demo/get?id=book1
{
  "doc":
  {
"id":"book1",
"title_t":["The Way of Kings"],
"author_s":"Brandon Sanderson",
"cat_s":"fantasy",
"pubyear_i":2010,
"ISBN_s":"978-0-7653-2635-5",
"_version_":1535488016326328320}}

$ curl http://localhost:8983/solr/demo/update -d '
[
{"id" : "book1",
"title_t":["The Way of Kings"],
"author_s":"Brandon Sanderson",
"cat_s":"aaa",
"pubyear_i":2010,
"ISBN_s":"978-0-7653-2635-5"
}
]'
{"responseHeader":{"status":0,"QTime":0}}

$ curl http://localhost:8983/solr/demo/get?id=book1
{
  "doc":
  {
"id":"book1",
"title_t":["The Way of Kings"],
"author_s":"Brandon Sanderson",
"cat_s":"fantasy",
"pubyear_i":2010,
"ISBN_s":"978-0-7653-2635-5",
"_version_":1535488016326328320}}


Tomas


On 22 Jun 2016, at 12:47, Tomas Ramanauskas 
<tomas.ramanaus...@springer.com<mailto:tomas.ramanaus...@springer.com>> wrote:

Hi, everyone,


would someone be able to share a working example (step by step) that 
demonstrates the use of Naive Bayes classifier in Solr?


I followed this Blog post:
https://alexbenedetti.blogspot.co.uk/2015/07/solr-document-classification-part-1.html?showComment=1464358093048#c2489902302085000947

And this tutorial:
http://yonik.com/solr-tutorial/

And this JIRA ticket:
https://issues.apache.org/jira/browse/SOLR-7739



So this is my configuration file (only what I added or modified):

  

  classification

  


  

  title_t,author_s
  cat_s
  bayes

  



If I modify an existing record, I think the functionality works:


$ curl http://localhost:8983/solr/demo/update -d '
[
{"id" : "book1",
"title_t":["The Way of Kings"],
"author_s":"Brandon Sanderson",
"cat_s":"",
"pubyear_i":2010,
"ISBN_s":"978-0-7653-2635-5"
}
]'
{"responseHeader":{"status":0,"QTime":8}}
$ curl http://localhost:8983/solr/demo/get?id=book1
{
  "doc":
  {
"id":"book1",
"title_t":["The Way of Kings"],
"author_s":"Brandon Sanderson",
"cat_s":"fantasy",
"pubyear_i":2010,
"ISBN_s":"978-0-7653-2635-5",
"_version_":1535488016326328320}}




If I add a new document, something isn’t quite working:

$ curl http://localhost:8983/solr/demo/update -d '
[
{"id" : "book7",
"title_t":["The Way of Kings"],
"author_s":"Brandon Sanderson",
"cat_s":"",
"pubyear_i":2010,
"ISBN_s":"978-0-7653-2635-5"
}
]'
{"responseHeader":{"status":0,"QTime":0}}
$ curl http://localhost:8983/solr/demo/get?id=book7
{
  "doc":null}







A working example to play with Naive Bayes classifier

2016-06-22 Thread Tomas Ramanauskas
Hi, everyone,


would someone be able to share a working example (step by step) that 
demonstrates the use of Naive Bayes classifier in Solr?


I followed this Blog post:
https://alexbenedetti.blogspot.co.uk/2015/07/solr-document-classification-part-1.html?showComment=1464358093048#c2489902302085000947

And this tutorial:
http://yonik.com/solr-tutorial/

And this JIRA ticket:
https://issues.apache.org/jira/browse/SOLR-7739



So this is my configuration file (only what I added or modified):

  

  classification

  


  

  title_t,author_s
  cat_s
  bayes

  



If I modify an existing record, I think the functionality works:


$ curl http://localhost:8983/solr/demo/update -d '
[
{"id" : "book1",
"title_t":["The Way of Kings"],
"author_s":"Brandon Sanderson",
"cat_s":"",
"pubyear_i":2010,
"ISBN_s":"978-0-7653-2635-5"
}
]'
{"responseHeader":{"status":0,"QTime":8}}
$ curl http://localhost:8983/solr/demo/get?id=book1
{
  "doc":
  {
"id":"book1",
"title_t":["The Way of Kings"],
"author_s":"Brandon Sanderson",
"cat_s":"fantasy",
"pubyear_i":2010,
"ISBN_s":"978-0-7653-2635-5",
"_version_":1535488016326328320}}




If I add a new document, something isn’t quite working:

$ curl http://localhost:8983/solr/demo/update -d '
[
{"id" : "book7",
"title_t":["The Way of Kings"],
"author_s":"Brandon Sanderson",
"cat_s":"",
"pubyear_i":2010,
"ISBN_s":"978-0-7653-2635-5"
}
]'
{"responseHeader":{"status":0,"QTime":0}}
$ curl http://localhost:8983/solr/demo/get?id=book7
{
  "doc":null}






SOLR 4.0 + ReversedWildcardFilterFactory + DefaultSolrHighlighter + multibyte chars = crash?

2012-10-29 Thread Tomas Zerolo
Hi, SOLR gurus

we're experiencing a crash with SOLR 4.0 whenever the results contain
multibyte characters (more precisely: German umlauts, utf-8 encoded).

The crashes only occur when using ReversedWildcardFilterFactory (which
is necessary in 4.0 to be able to have wildcards at the beginning of
the search pattern, as far as I understand), *and* the highlighter is
on. The stack trace (heavily snipped) looks like this:

 | 12.09.2012 13:08:12 org.apache.solr.common.SolrException log
 | SCHWERWIEGEND: org.apache.solr.common.SolrException: 
org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token 
substantial exceeds length of provided text sized 5107
 | at 
org.apache.solr.highlight.DefaultSolrHighlighter.doHighlightingByHighlighter(DefaultSolrHighlighter.java:517)
 | at 
org.apache.solr.highlight.DefaultSolrHighlighter.doHighlighting(DefaultSolrHighlighter.java:401)
 | at 
org.apache.solr.handler.component.HighlightComponent.process(HighlightComponent.java:136)
 | at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:206)
 | [...]
 | at 
org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:254)
 | at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:599)
 | at 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:534)
 | at java.lang.Thread.run(Thread.java:662)
 | Caused by: org.apache.lucene.search.highlight.InvalidTokenOffsetsException: 
Token substantial exceeds length of provided text sized 5107
 | at 
org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:233)
 | at 
org.apache.solr.highlight.DefaultSolrHighlighter.doHighlightingByHighlighter(DefaultSolrHighlighter.java:510)
 | ... 32 more

(excuse the German locale.) 

Poking around in the sources seems to point (to my untrained eye, that
is) to:

  https://issues.apache.org/jira/browse/LUCENE-3080

Is this the issue biting us? Any known workarounds? Anything
we might try to pin-point the problem resp. to fix the bug?

Thanks for any insights, regards
-- 
Tomás Zerolo
Axel Springer AG
Axel Springer media Systems
BILD Produktionssysteme
Axel-Springer-Straße 65
10888 Berlin
Tel.: +49 (30) 2591-72875
tomas.zer...@axelspringer.de
www.axelspringer.de

Axel Springer AG, Sitz Berlin, Amtsgericht Charlottenburg, HRB 4998
Vorsitzender des Aufsichtsrats: Dr. Giuseppe Vita
Vorstand: Dr. Mathias Döpfner (Vorsitzender)
Jan Bayer, Ralph Büchi, Lothar Lanz, Dr. Andreas Wiele


Re: SOLR 4.0 + ReversedWildcardFilterFactory + DefaultSolrHighlighter + multibyte chars = crash?

2012-10-29 Thread Tomas Zerolo
On Mon, Oct 29, 2012 at 08:55:27AM -0700, Ahmet Arslan wrote:
 Hi Tomas,
 
 I think this is same case Marian reported before.
 
 https://issues.apache.org/jira/browse/SOLR-3193
 https://issues.apache.org/jira/browse/SOLR-3901

Thanks, Ahmet. Yes, by the descriptions they look very similar. I'll
try to follow up on the bug reports.

Regards
-- 
Tomás Zerolo
Axel Springer AG
Axel Springer media Systems
BILD Produktionssysteme
Axel-Springer-Straße 65
10888 Berlin
Tel.: +49 (30) 2591-72875
tomas.zer...@axelspringer.de
www.axelspringer.de

Axel Springer AG, Sitz Berlin, Amtsgericht Charlottenburg, HRB 4998
Vorsitzender des Aufsichtsrats: Dr. Giuseppe Vita
Vorstand: Dr. Mathias Döpfner (Vorsitzender)
Jan Bayer, Ralph Büchi, Lothar Lanz, Dr. Andreas Wiele


Re: SOLR 4.0 / Jetty Security Set Up

2012-09-07 Thread Tomas Zerolo
On Fri, Sep 07, 2012 at 08:50:58AM +0200, Paul Libbrecht wrote:
 Erick,
 
 I think that should be described differently...
 You need to set-up protected access for some paths.
 /update is one of them.
 And you could make this protected at the jetty level or using Apache proxies 
 and rewrites.

So you'd advise always putting an Apache in front of Jetty?

 Probably /select should be kept open

As far as I understand [1], it's better to close /select (because you can
easily make an admin or update out of it, by e.g. doing a /select?qt=/admin
or /select?qt=/update)

  but you need to evaluate if that can get 
 you
 in DoS attacks if there are too big selects. If that is the case, you're left 
 to
 programme an interface all by yourself which limits and fetches from solr, or 
 which
 lives inside solr (a query component) and throws if things are too big.

[1] http://wiki.apache.org/solr/SolrSecurity#Path_Based_Authentication

Regads
-- 
Tomás Zerolo
Axel Springer AG
Axel Springer media Systems
BILD Produktionssysteme
Axel-Springer-Straße 65
10888 Berlin
Tel.: +49 (30) 2591-72875
tomas.zer...@axelspringer.de
www.axelspringer.de

Axel Springer AG, Sitz Berlin, Amtsgericht Charlottenburg, HRB 4998
Vorsitzender des Aufsichtsrats: Dr. Giuseppe Vita
Vorstand: Dr. Mathias Döpfner (Vorsitzender)
Jan Bayer, Ralph Büchi, Lothar Lanz, Dr. Andreas Wiele


Re: AW: Indexing wildcard patterns

2012-08-13 Thread Tomas Zerolo
On Fri, Aug 10, 2012 at 12:38:46PM -0400, Jack Krupansky wrote:
 Doc1 has the pattern AB%CD% associated with it (somehow?!).
 
 You need to clarify what you mean by that.

I'm not the OP, but I think (s)he means the patterns are in the
database and the string to match is given in the query. Perhaps
this inversion is a bit unusual, and most optimizers aren't
prepared for that, but still reasonable, IMHO.

 To be clear, Solr support for wildcards is a superset of the SQL
 LIKE operator, and the patterns used in the LIKE operator are NOT
 stored in the table data, but used at query time

I don't know about others, but PostgreSQL copes just fine:

 | tomas@rasputin:~$ psql template1
 | psql (9.1.2)
 | Type help for help.
 | 
 | template1=# create database test;
 | CREATE DATABASE
 | template1=# create table foo (
 | template1(#   pattern VARCHAR
 | template1(# );
 | CREATE TABLE
 | template1=# insert into foo values('%blah');
 | INSERT 0 1
 | template1=# insert into foo values('blah%');
 | INSERT 0 1
 | template1=# insert into foo values('%bloh%');
 | INSERT 0 1
 | template1=# select * from foo where 'blahblah' like pattern;
 |  pattern 
 | -
 |  %blah
 |  blah%
 | (2 rows)

Now don't ask whether the optimizer has a fair chance at this. Dunno
what happens when we have, say, 10^7 patterns... but the OP's pattern
set seems to be reasonably small.

  - same with Solr.
 In SQL you do not associate patterns with table data, but rather
 you query data using a pattern.

I'd guess that the above trick might be doable in SOLR as well, as
other posts in this thread seem to suggest. But I'm not that proficient
in SOLR, that's why I'm lurking here ;-)

tomás
-- 
Tomás Zerolo
Axel Springer AG
Axel Springer media Systems
BILD Produktionssysteme
Axel-Springer-Straße 65
10888 Berlin
Tel.: +49 (30) 2591-72875
tomas.zer...@axelspringer.de
www.axelspringer.de

Axel Springer AG, Sitz Berlin, Amtsgericht Charlottenburg, HRB 4998
Vorsitzender des Aufsichtsrats: Dr. Giuseppe Vita
Vorstand: Dr. Mathias Döpfner (Vorsitzender)
Jan Bayer, Ralph Büchi, Lothar Lanz, Dr. Andreas Wiele


Re: Lexical analysis tools for German language data

2012-04-13 Thread Tomas Zerolo
On Thu, Apr 12, 2012 at 03:46:56PM +, Michael Ludwig wrote:
  Von: Walter Underwood
 
  German noun decompounding is a little more complicated than it might
  seem.
  
  There can be transformations or inflections, like the s in
  Weinachtsbaum (Weinachten/Baum).
 
 I remember from my linguistics studies that the terminus technicus for
 these is Fugenmorphem (interstitial or joint morpheme) [...]

IANAL (I am not a linguist -- pun intended ;) but I've always read that
as a genitive. Any pointers?

Regards
-- 
Tomás Zerolo
Axel Springer AG
Axel Springer media Systems
BILD Produktionssysteme
Axel-Springer-Straße 65
10888 Berlin
Tel.: +49 (30) 2591-72875
tomas.zer...@axelspringer.de
www.axelspringer.de

Axel Springer AG, Sitz Berlin, Amtsgericht Charlottenburg, HRB 4998
Vorsitzender des Aufsichtsrats: Dr. Giuseppe Vita
Vorstand: Dr. Mathias Döpfner (Vorsitzender)
Jan Bayer, Ralph Büchi, Lothar Lanz, Dr. Andreas Wiele


Re: Solr as an part of api to unburden databases

2012-02-15 Thread Tomas Zerolo
On Wed, Feb 15, 2012 at 11:48:14AM +0100, Ramo Karahasan wrote:
 Hi,
 
  
 
 does anyone of the maillinglist users use solr as an API to avoid database
 queries? [...]

Like in a... cache?

Why not use a cache then? (memcached, for example, but there are more).

Regards
-- tomás


Re: how to avoid OOM while merge index

2012-01-09 Thread Tomas Zerolo
On Mon, Jan 09, 2012 at 01:29:39PM +0800, James wrote:
 I am build the solr index on the hadoop, and at reduce step I run the task 
 that merge the indexes, each part of index is about 1G, I have 10 indexes to 
 merge them together, I always get the java heap memory exhausted, the heap 
 size is about 2G  also. I wonder which part use these so many memory. And how 
 to avoid the OOM during the merge process.

There are three issues in there. You should first try to find out which
one it is (it's not clear to me based on your question):

  - Java heap memory: you can set that as a start option of the JVM.
You set the maximum with the -Xmxn start option. You get an
OutOfMemory exception if you reach that (no idea wheter the
SOLR code bubbles this up, but there are experts on that here).
  - Operating system limit: you can set the limit for a process's
use of resources (memory, among others). Typically, Linux based
systems are shipped with unlimited memory setting; Ralf already
posted how to check/set that.
The situation here is a bit complicated, because there are
different limits (memory size vs. virtual memory size, mainly)
and they are exercised differently depending on the allocation
pattern. Anyway, I'd expect malloc() returning NULL in this
case and the Java runtime translating it (again) into an OutOfMemory
exception.
  - Now the OOM killer is quite another kettle of fish. AFAIK, it's
Linux-specific. Once the global system memory is more-or-less
exhausted, the kernel kills some applications to try to improve
the situation. There's some heuristic in deciding which application
to kill, and there are some knobs to help the kernel in this
decision. I'd recommend [1]; after reading *that* you know all :-)
You know you've run into that by looking at the system log.


[1] https://lwn.net/Articles/317814/
-- 
Tomás Zerolo
Axel Springer AG
Axel Springer media Systems
BILD Produktionssysteme
Axel-Springer-Straße 65
10888 Berlin
Tel.: +49 (30) 2591-72875
tomas.zer...@axelspringer.de
www.axelspringer.de

Axel Springer AG, Sitz Berlin, Amtsgericht Charlottenburg, HRB 4998
Vorsitzender des Aufsichtsrats: Dr. Giuseppe Vita
Vorstand: Dr. Mathias Döpfner (Vorsitzender)
Jan Bayer, Ralph Büchi, Lothar Lanz, Dr. Andreas Wiele


Re: Poor performance on distributed search

2011-12-20 Thread Tomas Zerolo
On Mon, Dec 19, 2011 at 01:32:22PM -0800, ku3ia wrote:
 Uhm, either I misunderstand your question or you're doing 
 a lot of extra work for nothing 
 
 The whole point of sharding it exactly to collect the top N docs 
 from each shard and merge them into a single result [...]

 P.S. Is any mechanism, for example, to get top 100 rows from each shard,
 only merge it, sort by defined at query filed or score and pull result to
 the user?
 Uhm, either I misunderstand your question
 For example I have 4 shards. Finally, I need 2000 docs. Now, when I'm using
 shards=127.0.0.1:8080/solr/shard1,127.0.0.1:8080/solr/shard2,127.0.0.1:8080/solr/shard3,127.0.0.1:8080/solr/shard4
 Solr gets 2000 docs from each shard (shard1,2,3,4, summary we have 8000
 docs) merge and sort it, for example, by default field (score), and returns
 me only 2000 rows (not all 8000), which I specified at request.
 So, my question was about, is any mechanism in Solr, which gets not 2000
 rows from each shard, and say, If I specified 2000 docs at request, Solr
 calculates how much shards I have (four shards), divides total rows onto
 shards (2000/4=500) and sends to each shard queries with rows=500, but not
 rows=2000, so finally, summary after merging and sorting I'll have 2000 rows
 (maybe less), but not 8000... That was my question.

But then the results would be wrong? Suppose the documents are not evenly
distributed (wrt the sort criterium) across all the shards. In an extreme
case, just imagine all 2000 top-most documents are on shard 3. You would get
the 500 top-most (from shard 3) and some other you don't want (from the
other shards). You wouldn't even know.

What SOLR is doig here is planning for the worst case.

Now if it could just do some piece-wise merge sort of sorts, that would be
better.

-- 
Tomás Zerolo
Axel Springer AG
Axel Springer media Systems
BILD Produktionssysteme
Axel-Springer-Straße 65
10888 Berlin
Tel.: +49 (30) 2591-72875
tomas.zer...@axelspringer.de
www.axelspringer.de


Re: Don't snowball depending on terms

2011-11-29 Thread Tomas Zerolo
On Tue, Nov 29, 2011 at 01:53:44PM -0500, François Schiettecatte wrote:
 It won't and depending on how your analyzer is set up the terms are most 
 likely stemmed at index time.
 
 You could create a separate field for unstemmed terms though, or use a less 
 aggressive stemmer such as EnglishMinimalStemFilterFactory.

This is surprising to me. Snowball introduces new homonyms, meaning it
will lump e.g. management and manage into one index entry. Thus,
I'd expect a handful of false positives (but usually not too many).

That's a lossy index (loosely speaking) and could be fixed by
post-filtering (instead of introducing another index, which in
most cases would seem a waste of resurces).

Is there no way in SOLR of filtering the results *after* the index
scan? I'd be disappointed!

Regards
-- tomás


Re: Filtering results based on a set of values for a field

2011-08-19 Thread Tomas Zerolo
On Thu, Aug 18, 2011 at 02:32:48PM -0400, Erick Erickson wrote:
 Hmmm, I'm still not getting it...
 
 You have one or more lists. These lists change once a month or so. Are
 you trying
 to include or exclude the documents in these lists?

In our specific case to include *only* the documents having a value of
an attribute (author) in this list (the user decides at query time
which of those lists to use). But we do expect the problem to become
more general over time...

 And do the authors you 
 want
 to include or exclude change on a per-query basis or would you be all set if 
 you
 just had a filter that applied to all the authors on a particular list?

No. ATM there are two fixed lists (in the sense that they are updated
like monthly. One problem: the document basis itself is huge (in the
abouts of 3.5 million). Re-indexing is a painful exercise taking days,
so we tend not to do it too often ;-)

 But I *think* what you want is a SearchComponent that implements your
 Filter. You can see various examples of how to add components to a seach
 handler in the solrconfig.xml file.

Thanks a lot for the pointer. Rushing to read on it. 

 WARNING: Haven't done this myself, so I'm partly guessing here.

Hey: I asked for pointers and you're giving me some, so I'm a happy
man now :-)

 Although here's a hint that someone else has used this approach:
 http://www.mail-archive.com/solr-user@lucene.apache.org/msg54240.html

Thanks again

 And you'll want to insure that the Filter is cached so you don't have to 
 compute
 it more than once.

Yes, I hope that will be the trick giving us the needed boost. Somehow
we'll have to figure out how to drop the cache when a new version of the
list arrives (without killing everyone in the building).

I'll sure report back.

Regards
-- tomás


Re: Filtering results based on a set of values for a field

2011-08-18 Thread Tomas Zerolo
On Thu, Aug 18, 2011 at 08:36:08AM -0400, Erick Erickson wrote:
 How does this list of authors get selected? The reason I'm asking is
 I'm wondering
 if you can define the problem away. In other words, I'm wondering if this
 is an XY problem (http://people.apache.org/~hossman/#xyproblem).

:-)

 I can't imagine you expect a user to specify up to 2k authors...

Of course not for each individual query.

  so there must
 be something programmatic going on here, perhaps you can index some clever
 information with the docs that'll make this more tractable...

Alas, they do provide a list of names on a regular basis -- which comes from
an external source (which changes slowly and is provided, e.g. once a month).

To be more precise there will be a couple of those lists.

I don't see (yet) how to define the problem away, but I keep trying...

Thanks for the nudge
-- tomás


Re: Faceted Search Patent Lawsuit - Please Read

2011-08-17 Thread Tomas Zerolo
On Tue, Aug 16, 2011 at 03:58:29PM -0400, Grant Ingersoll wrote:
 I know you mean well and are probably wondering what to do next [...]

Still, a short heads-up like Johnson's would seem OK?

After all, this is of concern to us all.

Regards
-- tomás


Re: Filtering results based on a set of values for a field

2011-08-17 Thread Tomas Zerolo
On Tue, Aug 16, 2011 at 07:56:51AM +, tomas.zer...@axelspringer.de wrote:
 Hello, Solrs
 
 we are trying to filter out documents written by (one or more of) the authors 
 from
 a mediumish list (~2K). The document set itself is in the millions.

[...]

Sorry. Forgot to say that we are using SOLR 1.4 (yet). But any pointers, even if
they are 3.x only are highly appreciated.

Regards
-- tomás


Re: analyzer type

2010-11-12 Thread Tomas Fernandez Lobbe
For a field type the anslysis applied at index time (when you are adding 
documents to Solr) can be a slightly different than the analysis applied at 
query time (when a user executes a query). For example, if you know you are 
going to be indexing html pages, you might need to use the 
HTMLStripCharFilterFactor to strip the html tags, but the user wont be querieng 
with html tags, right? so in that case you might need the 
HTMLStripCharFilterFactory only at index time (on the index type analyzer).

If you don't specify the analyzer type, by default, the same analysis chain 
(all 
the same token filters, char filters and the tokenizer) will be applied to 
both, 
indexing and querying.

I hope I made myself clear





De: gauravshetti gaurav.she...@tcs.com
Para: solr-user@lucene.apache.org
Enviado: viernes, 12 de noviembre, 2010 13:46:49
Asunto: analyzer type


Can you please help me distinguish between analyzer types. i am not able to
find document for the same.

I want to add solr.HTMLStripCharFilterFactory in the schema.xml file.

And i can see two types defined in my schema.xml for analyzer
analyzer type=index
analyzer type=query
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/analyzer-type-tp1890002p1890002.html
Sent from the Solr - User mailing list archive at Nabble.com.



  

Re: Searching with AND + OR and spaces

2010-11-12 Thread Tomas Fernandez Lobbe
Hi Jon, for the first query:

title:Call of Duty OR subhead:Call of Duty

If you are sure that you have documents with the same phrase, make sure you 
don't have a problem with stop words and with token positions. I recommend you 
to check the analysis page at the Solr admin. pay special attention to the 
enablePositionIncrements attribute of the StopFilterFactory which defaults to 
false. 
(http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.StopFilterFactory).


for your second query:

title:Call of Duty OR subhead:Call of Duty AND type:4

make sure that you add parentheses like:

title:(Call of Duty) OR subhead:(Call of Duty) AND type:4

otherwise it will be translated to the query (supposing you have OR as your 
default parameter):

title:Call OR your_default_field:of OR  your_default_field:Duty OR subhead:Call 
OR  your_default_field:of OR  your_default_field:Duty AND type:4

Tomás





De: Jon Drukman j...@cluttered.com
Para: solr-user@lucene.apache.org
Enviado: viernes, 12 de noviembre, 2010 15:22:21
Asunto: Searching with AND + OR and spaces

I want to search two fields for the phrase Call Of Duty.  I tried this:

(title:Call of Duty OR subhead:Call of Duty)

No matches, despite the fact that there are many documents that should match.

So I left out the quotes, and it seems to work.  But now when I try doing things
like

title:Call of Duty OR subhead:Call of Duty AND type:4

I get a lot of things like called it! and i'm taking calls but call of duty
doesn't surface.

How can I get what I want?

-jsd-


  

Re: Search with accent

2010-11-10 Thread Tomas Fernandez Lobbe
I don't understand, when the user search for perequê you want the results for 
perequê and pereque?

If thats the case, any field type with ISOLatin1AccentFilterFactory should 
work. 
The accent should be removed at index time and at query time (Make sure the 
filter is being applied on both cases).

Tomás






De: Claudio Devecchi cdevec...@gmail.com
Para: Lista Solr solr-user@lucene.apache.org
Enviado: miércoles, 10 de noviembre, 2010 15:16:24
Asunto: Search with accent

Hi all,

Somebody knows how can I config my solr to make searches with and without
accents?

for example:

pereque and perequê


When I do it I need the same result, but its not working.

tks
--



  

Re: Search with accent

2010-11-10 Thread Tomas Fernandez Lobbe
It looks like ISOLatin1AccentFilter is deprecated on Solr 1.4.1, If you are on 
that version, you should use the ASCIIFoldingFilter instead.

Like with any other filter, to use it, you have to add the filter factory to 
the 
analysis chain of the field type you are using:

filter class=solr.ASCIIFoldingFilterFactory/

Make sure you add it to the query and index analysis chain, otherwise you'll 
have extrage results.

You'll have to perform a full reindex.

Tomás





De: Claudio Devecchi cdevec...@gmail.com
Para: solr-user@lucene.apache.org
Enviado: miércoles, 10 de noviembre, 2010 17:08:06
Asunto: Re: Search with accent

Tomas,

Let me try to explain better.

For example.

- I have 10 documents, where 7 have the word pereque (without accent) and 3
have the word perequê (with accent)

When I do a search pereque, solr is returning just 7, and when I do a search
perequê solr is returning 3.

But for me, these words are the same, and when I do some search for perequê
or pereque, it should show me 10 results.


About the ISOLatin you told, do you know how can I enable it?

tks,
Claudio

On Wed, Nov 10, 2010 at 5:00 PM, Tomas Fernandez Lobbe 
tomasflo...@yahoo.com.ar wrote:

 I don't understand, when the user search for perequê you want the results
 for
 perequê and pereque?

 If thats the case, any field type with ISOLatin1AccentFilterFactory should
 work.
 The accent should be removed at index time and at query time (Make sure the
 filter is being applied on both cases).

 Tomás





 
 De: Claudio Devecchi cdevec...@gmail.com
 Para: Lista Solr solr-user@lucene.apache.org
 Enviado: miércoles, 10 de noviembre, 2010 15:16:24
 Asunto: Search with accent

 Hi all,

 Somebody knows how can I config my solr to make searches with and without
 accents?

 for example:

 pereque and perequê


 When I do it I need the same result, but its not working.

 tks
 --








-- 
Claudio Devecchi
flickr.com/cdevecchi



  

Re: Search with accent

2010-11-10 Thread Tomas Fernandez Lobbe
That's what the ASCIIFoldingFilter does, it removes the accents, that's why you 
have to add it to the query analisis chain and to the index analysis chain, to 
search the same way you index. 



You can see how it works from the Analysis page on Solr Admin.






De: Savvas-Andreas Moysidis savvas.andreas.moysi...@googlemail.com
Para: solr-user@lucene.apache.org
Enviado: miércoles, 10 de noviembre, 2010 17:27:24
Asunto: Re: Search with accent

have you tried using a TokenFilter which removes accents both at
indexing and searching time? If you index terms without accents and
search the same
way you should be able to find all documents as you require.



On 10 November 2010 20:25, Tomas Fernandez Lobbe
tomasflo...@yahoo.com.arwrote:

 It looks like ISOLatin1AccentFilter is deprecated on Solr 1.4.1, If you are
 on
 that version, you should use the ASCIIFoldingFilter instead.

 Like with any other filter, to use it, you have to add the filter factory
 to the
 analysis chain of the field type you are using:

 filter class=solr.ASCIIFoldingFilterFactory/

 Make sure you add it to the query and index analysis chain, otherwise
 you'll
 have extrage results.

 You'll have to perform a full reindex.

 Tomás




 
 De: Claudio Devecchi cdevec...@gmail.com
 Para: solr-user@lucene.apache.org
 Enviado: miércoles, 10 de noviembre, 2010 17:08:06
 Asunto: Re: Search with accent

 Tomas,

 Let me try to explain better.

 For example.

 - I have 10 documents, where 7 have the word pereque (without accent) and 3
 have the word perequê (with accent)

 When I do a search pereque, solr is returning just 7, and when I do a
 search
 perequê solr is returning 3.

 But for me, these words are the same, and when I do some search for perequê
 or pereque, it should show me 10 results.


 About the ISOLatin you told, do you know how can I enable it?

 tks,
 Claudio

 On Wed, Nov 10, 2010 at 5:00 PM, Tomas Fernandez Lobbe 
 tomasflo...@yahoo.com.ar wrote:

  I don't understand, when the user search for perequê you want the results
  for
  perequê and pereque?
 
  If thats the case, any field type with ISOLatin1AccentFilterFactory
 should
  work.
  The accent should be removed at index time and at query time (Make sure
 the
  filter is being applied on both cases).
 
  Tomás
 
 
 
 
 
  
  De: Claudio Devecchi cdevec...@gmail.com
  Para: Lista Solr solr-user@lucene.apache.org
  Enviado: miércoles, 10 de noviembre, 2010 15:16:24
  Asunto: Search with accent
 
  Hi all,
 
  Somebody knows how can I config my solr to make searches with and without
  accents?
 
  for example:
 
  pereque and perequê
 
 
  When I do it I need the same result, but its not working.
 
  tks
  --
 
 
 
 
 



 --
 Claudio Devecchi
 flickr.com/cdevecchi







  

Re: Search with accent

2010-11-10 Thread Tomas Fernandez Lobbe
You have to modify the field type you are using in your schema.xml file. This 
is 
the text field type of Solr 1.4.1 exmple with this filter added:

fieldType name=text class=solr.TextField positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.ASCIIFoldingFilterFactory/
filter class=solr.StopFilterFactory
ignoreCase=true
words=stopwords.txt
enablePositionIncrements=true
/
filter class=solr.WordDelimiterFilterFactory generateWordParts=1 
generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 
splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.SnowballPorterFilterFactory language=English 
protected=protwords.txt/
  /analyzer
  analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.ASCIIFoldingFilterFactory/
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt 
ignoreCase=true expand=true/
filter class=solr.StopFilterFactory
ignoreCase=true
words=stopwords.txt
enablePositionIncrements=true
/
filter class=solr.WordDelimiterFilterFactory generateWordParts=1 
generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 
splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.SnowballPorterFilterFactory language=English 
protected=protwords.txt/
  /analyzer
/fieldType








De: Claudio Devecchi cdevec...@gmail.com
Para: solr-user@lucene.apache.org
Enviado: miércoles, 10 de noviembre, 2010 17:44:01
Asunto: Re: Search with accent

Ok tks,

I'm new with solr, my doubt is how can I enable theses feature. Or these
feature is already working by default?

Is this something to config on my schema.xml?

Tks!!


On Wed, Nov 10, 2010 at 6:40 PM, Tomas Fernandez Lobbe 
tomasflo...@yahoo.com.ar wrote:

 That's what the ASCIIFoldingFilter does, it removes the accents, that's why
 you
 have to add it to the query analisis chain and to the index analysis chain,
 to
 search the same way you index.



 You can see how it works from the Analysis page on Solr Admin.





 
 De: Savvas-Andreas Moysidis savvas.andreas.moysi...@googlemail.com
 Para: solr-user@lucene.apache.org
 Enviado: miércoles, 10 de noviembre, 2010 17:27:24
 Asunto: Re: Search with accent

 have you tried using a TokenFilter which removes accents both at
 indexing and searching time? If you index terms without accents and
 search the same
 way you should be able to find all documents as you require.



 On 10 November 2010 20:25, Tomas Fernandez Lobbe
 tomasflo...@yahoo.com.arwrote:

  It looks like ISOLatin1AccentFilter is deprecated on Solr 1.4.1, If you
 are
  on
  that version, you should use the ASCIIFoldingFilter instead.
 
  Like with any other filter, to use it, you have to add the filter factory
  to the
  analysis chain of the field type you are using:
 
  filter class=solr.ASCIIFoldingFilterFactory/
 
  Make sure you add it to the query and index analysis chain, otherwise
  you'll
  have extrage results.
 
  You'll have to perform a full reindex.
 
  Tomás
 
 
 
 
  
  De: Claudio Devecchi cdevec...@gmail.com
  Para: solr-user@lucene.apache.org
  Enviado: miércoles, 10 de noviembre, 2010 17:08:06
  Asunto: Re: Search with accent
 
  Tomas,
 
  Let me try to explain better.
 
  For example.
 
  - I have 10 documents, where 7 have the word pereque (without accent) and
 3
  have the word perequê (with accent)
 
  When I do a search pereque, solr is returning just 7, and when I do a
  search
  perequê solr is returning 3.
 
  But for me, these words are the same, and when I do some search for
 perequê
  or pereque, it should show me 10 results.
 
 
  About the ISOLatin you told, do you know how can I enable it?
 
  tks,
  Claudio
 
  On Wed, Nov 10, 2010 at 5:00 PM, Tomas Fernandez Lobbe 
  tomasflo...@yahoo.com.ar wrote:
 
   I don't understand, when the user search for perequê you want the
 results
   for
   perequê and pereque?
  
   If thats the case, any field type with ISOLatin1AccentFilterFactory
  should
   work.
   The accent should be removed at index time and at query time (Make sure
  the
   filter is being applied on both cases).
  
   Tomás
  
  
  
  
  
   
   De: Claudio Devecchi cdevec...@gmail.com
   Para: Lista Solr solr-user@lucene.apache.org
   Enviado: miércoles, 10 de noviembre, 2010 15:16:24
   Asunto: Search with accent
  
   Hi all,
  
   Somebody knows how can I config my solr to make searches with and
 without
   accents?
  
   for example:
  
   pereque and perequê
  
  
   When I do it I need the same result, but its not working.
  
   tks

Re: How to use protwords.txt

2010-08-31 Thread Tomas
Shaui, are you using a WordDelimiterFilterFactory in the analysis? That's the 
filter that might be transforming met1 into met and 1 and not the 
steamer. 
Check de Analysis page on Solr admin.





De: Shuai Weng sh...@genome.stanford.edu
Para: solr-user@lucene.apache.org
Enviado: lunes, 30 de agosto, 2010 20:00:41
Asunto: How to use protwords.txt


Hey,

Currently we have indexed some biological fulltext files. I was wondering how to
config the schema.xml such that the gene names (eg, 'met1', 'met2', 'met3' etc) 
won't
be stemmed into the same word ('met'). I added these gene names into the 
protwords.txt
file but it doesn't seem to work.  

Am I missing anything?

Thanks for any info you may provide!

Shaui


  

Stress Test Solr

2010-08-02 Thread Tomas
Hi All, we've been building an open source tool for load tests on Solr 
Installations. Thetool is called SolrMeter. It's on google code 
at http://code.google.com/p/solrmeter/. Here is some information about it:

SolrMeter is an stress testing / performance benchmarking tool for Apache Solr 
installations.  It is licensed under ASL and developed using JavaSE and Swing 
components, connected with Solr using SolrJ.
 
What can youdowith SolrMeter?
The main goal of this open source project is bring to the Apache Solr user 
community a tool for dealing with Solr specific issues regarding performance 
and 
stress testing like firing queries and adding documents to make sure that your 
Solr instalation will support real world's load and demands. With SolrMeter you 
can simulate a work load over the Apache Solr instalation and to obtain useful 
visual performance statistics and metrics.
Relevant Features:
* Execute queries against a Solr installation
* Execute dummy updates/inserts to the Solr installation, it can be the 
same 
server as the queries or a different one.
* Configure number of queries to fire in a time period interval
* Configure the number of updates/inserts in a time period.
* Configure commits frequency during adds
* Monitor error counts when adding and commiting documents.
* Perform and monitor index optimization
* Monitor query times online and visually
* Add filter queries into the test queries
* Add facet abilities into the test queries
* Import/Export test configuration
* Query time execution histogram chart
* Query times distribution chart
* Online error log and browsing capabilities
* Individual query graphical log and statistics
* and much more
 
Whatdo you need for use SolrMeter?
This is one of the most interesting points about SolrMeter, the requirements 
are 
minimal. It is simple to install and use.
* JRE versión 1.6
* The Solr Server you want to test.
 
Who can use SolrMeter?
Everyone who needs to assess the solrmeter server performance. To run the tool 
you only need to know about SOLR.



Try it and tell us what you think . . . . .  

Solrmeter Group
mailto:solrme...@googlegroups.com

What's next?
We are now building version 0.2.0, the objetive of this new version is to 
evolve 
SolrMeter into a pluggable architecture to allow deeper customizations like 
adding custom statistics, extractors or executors.
We are also adding some usability improvements.

On future versions we want to add a better interaction with Solr request 
handlers, for example, showing cache statistics online and graphically on some 
chart would be a great tool.
We also want to add more usability features to make of solrmeter a complete 
tool 
for testing a Solr instalation.
For more details on what's next che the Issues page on the google code site.



  

Index size on disk

2010-03-11 Thread Tomas
Hello, I needed an easy way to see the index size (the actual size on disk, not 
just the number of documents indexed) and as i didn't found anything for doing 
that on the documentation or on the list, I coded a fast solution.

I added the Index size as a statistic of the searcher, that way the value can 
be seen on the statistics page of the Solr admin. To do this I modified the 
method 

public NamedList getStatistics() {... 

on the class 

org.apache.solr.search.SolrIndexSearcher

by adding the line

lst.add(indexSize, this.calculateIndexSize(reader.directory()).toString() +  
MB);

and added the methods: 

 private BigDecimal calculateIndexSize(Directory directory) {
  long size = 0L;
  try {
for(String filePath:directory.listAll()) {
size+=directory.fileLength(filePath);
  }
} catch (IOException e) {
return new BigDecimal(-1);
}
return getSizeInMB(size, 2);
  }

private BigDecimal getSizeInMB(long size, int scale) {
BigDecimal divisor = new BigDecimal(1024);
BigDecimal sizeKb = new BigDecimal(size).divide(divisor, scale + 1, 
BigDecimal.ROUND_HALF_UP);
return sizeKb.divide(divisor, scale, BigDecimal.ROUND_HALF_UP);
}

I'm running Solr 1.4 on a JBoss 4.0.5 with Java 1.5 and this worked just fine. 
Does anyone see a potential problem on this?

I'm assuming that the solr index will never have directories inside (that's why 
I'm just looping on the index parent directory), is there any case when this is 
not true?

Tomás



  Yahoo! Cocina

Encontra las mejores recetas con Yahoo! Cocina.


http://ar.mujer.yahoo.com/cocina/

Re: user feedback in solr

2010-02-05 Thread Tomas
I'm responding to this old mail becouse I implemented something like this 
similar to http://wiki.apache.org/solr/SolrSnmp . Maybe we could discuss if 
this is a good solution.

I'm using Solr 1.4 on a JBoss 4.0.5 and Java 1.5.
In my particular case, what I'm trying to find out is how often the user uses 
wildcards on his query.

I implemented a Servlet Filter that extracts the q parameter from the request 
and log it (to a diferent log file, just for querys). After that, the filter 
pases the query to the MBean, wich parses que query looking for wildcrds and 
that's it.
This is not a complete solution but it might help.
 What we are planning to do with this is to use de MBean to see live 
information, and to parse the query.log for a more detailed analysis of usage.

I put the following code on a jar file on the server's lib directory

-QueryFilter--

public class QueryFilter implements Filter {

private static final Log log = LogFactory.getLog(QueryFilter.class);

private QueryStatsMBean mbean;

public void doFilter(ServletRequest request, ServletResponse response,
FilterChain chain) throws IOException, ServletException 
{

if(request != null) {
String query = request.getParameter(q);
if(query != null) {
log.info(Query:  + query);
mbean.parseQuery(query);
}
}
chain.doFilter(request, response);
}

public void init(FilterConfig filterConfig) throws ServletException {
log.info(init filter  + this);

mbean = new QueryStats();
try {
mbean.create();
registerMBean(mbean.getName(), mbean);
} catch (Exception e) {
throw new
 ServletException(e);
}

}

private void registerMBean(String objectName, Object bean) throws 
ServletException {
MBeanServer mbs = MBeanServerLocator.locateJBoss();
try {
  ObjectName on = new ObjectName(objectName);
  mbs.registerMBean(bean, on);
  log.info(MBean registered);
}
catch (NotCompliantMBeanException e) { throw new 
ServletException(e); }
catch (MBeanRegistrationException e) { throw new 
ServletException(e); }
catch (InstanceAlreadyExistsException e) { throw new 
ServletException(e); }
catch (MalformedObjectNameException e) { throw new 
ServletException(e); }
  }


public void destroy() {
log.info(Destroying Filter + this);
mbean.destroy();
}

}

---QueryStatsMBean--

public interface QueryStatsMBean extends org.jboss.system.ServiceMBean {

public Integer
 getTotalQueries();

public Integer getAsteriskQueries();

public Integer getQuestionQueries();

public Integer getDefaultQueries();

public void parseQuery(String query);

public void create() throws Exception;

public void start() throws Exception;

public void stop();

public void destroy();

public int fileReaded();


}




QueryStats 

public class QueryStats implements QueryStatsMBean{

private Integer totalQueries;

private Integer asteriskQueries;

private Integer questionQueries;

private Integer defaultQueries;

public void create() throws Exception {
totalQueries = new Integer(0);

asteriskQueries = new Integer(0);
questionQueries = new Integer(0);
defaultQueries = new Integer(0);
}

public void parseQuery(String query) {
totalQueries =
 new Integer(totalQueries.intValue() + 1);
if(isDefaultQuery(query)) {
defaultQueries = new Integer(defaultQueries.intValue() 
+ 1);
}else {
if(hasAsterisk(query)) {
asteriskQueries = new 
Integer(asteriskQueries.intValue() + 1);
}
if(hasQuestion(query)) {
questionQueries = new 
Integer(questionQueries.intValue() + 1);
}
}
}

private boolean hasQuestion(String query) {
return query.indexOf(?) != -1;
}

private boolean hasAsterisk(String query) {
return query.indexOf(*) != -1;
}

private boolean isDefaultQuery(String defaultQuery) {
return *:*.equals(defaultQuery);
}

public Integer getTotalQueries() {
return totalQueries;
}