from:"Otis Gospodnetic \(JIRA\)"

[jira] [Commented] (SOLR-13434) OpenTracing support for Solr

2019-05-26 Thread Otis Gospodnetic (JIRA)



[ 
https://issues.apache.org/jira/browse/SOLR-13434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16848549#comment-16848549
 ] 

Otis Gospodnetic commented on SOLR-13434:
-

[~caomanhdat] Note that OpenTracing has merged with OpenCensus to form 
OpenTelemetry.

> OpenTracing support for Solr
> 
>
> Key: SOLR-13434
> URL: https://issues.apache.org/jira/browse/SOLR-13434
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Shalin Shekhar Mangar
>Assignee: Cao Manh Dat
>Priority: Major
> Fix For: master (9.0), 8.2
>
> Attachments: SOLR-13434.patch
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> [OpenTracing|https://opentracing.io/] is a vendor neutral API and 
> infrastructure for distributed tracing. Many OSS tracers just as Jaeger, 
> OpenZipkin, Apache SkyWalking as well as commercial tools support OpenTracing 
> APIs. Ideally, we can implement it once and have integrations for popular 
> tracers like we have with metrics and prometheus.
> I'm aware of SOLR-9641 but HTrace has since retired from incubator for lack 
> of activity so this is a fresh attempt at solving this problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-12765) Possibly incorrect format in JMX cache stats

2018-09-13 Thread Otis Gospodnetic (JIRA)



 [ 
https://issues.apache.org/jira/browse/SOLR-12765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic updated SOLR-12765:

Component/s: metrics

> Possibly incorrect format in JMX cache stats
> 
>
> Key: SOLR-12765
> URL: https://issues.apache.org/jira/browse/SOLR-12765
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: metrics
>Affects Versions: 7.4
>Reporter: Bojan Smid
>Priority: Major
>
> I posted a question on ML 
> [https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201809.mbox/%3CCAGniRXR4Ps%3D03X0uiByCn5ecUT2VY4TLV4iNcxCde3dxBnmC-w%40mail.gmail.com%3E|https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201809.mbox/%3CCAGniRXR4Ps%3D03X0uiByCn5ecUT2VY4TLV4iNcxCde3dxBnmC-w%40mail.gmail.com%3E),]
>  , but didn't get feedback. Since it looks like a possible bug, I am opening 
> a ticket.
>  
>   It seems the format of cache mbeans changed with 7.4.0.  And from what I 
> see similar change wasn't made for other mbeans, which may mean it was 
> accidental and may be a bug.
>  
>   In Solr 7.3.* format was (each attribute on its own, numeric type):
>  
> mbean:
> solr:dom1=core,dom2=gettingstarted,dom3=shard1,dom4=replica_n1,category=CACHE,scope=searcher,name=filterCache
>  
> attributes:
>   lookups java.lang.Long = 0
>   hits java.lang.Long = 0
>   cumulative_evictions java.lang.Long = 0
>   size java.lang.Long = 0
>   hitratio java.lang.Float = 0.0
>   evictions java.lang.Long = 0
>   cumulative_lookups java.lang.Long = 0
>   cumulative_hitratio java.lang.Float = 0.0
>   warmupTime java.lang.Long = 0
>   inserts java.lang.Long = 0
>   cumulative_inserts java.lang.Long = 0
>   cumulative_hits java.lang.Long = 0
>  
>   With 7.4.0 there is a single attribute "Value" (java.lang.Object):
>  
> mbean:
> solr:dom1=core,dom2=gettingstarted,dom3=shard1,dom4=replica_n1,category=CACHE,scope=searcher,name=filterCache
>  
> attributes:
>   Value java.lang.Object = \{lookups=0, evictions=0, 
> cumulative_inserts=0, cumulative_hits=0, hits=0, cumulative_evictions=0, 
> size=0, hitratio=0.0, cumulative_lookups=0, cumulative_hitratio=0.0, 
> warmupTime=0, inserts=0}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-12765) Possibly incorrect format in JMX cache stats

2018-09-12 Thread Otis Gospodnetic (JIRA)



[ 
https://issues.apache.org/jira/browse/SOLR-12765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16612047#comment-16612047
 ] 

Otis Gospodnetic commented on SOLR-12765:
-

[~ab]is this a bug?  If so, we could try to get you the patch/PR.

> Possibly incorrect format in JMX cache stats
> 
>
> Key: SOLR-12765
> URL: https://issues.apache.org/jira/browse/SOLR-12765
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 7.4
>Reporter: Bojan Smid
>Priority: Major
>
> I posted a question on ML 
> [https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201809.mbox/%3CCAGniRXR4Ps%3D03X0uiByCn5ecUT2VY4TLV4iNcxCde3dxBnmC-w%40mail.gmail.com%3E|https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201809.mbox/%3CCAGniRXR4Ps%3D03X0uiByCn5ecUT2VY4TLV4iNcxCde3dxBnmC-w%40mail.gmail.com%3E),]
>  , but didn't get feedback. Since it looks like a possible bug, I am opening 
> a ticket.
>  
>   It seems the format of cache mbeans changed with 7.4.0.  And from what I 
> see similar change wasn't made for other mbeans, which may mean it was 
> accidental and may be a bug.
>  
>   In Solr 7.3.* format was (each attribute on its own, numeric type):
>  
> mbean:
> solr:dom1=core,dom2=gettingstarted,dom3=shard1,dom4=replica_n1,category=CACHE,scope=searcher,name=filterCache
>  
> attributes:
>   lookups java.lang.Long = 0
>   hits java.lang.Long = 0
>   cumulative_evictions java.lang.Long = 0
>   size java.lang.Long = 0
>   hitratio java.lang.Float = 0.0
>   evictions java.lang.Long = 0
>   cumulative_lookups java.lang.Long = 0
>   cumulative_hitratio java.lang.Float = 0.0
>   warmupTime java.lang.Long = 0
>   inserts java.lang.Long = 0
>   cumulative_inserts java.lang.Long = 0
>   cumulative_hits java.lang.Long = 0
>  
>   With 7.4.0 there is a single attribute "Value" (java.lang.Object):
>  
> mbean:
> solr:dom1=core,dom2=gettingstarted,dom3=shard1,dom4=replica_n1,category=CACHE,scope=searcher,name=filterCache
>  
> attributes:
>   Value java.lang.Object = \{lookups=0, evictions=0, 
> cumulative_inserts=0, cumulative_hits=0, hits=0, cumulative_evictions=0, 
> size=0, hitratio=0.0, cumulative_lookups=0, cumulative_hitratio=0.0, 
> warmupTime=0, inserts=0}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-8274) Add per-request MDC logging based on user-provided value.

2018-05-01 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-8274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16460158#comment-16460158
 ] 

Otis Gospodnetic commented on SOLR-8274:


Perhaps a more modern way to approach this is to instrument Solr.  OpenTracing 
comes to mind.  See 
[https://sematext.com/blog/opentracing-distributed-tracing-emerging-industry-standard/]
 for a quick overview.  See also [https://github.com/opentracing-contrib] 

> Add per-request MDC logging based on user-provided value.
> -
>
> Key: SOLR-8274
> URL: https://issues.apache.org/jira/browse/SOLR-8274
> Project: Solr
>  Issue Type: Improvement
>  Components: logging
>Reporter: Jason Gerlowski
>Priority: Minor
> Attachments: SOLR-8274.patch
>
>
> *Problem 1* Currently, there's no way (AFAIK) to find all log messages 
> associated with a particular request.
> *Problem 2* There's also no easy way for multi-tenant Solr setups to find all 
> log messages associated with a particular customer/tenant.
> Both of these problems would be more manageable if Solr could be configured 
> to record an MDC tag based on a header, or some other user provided value.
> This would allow admins to group together logs about a single request.  If 
> the same header value is repeated multiple times this functionality could 
> also be used to group together arbitrary requests, such as those that come 
> from a particular user, etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-11779) Basic long-term collection of aggregated metrics

2018-03-19 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16405324#comment-16405324
 ] 

Otis Gospodnetic commented on SOLR-11779:
-

IMHO don't do it.  Investing in APIs and building tools around Solr that 
consume Solr metrics, events, etc. is a much better investment than keeping 
things self-contained.  A platform and ecosystem it enables win over a tool 
that tries to do everything.

> Basic long-term collection of aggregated metrics
> 
>
> Key: SOLR-11779
> URL: https://issues.apache.org/jira/browse/SOLR-11779
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: metrics
>Affects Versions: 7.3, master (8.0)
>Reporter: Andrzej Bialecki 
>Assignee: Andrzej Bialecki 
>Priority: Major
>
> Tracking the key metrics over time is very helpful in understanding the 
> cluster and user behavior.
> Currently even basic metrics tracking requires setting up an external system 
> and either polling {{/admin/metrics}} or using {{SolrMetricReporter}}-s. The 
> advantage of this setup is that these external tools usually provide a lot of 
> sophisticated functionality. The downside is that they don't ship out of the 
> box with Solr and require additional admin effort to set up.
> Solr could collect some of the key metrics and keep their historical values 
> in a round-robin database (eg. using RRD4j) to keep the size of the historic 
> data constant (eg. ~64kB per metric), but at the same providing out of the 
> box useful insights into the basic system behavior over time. This data could 
> be persisted to the {{.system}} collection as blobs, and it could be also 
> presented in the Admin UI as graphs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-11323) Expose cache maxSize and autowarm settings in JMX

2017-09-20 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16174234#comment-16174234
 ] 

Otis Gospodnetic commented on SOLR-11323:
-

[~ab] this is that 1-line change we briefly chatted about in Vegas.  It would 
be great if you could this in in the next Solr 7.x minor release. Thanks.

> Expose cache maxSize and autowarm settings in JMX
> -
>
> Key: SOLR-11323
> URL: https://issues.apache.org/jira/browse/SOLR-11323
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: JMX, metrics
>Affects Versions: 7.0, 7.1
>Reporter: Bojan Smid
>
> Before Solr 7.*, cache maxSize and autowarm settings were exposed in JMX 
> along with cache metrics. There was a textual attribute "description" which 
> could be parsed to extract maxSize and autowarm settings. This was very 
> useful for various monitoring tools since maxSize and autowarm could then be 
> displayed on monitoring charts (one could for example compare current size of 
> some cache to its maxSize without digging through configs to find this 
> setting).
> Ideally maxSize and autowarm count/% would be exposed as two separate 
> attributes, but having a single description field (which can be parsed) would 
> also be better than nothing.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-11323) Expose cache maxSize and autowarm settings in JMX

2017-09-05 Thread Otis Gospodnetic (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-11323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic updated SOLR-11323:

Component/s: metrics
 JMX

> Expose cache maxSize and autowarm settings in JMX
> -
>
> Key: SOLR-11323
> URL: https://issues.apache.org/jira/browse/SOLR-11323
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: JMX, metrics
>Affects Versions: 7.0, 7.1
>Reporter: Bojan Smid
>
> Before Solr 7.*, cache maxSize and autowarm settings were exposed in JMX 
> along with cache metrics. There was a textual attribute "description" which 
> could be parsed to extract maxSize and autowarm settings. This was very 
> useful for various monitoring tools since maxSize and autowarm could then be 
> displayed on monitoring charts (one could for example compare current size of 
> some cache to its maxSize without digging through configs to find this 
> setting).
> Ideally maxSize and autowarm count/% would be exposed as two separate 
> attributes, but having a single description field (which can be parsed) would 
> also be better than nothing.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (SOLR-10573) Hide ZooKeeper

2017-04-26 Thread Otis Gospodnetic (JIRA)

Otis Gospodnetic created SOLR-10573:
---

 Summary: Hide ZooKeeper
 Key: SOLR-10573
 URL: https://issues.apache.org/jira/browse/SOLR-10573
 Project: Solr
  Issue Type: New Feature
  Security Level: Public (Default Security Level. Issues are Public)
Reporter: Otis Gospodnetic


It may make sense to either embed ZK in Solr and allow running Solr instances 
with just ZK and no data or something else that hides ZK from Solr users...

Based on what the Solr poll that revealed lowish SolrCloud adoption and 
comments in 
http://search-lucene.com/m/Solr/eHNlm8wPIKJ3v51?subj=Poll+Master+Slave+or+SolrCloud
 that showed that people still find SolrCloud complex, at least partly because 
of the external ZK recommendation.

See also: 
http://search-lucene.com/m/Lucene/l6pAi11rBma0gNoI1?subj=SolrCloud+master+mode+planned+




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-7806) Explore delta of delta encoding

2017-04-25 Thread Otis Gospodnetic (JIRA)

Otis Gospodnetic created LUCENE-7806:


 Summary: Explore delta of delta encoding
 Key: LUCENE-7806
 URL: https://issues.apache.org/jira/browse/LUCENE-7806
 Project: Lucene - Core
  Issue Type: New Feature
Reporter: Otis Gospodnetic


>From 
>http://search-lucene.com/m/Lucene/l6pAi1YEfXhuGGIl1?subj=Re+Delta+of+delta+encoding

{quote}
delta of delta encoding is one of the Facebook Gorilla tricks that allows it to 
compress 16 bytes into 1.37 bytes on average -- see section 4.1 that describes 
it -- http://www.vldb.org/pvldb/vol8/p1816-teller.pdf

This seems to be aimed at both time fields and numerical values.

https://github.com/burmanm/gorilla-tsc is a recent Java impl
{quote}

CC [~jpountz]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-10548) hyper-log-log based numBuckets for faceting

2017-04-22 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-10548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15980235#comment-15980235
 ] 

Otis Gospodnetic commented on SOLR-10548:
-

A new paper published in January introduced a new cardinality estimation 
algorithm called LogLog-Beta/β:

https://arxiv.org/abs/1612.02284

"The new algorithm uses only one formula and needs no additional bias
corrections for the entire range of cardinalities, therefore, it is more
efficient and simpler to implement. Our simulations show that the accuracy
provided by the new algorithm is as good as or better than the accuracy
provided by either of HyperLogLog or HyperLogLog++."
Some comments about its accuracy (graphs included) can be found in this PR: 
https://github.com/antirez/redis/pull/3677

> hyper-log-log based numBuckets for faceting
> ---
>
> Key: SOLR-10548
> URL: https://issues.apache.org/jira/browse/SOLR-10548
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Facet Module
>Reporter: Yonik Seeley
>
> numBuckets currently uses an estimate (same as the unique function detailed 
> at http://yonik.com/solr-count-distinct/ ).  We should either change 
> implementations or introduce a way to optionally select a hyper-log-log based 
> approach for a better estimate with high field cardinalities.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-10418) metrics should return JVM system properties

2017-04-06 Thread Otis Gospodnetic (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-10418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic updated SOLR-10418:

Component/s: metrics

> metrics should return JVM system properties
> ---
>
> Key: SOLR-10418
> URL: https://issues.apache.org/jira/browse/SOLR-10418
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: metrics
>Reporter: Noble Paul
>Assignee: Andrzej Bialecki 
>
> We need to stop using the custom solution used in rules and start using 
> metrics for everything



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-10359) User Interactions Logger Component

2017-03-27 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-10359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15944247#comment-15944247
 ] 

Otis Gospodnetic commented on SOLR-10359:
-

Solr *could* be used to process and store this data, but would it be better to 
think more about creating a "spec" for this sort of data and pluggable outputs, 
so that people can choose to push their data elsewhere, whether their own 
custom tooling or 3rd party services?

> User Interactions Logger Component
> --
>
> Key: SOLR-10359
> URL: https://issues.apache.org/jira/browse/SOLR-10359
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Alessandro Benedetti
>  Labels: CTR, evaluation
>
> *Introduction*
> Being able to evaluate the quality of your search engine is becoming more and 
> more important day by day.
> This issue is to put a milestone to integrate online evaluation metrics with 
> Solr.
> *Scope*
> Scope of this issue is to provide a set of components able to :
> 1) Collect Search Results impressions ( results shown per query)
> 2) Collect Users interactions ( user interactions on the search results per 
> query e.g. clicks, bookmarking,ect )
> 3) Calculate evaluation metrics on demand, such as Click Through Rate, DCG ...
> *Technical Design*
> A SearchComponent can be designed :
> *UsersEventsLoggerComponent*
> A property (such as storeDir) will define where the data collected will be 
> stored.
> Different data structures can be explored, to keep it simple, a first 
> implementation can be a Lucene Index.
> *Data Model*
> The user event can be modelled in the following way :
>  - the user query the event is related to
>  - the ID of the search result involved in the interaction
>  - the position in the ranking of the search result involved 
> in the interaction
>  - time when the interaction happened
>  - 0 for impressions, a value between 1-5 to identify the 
> type of user event, the semantic will depend on the domain and use cases
>  - this can identify a variant, in A/B testing
> *Impressions Logging*
> When the SearchComponent  is assigned to a request handler, everytime it 
> processes a request and return to the user a result set for a query, the 
> component will collect the impressions ( results returned) and index them in 
> the auxiliary lucene index.
> This will happen in parallel as soon as you return the results to avoid 
> affecting the query time.
> Of course an impact on CPU load and memory is expected, will be interesting 
> to minimise it.
> * User Events Logging *
> An UpdateHandler will be exposed to accept POST requests and collect user 
> events.
> Everytime a request is sent, the user event will be indexed in the underline 
> auxiliary Lucene Index.
> * Stats Calculation *
> A RequestHandler will be exposed to be able to calculate stats and 
> aggregations for the metrics :
> /evaluation?metric=ctr&stats=query&compare=testA,testB
> This request could calculate the CTR for our testA and testB to compare.
> Showing stats in total and per query ( to highlight the queries with 
> lower/higher CTR).
> The calculations will happen separating the  for an easy 
> comparison.
> Will be important to keep it as simple as possible for a first version, to 
> then extend it as much as we like



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-10247) Support non-numeric metrics and a "compact" format of /admin/metrics

2017-03-15 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-10247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15926832#comment-15926832
 ] 

Otis Gospodnetic commented on SOLR-10247:
-

Short version - /cat provides table-like output - columns, with optional 
header, more or less verbose.  Handy for piping into sort and friends that 
humans like to use, but also handy for agents because its output is 
simpler/cheaper to parse than JSON.

> Support non-numeric metrics and a "compact" format of /admin/metrics
> 
>
> Key: SOLR-10247
> URL: https://issues.apache.org/jira/browse/SOLR-10247
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: metrics
>Reporter: Andrzej Bialecki 
>Assignee: Andrzej Bialecki 
> Fix For: 6.5, master (7.0)
>
> Attachments: compactFormat.png, currentFormat.png, SOLR-10247.patch, 
> SOLR-10247.patch
>
>
> Metrics API currently supports only numeric values. However, it's useful also 
> to report non-numeric values such as eg. version, disk type, component state, 
> some system properties, etc.
> Codahale {{Gauge}} metric type can be used for this purpose, and 
> convenience methods can be added to {{SolrMetricManager}} to make it easier 
> to use.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-10247) Support non-numeric metrics and a "compact" format of /admin/metrics

2017-03-15 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-10247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15926791#comment-15926791
 ] 

Otis Gospodnetic commented on SOLR-10247:
-

bq. "compact" format of /admin/metrics
Something like /cat in ES or something different? /cat in ES is handy...

> Support non-numeric metrics and a "compact" format of /admin/metrics
> 
>
> Key: SOLR-10247
> URL: https://issues.apache.org/jira/browse/SOLR-10247
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: metrics
>Reporter: Andrzej Bialecki 
>Assignee: Andrzej Bialecki 
> Fix For: 6.5, master (7.0)
>
> Attachments: compactFormat.png, currentFormat.png, SOLR-10247.patch, 
> SOLR-10247.patch
>
>
> Metrics API currently supports only numeric values. However, it's useful also 
> to report non-numeric values such as eg. version, disk type, component state, 
> some system properties, etc.
> Codahale {{Gauge}} metric type can be used for this purpose, and 
> convenience methods can be added to {{SolrMetricManager}} to make it easier 
> to use.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-10262) Collect request latency metrics for histograms

2017-03-10 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-10262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15905566#comment-15905566
 ] 

Otis Gospodnetic commented on SOLR-10262:
-

bq. This would have to be configured early on in solr.xml or even via system 
properties, which is a bit ugly.
Not sure what exactly you mean by this, but I don't think it should be the new 
default because of 
http://search-lucene.com/m/Lucene/l6pAi15LobI6m5Ny1?subj=Solr+JMX+changes+and+backwards+in+compatibility
 .  I am hoping it can be added to whatever is already there.  Then people and 
tools that monitor Solr can decide which data they want to collect.  The old 
stuff could be marked/announced as deprecated if we really don't want/need that 
data, and removed in one of the future releases.

> Collect request latency metrics for histograms
> --
>
> Key: SOLR-10262
> URL: https://issues.apache.org/jira/browse/SOLR-10262
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: metrics
>Reporter: Otis Gospodnetic
>Assignee: Andrzej Bialecki 
>
> Since [~ab] is on a role with metrics...
> There is no way to accurately compute request latency percentiles from 
> metrics exposed by Solr today. We should consider making that possible. c.f. 
> https://github.com/HdrHistogram/HdrHistogram



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (SOLR-10262) Collect request latency metrics for histograms

2017-03-09 Thread Otis Gospodnetic (JIRA)

Otis Gospodnetic created SOLR-10262:
---

 Summary: Collect request latency metrics for histograms
 Key: SOLR-10262
 URL: https://issues.apache.org/jira/browse/SOLR-10262
 Project: Solr
  Issue Type: New Feature
  Security Level: Public (Default Security Level. Issues are Public)
  Components: metrics
Reporter: Otis Gospodnetic


Since [~ab] is on a role with metrics...
There is no way to accurately compute request latency percentiles from metrics 
exposed by Solr today. We should consider making that possible. c.f. 
https://github.com/HdrHistogram/HdrHistogram




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-10226) JMX metric avgTimePerRequest broken

2017-03-03 Thread Otis Gospodnetic (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic updated SOLR-10226:

Description: 
JMX Metric avgTimePerRequest (of 
org.apache.solr.handler.component.SearchHandler) doesn't appear to behave 
correctly anymore. It was a cumulative value in pre-6.4 versions. Since 
totalTime metric was removed (which was a base for monitoring calculations), 
avgTimePerRequest seems like possible alternative to calculate "time spent in 
requests since last measurement", but it behaves strangely after 6.4.

I did a simple test on gettingstarted collection (just unpacked the Solr 6.4.1 
version and started it with "bin/solr start -e cloud -noprompt"). The query I 
used was:
http://localhost:8983/solr/gettingstarted/select?indent=on&q=*:*&wt=json
I run it 30 times in a row (with approx 1 sec between executions).

At the same time I was looking (with jconsole) at bean 
solr/gettingstarted_shard2_replica2:type=/select,id=org.apache.solr.handler.component.SearchHandler

Here is how metric was changing over time (first number is "requests" metric, 
second number is "avgTimePerRequest"):
10   6.6033
12   5.9557
13   0.9015---> 13th req would need negative duration if this was cumulative
15   6.7315
16   7.4873
17   0.8458---> same case with 17th request
23   6.1076

At the same time bean 
solr/gettingstarted_shard1_replica2:type=/select,id=org.apache.solr.handler.component.SearchHandler
  also showed strange values:
65.13482
810.5694
90.504
10  0.344
12  8.8121
18  3.3531

CC [~ab]

  was:
JMX Metric avgTimePerRequest (of 
org.apache.solr.handler.component.SearchHandler) doesn't appear to behave 
correctly anymore. It was a cumulative value in pre-6.4 versions. Since 
totalTime metric was removed (which was a base for monitoring calculations), 
avgTimePerRequest seems like possible alternative to calculate "time spent in 
requests since last measurement", but it behaves strangely after 6.4.

I did a simple test on gettingstarted collection (just unpacked the Solr 6.4.1 
version and started it with "bin/solr start -e cloud -noprompt"). The query I 
used was:
http://localhost:8983/solr/gettingstarted/select?indent=on&q=*:*&wt=json
I run it 30 times in a row (with approx 1 sec between executions).

At the same time I was looking (with jconsole) at bean 
solr/gettingstarted_shard2_replica2:type=/select,id=org.apache.solr.handler.component.SearchHandler

Here is how metric was changing over time (first number is "requests" metric, 
second number is "avgTimePerRequest"):
10   6.6033
12   5.9557
13   0.9015---> 13th req would need negative duration if this was cumulative
15   6.7315
16   7.4873
17   0.8458---> same case with 17th request
23   6.1076

At the same time bean 
solr/gettingstarted_shard1_replica2:type=/select,id=org.apache.solr.handler.component.SearchHandler
  also showed strange values:
65.13482
810.5694
90.504
10  0.344
12  8.8121
18  3.3531


> JMX metric avgTimePerRequest broken
> ---
>
> Key: SOLR-10226
> URL: https://issues.apache.org/jira/browse/SOLR-10226
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: metrics
>Affects Versions: 6.4.1
>Reporter: Bojan Smid
>
> JMX Metric avgTimePerRequest (of 
> org.apache.solr.handler.component.SearchHandler) doesn't appear to behave 
> correctly anymore. It was a cumulative value in pre-6.4 versions. Since 
> totalTime metric was removed (which was a base for monitoring calculations), 
> avgTimePerRequest seems like possible alternative to calculate "time spent in 
> requests since last measurement", but it behaves strangely after 6.4.
> I did a simple test on gettingstarted collection (just unpacked the Solr 
> 6.4.1 version and started it with "bin/solr start -e cloud -noprompt"). The 
> query I used was:
> http://localhost:8983/solr/gettingstarted/select?indent=on&q=*:*&wt=json
> I run it 30 times in a row (with approx 1 sec between executions).
> At the same time I was looking (with jconsole) at bean 
> solr/gettingstarted_shard2_replica2:type=/select,id=org.apache.solr.handler.component.SearchHandler
> Here is how metric was changing over time (first number is "requests" metric, 
> second number is "avgTimePerRequest"):
> 10   6.6033
> 12   5.9557
> 13   0.9015---> 13th req would need negative duration if this was 
> cumulative
> 15   6.7315
> 16   7.4873
> 17   0.8458---> same case with 17th request
> 23   6.1076
> At the same time bean 
> solr/gettingstarted_shard1_replica2:type=/select,id=org.apache.solr.handler.component.SearchHandler
>   also showed strange values:
> 65.13482
> 810.5694
> 90.504
> 10  0.344
> 12  8.8121
> 18  3.3531
> CC [~ab]



--
This message was sent by Atlassian JIRA
(v6.3.15#63

[jira] [Issue Comment Deleted] (SOLR-9898) Documentation for metrics collection and /admin/metrics

2017-03-02 Thread Otis Gospodnetic (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-9898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic updated SOLR-9898:
---
Comment: was deleted

(was: While I love all the new metrics you are adding, I think metrics should 
be treated like code/features in terms of how backwards 
compatibility/deprecation is handled.  Otherwise, on upgrade, people's 
monitoring breaks and monitoring is kind of important... :)

Note: Looks like recent metrics changes broke/changed previously-existing 
MBeans...)

> Documentation for metrics collection and /admin/metrics
> ---
>
> Key: SOLR-9898
> URL: https://issues.apache.org/jira/browse/SOLR-9898
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: documentation
>Affects Versions: 6.4, master (7.0)
>Reporter: Andrzej Bialecki 
>Assignee: Cassandra Targett
>
> Draft documentation follows.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-9898) Documentation for metrics collection and /admin/metrics

2017-03-02 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-9898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15892436#comment-15892436
 ] 

Otis Gospodnetic commented on SOLR-9898:


While I love all the new metrics you are adding, I think metrics should be 
treated like code/features in terms of how backwards compatibility/deprecation 
is handled.  Otherwise, on upgrade, people's monitoring breaks and 
monitoring is kind of important... :)

Note: Looks like recent metrics changes broke/changed previously-existing 
MBeans...

> Documentation for metrics collection and /admin/metrics
> ---
>
> Key: SOLR-9898
> URL: https://issues.apache.org/jira/browse/SOLR-9898
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: documentation
>Affects Versions: 6.4, master (7.0)
>Reporter: Andrzej Bialecki 
>Assignee: Cassandra Targett
>
> Draft documentation follows.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-10214) clean up BlockCache metrics

2017-02-28 Thread Otis Gospodnetic (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-10214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic updated SOLR-10214:

Component/s: metrics

> clean up BlockCache metrics
> ---
>
> Key: SOLR-10214
> URL: https://issues.apache.org/jira/browse/SOLR-10214
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: metrics
>Reporter: Yonik Seeley
> Attachments: SOLR-10214.patch
>
>
> Many (most) of the block cache metrics are unused (I assume just inherited 
> from Blur) and unmaintained (i.e. most will be 0).  Currently only the size 
> and number of evictions is tracked.
> We should remove unused stats and start tracking
> - number of lookups (or number of misses)
> - number of hits
> - number of inserts
> - number of store failures



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-10029) Fix search link on http://lucene.apache.org/

2017-02-07 Thread Otis Gospodnetic (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-10029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic updated SOLR-10029:

Summary: Fix search link on http://lucene.apache.org/  (was: Fix Search 
link in http://lucene.apache.org/ & http://lucene.apache.org/)

> Fix search link on http://lucene.apache.org/
> 
>
> Key: SOLR-10029
> URL: https://issues.apache.org/jira/browse/SOLR-10029
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: website
>Reporter: Tien Nguyen Manh
> Attachments: SOLR-10029.patch
>
>
> The current link to http://search-lucene.com is
> http://search-lucene.com/lucene?q=apache&searchProvider=sl
> http://search-lucene.com/solr?q=apache
> The project names  should be upcase
> http://search-lucene.com/Lucene?q=apache&searchProvider=sl
> http://search-lucene.com/Solr?q=apache



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-10091) Support for CDCR using an external queueing service

2017-02-03 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-10091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15852310#comment-15852310
 ] 

Otis Gospodnetic commented on SOLR-10091:
-

Would this create a dependency on (specific version of) Kafka?  You may want to 
run that by dev@

> Support for CDCR using an external queueing service
> ---
>
> Key: SOLR-10091
> URL: https://issues.apache.org/jira/browse/SOLR-10091
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: CDCR
>Affects Versions: 6.x
>Reporter: Oliver Bates
>Priority: Minor
>  Labels: features
>
> The idea is to contribute part of the work presented here:
> https://www.youtube.com/watch?v=83vbY9f3nXA
> Specifically these components:
> - update processor that writes updates to an external queueing service 
> (abstracted by an interface)
> - a Kafka implementation of this interface (that goes into /contrib?) so 
> anyone using kafka can use this "out of the box"
> - a consumer application
> For the consumer application, the idea is an app that's queue-agnostic and 
> then the queue-specific consumer bit is loaded at runtime. In this case, 
> there's a "default" kafka consumer in there as well.
> I'm not exactly sure what the best structure would be for these pieces (the 
> kafka implementations and the general consumer app code), so I'll simply post 
> class definitions here and let the community decide where they should go.
> The core work is finished. I just need to clean it up a bit and convert the 
> tests to fit this repo (right now they're using an external framework).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Resolved] (SOLR-4500) How can we integrate LDAP authentiocation with the Solr instance

2017-01-26 Thread Otis Gospodnetic (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-4500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic resolved SOLR-4500.

Resolution: Invalid

Questions => mailing list

> How can we integrate LDAP authentiocation with the Solr instance
> 
>
> Key: SOLR-4500
> URL: https://issues.apache.org/jira/browse/SOLR-4500
> Project: Solr
>  Issue Type: Task
>Affects Versions: 4.1
>Reporter: Srividhya
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-9880) Add Ganglia and Graphite metrics reporters

2016-12-21 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-9880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15768143#comment-15768143
 ] 

Otis Gospodnetic commented on SOLR-9880:


May I recommend a different approach?  With the current approach you'll always 
have somebody come in and ask for additional reporter and typically that won't 
be very high on Solr devs' interest, plus it will require more work, additional 
dependencies, etc.  Moreover, if you do this then you have to think about 
things like destination not being available, about possible on-disk buffering 
so data is not lost, about ensuring the buffered data is purged if there is too 
much of it, and so on.  Solr doesn't want to be in data shipper business.  My 
suggestion, based on working with monitoring and logging for the last N years - 
just log metrics to a file.  There are a number of modern tools that know how 
to tail files, parse their content, ship it somewhere, have buffering, have 
multiple outputs, and so on.  Just make sure data is nicely structured to make 
parsing easy, and done in a way that when you add more metrics you can do it in 
a backwards-compatible way.

> Add Ganglia and Graphite metrics reporters
> --
>
> Key: SOLR-9880
> URL: https://issues.apache.org/jira/browse/SOLR-9880
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: metrics
>Reporter: Andrzej Bialecki 
>Assignee: Andrzej Bialecki 
>Priority: Minor
> Fix For: master (7.0), 6.4
>
>
> Originally SOLR-4735 provided implementations for these reporters (wrappers 
> for Dropwizard components to use {{SolrMetricReporter}} API).
> However, this functionality has been split into its own issue due to the 
> additional transitive dependencies that these reporters bring:
> * Ganglia:
> ** metrics-ganglia, ASL, 3kB
> ** gmetric4j (Ganglia RPC implementation), BSD, 29kB
> * Graphite
> ** metrics-graphite, ASL, 10kB
> ** amqp-client (RabbitMQ Java client, marked optional in pom?), ASL/MIT/GPL2, 
> 190kB
> IMHO these are not very large dependencies, and given the useful 
> functionality they provide it's worth adding them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-9805) Use metrics-jvm library to instrument jvm internals

2016-12-17 Thread Otis Gospodnetic (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-9805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic updated SOLR-9805:
---
Component/s: metrics

> Use metrics-jvm library to instrument jvm internals
> ---
>
> Key: SOLR-9805
> URL: https://issues.apache.org/jira/browse/SOLR-9805
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: metrics
>Reporter: Shalin Shekhar Mangar
>Assignee: Shalin Shekhar Mangar
> Fix For: master (7.0), 6.4
>
> Attachments: SOLR-9805.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> See http://metrics.dropwizard.io/3.1.0/manual/jvm/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-8785) Use Metrics library for core metrics

2016-12-17 Thread Otis Gospodnetic (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-8785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic updated SOLR-8785:
---
Component/s: metrics

> Use Metrics library for core metrics
> 
>
> Key: SOLR-8785
> URL: https://issues.apache.org/jira/browse/SOLR-8785
> Project: Solr
>  Issue Type: Improvement
>  Components: metrics
>Affects Versions: 4.1
>Reporter: Jeff Wartes
>Assignee: Shalin Shekhar Mangar
>  Labels: patch, patch-available
> Fix For: master (7.0), 6.4
>
> Attachments: SOLR-8785-increment.patch, SOLR-8785.patch, 
> SOLR-8785.patch, SOLR-8785.patch, SOLR_8775_rates_per_minute_fix.patch
>
>
> The Metrics library (https://dropwizard.github.io/metrics/3.1.0/) is a 
> well-known way to track metrics about applications. 
> In SOLR-1972, latency percentile tracking was added. The comment list is 
> long, so here’s my synopsis:
> 1. An attempt was made to use the Metrics library
> 2. That attempt failed due to a memory leak in Metrics v2.1.1
> 3. Large parts of Metrics were then copied wholesale into the 
> org.apache.solr.util.stats package space and that was used instead.
> Copy/pasting Metrics code into Solr may have been the correct solution at the 
> time, but I submit that it isn’t correct any more. 
> The leak in Metrics was fixed even before SOLR-1972 was released, and by 
> copy/pasting a subset of the functionality, we miss access to other important 
> things that the Metrics library provides, particularly the concept of a 
> Reporter. (https://dropwizard.github.io/metrics/3.1.0/manual/core/#reporters)
> Further, Metrics v3.0.2 is already packaged with Solr anyway, because it’s 
> used in two contrib modules. (map-reduce and morphines-core)
> I’m proposing that:
> 1. Metrics as bundled with Solr be upgraded to the current v3.1.2
> 2. Most of the org.apache.solr.util.stats package space be deleted outright, 
> or gutted and replaced with simple calls to Metrics. Due to the copy/paste 
> origin, the concepts should mostly map 1:1.
> I’d further recommend a usage pattern like:
> SharedMetricRegistries.getOrCreate(System.getProperty(“solr.metrics.registry”,
>  “solr-registry”))
> There are all kinds of areas in Solr that could benefit from metrics tracking 
> and reporting. This pattern allows diverse areas of code to track metrics 
> within a single, named registry. This well-known-name then becomes a handle 
> you can use to easily attach a Reporter and ship all of those metrics off-box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-9812) Implement a /admin/metrics API

2016-12-17 Thread Otis Gospodnetic (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-9812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic updated SOLR-9812:
---
Component/s: metrics

> Implement a /admin/metrics API
> --
>
> Key: SOLR-9812
> URL: https://issues.apache.org/jira/browse/SOLR-9812
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: metrics
>Reporter: Shalin Shekhar Mangar
>Assignee: Shalin Shekhar Mangar
> Fix For: master (7.0), 6.4
>
> Attachments: SOLR-9812.patch, SOLR-9812.patch
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> We added a bare bones metrics API in SOLR-9788 but due to limitations with 
> the metrics servlet supplied by the metrics library, it can show statistics 
> from only one metric registry. SOLR-4735 has added a hierarchy of metric 
> registries and the /admin/metrics API should support showing all of them as 
> well as be able to filter metrics from a given registry name.
> In this issue we will implement the improved /admin/metrics API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-9788) Use instrumented jetty classes

2016-12-17 Thread Otis Gospodnetic (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-9788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic updated SOLR-9788:
---
Component/s: metrics

> Use instrumented jetty classes
> --
>
> Key: SOLR-9788
> URL: https://issues.apache.org/jira/browse/SOLR-9788
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: metrics
>Reporter: Shalin Shekhar Mangar
>Assignee: Shalin Shekhar Mangar
> Fix For: master (7.0), 6.4
>
> Attachments: SOLR_9788.patch, SOLR_9788.patch, SOLR_9788.patch
>
>
> Dropwizard metrics library integrated in SOLR-8785 provides a set of 
> instrumented equivalents of Jetty classes. This allows us to collect 
> statistics on  Jetty's connector, thread pool and handlers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-9599) DocValues performance regression with new iterator API

2016-12-15 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-9599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15752588#comment-15752588
 ] 

Otis Gospodnetic commented on SOLR-9599:


[~ysee...@gmail.com] all sub-tasks seem to be done/resolved should this 
then be resolved, too?

> DocValues performance regression with new iterator API
> --
>
> Key: SOLR-9599
> URL: https://issues.apache.org/jira/browse/SOLR-9599
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: master (7.0)
>Reporter: Yonik Seeley
> Fix For: master (7.0)
>
>
> I did a quick performance comparison of faceting indexed fields (i.e. 
> docvalues are not stored) using method=dv before and after the new docvalues 
> iterator went in (LUCENE-7407).
> 5M document index, 21 segments, single valued string fields w/ no missing 
> values.
> || field cardinality || new_time / old_time ||
> |10|2.01|
> |1000|2.02|
> |1|1.85|
> |10|1.56|
> |100|1.31|
> So unfortunately, often twice as slow.
> See followup messages for tests using real docvalues as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-7253) Make sparse doc values and segments merging more efficient

2016-12-15 Thread Otis Gospodnetic (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-7253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic resolved LUCENE-7253.
--
Resolution: Duplicate
  Assignee: Michael McCandless

LUCENE-7457 and many others actually took care of the issue reported here.

> Make sparse doc values and segments merging more efficient 
> ---
>
> Key: LUCENE-7253
> URL: https://issues.apache.org/jira/browse/LUCENE-7253
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 5.5, 6.0
>Reporter: Pawel Rog
>Assignee: Michael McCandless
>  Labels: performance
> Fix For: master (7.0)
>
>
> Doc Values were optimized recently to efficiently store sparse data. 
> Unfortunately there is still big problem with Doc Values merges for sparse 
> fields. When we imagine 1 billion documents index it seems it doesn't matter 
> if all documents have value for this field or there is only 1 document with 
> value. Segment merge time is the same for both cases. In most cases this is 
> not a problem but there are several cases in which one can expect having many 
> fields with sparse doc values.
> I can describe an example. During performance tests of a system with large 
> number of sparse fields I realized that Doc Values merges are a bottleneck. I 
> had hundreds of different numeric fields. Each document contained only small 
> subset of all fields. Average document contains 5-7 different numeric values. 
> As you can see data was very sparse in these fields. It turned out that 
> ingestion process was CPU-bound. Most of CPU time was spent in DocValues 
> related methods (SingletonSortedNumericDocValues#setDocument, 
> DocValuesConsumer$10$1#next, DocValuesConsumer#isSingleValued, 
> DocValuesConsumer$4$1#setNext, ...) - mostly during merging segments.
> Adrien Grand suggested to reduce the number of sparse fields and replace them 
> with smaller number of denser fields. This helped a lot but complicated 
> fields naming. 
> I am not much familiar with Doc Values source code but I have small 
> suggestion how to improve Doc Values merges for sparse fields. I realized 
> that Doc Values producers and consumers use Iterators. Let's take an example 
> of numeric Doc Values. Would it be possible to replace Iterator which 
> "travels" through all documents with Iterator over collection of non empty 
> values? Of course this would require storing object (instead of numeric) 
> which contains value and document ID. Such an iterator could significantly 
> improve merge time of sparse Doc Values fields. IMHO this won't cause big 
> overhead for dense structures but it can be game changer for sparse 
> structures.
> This is what happens in NumericDocValuesWriter on flush
> {code}
> dvConsumer.addNumericField(fieldInfo,
>new Iterable() {
>  @Override
>  public Iterator iterator() {
>return new NumericIterator(maxDoc, values, 
> docsWithField);
>  }
>});
> {code}
> Before this happens during addValue, this loop is executed to fill holes.
> {code}
> // Fill in any holes:
> for (int i = (int)pending.size(); i < docID; ++i) {
>   pending.add(MISSING);
> }
> {code}
> It turns out that variable called pending is used only internally in 
> NumericDocValuesWriter. I know pending is PackedLongValues and it wouldn't be 
> good to change it with different class (some kind of list) because this may 
> break DV performance for dense fields. I hope someone can suggest interesting 
> solutions for this problem :).
> It would be great if discussion about sparse Doc Values merge performance can 
> start here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-4587) Implement Saved Searches a la ElasticSearch Percolator

2016-12-15 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-4587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15752039#comment-15752039
 ] 

Otis Gospodnetic commented on SOLR-4587:


http://search-lucene.com/m/Solr/eHNlnz4JxwIMSo1?subj=Deep+dive+on+the+topic+streaming+expression
 for anyone who wants to follow.

> Implement Saved Searches a la ElasticSearch Percolator
> --
>
> Key: SOLR-4587
> URL: https://issues.apache.org/jira/browse/SOLR-4587
> Project: Solr
>  Issue Type: New Feature
>  Components: SearchComponents - other, SolrCloud
>Reporter: Otis Gospodnetic
> Fix For: 6.0
>
>
> Use Lucene MemoryIndex for this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-6966) Contribution: Codec for index-level encryption

2016-11-21 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15685728#comment-15685728
 ] 

Otis Gospodnetic commented on LUCENE-6966:
--

Uh, silence. :( I have not looked into the implementation and have only skimmed 
comments here in the past.  My general feeling though is that until/unless this 
gets committed most people won't bother looking (I think we saw similar 
behaviour with Solr CDCR which was WIP in JIRA for a while and was labeled as 
such for a long time but now that it's in I hear more and more people using 
it http://search-lucene.com/?q=cdcr ) and once it's in it may get worked on 
by more interested parties.

> Contribution: Codec for index-level encryption
> --
>
> Key: LUCENE-6966
> URL: https://issues.apache.org/jira/browse/LUCENE-6966
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/other
>Reporter: Renaud Delbru
>  Labels: codec, contrib
> Attachments: Encryption Codec Documentation.pdf, LUCENE-6966-1.patch, 
> LUCENE-6966-2-docvalues.patch, LUCENE-6966-2.patch
>
>
> We would like to contribute a codec that enables the encryption of sensitive 
> data in the index that has been developed as part of an engagement with a 
> customer. We think that this could be of interest for the community.
> Below is a description of the project.
> h1. Introduction
> In comparison with approaches where all data is encrypted (e.g., file system 
> encryption, index output / directory encryption), encryption at a codec level 
> enables more fine-grained control on which block of data is encrypted. This 
> is more efficient since less data has to be encrypted. This also gives more 
> flexibility such as the ability to select which field to encrypt.
> Some of the requirements for this project were:
> * The performance impact of the encryption should be reasonable.
> * The user can choose which field to encrypt.
> * Key management: During the life cycle of the index, the user can provide a 
> new version of his encryption key. Multiple key versions should co-exist in 
> one index.
> h1. What is supported ?
> - Block tree terms index and dictionary
> - Compressed stored fields format
> - Compressed term vectors format
> - Doc values format (prototype based on an encrypted index output) - this 
> will be submitted as a separated patch
> - Index upgrader: command to upgrade all the index segments with the latest 
> key version available.
> h1. How it is implemented ?
> h2. Key Management
> One index segment is encrypted with a single key version. An index can have 
> multiple segments, each one encrypted using a different key version. The key 
> version for a segment is stored in the segment info.
> The provided codec is abstract, and a subclass is responsible in providing an 
> implementation of the cipher factory. The cipher factory is responsible of 
> the creation of a cipher instance based on a given key version.
> h2. Encryption Model
> The encryption model is based on AES/CBC with padding. Initialisation vector 
> (IV) is reused for performance reason, but only on a per format and per 
> segment basis.
> While IV reuse is usually considered a bad practice, the CBC mode is somehow 
> resilient to IV reuse. The only "leak" of information that this could lead to 
> is being able to know that two encrypted blocks of data starts with the same 
> prefix. However, it is unlikely that two data blocks in an index segment will 
> start with the same data:
> - Stored Fields Format: Each encrypted data block is a compressed block 
> (~4kb) of one or more documents. It is unlikely that two compressed blocks 
> start with the same data prefix.
> - Term Vectors: Each encrypted data block is a compressed block (~4kb) of 
> terms and payloads from one or more documents. It is unlikely that two 
> compressed blocks start with the same data prefix.
> - Term Dictionary Index: The term dictionary index is encoded and encrypted 
> in one single data block.
> - Term Dictionary Data: Each data block of the term dictionary encodes a set 
> of suffixes. It is unlikely to have two dictionary data blocks sharing the 
> same prefix within the same segment.
> - DocValues: A DocValues file will be composed of multiple encrypted data 
> blocks. It is unlikely to have two data blocks sharing the same prefix within 
> the same segment (each one will encodes a list of values associated to a 
> field).
> To the best of our knowledge, this model should be safe. However, it would be 
> good if someone with security expertise in the community could review and 
> validate it. 
> h1. Performance
> We report here a performance benchmark we did on an early prototype based on 
> Lucene 4.x. The benchmark was performed on the Wikipedia dataset where all 
> t

[jira] [Commented] (LUCENE-7407) Explore switching doc values to an iterator API

2016-11-15 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15669305#comment-15669305
 ] 

Otis Gospodnetic commented on LUCENE-7407:
--

I had a quick look at [~yo...@apache.org]'s SOLR-9599 and then at [~jpountz]'s 
patch in LUCENE-7462 that makes the search-time work less expensive.  Last 
comment from Yonik reporting faceting regression in Solr was from October 18.  
Adrien't patch was committed on October 24.  Maybe things are working better 
for Solr now?

If not, in interest of moving forward, what do people think about Yonik's 
suggestion:
bq. Perhaps we should have both a random access API as well as an iterator API?
?

> Explore switching doc values to an iterator API
> ---
>
> Key: LUCENE-7407
> URL: https://issues.apache.org/jira/browse/LUCENE-7407
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>  Labels: docValues
> Fix For: master (7.0)
>
> Attachments: LUCENE-7407.patch
>
>
> I think it could be compelling if we restricted doc values to use an
> iterator API at read time, instead of the more general random access
> API we have today:
>   * It would make doc values disk usage more of a "you pay for what
> what you actually use", like postings, which is a compelling
> reduction for sparse usage.
>   * I think codecs could compress better and maybe speed up decoding
> of doc values, even in the non-sparse case, since the read-time
> API is more restrictive "forward only" instead of random access.
>   * We could remove {{getDocsWithField}} entirely, since that's
> implicit in the iteration, and the awkward "return 0 if the
> document didn't have this field" would go away.
>   * We can remove the annoying thread locals we must make today in
> {{CodecReader}}, and close the trappy "I accidentally shared a
> single XXXDocValues instance across threads", since an iterator is
> inherently "use once".
>   * We could maybe leverage the numerous optimizations we've done for
> postings over time, since the two problems ("iterate over doc ids
> and store something interesting for each") are very similar.
> This idea has come up many in the past, e.g. LUCENE-7253 is a recent
> example, and very early iterations of doc values started with exactly
> this ;)
> However, it's a truly enormous change, likely 7.0 only.  Or maybe we
> could have the new iterator APIs also ported to 6.x side by side with
> the deprecate existing random-access APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7474) Improve doc values writers

2016-10-10 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15564060#comment-15564060
 ] 

Otis Gospodnetic commented on LUCENE-7474:
--

I was wondering how one could compare Lucene indexing (and searching) 
performance before and after this change.  Is there a way to add a sparse 
dataset for the nightly benchmark and use it for both trunk and 6.x branch, so 
one can see the performance difference of Lucene 6.x with sparse data vs. 
Lucene 7.x with sparse data?

> Improve doc values writers
> --
>
> Key: LUCENE-7474
> URL: https://issues.apache.org/jira/browse/LUCENE-7474
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Minor
> Fix For: master (7.0)
>
> Attachments: LUCENE-7474.patch
>
>
> One of the goals of the new iterator-based API is to better handle sparse 
> data. However, the current doc values writers still use a dense 
> representation, and some of them perform naive linear scans in the nextDoc 
> implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7474) Improve doc values writers

2016-10-05 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15548058#comment-15548058
 ] 

Otis Gospodnetic commented on LUCENE-7474:
--

yhooo! :)
Do the nightly builds have any tests that will exercise these new writers, the 
new 7.0 Codec, etc., so one can see how much speed this change gains?

> Improve doc values writers
> --
>
> Key: LUCENE-7474
> URL: https://issues.apache.org/jira/browse/LUCENE-7474
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Minor
> Fix For: master (7.0)
>
> Attachments: LUCENE-7474.patch
>
>
> One of the goals of the new iterator-based API is to better handle sparse 
> data. However, the current doc values writers still use a dense 
> representation, and some of them perform naive linear scans in the nextDoc 
> implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2883) Add QParser boolean hint for filter queries

2016-09-23 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15516375#comment-15516375
 ] 

Otis Gospodnetic commented on SOLR-2883:


We hit this in Solr-Redis: 
https://github.com/sematext/solr-redis/issues/38#issuecomment-249184074

> Add QParser boolean hint for filter queries
> ---
>
> Key: SOLR-2883
> URL: https://issues.apache.org/jira/browse/SOLR-2883
> Project: Solr
>  Issue Type: Improvement
>  Components: search
>Reporter: David Smiley
>
> It would be useful if there was a QParser hint of some kind that indicated 
> that the score isn't needed. This would be set by Solr in QueryComponent when 
> processing the fq param, and some field types could check for this and return 
> more efficient Query implementations from FieldType.getFieldQuery(). For 
> example, a geospatial field could return a ConstantScoreQuery(Filter) 
> implementation when only filtering is needed, or return a query that returns 
> a geospatial distance for a document's score. I think there are probably 
> other opportunities for this flag to have its use but I'm not sure.
> As an example solution, a local param of needScore=false could be added.  I 
> should be functionally equivalent to fq={!needScore=false}.
> Here is a modified portion of QueryComponent at line 135 to illustrate what 
> the change would be. I haven't tested it but it compiles.
> {code:java}
> for (String fq : fqs) {
>   if (fq != null && fq.trim().length()!=0) {
> QParser fqp = QParser.getParser(fq, null, req);
> SolrParams localParams = fqp.getLocalParams();
> SolrParams defaultLocalParams = new 
> MapSolrParams(Collections.singletonMap("needScore","false"));
> SolrParams newLocalParams = new 
> DefaultSolrParams(localParams,defaultLocalParams);
> fqp.setLocalParams(newLocalParams);
> filters.add(fqp.getQuery());
>   }
> }
> {code}
> It would probably be best to define the "needScore" constant somewhere better 
> but this is it in a nutshell.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-7253) Make sparse doc values and segments merging more efficient

2016-09-21 Thread Otis Gospodnetic (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-7253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic updated LUCENE-7253:
-
Fix Version/s: master (7.0)

> Make sparse doc values and segments merging more efficient 
> ---
>
> Key: LUCENE-7253
> URL: https://issues.apache.org/jira/browse/LUCENE-7253
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 5.5, 6.0
>Reporter: Pawel Rog
>  Labels: performance
> Fix For: master (7.0)
>
>
> Doc Values were optimized recently to efficiently store sparse data. 
> Unfortunately there is still big problem with Doc Values merges for sparse 
> fields. When we imagine 1 billion documents index it seems it doesn't matter 
> if all documents have value for this field or there is only 1 document with 
> value. Segment merge time is the same for both cases. In most cases this is 
> not a problem but there are several cases in which one can expect having many 
> fields with sparse doc values.
> I can describe an example. During performance tests of a system with large 
> number of sparse fields I realized that Doc Values merges are a bottleneck. I 
> had hundreds of different numeric fields. Each document contained only small 
> subset of all fields. Average document contains 5-7 different numeric values. 
> As you can see data was very sparse in these fields. It turned out that 
> ingestion process was CPU-bound. Most of CPU time was spent in DocValues 
> related methods (SingletonSortedNumericDocValues#setDocument, 
> DocValuesConsumer$10$1#next, DocValuesConsumer#isSingleValued, 
> DocValuesConsumer$4$1#setNext, ...) - mostly during merging segments.
> Adrien Grand suggested to reduce the number of sparse fields and replace them 
> with smaller number of denser fields. This helped a lot but complicated 
> fields naming. 
> I am not much familiar with Doc Values source code but I have small 
> suggestion how to improve Doc Values merges for sparse fields. I realized 
> that Doc Values producers and consumers use Iterators. Let's take an example 
> of numeric Doc Values. Would it be possible to replace Iterator which 
> "travels" through all documents with Iterator over collection of non empty 
> values? Of course this would require storing object (instead of numeric) 
> which contains value and document ID. Such an iterator could significantly 
> improve merge time of sparse Doc Values fields. IMHO this won't cause big 
> overhead for dense structures but it can be game changer for sparse 
> structures.
> This is what happens in NumericDocValuesWriter on flush
> {code}
> dvConsumer.addNumericField(fieldInfo,
>new Iterable() {
>  @Override
>  public Iterator iterator() {
>return new NumericIterator(maxDoc, values, 
> docsWithField);
>  }
>});
> {code}
> Before this happens during addValue, this loop is executed to fill holes.
> {code}
> // Fill in any holes:
> for (int i = (int)pending.size(); i < docID; ++i) {
>   pending.add(MISSING);
> }
> {code}
> It turns out that variable called pending is used only internally in 
> NumericDocValuesWriter. I know pending is PackedLongValues and it wouldn't be 
> good to change it with different class (some kind of list) because this may 
> break DV performance for dense fields. I hope someone can suggest interesting 
> solutions for this problem :).
> It would be great if discussion about sparse Doc Values merge performance can 
> start here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7407) Explore switching doc values to an iterator API

2016-09-20 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15507892#comment-15507892
 ] 

Otis Gospodnetic commented on LUCENE-7407:
--

bq. there is a lot of fun improvements we can make here, in follow-on issues, 
so that e.g. LUCENE-7253 (merging of sparse doc values fields) is fixed.

So LUCENE-7253 is where the new Codec work for trunk will go?
Did you maybe create the other issues you mentioned?  Asking because I'm 
curious what you have in mind and so I can link+watch.

> Explore switching doc values to an iterator API
> ---
>
> Key: LUCENE-7407
> URL: https://issues.apache.org/jira/browse/LUCENE-7407
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>  Labels: docValues
> Attachments: LUCENE-7407.patch
>
>
> I think it could be compelling if we restricted doc values to use an
> iterator API at read time, instead of the more general random access
> API we have today:
>   * It would make doc values disk usage more of a "you pay for what
> what you actually use", like postings, which is a compelling
> reduction for sparse usage.
>   * I think codecs could compress better and maybe speed up decoding
> of doc values, even in the non-sparse case, since the read-time
> API is more restrictive "forward only" instead of random access.
>   * We could remove {{getDocsWithField}} entirely, since that's
> implicit in the iteration, and the awkward "return 0 if the
> document didn't have this field" would go away.
>   * We can remove the annoying thread locals we must make today in
> {{CodecReader}}, and close the trappy "I accidentally shared a
> single XXXDocValues instance across threads", since an iterator is
> inherently "use once".
>   * We could maybe leverage the numerous optimizations we've done for
> postings over time, since the two problems ("iterate over doc ids
> and store something interesting for each") are very similar.
> This idea has come up many in the past, e.g. LUCENE-7253 is a recent
> example, and very early iterations of doc values started with exactly
> this ;)
> However, it's a truly enormous change, likely 7.0 only.  Or maybe we
> could have the new iterator APIs also ported to 6.x side by side with
> the deprecate existing random-access APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7407) Explore switching doc values to an iterator API

2016-08-31 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15454278#comment-15454278
 ] 

Otis Gospodnetic commented on LUCENE-7407:
--

Once these changes are made do you think one will be able to just replace the 
Lucene jar in e.g. ES 5.x?

> Explore switching doc values to an iterator API
> ---
>
> Key: LUCENE-7407
> URL: https://issues.apache.org/jira/browse/LUCENE-7407
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>  Labels: docValues
>
> I think it could be compelling if we restricted doc values to use an
> iterator API at read time, instead of the more general random access
> API we have today:
>   * It would make doc values disk usage more of a "you pay for what
> what you actually use", like postings, which is a compelling
> reduction for sparse usage.
>   * I think codecs could compress better and maybe speed up decoding
> of doc values, even in the non-sparse case, since the read-time
> API is more restrictive "forward only" instead of random access.
>   * We could remove {{getDocsWithField}} entirely, since that's
> implicit in the iteration, and the awkward "return 0 if the
> document didn't have this field" would go away.
>   * We can remove the annoying thread locals we must make today in
> {{CodecReader}}, and close the trappy "I accidentally shared a
> single XXXDocValues instance across threads", since an iterator is
> inherently "use once".
>   * We could maybe leverage the numerous optimizations we've done for
> postings over time, since the two problems ("iterate over doc ids
> and store something interesting for each") are very similar.
> This idea has come up many in the past, e.g. LUCENE-7253 is a recent
> example, and very early iterations of doc values started with exactly
> this ;)
> However, it's a truly enormous change, likely 7.0 only.  Or maybe we
> could have the new iterator APIs also ported to 6.x side by side with
> the deprecate existing random-access APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7407) Explore switching doc values to an iterator API

2016-08-05 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15410479#comment-15410479
 ] 

Otis Gospodnetic commented on LUCENE-7407:
--

Can I label this with #AWESOME!!! ? Could Adrien's LUCENE-6928 piggyback on 
this?

> Explore switching doc values to an iterator API
> ---
>
> Key: LUCENE-7407
> URL: https://issues.apache.org/jira/browse/LUCENE-7407
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>  Labels: docValues
>
> I think it could be compelling if we restricted doc values to use an
> iterator API at read time, instead of the more general random access
> API we have today:
>   * It would make doc values disk usage more of a "you pay for what
> what you actually use", like postings, which is a compelling
> reduction for sparse usage.
>   * I think codecs could compress better and maybe speed up decoding
> of doc values, even in the non-sparse case, since the read-time
> API is more restrictive "forward only" instead of random access.
>   * We could remove {{getDocsWithField}} entirely, since that's
> implicit in the iteration, and the awkward "return 0 if the
> document didn't have this field" would go away.
>   * We can remove the annoying thread locals we must make today in
> {{CodecReader}}, and close the trappy "I accidentally shared a
> single XXXDocValues instance across threads", since an iterator is
> inherently "use once".
>   * We could maybe leverage the numerous optimizations we've done for
> postings over time, since the two problems ("iterate over doc ids
> and store something interesting for each") are very similar.
> This idea has come up many in the past, e.g. LUCENE-7253 is a recent
> example, and very early iterations of doc values started with exactly
> this ;)
> However, it's a truly enormous change, likely 7.0 only.  Or maybe we
> could have the new iterator APIs also ported to 6.x side by side with
> the deprecate existing random-access APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-7407) Explore switching doc values to an iterator API

2016-08-05 Thread Otis Gospodnetic (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-7407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic updated LUCENE-7407:
-
Labels: docValues  (was: )

> Explore switching doc values to an iterator API
> ---
>
> Key: LUCENE-7407
> URL: https://issues.apache.org/jira/browse/LUCENE-7407
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>  Labels: docValues
>
> I think it could be compelling if we restricted doc values to use an
> iterator API at read time, instead of the more general random access
> API we have today:
>   * It would make doc values disk usage more of a "you pay for what
> what you actually use", like postings, which is a compelling
> reduction for sparse usage.
>   * I think codecs could compress better and maybe speed up decoding
> of doc values, even in the non-sparse case, since the read-time
> API is more restrictive "forward only" instead of random access.
>   * We could remove {{getDocsWithField}} entirely, since that's
> implicit in the iteration, and the awkward "return 0 if the
> document didn't have this field" would go away.
>   * We can remove the annoying thread locals we must make today in
> {{CodecReader}}, and close the trappy "I accidentally shared a
> single XXXDocValues instance across threads", since an iterator is
> inherently "use once".
>   * We could maybe leverage the numerous optimizations we've done for
> postings over time, since the two problems ("iterate over doc ids
> and store something interesting for each") are very similar.
> This idea has come up many in the past, e.g. LUCENE-7253 is a recent
> example, and very early iterations of doc values started with exactly
> this ;)
> However, it's a truly enormous change, likely 7.0 only.  Or maybe we
> could have the new iterator APIs also ported to 6.x side by side with
> the deprecate existing random-access APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-7452) json facet api returning inconsistent counts in cloud set up

2016-07-13 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-7452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15375418#comment-15375418
 ] 

Otis Gospodnetic commented on SOLR-7452:


[~yo...@apache.org] We've gotten inquiries about this bug/patch/fix at 
Sematext, but if you're working on this then maybe it's better for us not to 
meddle, so like a few others above, I'm curious about the status of this.

> json facet api returning inconsistent counts in cloud set up
> 
>
> Key: SOLR-7452
> URL: https://issues.apache.org/jira/browse/SOLR-7452
> Project: Solr
>  Issue Type: Bug
>  Components: Facet Module
>Affects Versions: 5.1
>Reporter: Vamsi Krishna D
>  Labels: count, facet, sort
> Fix For: 5.2
>
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> While using the newly added feature of json term facet api 
> (http://yonik.com/json-facet-api/#TermsFacet) I am encountering inconsistent 
> returns of counts of faceted value ( Note I am running on a cloud mode of 
> solr). For example consider that i have txns_id(unique field or key), 
> consumer_number and amount. Now for a 10 million such records , lets say i 
> query for 
> q=*:*&rows=0&
>  json.facet={
>biskatoo:{
>type : terms,
>field : consumer_number,
>limit : 20,
>   sort : {y:desc},
>   numBuckets : true,
>   facet:{
>y : "sum(amount)"
>}
>}
>  }
> the results are as follows ( some are omitted ):
> "facets":{
> "count":6641277,
> "biskatoo":{
>   "numBuckets":3112708,
>   "buckets":[{
>   "val":"surya",
>   "count":4,
>   "y":2.264506},
>   {
>   "val":"raghu",
>   "COUNT":3,   // capitalised for recognition 
>   "y":1.8},
> {
>   "val":"malli",
>   "count":4,
>   "y":1.78}]}}}
> but if i restrict the query to 
> q=consumer_number:raghu&rows=0&
>  json.facet={
>biskatoo:{
>type : terms,
>field : consumer_number,
>limit : 20,
>   sort : {y:desc},
>   numBuckets : true,
>   facet:{
>y : "sum(amount)"
>}
>}
>  }
> i get :
>   "facets":{
> "count":4,
> "biskatoo":{
>   "numBuckets":1,
>   "buckets":[{
>   "val":"raghu",
>   "COUNT":4,
>   "y":2429708.24}]}}}
> One can see the count results are inconsistent ( and I found many occasions 
> of inconsistencies).
> I have tried the patch https://issues.apache.org/jira/browse/SOLR-7412 but 
> still the issue seems not resolved



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-7341) xjoin - join data from external sources

2016-07-05 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-7341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15362942#comment-15362942
 ] 

Otis Gospodnetic commented on SOLR-7341:


[~adamgamble] - vote for it :)

> xjoin - join data from external sources
> ---
>
> Key: SOLR-7341
> URL: https://issues.apache.org/jira/browse/SOLR-7341
> Project: Solr
>  Issue Type: New Feature
>  Components: search
>Reporter: Tom Winch
>Priority: Minor
> Fix For: 4.10.3, 5.3.2, 6.0
>
> Attachments: SOLR-7341.patch-4.10.3, SOLR-7341.patch-4_10, 
> SOLR-7341.patch-5.3.2, SOLR-7341.patch-5_3, SOLR-7341.patch-master, 
> SOLR-7341.patch-trunk, SOLR-7341.patch-trunk
>
>
> h2. XJoin
> The "xjoin" SOLR contrib allows external results to be joined with SOLR 
> results in a query and the SOLR result set to be filtered by the results of 
> an external query. Values from the external results are made available in the 
> SOLR results and may also be used to boost the scores of corresponding 
> documents during the search. The contrib consists of the Java classes 
> XJoinSearchComponent, XJoinValueSourceParser and XJoinQParserPlugin (and 
> associated classes), which must be configured in solrconfig.xml, and the 
> interfaces XJoinResultsFactory and XJoinResults, which are implemented by the 
> user to provide the link between SOLR and the external results source (but 
> see below for details of how to use the in-built SimpleXJoinResultsFactory 
> implementation). External results and SOLR documents are matched via a single 
> configurable attribute (the "join field").
> To include the XJoin contrib classes, add the following config to 
> solrconfig.xml:
> {code:xml}
> 
>   ..
>
>regex=".*\.jar" />
>regex="solr-xjoin-\d.*\.jar" />
>   ..
> 
> {code}
> Note that any JARs containing implementations of the XJoinResultsFactory must 
> also be included.
> h2. Java classes and interfaces
> h3. XJoinResultsFactory
> The user implementation of this interface is responsible for connecting to an 
> external source to perform a query (or otherwise collect results). Parameters 
> with prefix ".external." are passed from the SOLR query URL 
> to pararameterise the search. The interface has the following methods:
> * void init(NamedList args) - this is called during SOLR initialisation, and 
> passed parameters from the search component configuration (see below)
> * XJoinResults getResults(SolrParams params) - this is called during a SOLR 
> search to generate external results, and is passed parameters from the SOLR 
> query URL (as above)
> For example, the implementation might perform queries of an external source 
> based on the 'q' SOLR query URL parameter (in full,  name>.external.q).
> h3. XJoinResults
> A user implementation of this interface is returned by the getResults() 
> method of the XJoinResultsFactory implementation. It has methods:
> * Object getResult(String joinId) - this should return a particular result 
> given the value of the join attribute
> * Iterable getJoinIds() - this should return an ordered (ascending) 
> list of the join attribute values for all results of the external search
> h3. XJoinSearchComponent
> This is the central Java class of the contrib. It is a SOLR search component, 
> configured in solrconfig.xml and included in one or more SOLR request 
> handlers. There is one XJoin search component per external source, and each 
> has two main responsibilities:
> * Before the SOLR search, it connects to the external source and retrieves 
> results, storing them in the SOLR request context
> * After the SOLR search, it matches SOLR document in the results set and 
> external results via the join field, adding attributes from the external 
> results to documents in the SOLR results set
> It takes the following initialisation parameters:
> * factoryClass - this specifies the user-supplied class implementing 
> XJoinResultsFactory, used to generate external results
> * joinField - this specifies the attribute on which to join between SOLR 
> documents and external results
> * external - this parameter set is passed to configure the 
> XJoinResultsFactory implementation
> For example, in solrconfig.xml:
> {code:xml}
>  class="org.apache.solr.search.xjoin.XJoinSearchComponent">
>   test.TestXJoinResultsFactory
>   id
>   
> 1,2,3
>   
> 
> {code}
> Here, the search component instantiates a new TextXJoinResultsFactory during 
> initialisation, and passes it the "values" parameter (1, 2, 3) to configure 
> it. To properly use the XJoinSearchComponent in a request handler, it must be 
> included at the start and end of the component list, and may be configured 
> with the following query parameters:
> * results - a comma-separated list of attributes from the XJoinResults 
> implementation (created by

[jira] [Comment Edited] (LUCENE-2605) queryparser parses on whitespace

2016-06-14 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15330012#comment-15330012
 ] 

Otis Gospodnetic edited comment on LUCENE-2605 at 6/14/16 8:31 PM:
---

[~steve_rowe] you are about to become everyone's hero and a household name! :)
Is this going to be in the upcoming 6.1?



was (Author: otis):
[~steve_rowe] you are about to become everyone's here and a household name! :)
Is this going to be in the upcoming 6.1?


> queryparser parses on whitespace
> 
>
> Key: LUCENE-2605
> URL: https://issues.apache.org/jira/browse/LUCENE-2605
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/queryparser
>Reporter: Robert Muir
>Assignee: Steve Rowe
> Attachments: LUCENE-2605.patch, LUCENE-2605.patch, LUCENE-2605.patch
>
>
> The queryparser parses input on whitespace, and sends each whitespace 
> separated term to its own independent token stream.
> This breaks the following at query-time, because they can't see across 
> whitespace boundaries:
> * n-gram analysis
> * shingles 
> * synonyms (especially multi-word for whitespace-separated languages)
> * languages where a 'word' can contain whitespace (e.g. vietnamese)
> Its also rather unexpected, as users think their 
> charfilters/tokenizers/tokenfilters will do the same thing at index and 
> querytime, but
> in many cases they can't. Instead, preferably the queryparser would parse 
> around only real 'operators'.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2605) queryparser parses on whitespace

2016-06-14 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15330012#comment-15330012
 ] 

Otis Gospodnetic commented on LUCENE-2605:
--

[~steve_rowe] you are about to become everyone's here and a household name! :)
Is this going to be in the upcoming 6.1?


> queryparser parses on whitespace
> 
>
> Key: LUCENE-2605
> URL: https://issues.apache.org/jira/browse/LUCENE-2605
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/queryparser
>Reporter: Robert Muir
>Assignee: Steve Rowe
> Fix For: 4.9, 6.0
>
> Attachments: LUCENE-2605.patch, LUCENE-2605.patch, LUCENE-2605.patch
>
>
> The queryparser parses input on whitespace, and sends each whitespace 
> separated term to its own independent token stream.
> This breaks the following at query-time, because they can't see across 
> whitespace boundaries:
> * n-gram analysis
> * shingles 
> * synonyms (especially multi-word for whitespace-separated languages)
> * languages where a 'word' can contain whitespace (e.g. vietnamese)
> Its also rather unexpected, as users think their 
> charfilters/tokenizers/tokenfilters will do the same thing at index and 
> querytime, but
> in many cases they can't. Instead, preferably the queryparser would parse 
> around only real 'operators'.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-7253) Make sparse doc values and segments merging more efficient

2016-05-07 Thread Otis Gospodnetic (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-7253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic updated LUCENE-7253:
-
Summary: Make sparse doc values and segments merging more efficient   (was: 
Sparse data in doc values and segments merging )

> Make sparse doc values and segments merging more efficient 
> ---
>
> Key: LUCENE-7253
> URL: https://issues.apache.org/jira/browse/LUCENE-7253
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 5.5, 6.0
>Reporter: Pawel Rog
>  Labels: performance
>
> Doc Values were optimized recently to efficiently store sparse data. 
> Unfortunately there is still big problem with Doc Values merges for sparse 
> fields. When we imagine 1 billion documents index it seems it doesn't matter 
> if all documents have value for this field or there is only 1 document with 
> value. Segment merge time is the same for both cases. In most cases this is 
> not a problem but there are several cases in which one can expect having many 
> fields with sparse doc values.
> I can describe an example. During performance tests of a system with large 
> number of sparse fields I realized that Doc Values merges are a bottleneck. I 
> had hundreds of different numeric fields. Each document contained only small 
> subset of all fields. Average document contains 5-7 different numeric values. 
> As you can see data was very sparse in these fields. It turned out that 
> ingestion process was CPU-bound. Most of CPU time was spent in DocValues 
> related methods (SingletonSortedNumericDocValues#setDocument, 
> DocValuesConsumer$10$1#next, DocValuesConsumer#isSingleValued, 
> DocValuesConsumer$4$1#setNext, ...) - mostly during merging segments.
> Adrien Grand suggested to reduce the number of sparse fields and replace them 
> with smaller number of denser fields. This helped a lot but complicated 
> fields naming. 
> I am not much familiar with Doc Values source code but I have small 
> suggestion how to improve Doc Values merges for sparse fields. I realized 
> that Doc Values producers and consumers use Iterators. Let's take an example 
> of numeric Doc Values. Would it be possible to replace Iterator which 
> "travels" through all documents with Iterator over collection of non empty 
> values? Of course this would require storing object (instead of numeric) 
> which contains value and document ID. Such an iterator could significantly 
> improve merge time of sparse Doc Values fields. IMHO this won't cause big 
> overhead for dense structures but it can be game changer for sparse 
> structures.
> This is what happens in NumericDocValuesWriter on flush
> {code}
> dvConsumer.addNumericField(fieldInfo,
>new Iterable() {
>  @Override
>  public Iterator iterator() {
>return new NumericIterator(maxDoc, values, 
> docsWithField);
>  }
>});
> {code}
> Before this happens during addValue, this loop is executed to fill holes.
> {code}
> // Fill in any holes:
> for (int i = (int)pending.size(); i < docID; ++i) {
>   pending.add(MISSING);
> }
> {code}
> It turns out that variable called pending is used only internally in 
> NumericDocValuesWriter. I know pending is PackedLongValues and it wouldn't be 
> good to change it with different class (some kind of list) because this may 
> break DV performance for dense fields. I hope someone can suggest interesting 
> solutions for this problem :).
> It would be great if discussion about sparse Doc Values merge performance can 
> start here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7253) Sparse data in doc values and segments merging

2016-05-03 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15268972#comment-15268972
 ] 

Otis Gospodnetic commented on LUCENE-7253:
--

My take on this:
# sparse fields are indeed not an abuse case
# my understanding of what Robert is saying is that he agrees with 1), but that 
current implementation is not geared for 1) and if existing DV code just 
modified slightly to improve performance then it would not be the right 
implementation
# Robert didn't actually mention -1 explicitly until David brought that up, 
although we all know that Robert could always throw in his -1 in the end, after 
a contributor has already spent hours or days making changes, just to have them 
rejected (but this is a general Lucene project problem that, I think, 
nobody has actually tried solving directly because it'd be painful)
# Robert actually proposed "The correct solution is to have a more next/advance 
type api geared at forward iteration rather than one that mimics an array. Then 
nulls can be handled in typical ways in various situations (eg rle). It should 
be possible esp that scoring is in order.", so my take is that if a contributor 
did exactly what Robert wants then this could potentially be accepted
# I assume the "correct approach" involves more changes and more coding and 
time.  I assume it would be useful to make a simpler and maybe not acceptable 
change first in order to get some numbers and see if it's even worth investing 
time in "correct approach"
# If the numbers look good then, because of a potential -1 from Robert, whoever 
takes on this challenge would have to be very clear, before any additional dev 
work, about what Robert wants, what he would -1, and what he would let in

> Sparse data in doc values and segments merging 
> ---
>
> Key: LUCENE-7253
> URL: https://issues.apache.org/jira/browse/LUCENE-7253
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 5.5, 6.0
>Reporter: Pawel Rog
>  Labels: performance
>
> Doc Values were optimized recently to efficiently store sparse data. 
> Unfortunately there is still big problem with Doc Values merges for sparse 
> fields. When we imagine 1 billion documents index it seems it doesn't matter 
> if all documents have value for this field or there is only 1 document with 
> value. Segment merge time is the same for both cases. In most cases this is 
> not a problem but there are several cases in which one can expect having many 
> fields with sparse doc values.
> I can describe an example. During performance tests of a system with large 
> number of sparse fields I realized that Doc Values merges are a bottleneck. I 
> had hundreds of different numeric fields. Each document contained only small 
> subset of all fields. Average document contains 5-7 different numeric values. 
> As you can see data was very sparse in these fields. It turned out that 
> ingestion process was CPU-bound. Most of CPU time was spent in DocValues 
> related methods (SingletonSortedNumericDocValues#setDocument, 
> DocValuesConsumer$10$1#next, DocValuesConsumer#isSingleValued, 
> DocValuesConsumer$4$1#setNext, ...) - mostly during merging segments.
> Adrien Grand suggested to reduce the number of sparse fields and replace them 
> with smaller number of denser fields. This helped a lot but complicated 
> fields naming. 
> I am not much familiar with Doc Values source code but I have small 
> suggestion how to improve Doc Values merges for sparse fields. I realized 
> that Doc Values producers and consumers use Iterators. Let's take an example 
> of numeric Doc Values. Would it be possible to replace Iterator which 
> "travels" through all documents with Iterator over collection of non empty 
> values? Of course this would require storing object (instead of numeric) 
> which contains value and document ID. Such an iterator could significantly 
> improve merge time of sparse Doc Values fields. IMHO this won't cause big 
> overhead for dense structures but it can be game changer for sparse 
> structures.
> This is what happens in NumericDocValuesWriter on flush
> {code}
> dvConsumer.addNumericField(fieldInfo,
>new Iterable() {
>  @Override
>  public Iterator iterator() {
>return new NumericIterator(maxDoc, values, 
> docsWithField);
>  }
>});
> {code}
> Before this happens during addValue, this loop is executed to fill holes.
> {code}
> // Fill in any holes:
> for (int i = (int)pending.size(); i < docID; ++i) {
>   pending.add(MISSING);
> }
> {code}
> It turns out that variable called pending is used on

[jira] [Commented] (SOLR-6568) Join Discovery Contrib

2016-03-02 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-6568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15177046#comment-15177046
 ] 

Otis Gospodnetic commented on SOLR-6568:


Hey [~joel.bernstein], has this been superseded by other joins?

> Join Discovery Contrib
> --
>
> Key: SOLR-6568
> URL: https://issues.apache.org/jira/browse/SOLR-6568
> Project: Solr
>  Issue Type: New Feature
>Reporter: Joel Bernstein
>Assignee: Joel Bernstein
>Priority: Minor
> Fix For: 5.0
>
>
> This contribution was commissioned by the *NCBI* (National Center for 
> Biotechnology Information). 
> The Join Discovery Contrib is a set of Solr plugins that support large scale 
> joins and "join facets" between Solr cores. 
> There are two different Join implementations included in this contribution. 
> Both implementations are designed to work with integer join keys. It is very 
> common in large BioInformatic and Genomic databases to use integer primary 
> and foreign keys. Integer keys allow Bioinformatic and Genomic search engines 
> and discovery tools to perform complex operations on large data sets very 
> efficiently. 
> The Join Discovery Contrib provides features that will be applicable to 
> anyone working with the freely available databases from the NCBI and likely a 
> large number of other BioInformatic and Genomic databases. These features are 
> not specific though to Bioinformatics and Genomics, they can be used in any 
> datasets where integer keys are used to define the primary and foreign keys.
> What is included in this contrib:
> 1) A new JoinComponent. This component is used instead of the standard 
> QueryComponent. It facilitates very large scale relational joins between two 
> Solr indexes (cores). The join algorithm used in this component is known as a 
> *parallel partitioned merge join*. This is an algorithm which partitions the 
> results from both sides of the join and then sorts and merges the partitions 
> in parallel. 
>  Below are some of it's features:
> * Sub-second performance on very large joins. The parallel join algorithm is 
> capable of sub-second performance on joins with tens of millions of records 
> on both sides of the join.
> * The JoinComponent returns "tuples" with fields from both sides of the join. 
> The initial release returns the primary keys from both sides of the join and 
> the join key. 
> * The tuples also include, and are ranked by, a combined score from both 
> sides of the join.
> * Special purpose memory-mapped on-disk indexes to support \*:\* joins. This 
> makes it possible to join an entire index with a sub-set of another index 
> with sub-second performance. 
> * Support for very fast one-to-one, one-to-many and many-to-many joins. Fast 
> many-to-many joins make it possible to join between indexes on multi-value 
> fields. 
> 2) A new JoinFacetComponent. This component provides facets for both indexes 
> involved in the join. 
> 3) The BitSetJoinQParserPlugin. A very fast parallel filter join based on 
> bitsets that supports infinite levels of nesting. It can be used as a filter 
> query in combination with the JoinComponent or with the standard query 
> component. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-8228) Facet Telemetry

2015-11-21 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-8228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15020494#comment-15020494
 ] 

Otis Gospodnetic commented on SOLR-8228:


{quote}
Although not necessarily part of this initial issue, we should think about how 
to get information about certain requests that does not involve modifying the 
actual request or response. For example, "log telemetry data for the next N 
requests that match this pattern". Something like that would more naturally 
point to method 1 for returning the data (i.e. separate from the response).
{quote}

Yes. Think operations and monitoring, and tools that need to collect this data, 
but are obviously not Solr clients issuing queries and collecting this info 
from responses.  So, logs, JMX, stats API, that sort of stuff.

> Facet Telemetry
> ---
>
> Key: SOLR-8228
> URL: https://issues.apache.org/jira/browse/SOLR-8228
> Project: Solr
>  Issue Type: New Feature
>  Components: Facet Module
>Reporter: Yonik Seeley
> Fix For: Trunk
>
>
> As the JSON Facet API becomes more complex and has more optimizations, it 
> would be nice to get a better view of what is going on in faceting... what 
> methods/algorithms are being used and what is taking up the most time or 
> memory.
>   - the strategy/method used to facet the field
>   - number of unique values in facet field
>   - memory usage of facet field itself
>   - memory usage for request (count arrays, etc)
>   - timing of various parts of facet request (finding top N, executing 
> sub-facets, etc)
> This will also help with unit tests, making sure we have proper coverage of 
> various optimizations.
> Some of this information collection may make sense to happen all the time, 
> while other information may be calculated only if requested.
> When adding facet info to a response, it could be done one of two ways:
>  1. in the existing debug block in the response, along with other debug info, 
> structured like 
>  2. directly in the facet response (i.e. in something like "\_debug\_" that 
> is a sibling of "buckets")
> We need to also consider how to merge distributed debug info (and add more 
> info about the distributed phase as well).  Given this, (2) may be simpler 
> (adding directly to facet response) as we already have a framework for 
> merging.
> Although not necessarily part of this initial issue, we should think about 
> how to get information about certain requests that does not involve modifying 
> the actual request or response.  For example, "log telemetry data for the 
> next N requests that match this pattern".  Something like that would more 
> naturally point to method 1 for returning the data (i.e. separate from the 
> response).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-8095) Allow disabling HDFS Locality Metrics

2015-09-24 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-8095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14907335#comment-14907335
 ] 

Otis Gospodnetic commented on SOLR-8095:


But why does having these metrics bother anyone?  Never heard of turning 
metrics on/off.  If it's just sitting there in JMX, it shouldn't bother any 
one, unless they are very expensive to compute or?

> Allow disabling HDFS Locality Metrics
> -
>
> Key: SOLR-8095
> URL: https://issues.apache.org/jira/browse/SOLR-8095
> Project: Solr
>  Issue Type: Bug
>  Components: hdfs
>Reporter: Mike Drob
>  Labels: metrics
> Fix For: Trunk
>
> Attachments: SOLR-8095.patch
>
>
> We added metrics, but not a way to configure/turn them off.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-5379) Query-time multi-word synonym expansion

2015-09-01 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-5379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726326#comment-14726326
 ] 

Otis Gospodnetic commented on SOLR-5379:


There is a patch for 4.10.3, but it was not committed, so this is still not 
available in Solr AFAIK.  Would be great to get this into 5.x.

> Query-time multi-word synonym expansion
> ---
>
> Key: SOLR-5379
> URL: https://issues.apache.org/jira/browse/SOLR-5379
> Project: Solr
>  Issue Type: Improvement
>  Components: query parsers
>Reporter: Tien Nguyen Manh
>  Labels: multi-word, queryparser, synonym
> Fix For: 4.9, Trunk
>
> Attachments: conf-test-files-4_8_1.patch, quoted-4_8_1.patch, 
> quoted.patch, solr-5379-version-4.10.3.patch, synonym-expander-4_8_1.patch, 
> synonym-expander.patch
>
>
> While dealing with synonym at query time, solr failed to work with multi-word 
> synonyms due to some reasons:
> - First the lucene queryparser tokenizes user query by space so it split 
> multi-word term into two terms before feeding to synonym filter, so synonym 
> filter can't recognized multi-word term to do expansion
> - Second, if synonym filter expand into multiple terms which contains 
> multi-word synonym, The SolrQueryParseBase currently use MultiPhraseQuery to 
> handle synonyms. But MultiPhraseQuery don't work with term have different 
> number of words.
> For the first one, we can extend quoted all multi-word synonym in user query 
> so that lucene queryparser don't split it. There are a jira task related to 
> this one https://issues.apache.org/jira/browse/LUCENE-2605.
> For the second, we can replace MultiPhraseQuery by an appropriate BoleanQuery 
> SHOULD which contains multiple PhraseQuery in case tokens stream have 
> multi-word synonym.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Issue Comment Deleted] (SOLR-7143) MoreLikeThis Query Parser does not handle multiple field names

2015-07-02 Thread Otis Gospodnetic (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-7143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic updated SOLR-7143:
---
Comment: was deleted

(was: Not sure how this ended in my private e-mail.

We were to suggest them to upgrade to 5.x so that, amongst other fixes and
improvements, they can use this new MLT QueryParser to solve problem with
non-functioning MLT handler in cloud mode (
https://issues.apache.org/jira/browse/SOLR-788), but now it seems that even
5 would have to be patched (if those patches work).
On Thu, Jul 2, 2015 at 11:59 PM Otis Gospodnetic (JIRA) 

)

> MoreLikeThis Query Parser does not handle multiple field names
> --
>
> Key: SOLR-7143
> URL: https://issues.apache.org/jira/browse/SOLR-7143
> Project: Solr
>  Issue Type: Bug
>  Components: query parsers
>Affects Versions: 5.0
>Reporter: Jens Wille
>Assignee: Anshum Gupta
> Attachments: SOLR-7143.patch, SOLR-7143.patch, SOLR-7143.patch, 
> SOLR-7143.patch, SOLR-7143.patch
>
>
> The newly introduced MoreLikeThis Query Parser (SOLR-6248) does not return 
> any results when supplied with multiple fields in the {{qf}} parameter.
> To reproduce within the techproducts example, compare:
> {code}
> curl 
> 'http://localhost:8983/solr/techproducts/select?q=%7B!mlt+qf=name%7DMA147LL/A'
> curl 
> 'http://localhost:8983/solr/techproducts/select?q=%7B!mlt+qf=features%7DMA147LL/A'
> curl 
> 'http://localhost:8983/solr/techproducts/select?q=%7B!mlt+qf=name,features%7DMA147LL/A'
> {code}
> The first two queries return 8 and 5 results, respectively. The third query 
> doesn't return any results (not even the matched document).
> In contrast, the MoreLikeThis Handler works as expected (accounting for the 
> default {{mintf}} and {{mindf}} values in SimpleMLTQParser):
> {code}
> curl 
> 'http://localhost:8983/solr/techproducts/mlt?q=id:MA147LL/A&mlt.fl=name&mlt.mintf=1&mlt.mindf=1'
> curl 
> 'http://localhost:8983/solr/techproducts/mlt?q=id:MA147LL/A&mlt.fl=features&mlt.mintf=1&mlt.mindf=1'
> curl 
> 'http://localhost:8983/solr/techproducts/mlt?q=id:MA147LL/A&mlt.fl=name,features&mlt.mintf=1&mlt.mindf=1'
> {code}
> After adding the following line to 
> {{example/techproducts/solr/techproducts/conf/solrconfig.xml}}:
> {code:language=XML}
> 
> {code}
> The first two queries return 7 and 4 results, respectively (excluding the 
> matched document). The third query returns 7 results, as one would expect.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-7143) MoreLikeThis Query Parser does not handle multiple field names

2015-07-02 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-7143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612595#comment-14612595
 ] 

Otis Gospodnetic commented on SOLR-7143:


[~anshumg] - which Solr version is this going in? Fix Version is empty.  Thanks.

> MoreLikeThis Query Parser does not handle multiple field names
> --
>
> Key: SOLR-7143
> URL: https://issues.apache.org/jira/browse/SOLR-7143
> Project: Solr
>  Issue Type: Bug
>  Components: query parsers
>Affects Versions: 5.0
>Reporter: Jens Wille
>Assignee: Anshum Gupta
> Attachments: SOLR-7143.patch, SOLR-7143.patch, SOLR-7143.patch, 
> SOLR-7143.patch, SOLR-7143.patch
>
>
> The newly introduced MoreLikeThis Query Parser (SOLR-6248) does not return 
> any results when supplied with multiple fields in the {{qf}} parameter.
> To reproduce within the techproducts example, compare:
> {code}
> curl 
> 'http://localhost:8983/solr/techproducts/select?q=%7B!mlt+qf=name%7DMA147LL/A'
> curl 
> 'http://localhost:8983/solr/techproducts/select?q=%7B!mlt+qf=features%7DMA147LL/A'
> curl 
> 'http://localhost:8983/solr/techproducts/select?q=%7B!mlt+qf=name,features%7DMA147LL/A'
> {code}
> The first two queries return 8 and 5 results, respectively. The third query 
> doesn't return any results (not even the matched document).
> In contrast, the MoreLikeThis Handler works as expected (accounting for the 
> default {{mintf}} and {{mindf}} values in SimpleMLTQParser):
> {code}
> curl 
> 'http://localhost:8983/solr/techproducts/mlt?q=id:MA147LL/A&mlt.fl=name&mlt.mintf=1&mlt.mindf=1'
> curl 
> 'http://localhost:8983/solr/techproducts/mlt?q=id:MA147LL/A&mlt.fl=features&mlt.mintf=1&mlt.mindf=1'
> curl 
> 'http://localhost:8983/solr/techproducts/mlt?q=id:MA147LL/A&mlt.fl=name,features&mlt.mintf=1&mlt.mindf=1'
> {code}
> After adding the following line to 
> {{example/techproducts/solr/techproducts/conf/solrconfig.xml}}:
> {code:language=XML}
> 
> {code}
> The first two queries return 7 and 4 results, respectively (excluding the 
> matched document). The third query returns 7 results, as one would expect.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (SOLR-7571) Return metrics with update requests to allow clients to self-throttle

2015-06-02 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-7571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14570150#comment-14570150
 ] 

Otis Gospodnetic edited comment on SOLR-7571 at 6/3/15 2:25 AM:


[~erickerickson] - Not sure yet, would have to see which numbers you'd end up 
having, but I'm guessing you could construct MBean names in such a way that the 
name of the leader would be a part of its name (think about parsing, allowed 
vs. forbidden chars!) and the metric you want to return would be the MBean 
value.

Now that I think about this, you should also probably think about how the size 
of Solr response would be affected by more info being added to every response 
and how much that would affect the client that has to process this.  Providing 
this via JMX, which does not stand between client and server on every request, 
and is checked independently of search requests, my in some ways be better.


was (Author: otis):
[~erickerickson] - Not sure yet, would have to see which numbers you'd end up 
having, but I'm guessing you could construct MBean names in such a way that the 
name of the leader would be a part of its name (think about parsing, allowed 
vs. forbidden chars!) and the metric you want to return would be the MBean 
value.

> Return metrics with update requests to allow clients to self-throttle
> -
>
> Key: SOLR-7571
> URL: https://issues.apache.org/jira/browse/SOLR-7571
> Project: Solr
>  Issue Type: Improvement
>Affects Versions: 4.10.3
>Reporter: Erick Erickson
>Assignee: Erick Erickson
>
> I've assigned this to myself to keep track of it, anyone who wants please 
> feel free to take this.
> I've recently seen a setup with 10 shards and 4 replicas. The SolrJ client 
> (and post.jar for json files for that matter) firehose updates (150 separate 
> threads in total) at Solr. Eventually, replicas (not leaders) go into 
> recovery and the state cascades and eventually the entire cluster becomes 
> unusable. SOLR-5850 delays the behavior, but it still occurs. There are no 
> errors in the follower's logs this is leader-initiated-recovery because of a 
> timeout.
> I think the root problem is that the client is just sending too many requests 
> to the cluster, and ConcurrentUpdateSolrClient/Server (used by the leader to 
> distribute update requests to all the followers) (this was observed in Solr 
> 4.10.3+).  I see thread counts of 500+ when this happens.
> So assuming that this is the root cause, the obvious "cure" is "don't index 
> that fast". This is unsatisfactory since "that fast" is variable, the only 
> recourse is to set that threshold low enough that the Solr cluster isn't 
> being driven as fast is it can be.
> We should provide some mechanism for having the client throttle itself. The 
> number of outstanding update threads is one possibility. The client could 
> then slow down sending updates to Solr. 
> I'm not sure there's a good way to deal with this on the server. Once the 
> timeout is encountered, you don't know whether the doc has actually been 
> indexed on the follower (actually, in this case it _is_ indexed, it just take 
> a while). Ideally we'd just manage it all magically, but an alternative to 
> let clients dynamically throttle themselves seems do-able.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-7571) Return metrics with update requests to allow clients to self-throttle

2015-06-02 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-7571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14570150#comment-14570150
 ] 

Otis Gospodnetic commented on SOLR-7571:


[~erickerickson] - Not sure yet, would have to see which numbers you'd end up 
having, but I'm guessing you could construct MBean names in such a way that the 
name of the leader would be a part of its name (think about parsing, allowed 
vs. forbidden chars!) and the metric you want to return would be the MBean 
value.

> Return metrics with update requests to allow clients to self-throttle
> -
>
> Key: SOLR-7571
> URL: https://issues.apache.org/jira/browse/SOLR-7571
> Project: Solr
>  Issue Type: Improvement
>Affects Versions: 4.10.3
>Reporter: Erick Erickson
>Assignee: Erick Erickson
>
> I've assigned this to myself to keep track of it, anyone who wants please 
> feel free to take this.
> I've recently seen a setup with 10 shards and 4 replicas. The SolrJ client 
> (and post.jar for json files for that matter) firehose updates (150 separate 
> threads in total) at Solr. Eventually, replicas (not leaders) go into 
> recovery and the state cascades and eventually the entire cluster becomes 
> unusable. SOLR-5850 delays the behavior, but it still occurs. There are no 
> errors in the follower's logs this is leader-initiated-recovery because of a 
> timeout.
> I think the root problem is that the client is just sending too many requests 
> to the cluster, and ConcurrentUpdateSolrClient/Server (used by the leader to 
> distribute update requests to all the followers) (this was observed in Solr 
> 4.10.3+).  I see thread counts of 500+ when this happens.
> So assuming that this is the root cause, the obvious "cure" is "don't index 
> that fast". This is unsatisfactory since "that fast" is variable, the only 
> recourse is to set that threshold low enough that the Solr cluster isn't 
> being driven as fast is it can be.
> We should provide some mechanism for having the client throttle itself. The 
> number of outstanding update threads is one possibility. The client could 
> then slow down sending updates to Solr. 
> I'm not sure there's a good way to deal with this on the server. Once the 
> timeout is encountered, you don't know whether the doc has actually been 
> indexed on the follower (actually, in this case it _is_ indexed, it just take 
> a while). Ideally we'd just manage it all magically, but an alternative to 
> let clients dynamically throttle themselves seems do-able.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-7571) Return metrics with update requests to allow clients to self-throttle

2015-06-02 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-7571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14569459#comment-14569459
 ] 

Otis Gospodnetic commented on SOLR-7571:


If you are going to be keeping track of any (new) metrics around this, in 
addition to possibly returning them to clients, please expose them via JMX, so 
monitoring tools can expose what is going on with the Solr cluster, too.  This 
can then trigger alert events and alerts events can trigger actions, such as 
reducing the indexing rate.

> Return metrics with update requests to allow clients to self-throttle
> -
>
> Key: SOLR-7571
> URL: https://issues.apache.org/jira/browse/SOLR-7571
> Project: Solr
>  Issue Type: Improvement
>Affects Versions: 4.10.3
>Reporter: Erick Erickson
>Assignee: Erick Erickson
>
> I've assigned this to myself to keep track of it, anyone who wants please 
> feel free to take this.
> I've recently seen a setup with 10 shards and 4 replicas. The SolrJ client 
> (and post.jar for json files for that matter) firehose updates (150 separate 
> threads in total) at Solr. Eventually, replicas (not leaders) go into 
> recovery and the state cascades and eventually the entire cluster becomes 
> unusable. SOLR-5850 delays the behavior, but it still occurs. There are no 
> errors in the follower's logs this is leader-initiated-recovery because of a 
> timeout.
> I think the root problem is that the client is just sending too many requests 
> to the cluster, and ConcurrentUpdateSolrClient/Server (used by the leader to 
> distribute update requests to all the followers) (this was observed in Solr 
> 4.10.3+).  I see thread counts of 500+ when this happens.
> So assuming that this is the root cause, the obvious "cure" is "don't index 
> that fast". This is unsatisfactory since "that fast" is variable, the only 
> recourse is to set that threshold low enough that the Solr cluster isn't 
> being driven as fast is it can be.
> We should provide some mechanism for having the client throttle itself. The 
> number of outstanding update threads is one possibility. The client could 
> then slow down sending updates to Solr. 
> I'm not sure there's a good way to deal with this on the server. Once the 
> timeout is encountered, you don't know whether the doc has actually been 
> indexed on the follower (actually, in this case it _is_ indexed, it just take 
> a while). Ideally we'd just manage it all magically, but an alternative to 
> let clients dynamically throttle themselves seems do-able.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-7457) Make DirectoryFactory publishing MBeanInfo extensible

2015-04-23 Thread Otis Gospodnetic (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-7457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic updated SOLR-7457:
---
Labels: metrics  (was: )

> Make DirectoryFactory publishing MBeanInfo extensible
> -
>
> Key: SOLR-7457
> URL: https://issues.apache.org/jira/browse/SOLR-7457
> Project: Solr
>  Issue Type: Improvement
>  Components: JMX
>Affects Versions: 5.0
>Reporter: Mike Drob
>  Labels: metrics
> Fix For: Trunk, 5.2
>
> Attachments: SOLR-7457.patch
>
>
> In SOLR-6766, we added JMX to the HdfsDirectoryFactory. However, the 
> implementation is pretty brittle and difficult to extend.
> It is conceivable that any implementation of DirectoryFactory might have 
> MInfoBeans that it would like to expose, so we should explicitly accommodate 
> that instead of relying on a side effect of the SolrResourceLoader's 
> behaviour.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-7458) Expose HDFS Block Locality Metrics

2015-04-23 Thread Otis Gospodnetic (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-7458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic updated SOLR-7458:
---
Labels: metrics  (was: )

> Expose HDFS Block Locality Metrics
> --
>
> Key: SOLR-7458
> URL: https://issues.apache.org/jira/browse/SOLR-7458
> Project: Solr
>  Issue Type: Improvement
>  Components: SolrCloud
>Reporter: Mike Drob
>  Labels: metrics
> Attachments: SOLR-7458.patch
>
>
> We should publish block locality metrics when using HDFS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-7344) Use two thread pools, one for internal requests and one for external, to avoid distributed deadlock and decrease the number of threads that need to be created.

2015-04-21 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-7344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14505658#comment-14505658
 ] 

Otis Gospodnetic commented on SOLR-7344:


[~hgadre] - didn't check the patch, but does that means that we will now be 
able to see request metrics (counts, latencies) for internal vs. external 
requests separately?  That would be awesome because current metrics don't make 
this distinction.

> Use two thread pools, one for internal requests and one for external, to 
> avoid distributed deadlock and decrease the number of threads that need to be 
> created.
> ---
>
> Key: SOLR-7344
> URL: https://issues.apache.org/jira/browse/SOLR-7344
> Project: Solr
>  Issue Type: Improvement
>  Components: SolrCloud
>Reporter: Mark Miller
> Attachments: SOLR-7344.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-2867) Problem Wtih solr Score Display

2015-04-08 Thread Otis Gospodnetic (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-2867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic updated SOLR-2867:
---
Labels:   (was: patch)

> Problem Wtih solr Score Display
> ---
>
> Key: SOLR-2867
> URL: https://issues.apache.org/jira/browse/SOLR-2867
> Project: Solr
>  Issue Type: Bug
>  Components: SearchComponents - other
>Affects Versions: 3.1
> Environment: Linux and Mysql
>Reporter: Pragyanjeet Rout
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> We are firing a solr query and checking its relevancy score.
> But problem with relevancy score is that for some results the value for score 
> is been truncated.
> Example:-I have a query as below
> http://localhost:8983/solr/mywork/select/?q=( contractLength:12 speedScore:[4 
> TO 7] dataScore:[2 TO *])&fq=( ( connectionType:"Cable" 
> connectionType:"Naked")AND ( monthlyCost:[* TO *])AND ( speedScore:[4 TO 
> *])AND ( dataScore:[2 TO 
> *]))&version=2.2&start=0&rows=500&indent=on&sort=score desc, planType asc, 
> monthlyCost1 asc, monthlyCost2  asc
> The below mentioned is my xml returned from solr :-
> 
> 3.6897283
> 12
> 3
> ABC
> 120.9
> 7
> 
> 
> 3.689728
> 12
> 2
> DEF
> 49.95
> 6
> 
> I have used the "debugQuery=true" in query and I saw solr is calculating the 
> correct score(PSB) but somehow is it truncating the lastdigit i.e "3" from 
> the second result.
> Because of this my ranking order gets disturbed and I get wrong results while 
> displaying 
> 
> 3.6897283 = (MATCH) sum of:3.1476827 = (MATCH) weight(contractLength:€#0;#12; 
> in 51), product of:0.92363054 = queryWeight(contractLength:€#0;#12;), product 
> of:3.4079456 = idf(docFreq=8, maxDocs=100)  0.27102268 = queryNorm 3.4079456 
> = (MATCH) fieldWeight(contractLength:€#0;#12; in 51), product of:1.0 = 
> tf(termFreq(contractLength:€#0;#12;)=1) 3.4079456 = idf(docFreq=8, 
> maxDocs=100)
>   1.0 = fieldNorm(field=contractLength, doc=51)  0.27102268 = (MATCH) 
> ConstantScore(speedScore:[€#0;#4; TO *]), product of:
> 1.0 = boost  0.27102268 = queryNorm  0.27102268 = (MATCH) 
> ConstantScore(dataScore:[€#0;#2; TO *]), product of: 1.0 = boost   0.27102268 
> = queryNorm
> 
> 
> 3.6897283 = (MATCH) sum of: 3.1476827 = (MATCH) 
> weight(contractLength:€#0;#12; in 97), product of: 0.92363054 = 
> queryWeight(contractLength:€#0;#12;), product of: 3.4079456 = idf(docFreq=8, 
> maxDocs=100)  0.27102268 = queryNorm 3.4079456 = (MATCH) 
> fieldWeight(contractLength:€#0;#12; in 97), product of: 1.0 = 
> tf(termFreq(contractLength:€#0;#12;)=1) 3.4079456 = idf(docFreq=8, 
> maxDocs=100)  1.0 = fieldNorm(field=contractLength, doc=97)  0.27102268 = 
> (MATCH) ConstantScore(speedScore:[€#0;#4; TO *]), product of: 1.0 = boost
> 0.27102268 = queryNorm  0.27102268 = (MATCH) 
> ConstantScore(dataScore:[€#0;#2; TO *]), product of:1.0 = boost
> 0.27102268 = queryNorm
> 
> Please educate me for the above behaviour from solr.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Issue Comment Deleted] (SOLR-7319) Workaround the "Four Month Bug" causing GC pause problems

2015-03-27 Thread Otis Gospodnetic (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-7319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic updated SOLR-7319:
---
Comment: was deleted

(was: bq. tools like jstat will no longer function

Sounds problematic, no?)

> Workaround the "Four Month Bug" causing GC pause problems
> -
>
> Key: SOLR-7319
> URL: https://issues.apache.org/jira/browse/SOLR-7319
> Project: Solr
>  Issue Type: Bug
>  Components: scripts and tools
>Affects Versions: 5.0
>Reporter: Shawn Heisey
>Assignee: Shawn Heisey
> Fix For: 5.1
>
> Attachments: SOLR-7319.patch, SOLR-7319.patch, SOLR-7319.patch
>
>
> A twitter engineer found a bug in the JVM that contributes to GC pause 
> problems:
> http://www.evanjones.ca/jvm-mmap-pause.html
> Problem summary (in case the blog post disappears):  The JVM calculates 
> statistics on things like garbage collection and writes them to a file in the 
> temp directory using MMAP.  If there is a lot of other MMAP write activity, 
> which is precisely how Lucene accomplishes indexing and merging, it can 
> result in a GC pause because the mmap write to the temp file is delayed.
> We should implement the workaround in the solr start scripts (disable 
> creation of the mmap statistics tempfile) and document the impact in 
> CHANGES.txt.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-7319) Workaround the "Four Month Bug" causing GC pause problems

2015-03-27 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-7319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14384472#comment-14384472
 ] 

Otis Gospodnetic commented on SOLR-7319:


bq. tools like jstat will no longer function

Sounds problematic, no?

> Workaround the "Four Month Bug" causing GC pause problems
> -
>
> Key: SOLR-7319
> URL: https://issues.apache.org/jira/browse/SOLR-7319
> Project: Solr
>  Issue Type: Bug
>  Components: scripts and tools
>Affects Versions: 5.0
>Reporter: Shawn Heisey
>Assignee: Shawn Heisey
> Fix For: 5.1
>
> Attachments: SOLR-7319.patch, SOLR-7319.patch, SOLR-7319.patch
>
>
> A twitter engineer found a bug in the JVM that contributes to GC pause 
> problems:
> http://www.evanjones.ca/jvm-mmap-pause.html
> Problem summary (in case the blog post disappears):  The JVM calculates 
> statistics on things like garbage collection and writes them to a file in the 
> temp directory using MMAP.  If there is a lot of other MMAP write activity, 
> which is precisely how Lucene accomplishes indexing and merging, it can 
> result in a GC pause because the mmap write to the temp file is delayed.
> We should implement the workaround in the solr start scripts (disable 
> creation of the mmap statistics tempfile) and document the impact in 
> CHANGES.txt.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-7296) Reconcile facetting implementations

2015-03-25 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-7296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14381276#comment-14381276
 ] 

Otis Gospodnetic commented on SOLR-7296:


IIRC it requires a sidecar index, which is probably its main negative.

> Reconcile facetting implementations
> ---
>
> Key: SOLR-7296
> URL: https://issues.apache.org/jira/browse/SOLR-7296
> Project: Solr
>  Issue Type: Task
>  Components: faceting
>Reporter: Steve Molloy
>
> SOLR-7214 introduced a new way of controlling faceting, the unmbrella 
> SOLR-6348 brings a lot of improvements in facet functionality, namely around 
> pivots. Both make a lot of sense from a user perspective, but currently have 
> completely different implementations. With the analytics components, this 
> makes 3 implementation of the same logic, which is bound to behave 
> differently as time goes by. We should reconcile all implementations to ease 
> maintenance and offer consistent behaviour no matter how parameters are 
> passed to the API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-7121) Solr nodes should go down based on configurable thresholds and not rely on resource exhaustion

2015-03-02 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-7121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14344566#comment-14344566
 ] 

Otis Gospodnetic commented on SOLR-7121:


bq. I will surely add these metrics to JMX but can we handle that in a 
follow-up ticket to this one?

Sure, if that works for you.

> Solr nodes should go down based on configurable thresholds and not rely on 
> resource exhaustion
> --
>
> Key: SOLR-7121
> URL: https://issues.apache.org/jira/browse/SOLR-7121
> Project: Solr
>  Issue Type: New Feature
>Reporter: Sachin Goyal
> Attachments: SOLR-7121.patch, SOLR-7121.patch, SOLR-7121.patch, 
> SOLR-7121.patch, SOLR-7121.patch
>
>
> Currently, there is no way to control when a Solr node goes down.
> If the server is having high GC pauses or too many threads or is just getting 
> too many queries due to some bad load-balancer, the cores in the machine keep 
> on serving unless they exhaust the machine's resources and everything comes 
> to a stall.
> Such a slow-dying core can affect other cores as well by taking huge time to 
> serve their distributed queries.
> There should be a way to specify some threshold values beyond which the 
> targeted core can its ill-health and proactively go down to recover.
> When the load improves, the core should come up automatically.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-7082) Streaming Aggregation for SolrCloud

2015-03-01 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-7082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14342655#comment-14342655
 ] 

Otis Gospodnetic commented on SOLR-7082:


Thanks Joel.  Re 1) -- but conceptually and functionally speaking, would you 
say this is more or less the same as ES aggregations?

> Streaming Aggregation for SolrCloud
> ---
>
> Key: SOLR-7082
> URL: https://issues.apache.org/jira/browse/SOLR-7082
> Project: Solr
>  Issue Type: New Feature
>  Components: SolrCloud
>Reporter: Joel Bernstein
> Fix For: Trunk, 5.1
>
> Attachments: SOLR-7082.patch, SOLR-7082.patch
>
>
> This issue provides a general purpose streaming aggregation framework for 
> SolrCloud. An overview of how it works can be found at this link:
> http://heliosearch.org/streaming-aggregation-for-solrcloud/
> This functionality allows SolrCloud users to perform operations that we're 
> typically done using map/reduce or a parallel computing platform.
> Here is a brief explanation of how the framework works:
> There is a new Solrj *io* package found in: *org.apache.solr.client.solrj.io*
> Key classes:
> *Tuple*: Abstracts a document in a search result as a Map of key/value pairs.
> *TupleStream*: is the base class for all of the streams. Abstracts search 
> results as a stream of Tuples.
> *SolrStream*: connects to a single Solr instance. You call the read() method 
> to iterate over the Tuples.
> *CloudSolrStream*: connects to a SolrCloud collection and merges the results 
> based on the sort param. The merge takes place in CloudSolrStream itself.
> *Decorator Streams*: wrap other streams to gather *Metrics* on streams and 
> *transform* the streams. Some examples are the MetricStream, RollupStream, 
> GroupByStream, UniqueStream, MergeJoinStream, HashJoinStream, MergeStream, 
> FilterStream.
> *Going parallel with the ParallelStream and  "Worker Collections"*
> The io package also contains the *ParallelStream*, which wraps a TupleStream 
> and sends it to N worker nodes. The workers are chosen from a SolrCloud 
> collection. These "Worker Collections" don't have to hold any data, they can 
> just be used to execute TupleStreams.
> *The StreamHandler*
> The Worker nodes have a new RequestHandler called the *StreamHandler*. The 
> ParallelStream serializes a TupleStream, before it is opened, and sends it to 
> the StreamHandler on the Worker Nodes.
> The StreamHandler on each Worker node deserializes the TupleStream, opens the 
> stream, iterates the tuples and streams them back to the ParallelStream. The 
> ParallelStream performs the final merge of Metrics and can be wrapped by 
> other Streams to handled the final merged TupleStream.
> *Sorting and Partitioning search results (Shuffling)*
> Each Worker node is shuffled 1/N of the document results. There is a 
> "partitionKeys" parameter that can be included with each TupleStream to 
> ensure that Tuples with the same partitionKeys are shuffled to the same 
> Worker. The actual partitioning is done with a filter query using the 
> HashQParserPlugin. The DocSets from the HashQParserPlugin can be cached in 
> the filter cache which provides extremely high performance hash partitioning. 
> Many of the stream transformations rely on the sort order of the TupleStreams 
> (GroupByStream, MergeJoinStream, UniqueStream, FilterStream etc..). To 
> accommodate this the search results can be sorted by specific keys. The 
> "/export" handler can be used to sort entire result sets efficiently.
> By specifying the sort order of the results and the partition keys, documents 
> will be sorted and partitioned inside of the search engine. So when the 
> tuples hit the network they are already sorted, partitioned and headed 
> directly to correct worker node.
> *Extending The Framework*
> To extend the framework you create new TupleStream Decorators, that gather 
> custom metrics or perform custom stream transformations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-6832) Queries be served locally rather than being forwarded to another replica

2015-02-25 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-6832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14337817#comment-14337817
 ] 

Otis Gospodnetic commented on SOLR-6832:


H didn't examine the patch or tried this functionality, but based on 
your description... here are some comments.

bq. This helps only where there is over-sharding.

That in itself should be avoided whenever possible in my experience. Overhead 
around memory and communication during querying.  Could be related to your 
deadlocks.  Or maybe you do a ton more writes so distributing writes across all 
nodes is worth the query-time overhead of over-sharding?

bq. Since all queries were send to other nodes, we were getting hit with 
distributed deadlocks more often when one or more nodes were slow/overloaded.

H... if that is truly happening, then isn't that a separate issue to be 
fixed?

bq. So this patch is a slight optimization and a reduction of likelihood of 
getting bogged down by other slow nodes when the parent query node has the core.

But VERY slight, right? (hence my Q about whether you've quantified the 
improvement from this patch)

Intuitively, querying the local data makes sense - why would one not do that if 
the data is right there.  I just wonder how much you really benefit if you are 
saving just 1 (or very small) number of network calls in request that ends up 
dispatching NN requests to NN other nodes in the cluster.

> Queries be served locally rather than being forwarded to another replica
> 
>
> Key: SOLR-6832
> URL: https://issues.apache.org/jira/browse/SOLR-6832
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 4.10.2
>Reporter: Sachin Goyal
>Assignee: Timothy Potter
> Fix For: Trunk, 5.1
>
> Attachments: SOLR-6832.patch, SOLR-6832.patch, SOLR-6832.patch, 
> SOLR-6832.patch
>
>
> Currently, I see that code flow for a query in SolrCloud is as follows:
> For distributed query:
> SolrCore -> SearchHandler.handleRequestBody() -> HttpShardHandler.submit()
> For non-distributed query:
> SolrCore -> SearchHandler.handleRequestBody() -> QueryComponent.process()
> \\
> \\
> \\
> For a distributed query, the request is always sent to all the shards even if 
> the originating SolrCore (handling the original distributed query) is a 
> replica of one of the shards.
> If the original Solr-Core can check itself before sending http requests for 
> any shard, we can probably save some network hopping and gain some 
> performance.
> \\
> \\
> We can change SearchHandler.handleRequestBody() or HttpShardHandler.submit() 
> to fix this behavior (most likely the former and not the latter).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-7110) Optimize JavaBinCodec to minimize string Object creation

2015-02-25 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-7110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14337788#comment-14337788
 ] 

Otis Gospodnetic commented on SOLR-7110:


Possibly better ways to test:
* use something like SPM or VisualVM or anything that gives you visualization 
of:
** various memory pools (size + utilization) in the heap
** GC activity (frequency, avg time, max time, size, etc.)
** CPU usage
* enable GC logging, grep for FullGC, or run jstat

 all of this over time - not just a few minutes, but longer runs before 
patch vs. after patch.  Then you can really see what difference this makes.

> Optimize JavaBinCodec to minimize string Object creation
> 
>
> Key: SOLR-7110
> URL: https://issues.apache.org/jira/browse/SOLR-7110
> Project: Solr
>  Issue Type: Improvement
>Reporter: Noble Paul
>Assignee: Noble Paul
>Priority: Minor
> Attachments: SOLR-7110.patch, SOLR-7110.patch
>
>
> In JavabinCodec we already optimize on strings creation , if they are 
> repeated in the same payload. if we use a cache it is possible to avoid 
> string creation across objects as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-6832) Queries be served locally rather than being forwarded to another replica

2015-02-25 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-6832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14337776#comment-14337776
 ] 

Otis Gospodnetic commented on SOLR-6832:


bq. The performance gain increases if coresPerMachine is > 1 and a single JVM 
has cores from 'k' shards.

Ever managed to measure how much this feature helps in various scenarios?

bq. For a distributed query, the request is always sent to all the shards even 
if the originating SolrCore (handling the original distributed query) is a 
replica of one of the shards.  If the original Solr-Core can check itself 
before sending http requests for any shard, we can probably save some network 
hopping and gain some performance.

This sounds as like it saves only a N local calls out of M, where M > N, N is 
the number of local replicas that could be queried locally, and M is the total 
number of primary shards in the cluster that are to be queries.  Is this 
correct?

So say there are 20 shards spread evenly over 20 nodes (i.e., 1 shard per node) 
and a query request comes in, the node that got the request will query send 19 
requests to the remaining 19 nodes and thus save just one network trip by 
querying a local shard?  I must be missing something...

> Queries be served locally rather than being forwarded to another replica
> 
>
> Key: SOLR-6832
> URL: https://issues.apache.org/jira/browse/SOLR-6832
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 4.10.2
>Reporter: Sachin Goyal
>Assignee: Timothy Potter
> Fix For: Trunk, 5.1
>
> Attachments: SOLR-6832.patch, SOLR-6832.patch, SOLR-6832.patch, 
> SOLR-6832.patch
>
>
> Currently, I see that code flow for a query in SolrCloud is as follows:
> For distributed query:
> SolrCore -> SearchHandler.handleRequestBody() -> HttpShardHandler.submit()
> For non-distributed query:
> SolrCore -> SearchHandler.handleRequestBody() -> QueryComponent.process()
> \\
> \\
> \\
> For a distributed query, the request is always sent to all the shards even if 
> the originating SolrCore (handling the original distributed query) is a 
> replica of one of the shards.
> If the original Solr-Core can check itself before sending http requests for 
> any shard, we can probably save some network hopping and gain some 
> performance.
> \\
> \\
> We can change SearchHandler.handleRequestBody() or HttpShardHandler.submit() 
> to fix this behavior (most likely the former and not the latter).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-7121) Solr nodes should go down based on configurable thresholds and not rely on resource exhaustion

2015-02-25 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-7121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14337766#comment-14337766
 ] 

Otis Gospodnetic commented on SOLR-7121:


[~sachingoyal] - would it make sense to expose all metrics you rely on via JMX? 
 That way monitoring tools would be able to extract this data and graph it 
which, in addition to logs, would help people do post-mortem, understand what 
happened, which metric(s) went up or down, what it/their historical values 
were, maybe set alerts based on that, etc.

> Solr nodes should go down based on configurable thresholds and not rely on 
> resource exhaustion
> --
>
> Key: SOLR-7121
> URL: https://issues.apache.org/jira/browse/SOLR-7121
> Project: Solr
>  Issue Type: New Feature
>Reporter: Sachin Goyal
> Attachments: SOLR-7121.patch, SOLR-7121.patch, SOLR-7121.patch
>
>
> Currently, there is no way to control when a Solr node goes down.
> If the server is having high GC pauses or too many threads or is just getting 
> too many queries due to some bad load-balancer, the cores in the machine keep 
> on serving unless they exhaust the machine's resources and everything comes 
> to a stall.
> Such a slow-dying core can affect other cores as well by taking huge time to 
> serve their distributed queries.
> There should be a way to specify some threshold values beyond which the 
> targeted core can its ill-health and proactively go down to recover.
> When the load improves, the core should come up automatically.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-7090) Cross collection join

2015-02-25 Thread Otis Gospodnetic (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-7090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic updated SOLR-7090:
---
Issue Type: New Feature  (was: Bug)

> Cross collection join
> -
>
> Key: SOLR-7090
> URL: https://issues.apache.org/jira/browse/SOLR-7090
> Project: Solr
>  Issue Type: New Feature
>Reporter: Ishan Chattopadhyaya
> Fix For: 5.1
>
> Attachments: SOLR-7090.patch
>
>
> Although SOLR-4905 supports joins across collections in Cloud mode, there are 
> limitations, (i) the secondary collection must be replicated at each node 
> where the primary collection has a replica, (ii) the secondary collection 
> must be singly sharded.
> This issue explores ideas/possibilities of cross collection joins, even 
> across nodes. This will be helpful for users who wish to maintain boosts or 
> signals in a secondary, more frequently updated collection, and perform query 
> time join of these boosts/signals with results from the primary collection.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-6288) FieldCacheRangeFilter missing from MIGRATE.html

2015-02-25 Thread Otis Gospodnetic (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-6288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic resolved LUCENE-6288.
--
Resolution: Invalid

Please ask on the mailing list.

> FieldCacheRangeFilter missing from MIGRATE.html
> ---
>
> Key: LUCENE-6288
> URL: https://issues.apache.org/jira/browse/LUCENE-6288
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/other
>Affects Versions: 5.0
>Reporter: Torsten Krah
>
> Hi,
> i am searching the {{FieldCacheRangeFilter}} - its not mentioned in the 
> {{https://lucene.apache.org/core/5_0_0/MIGRATE.html}} document and not 
> mentioned in the Changelog  - where to find this one and if it is gone, is it 
> possible to mention this in the migration guide please and how to cope with 
> it in 5.x?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-7136) Add an AutoPhrasing TokenFilter

2015-02-25 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-7136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14337734#comment-14337734
 ] 

Otis Gospodnetic commented on SOLR-7136:


[~tedsullivan] do you know / have you tested and compared this with SOLR-5379 
to the point where you can say that the functionality provided in this issue is 
a *superset* of SOLR-5379?  Or is that not the case?  Or maybe you didn't test 
and compare enough to be able to say?  Thanks.

> Add an AutoPhrasing TokenFilter
> ---
>
> Key: SOLR-7136
> URL: https://issues.apache.org/jira/browse/SOLR-7136
> Project: Solr
>  Issue Type: New Feature
>Reporter: Ted Sullivan
> Attachments: SOLR-7136.patch, SOLR-7136.patch
>
>
> Adds an 'autophrasing' token filter which is designed to enable noun phrases 
> that represent a single entity to be tokenized in a singular fashion. Adds 
> support for ManagedResources and Query parser auto-phrasing support given 
> LUCENE-2605.
> The rationale for this Token Filter and its use in solving the long standing 
> multi-term synonym problem in Lucene Solr has been documented online. 
> http://lucidworks.com/blog/automatic-phrase-tokenization-improving-lucene-search-precision-by-more-precise-linguistic-analysis/
> https://lucidworks.com/blog/solution-for-multi-term-synonyms-in-lucenesolr-using-the-auto-phrasing-tokenfilter/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-7147) Introduce new TrackingShardHandlerFactory for monitoring what requests are sent to shards during tests

2015-02-25 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-7147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14337723#comment-14337723
 ] 

Otis Gospodnetic commented on SOLR-7147:


Is this TrackingShardHandlerFactory really useful only for tests?  Wouldn't 
this be a useful debugging tool in general?

> Introduce new TrackingShardHandlerFactory for monitoring what requests are 
> sent to shards during tests
> --
>
> Key: SOLR-7147
> URL: https://issues.apache.org/jira/browse/SOLR-7147
> Project: Solr
>  Issue Type: Improvement
>Reporter: Hoss Man
> Attachments: SOLR-7147.patch, SOLR-7147.patch, SOLR-7147.patch, 
> SOLR-7147.patch, SOLR-7147.patch, SOLR-7147.patch
>
>
> this is an idea shalin proposed as part of the testing for SOLR-7128...
> bq. I created a TrackingShardHandlerFactory which can record shard requests 
> sent from any node. There are a few helper methods to get requests by shard 
> and by purpose.
> ...
> bq. I will likely move the TrackingShardHandlerFactory into its own issue 
> because it is helpful for other distributed tests as well. I also need to 
> decouple it from the MiniSolrCloudCluster abstraction.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-7159) Add httpclient connection stats to JMX report

2015-02-25 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-7159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14337716#comment-14337716
 ] 

Otis Gospodnetic commented on SOLR-7159:


[~vamsee] - yes, please split those into multiple attributes - I think this is 
more common/standard AND, importantly, easier for monitoring tools to work 
with. Think of MBeans in JMX as the API.  For example, what if in the next Solr 
version somebody decides to add another value to that "{...}" string?  Various 
tools out there that parsed this might break.

> Add httpclient connection stats to JMX report
> -
>
> Key: SOLR-7159
> URL: https://issues.apache.org/jira/browse/SOLR-7159
> Project: Solr
>  Issue Type: Improvement
>Affects Versions: 4.10.3
>Reporter: Vamsee Yarlagadda
>Priority: Minor
> Attachments: SOLR-7159.patch, SOLR-7159v2.patch, Screen Shot 
> 2015-02-25 at 2.05.34 PM.png, Screen Shot 2015-02-25 at 2.05.45 PM.png, 
> jmx-layout.png
>
>
> Currently, we are logging the stats of httpclient as part of debug level.
> bq. 2015-01-20 13:47:48,640 DEBUG 
> org.apache.http.impl.conn.PoolingClientConnectionManager: Connection request: 
> [route: {}->http://plh04.wil.csc.local:8983][total kept alive: 254; route 
> allocated: 100 of 100; total allocated: 462 of 1]
> Instead, it would be good to expose these metrics via JMX for easy checking.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-7082) Streaming Aggregation for SolrCloud

2015-02-23 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-7082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14333490#comment-14333490
 ] 

Otis Gospodnetic commented on SOLR-7082:


This looks really nice, Joel.  2 questions:
* this looks a lot like ES aggregations.  Have you maybe made any comparisons 
in terms of speed or memory footprint? (ES aggregations love heap)
* is this all going to land in Solr or will some of it remain in Heliosearch?


> Streaming Aggregation for SolrCloud
> ---
>
> Key: SOLR-7082
> URL: https://issues.apache.org/jira/browse/SOLR-7082
> Project: Solr
>  Issue Type: New Feature
>  Components: SolrCloud
>Reporter: Joel Bernstein
> Fix For: Trunk, 5.1
>
> Attachments: SOLR-7082.patch, SOLR-7082.patch
>
>
> This issue provides a general purpose streaming aggregation framework for 
> SolrCloud. An overview of how it works can be found at this link:
> http://heliosearch.org/streaming-aggregation-for-solrcloud/
> This functionality allows SolrCloud users to perform operations that we're 
> typically done using map/reduce or a parallel computing platform.
> Here is a brief explanation of how the framework works:
> There is a new Solrj *io* package found in: *org.apache.solr.client.solrj.io*
> Key classes:
> *Tuple*: Abstracts a document in a search result as a Map of key/value pairs.
> *TupleStream*: is the base class for all of the streams. Abstracts search 
> results as a stream of Tuples.
> *SolrStream*: connects to a single Solr instance. You call the read() method 
> to iterate over the Tuples.
> *CloudSolrStream*: connects to a SolrCloud collection and merges the results 
> based on the sort param. The merge takes place in CloudSolrStream itself.
> *Decorator Streams*: wrap other streams to gather *Metrics* on streams and 
> *transform* the streams. Some examples are the MetricStream, RollupStream, 
> GroupByStream, UniqueStream, MergeJoinStream, HashJoinStream, MergeStream, 
> FilterStream.
> *Going parallel with the ParallelStream and  "Worker Collections"*
> The io package also contains the *ParallelStream*, which wraps a TupleStream 
> and sends it to N worker nodes. The workers are chosen from a SolrCloud 
> collection. These "Worker Collections" don't have to hold any data, they can 
> just be used to execute TupleStreams.
> *The StreamHandler*
> The Worker nodes have a new RequestHandler called the *StreamHandler*. The 
> ParallelStream serializes a TupleStream, before it is opened, and sends it to 
> the StreamHandler on the Worker Nodes.
> The StreamHandler on each Worker node deserializes the TupleStream, opens the 
> stream, iterates the tuples and streams them back to the ParallelStream. The 
> ParallelStream performs the final merge of Metrics and can be wrapped by 
> other Streams to handled the final merged TupleStream.
> *Sorting and Partitioning search results (Shuffling)*
> Each Worker node is shuffled 1/N of the document results. There is a 
> "partitionKeys" parameter that can be included with each TupleStream to 
> ensure that Tuples with the same partitionKeys are shuffled to the same 
> Worker. The actual partitioning is done with a filter query using the 
> HashQParserPlugin. The DocSets from the HashQParserPlugin can be cached in 
> the filter cache which provides extremely high performance hash partitioning. 
> Many of the stream transformations rely on the sort order of the TupleStreams 
> (GroupByStream, MergeJoinStream, UniqueStream, FilterStream etc..). To 
> accommodate this the search results can be sorted by specific keys. The 
> "/export" handler can be used to sort entire result sets efficiently.
> By specifying the sort order of the results and the partition keys, documents 
> will be sorted and partitioned inside of the search engine. So when the 
> tuples hit the network they are already sorted, partitioned and headed 
> directly to correct worker node.
> *Extending The Framework*
> To extend the framework you create new TupleStream Decorators, that gather 
> custom metrics or perform custom stream transformations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-5379) Query-time multi-word synonym expansion

2015-02-03 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-5379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304161#comment-14304161
 ] 

Otis Gospodnetic commented on SOLR-5379:


bq. I am sure there is, but there are no working patches for 4.10 or 5.x thus 
far.

Right.  What I was trying to ask is whether any of the active Solr committers 
wants to commit this.  If there is no will to commit, I'd rather keep things 
simple on our end ignore this issue.  But there is a will to commit, I'd love 
to see this in Solr, as would 30+ other watchers, I imagine.

> Query-time multi-word synonym expansion
> ---
>
> Key: SOLR-5379
> URL: https://issues.apache.org/jira/browse/SOLR-5379
> Project: Solr
>  Issue Type: Improvement
>  Components: query parsers
>Reporter: Tien Nguyen Manh
>  Labels: multi-word, queryparser, synonym
> Fix For: 4.9, Trunk
>
> Attachments: conf-test-files-4_8_1.patch, quoted-4_8_1.patch, 
> quoted.patch, synonym-expander-4_8_1.patch, synonym-expander.patch
>
>
> While dealing with synonym at query time, solr failed to work with multi-word 
> synonyms due to some reasons:
> - First the lucene queryparser tokenizes user query by space so it split 
> multi-word term into two terms before feeding to synonym filter, so synonym 
> filter can't recognized multi-word term to do expansion
> - Second, if synonym filter expand into multiple terms which contains 
> multi-word synonym, The SolrQueryParseBase currently use MultiPhraseQuery to 
> handle synonyms. But MultiPhraseQuery don't work with term have different 
> number of words.
> For the first one, we can extend quoted all multi-word synonym in user query 
> so that lucene queryparser don't split it. There are a jira task related to 
> this one https://issues.apache.org/jira/browse/LUCENE-2605.
> For the second, we can replace MultiPhraseQuery by an appropriate BoleanQuery 
> SHOULD which contains multiple PhraseQuery in case tokens stream have 
> multi-word synonym.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-5379) Query-time multi-word synonym expansion

2015-02-02 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-5379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14301433#comment-14301433
 ] 

Otis Gospodnetic commented on SOLR-5379:


Is there any interest in committing this to 4.x or 5.x?  We have a client at 
Sematext who needs query-time synonym support for their Solr 4.x setup.  So we 
can make sure this patch works for 4.x.  If any of the Solr developers wants to 
commit this to 5.x, please leave a comment here.

> Query-time multi-word synonym expansion
> ---
>
> Key: SOLR-5379
> URL: https://issues.apache.org/jira/browse/SOLR-5379
> Project: Solr
>  Issue Type: Improvement
>  Components: query parsers
>Reporter: Tien Nguyen Manh
>  Labels: multi-word, queryparser, synonym
> Fix For: 4.9, Trunk
>
> Attachments: conf-test-files-4_8_1.patch, quoted-4_8_1.patch, 
> quoted.patch, synonym-expander-4_8_1.patch, synonym-expander.patch
>
>
> While dealing with synonym at query time, solr failed to work with multi-word 
> synonyms due to some reasons:
> - First the lucene queryparser tokenizes user query by space so it split 
> multi-word term into two terms before feeding to synonym filter, so synonym 
> filter can't recognized multi-word term to do expansion
> - Second, if synonym filter expand into multiple terms which contains 
> multi-word synonym, The SolrQueryParseBase currently use MultiPhraseQuery to 
> handle synonyms. But MultiPhraseQuery don't work with term have different 
> number of words.
> For the first one, we can extend quoted all multi-word synonym in user query 
> so that lucene queryparser don't split it. There are a jira task related to 
> this one https://issues.apache.org/jira/browse/LUCENE-2605.
> For the second, we can replace MultiPhraseQuery by an appropriate BoleanQuery 
> SHOULD which contains multiple PhraseQuery in case tokens stream have 
> multi-word synonym.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-6161) Applying deletes is sometimes dog slow

2015-01-12 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-6161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14273897#comment-14273897
 ] 

Otis Gospodnetic commented on LUCENE-6161:
--

I'd assume that while merges are now faster, they are using more of the 
computing resources (than before) needed for the rest of what Lucene is doing, 
hence no improvement in overall indexing time.

> Applying deletes is sometimes dog slow
> --
>
> Key: LUCENE-6161
> URL: https://issues.apache.org/jira/browse/LUCENE-6161
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Michael McCandless
> Fix For: 5.0, Trunk
>
> Attachments: LUCENE-6161.patch, LUCENE-6161.patch, LUCENE-6161.patch
>
>
> I hit this while testing various use cases for LUCENE-6119 (adding 
> auto-throttle to ConcurrentMergeScheduler).
> When I tested "always call updateDocument" (each add buffers a delete term), 
> with many indexing threads, opening an NRT reader once per second (forcing 
> all deleted terms to be applied), I see that 
> BufferedUpdatesStream.applyDeletes sometimes seems to take a lng time, 
> e.g.:
> {noformat}
> BD 0 [2015-01-04 09:31:12.597; Lucene Merge Thread #69]: applyDeletes took 
> 339 msec for 10 segments, 117 deleted docs, 607333 visited terms
> BD 0 [2015-01-04 09:31:18.148; Thread-4]: applyDeletes took 5533 msec for 62 
> segments, 10989 deleted docs, 8517225 visited terms
> BD 0 [2015-01-04 09:31:21.463; Lucene Merge Thread #71]: applyDeletes took 
> 1065 msec for 10 segments, 470 deleted docs, 1825649 visited terms
> BD 0 [2015-01-04 09:31:26.301; Thread-5]: applyDeletes took 4835 msec for 61 
> segments, 14676 deleted docs, 9649860 visited terms
> BD 0 [2015-01-04 09:31:35.572; Thread-11]: applyDeletes took 6073 msec for 72 
> segments, 13835 deleted docs, 11865319 visited terms
> BD 0 [2015-01-04 09:31:37.604; Lucene Merge Thread #75]: applyDeletes took 
> 251 msec for 10 segments, 58 deleted docs, 240721 visited terms
> BD 0 [2015-01-04 09:31:44.641; Thread-11]: applyDeletes took 5956 msec for 64 
> segments, 15109 deleted docs, 10599034 visited terms
> BD 0 [2015-01-04 09:31:47.814; Lucene Merge Thread #77]: applyDeletes took 
> 396 msec for 10 segments, 137 deleted docs, 719914 visit
> {noformat}
> What this means is even though I want an NRT reader every second, often I 
> don't get one for up to ~7 or more seconds.
> This is on an SSD, machine has 48 GB RAM, heap size is only 2 GB.  12 
> indexing threads.
> As hideously complex as this code is, I think there are some inefficiencies, 
> but fixing them could be hard / make code even hairier ...
> Also, this code is mega-locked: holds IW's lock, holds BD's lock.  It blocks 
> things like merges kicking off or finishing...
> E.g., we pull the MergedIterator many times on the same set of sub-iterators. 
>  Maybe we can create the sorted terms up front and reuse that?
> Maybe we should go "term stride" (one term visits all N segments) not 
> "segment stride" (visit each segment, iterating all deleted terms for it).  
> Just iterating the terms to be deleted takes a sizable part of the time, and 
> we now do that once for every segment in the index.
> Also, the "isUnique" bit in LUCENE-6005 should help here, since if we know 
> the field is unique, we can stop seekExact once we found a segment that has 
> the deleted term, we can maybe pass false for removeDuplicates to 
> MergedIterator...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-6273) Cross Data Center Replication

2015-01-05 Thread Otis Gospodnetic (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-6273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic updated SOLR-6273:
---
Summary: Cross Data Center Replication  (was: Cross Data Center Replicaton)

> Cross Data Center Replication
> -
>
> Key: SOLR-6273
> URL: https://issues.apache.org/jira/browse/SOLR-6273
> Project: Solr
>  Issue Type: New Feature
>Reporter: Yonik Seeley
> Attachments: SOLR-6273.patch
>
>
> This is the master issue for Cross Data Center Replication (CDCR)
> described at a high level here: 
> http://heliosearch.org/solr-cross-data-center-replication/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-6675) Solr webapp deployment is very slow with in solrconfig.xml

2014-12-02 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-6675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14232555#comment-14232555
 ] 

Otis Gospodnetic commented on SOLR-6675:


I've never heard or seen this before.  Have you tried latest Solr 4.10.x?
Which JVM is this on?


> Solr webapp deployment is very slow with  in solrconfig.xml
> -
>
> Key: SOLR-6675
> URL: https://issues.apache.org/jira/browse/SOLR-6675
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 4.7
> Environment: Linux Redhat 64bit
>Reporter: Forest Soup
>Priority: Critical
>  Labels: performance
> Attachments: callstack.png
>
>
> We have a SolrCloud with Solr version 4.7 with Tomcat 7. And our solr 
> index(cores) are big(50~100G) each core. 
> When we start up tomcat, the solr webapp deployment is very slow. From 
> tomcat's catalina log, every time it takes about 10 minutes to get deployed. 
> After we analyzing java core dump, we notice it's because the loading process 
> cannot finish until the MBean calculation for large index is done.
>  
> So we tried to remove the  from solrconfig.xml, after that, the loading 
> of solr webapp only take about 1 minute. So we can sure the MBean calculation 
> for large index is the root cause.
> Could you please point me if there is any async way to do statistic 
> monitoring without  in solrconfig.xml, or let it do calculation after 
> the deployment? Thanks!
> The callstack.png file in the attachment is the call stack of the long 
> blocking thread which is doing statistics calculation.
> The catalina log of tomcat:
> INFO: Starting Servlet Engine: Apache Tomcat/7.0.54
> Oct 13, 2014 2:00:29 AM org.apache.catalina.startup.HostConfig deployWAR
> INFO: Deploying web application archive 
> /opt/ibm/solrsearch/tomcat/webapps/solr.war
> Oct 13, 2014 2:10:23 AM org.apache.catalina.startup.HostConfig deployWAR
> INFO: Deployment of web application archive 
> /opt/ibm/solrsearch/tomcat/webapps/solr.war has finished in 594,325 ms 
> < Time taken for solr app Deployment is about 10 minutes 
> ---
> Oct 13, 2014 2:10:23 AM org.apache.catalina.startup.HostConfig deployDirectory
> INFO: Deploying web application directory 
> /opt/ibm/solrsearch/tomcat/webapps/manager
> Oct 13, 2014 2:10:26 AM org.apache.catalina.startup.HostConfig deployDirectory
> INFO: Deployment of web application directory 
> /opt/ibm/solrsearch/tomcat/webapps/manager has finished in 2,035 ms
> Oct 13, 2014 2:10:26 AM org.apache.catalina.startup.HostConfig deployDirectory
> INFO: Deploying web application directory 
> /opt/ibm/solrsearch/tomcat/webapps/examples
> Oct 13, 2014 2:10:27 AM org.apache.catalina.startup.HostConfig deployDirectory
> INFO: Deployment of web application directory 
> /opt/ibm/solrsearch/tomcat/webapps/examples has finished in 1,789 ms
> Oct 13, 2014 2:10:27 AM org.apache.catalina.startup.HostConfig deployDirectory
> INFO: Deploying web application directory 
> /opt/ibm/solrsearch/tomcat/webapps/docs
> Oct 13, 2014 2:10:28 AM org.apache.catalina.startup.HostConfig deployDirectory
> INFO: Deployment of web application directory 
> /opt/ibm/solrsearch/tomcat/webapps/docs has finished in 1,037 ms
> Oct 13, 2014 2:10:28 AM org.apache.catalina.startup.HostConfig deployDirectory
> INFO: Deploying web application directory 
> /opt/ibm/solrsearch/tomcat/webapps/ROOT
> Oct 13, 2014 2:10:29 AM org.apache.catalina.startup.HostConfig deployDirectory
> INFO: Deployment of web application directory 
> /opt/ibm/solrsearch/tomcat/webapps/ROOT has finished in 948 ms
> Oct 13, 2014 2:10:29 AM org.apache.catalina.startup.HostConfig deployDirectory
> INFO: Deploying web application directory 
> /opt/ibm/solrsearch/tomcat/webapps/host-manager
> Oct 13, 2014 2:10:30 AM org.apache.catalina.startup.HostConfig deployDirectory
> INFO: Deployment of web application directory 
> /opt/ibm/solrsearch/tomcat/webapps/host-manager has finished in 951 ms
> Oct 13, 2014 2:10:31 AM org.apache.coyote.AbstractProtocol start
> INFO: Starting ProtocolHandler ["http-bio-8080"]
> Oct 13, 2014 2:10:31 AM org.apache.coyote.AbstractProtocol start
> INFO: Starting ProtocolHandler ["ajp-bio-8009"]
> Oct 13, 2014 2:10:31 AM org.apache.catalina.startup.Catalina start
> INFO: Server startup in 601506 ms



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Resolved] (SOLR-6674) Solr webapp deployment is very slow with in solrconfig.xml

2014-12-02 Thread Otis Gospodnetic (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-6674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic resolved SOLR-6674.

Resolution: Duplicate

Dupe of SOLR-6675

> Solr webapp deployment is very slow with  in solrconfig.xml
> -
>
> Key: SOLR-6674
> URL: https://issues.apache.org/jira/browse/SOLR-6674
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 4.7
> Environment: Linux Redhat 64bit
>Reporter: Forest Soup
>Priority: Critical
>  Labels: performance
>
> We have a SolrCloud with Solr version 4.7 with Tomcat 7. And our solr 
> index(cores) are big(50~100G) each core. 
> When we start up tomcat, the solr webapp deployment is very slow. From 
> tomcat's catalina log, every time it takes about 10 minutes to get deployed. 
> After we analyzing java core dump, we notice it's because the loading process 
> cannot finish until the MBean calculation for large index is done.
>  
> So we tried to remove the  from solrconfig.xml, after that, the loading 
> of solr webapp only take about 1 minute. So we can sure the MBean calculation 
> for large index is the root cause.
> Could you please point me if there is any async way to do statistic 
> monitoring without  in solrconfig.xml, or let it do calculation after 
> the deployment? Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-6053) Serbian Analyzer

2014-11-24 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-6053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14224016#comment-14224016
 ] 

Otis Gospodnetic commented on LUCENE-6053:
--

Hm, calling this Serbian is a bit limiting - languages from all ex-Yugoslavian 
countries use the *exact-same* diacritic characters (the 
"abcčćddžđefghijklljmnnjoprsštuvzž" ones, not the Cyrillic ones).  [~nikola] - 
do you think you could reorganize things a bit so isolate Cyrillic part and 
thus make the rest reusable?


> Serbian Analyzer
> 
>
> Key: LUCENE-6053
> URL: https://issues.apache.org/jira/browse/LUCENE-6053
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Nikola Smolenski
> Fix For: 5.0, Trunk
>
> Attachments: LUCENE-Serbian-1.patch
>
>
> This is analyzer for Serbian language, so far consisting only of a 
> normalizer. Serbian language uses both Cyrillic and Latin alphabet, so the 
> normalizer works with both alphabets.
> In the future, I'll see to add stopwords, stemmer and so on.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-6766) Switch o.a.s.store.blockcache.Metrics to use JMX

2014-11-24 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-6766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223754#comment-14223754
 ] 

Otis Gospodnetic commented on SOLR-6766:


Is this aimed at 4.10.3?

> Switch o.a.s.store.blockcache.Metrics to use JMX
> 
>
> Key: SOLR-6766
> URL: https://issues.apache.org/jira/browse/SOLR-6766
> Project: Solr
>  Issue Type: Bug
>Reporter: Mike Drob
>  Labels: metrics
> Attachments: SOLR-6766.patch, SOLR-6766.patch
>
>
> The Metrics class currently reports to hadoop metrics, but it would be better 
> to report to JMX.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (SOLR-6788) Expose Solr version via MBean

2014-11-24 Thread Otis Gospodnetic (JIRA)

Otis Gospodnetic created SOLR-6788:
--

 Summary: Expose Solr version via MBean
 Key: SOLR-6788
 URL: https://issues.apache.org/jira/browse/SOLR-6788
 Project: Solr
  Issue Type: Improvement
Reporter: Otis Gospodnetic
 Fix For: 4.10.3


Solr should expose its version via an MBean so tools know which version of Solr 
they are talking to.  When MBean structure changes tools depend on this 
information to know which MBeans to look for, how to parse/interpret their 
values, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-6058) Solr needs a new website

2014-11-13 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-6058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14211840#comment-14211840
 ] 

Otis Gospodnetic commented on SOLR-6058:


[~sar...@syr.edu] - the search box at the top goes to 
http://search-lucene.com/lucene?q=foo&searchProvider=sl , but it should go to 
http://search-lucene.com/solr?q=monkey (note /lucene => /solr and removal of 
&searchProvider=sl which is not needed) .  Do you think you could include this 
little change?

> Solr needs a new website
> 
>
> Key: SOLR-6058
> URL: https://issues.apache.org/jira/browse/SOLR-6058
> Project: Solr
>  Issue Type: Task
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
> Attachments: HTML.rar, SOLR-6058, SOLR-6058.location-fix.patchfile, 
> SOLR-6058.offset-fix.patch, Solr_Icons.pdf, Solr_Logo_on_black.pdf, 
> Solr_Logo_on_black.png, Solr_Logo_on_orange.pdf, Solr_Logo_on_orange.png, 
> Solr_Logo_on_white.pdf, Solr_Logo_on_white.png, Solr_Styleguide.pdf
>
>
> Solr needs a new website:  better organization of content, less verbose, more 
> pleasing graphics, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-4587) Implement Saved Searches a la ElasticSearch Percolator

2014-11-07 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-4587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14202585#comment-14202585
 ] 

Otis Gospodnetic commented on SOLR-4587:


bq.  I believe that much of the job in luwak also comes from the realization 
that the number of documents must be reduced prior to looping

That's correct.  In our work with Luwak this is the key.  You can have 1M 
queries, but if you *really* need to run incoming documents against all 1M 
queries expect to have VERY low throughput and VERY HIGH match latencies.  We 
are working with 1-2M queries and reducing those to a few thousand queries with 
Luwak's Presearcher, and still have latencies of a few hundred milliseconds.

> Implement Saved Searches a la ElasticSearch Percolator
> --
>
> Key: SOLR-4587
> URL: https://issues.apache.org/jira/browse/SOLR-4587
> Project: Solr
>  Issue Type: New Feature
>  Components: SearchComponents - other, SolrCloud
>Reporter: Otis Gospodnetic
> Fix For: Trunk
>
>
> Use Lucene MemoryIndex for this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-6717) SolrCloud indexing performance when sending updates to incorrect core is terrible

2014-11-07 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-6717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14202392#comment-14202392
 ] 

Otis Gospodnetic commented on SOLR-6717:


Here's the full thread: http://search-lucene.com/m/QTPaWzeof

> SolrCloud indexing performance when sending updates to incorrect core is 
> terrible
> -
>
> Key: SOLR-6717
> URL: https://issues.apache.org/jira/browse/SOLR-6717
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 4.10.2
>Reporter: Shawn Heisey
> Fix For: 5.0, Trunk
>
>
> A user on the mailing list was sending document updates to a random node/core 
> in his SolrCloud.  Performance was not scaling anywhere close to what was 
> expected.  Basically, indexing performance was not scaling when adding shards 
> and servers.
> As soon as the user implemented a smart router that was aware of the cloud 
> structure and could send to the proper shard leader, performance scaled 
> exactly as expected.  It's not Java code, so CloudSolrServer was not an 
> option.
> There will always be some overhead involved when sending update requests to 
> the wrong shard replica, but hopefully something can be done about the 
> performance hit.
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201411.mbox/%3CCALswpfDQT4+_eZ6416gMyVHkuhdTYtxXxwxQabR6xeTZ8Lx=t...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-6699) To enable SPDY in a SolrCloud setup

2014-11-03 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-6699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14195672#comment-14195672
 ] 

Otis Gospodnetic commented on SOLR-6699:


Is this supposed to bring performance or scalability benefits?  If so, do you 
have any numbers you can share?

> To enable SPDY in a SolrCloud setup
> ---
>
> Key: SOLR-6699
> URL: https://issues.apache.org/jira/browse/SOLR-6699
> Project: Solr
>  Issue Type: Improvement
>Reporter: Harsh Prasad
> Attachments: SOLR-6699.patch
>
>
> Solr has lot of inter node communication happening during distributed 
> searching or indexing. Benefits of SPDY is as follows: 
> -Multiple requests can be sent in parallel (multiplexing) and responses can 
> be received out of order.
> -Headers are compressed and optimized.
> This implementation will be using clear-text spdy and not the usual TLS layer 
> spdy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-6699) To enable SPDY in a SolrCloud setup

2014-11-03 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-6699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14195670#comment-14195670
 ] 

Otis Gospodnetic commented on SOLR-6699:


I looked at only about 50-100 top lines in this patch and saw a number of 
versions of various libraries being downgraded, which seems strange, no?

> To enable SPDY in a SolrCloud setup
> ---
>
> Key: SOLR-6699
> URL: https://issues.apache.org/jira/browse/SOLR-6699
> Project: Solr
>  Issue Type: Improvement
>Reporter: Harsh Prasad
> Attachments: SOLR-6699.patch
>
>
> Solr has lot of inter node communication happening during distributed 
> searching or indexing. Benefits of SPDY is as follows: 
> -Multiple requests can be sent in parallel (multiplexing) and responses can 
> be received out of order.
> -Headers are compressed and optimized.
> This implementation will be using clear-text spdy and not the usual TLS layer 
> spdy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-6288) Create a parser and rule engine for the rules syntax

2014-07-31 Thread Otis Gospodnetic (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-6288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic updated SOLR-6288:
---

Summary: Create a parser and rule engine for the rules syntax  (was: Create 
a parser and rule engine for the rules sytax)

> Create a parser and rule engine for the rules syntax
> 
>
> Key: SOLR-6288
> URL: https://issues.apache.org/jira/browse/SOLR-6288
> Project: Solr
>  Issue Type: Sub-task
>  Components: SolrCloud
>Reporter: Noble Paul
>Assignee: Noble Paul
>
> The proposed syntax needs to be parsed and given the tags for a bunch of 
> nodes it should be able to asign replicas to nodes or just bailout if  it not 
> possible



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-6233) Provide basic command line tools for checking Solr status and health.

2014-07-20 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-6233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14068146#comment-14068146
 ] 

Otis Gospodnetic commented on SOLR-6233:


Consider adding a non-JSON format that's easier for grepping, piping, etc.  See 
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/cat.html


> Provide basic command line tools for checking Solr status and health.
> -
>
> Key: SOLR-6233
> URL: https://issues.apache.org/jira/browse/SOLR-6233
> Project: Solr
>  Issue Type: Improvement
>Reporter: Timothy Potter
>Priority: Minor
>
> As part of the start script development work SOLR-3617, example restructuring 
> SOLR-3619, and the overall curb appeal work SOLR-4430, I'd like to have an 
> option on the SystemInfoHandler that gives a shorter, well formatted JSON 
> synopsis of essential information. I know "essential" is vague ;-) but right 
> now using curl to http://host:port/solr/admin/info/system?wt=json gives too 
> much information when I just want a synopsis of a Solr server. 
> Maybe something like &overview=true?
> Result would be:
> {noformat}
> {
>   "address": "http://localhost:8983/solr";,
>   "mode": "solrcloud",
>   "zookeeper": "localhost:2181/foo",
>   "uptime": "2 days, 3 hours, 4 minutes, 5 seconds",
>   "version": "5.0-SNAPSHOT",
>   "status": "healthy",
>   "memory": "4.2g of 6g"
> }
> {noformat}
> Now of course, one may argue all this information can be easily parsed from 
> the JSON but consider cross-platform command-line tools that don't have 
> immediate access to a JSON parser, such as the bin/solr start script.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (SOLR-6233) Provide basic command line tools for checking Solr status and health.

2014-07-20 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-6233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14068146#comment-14068146
 ] 

Otis Gospodnetic edited comment on SOLR-6233 at 7/21/14 3:12 AM:
-

Please consider adding a non-JSON format that's easier for grepping, piping, 
etc.  See 
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/cat.html



was (Author: otis):
Consider adding a non-JSON format that's easier for grepping, piping, etc.  See 
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/cat.html


> Provide basic command line tools for checking Solr status and health.
> -
>
> Key: SOLR-6233
> URL: https://issues.apache.org/jira/browse/SOLR-6233
> Project: Solr
>  Issue Type: Improvement
>Reporter: Timothy Potter
>Priority: Minor
>
> As part of the start script development work SOLR-3617, example restructuring 
> SOLR-3619, and the overall curb appeal work SOLR-4430, I'd like to have an 
> option on the SystemInfoHandler that gives a shorter, well formatted JSON 
> synopsis of essential information. I know "essential" is vague ;-) but right 
> now using curl to http://host:port/solr/admin/info/system?wt=json gives too 
> much information when I just want a synopsis of a Solr server. 
> Maybe something like &overview=true?
> Result would be:
> {noformat}
> {
>   "address": "http://localhost:8983/solr";,
>   "mode": "solrcloud",
>   "zookeeper": "localhost:2181/foo",
>   "uptime": "2 days, 3 hours, 4 minutes, 5 seconds",
>   "version": "5.0-SNAPSHOT",
>   "status": "healthy",
>   "memory": "4.2g of 6g"
> }
> {noformat}
> Now of course, one may argue all this information can be easily parsed from 
> the JSON but consider cross-platform command-line tools that don't have 
> immediate access to a JSON parser, such as the bin/solr start script.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5674) A new token filter: SubSequence

2014-05-25 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14008476#comment-14008476
 ] 

Otis Gospodnetic commented on LUCENE-5674:
--

Didn't look at this, but I remember needing/writing something like this 10+ 
years ago but I think back then I wanted to have output be something like: 
com, com.google, com.google.www - i.e. tokenized, but reversed order.

> A new token filter: SubSequence
> ---
>
> Key: LUCENE-5674
> URL: https://issues.apache.org/jira/browse/LUCENE-5674
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/other
>Reporter: Nitzan Shaked
>Priority: Minor
> Attachments: subseqfilter.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> A new configurable token filter which, given a token breaks it into sub-parts 
> and outputs consecutive sub-sequences of those sub-parts.
> Useful for, for example, using during indexing to generate variations on 
> domain names, so that "www.google.com" can be found by searching for 
> "google.com", or "www.google.com".
> Parameters:
> sepRegexp: A regular expression used split incoming tokens into sub-parts.
> glue: A string used to concatenate sub-parts together when creating 
> sub-sequences.
> minLen: Minimum length (in sub-parts) of output sub-sequences
> maxLen: Maximum length (in sub-parts) of output sub-sequences (0 for 
> unlimited; negative numbers for token length in sub-parts minus specified 
> length)
> anchor: Anchor.START to output only prefixes, or Anchor.END to output only 
> suffixes, or Anchor.NONE to output any sub-sequence
> withOriginal: whether to output also the original token
> EDIT: now includes tests for filter and for factory.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-4962) Allow for analytic functions to be performed through altered collectors

2014-05-14 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-4962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13995257#comment-13995257
 ] 

Otis Gospodnetic commented on SOLR-4962:


Greg - the parent issue is Closed.  Should this be Closed, too?  Not sure how 
it related to SOLR-5073.

> Allow for analytic functions to be performed through altered collectors
> ---
>
> Key: SOLR-4962
> URL: https://issues.apache.org/jira/browse/SOLR-4962
> Project: Solr
>  Issue Type: Sub-task
>  Components: search
>Reporter: Greg Bowyer
> Fix For: 4.9, 5.0
>
>
> This is a split from SOLR-4465, in that issue the ability to create 
> customised collectors that allow for aggregate functions was born, but 
> suffers from being unable to work well with queryResultCaching and grouping.
> Migrating out this functionality into a collector component within solr, and 
> perhaps pushing down some of the logic towards lucene seems to be the way to 
> go.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-4465) Configurable Collectors

2014-05-14 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-4465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13995249#comment-13995249
 ] 

Otis Gospodnetic commented on SOLR-4465:


[~joel.bernstein] maybe this should be closed so it's not confusing people 
(because there have been a LOT of JIRAs in this post-filter/configurable 
collector/pluggable ranking collector space)

> Configurable Collectors
> ---
>
> Key: SOLR-4465
> URL: https://issues.apache.org/jira/browse/SOLR-4465
> Project: Solr
>  Issue Type: New Feature
>  Components: search
>Affects Versions: 4.1
>Reporter: Joel Bernstein
> Fix For: 4.8
>
> Attachments: SOLR-4465.patch, SOLR-4465.patch, SOLR-4465.patch, 
> SOLR-4465.patch, SOLR-4465.patch, SOLR-4465.patch, SOLR-4465.patch, 
> SOLR-4465.patch, SOLR-4465.patch, SOLR-4465.patch, SOLR-4465.patch, 
> SOLR-4465.patch, SOLR-4465.patch, SOLR-4465.patch, SOLR-4465.patch, 
> SOLR-4465.patch, SOLR-4465.patch
>
>
> This ticket provides a patch to add pluggable collectors to Solr. This patch 
> was generated and tested with Solr 4.1.
> This is how the patch functions:
> Collectors are plugged into Solr in the solconfig.xml using the new 
> collectorFactory element. For example:
> 
> 
> The elements above define two collector factories. The first one is the 
> "default" collectorFactory. The class attribute points to 
> org.apache.solr.handler.component.CollectorFactory, which implements logic 
> that returns the default TopScoreDocCollector and TopFieldCollector. 
> To create your own collectorFactory you must subclass the default 
> CollectorFactory and at a minimum override the getCollector method to return 
> your new collector. 
> The parameter "cl" turns on pluggable collectors:
> cl=true
> If cl is not in the parameters, Solr will automatically use the default 
> collectorFactory.
> *Pluggable Doclist Sorting With the Docs Collector*
> You can specify two types of pluggable collectors. The first type is the docs 
> collector. For example:
> cl.docs=
> The above param points to a named collectorFactory in the solrconfig.xml to 
> construct the collector. The docs collectorFactorys must return a collector 
> that extends the TopDocsCollector base class. Docs collectors are responsible 
> for collecting the doclist.
> You can specify only one docs collector per query.
> You can pass parameters to the docs collector using local params syntax. For 
> example:
> cl.docs=\{! sort=mycustomesort\}mycollector
> If cl=true and a docs collector is not specified, Solr will use the default 
> collectorFactory to create the docs collector.
> *Pluggable Custom Analytics With Delegating Collectors*
> You can also specify any number of custom analytic collectors with the 
> "cl.analytic" parameter. Analytic collectors are designed to collect 
> something else besides the doclist. Typically this would be some type of 
> custom analytic. For example:
> cl.analytic=sum
> The parameter above specifies a analytic collector named sum. Like the docs 
> collectors, "sum" points to a named collectorFactory in the solrconfig.xml. 
> You can specificy any number of analytic collectors by adding additional 
> cl.analytic parameters.
> Analytic collector factories must return Collector instances that extend 
> DelegatingCollector. 
> A sample analytic collector is provided in the patch through the 
> org.apache.solr.handler.component.SumCollectorFactory.
> This collectorFactory provides a very simple DelegatingCollector that groups 
> by a field and sums a column of floats. The sum collector is not designed to 
> be a fully functional sum function but to be a proof of concept for pluggable 
> analytics through delegating collectors.
> You can send parameters to analytic collectors with solr local param syntax.
> For example:
> cl.analytic=\{! id=1 groupby=field1 column=field2\}sum
> The "id" parameter is mandatory for analytic collectors and is used to 
> identify the output from the collector. In this example the "groupby" and 
> "column" params tell the sum collector which field to group by and sum.
> Analytic collectors are passed a reference to the ResponseBuilder and can 
> place maps with analytic output directory into the SolrQueryResponse with the 
> add() method.
> Maps that are placed in the SolrQueryResponse are automatically added to the 
> outgoing response. The response will include a list named cl.analytic., 
> where id is specified in the local param.
> *Distributed Search*
> The CollectorFactory also has a method called merge(). This method aggregates 
> the results from each of the shards during distributed search. The "default" 
> CollectoryFactory implements the default merge logic for merging documents 
> from each shard. If you define a different docs collector you can override 
> the default merge method to merge d

[jira] [Resolved] (SOLR-1680) Provide an API to specify custom Collectors

2014-05-13 Thread Otis Gospodnetic (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic resolved SOLR-1680.


Resolution: Won't Fix

Apparently covered by SOLR-5973.

> Provide an API to specify custom Collectors
> ---
>
> Key: SOLR-1680
> URL: https://issues.apache.org/jira/browse/SOLR-1680
> Project: Solr
>  Issue Type: Sub-task
>  Components: search
>Affects Versions: 1.3
>Reporter: Martijn van Groningen
> Fix For: 4.9, 5.0
>
> Attachments: SOLR-1680.patch, field-collapse-core.patch
>
>
> The issue is dedicated to incorporate fieldcollapse's changes to the Solr's 
> core code. 
> We want to make it possible for components to specify custom Collectors in 
> SolrIndexSearcher methods.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-4465) Configurable Collectors

2014-05-12 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-4465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13995250#comment-13995250
 ] 

Otis Gospodnetic commented on SOLR-4465:


Oops, already closed, ignore me!

> Configurable Collectors
> ---
>
> Key: SOLR-4465
> URL: https://issues.apache.org/jira/browse/SOLR-4465
> Project: Solr
>  Issue Type: New Feature
>  Components: search
>Affects Versions: 4.1
>Reporter: Joel Bernstein
> Fix For: 4.8
>
> Attachments: SOLR-4465.patch, SOLR-4465.patch, SOLR-4465.patch, 
> SOLR-4465.patch, SOLR-4465.patch, SOLR-4465.patch, SOLR-4465.patch, 
> SOLR-4465.patch, SOLR-4465.patch, SOLR-4465.patch, SOLR-4465.patch, 
> SOLR-4465.patch, SOLR-4465.patch, SOLR-4465.patch, SOLR-4465.patch, 
> SOLR-4465.patch, SOLR-4465.patch
>
>
> This ticket provides a patch to add pluggable collectors to Solr. This patch 
> was generated and tested with Solr 4.1.
> This is how the patch functions:
> Collectors are plugged into Solr in the solconfig.xml using the new 
> collectorFactory element. For example:
> 
> 
> The elements above define two collector factories. The first one is the 
> "default" collectorFactory. The class attribute points to 
> org.apache.solr.handler.component.CollectorFactory, which implements logic 
> that returns the default TopScoreDocCollector and TopFieldCollector. 
> To create your own collectorFactory you must subclass the default 
> CollectorFactory and at a minimum override the getCollector method to return 
> your new collector. 
> The parameter "cl" turns on pluggable collectors:
> cl=true
> If cl is not in the parameters, Solr will automatically use the default 
> collectorFactory.
> *Pluggable Doclist Sorting With the Docs Collector*
> You can specify two types of pluggable collectors. The first type is the docs 
> collector. For example:
> cl.docs=
> The above param points to a named collectorFactory in the solrconfig.xml to 
> construct the collector. The docs collectorFactorys must return a collector 
> that extends the TopDocsCollector base class. Docs collectors are responsible 
> for collecting the doclist.
> You can specify only one docs collector per query.
> You can pass parameters to the docs collector using local params syntax. For 
> example:
> cl.docs=\{! sort=mycustomesort\}mycollector
> If cl=true and a docs collector is not specified, Solr will use the default 
> collectorFactory to create the docs collector.
> *Pluggable Custom Analytics With Delegating Collectors*
> You can also specify any number of custom analytic collectors with the 
> "cl.analytic" parameter. Analytic collectors are designed to collect 
> something else besides the doclist. Typically this would be some type of 
> custom analytic. For example:
> cl.analytic=sum
> The parameter above specifies a analytic collector named sum. Like the docs 
> collectors, "sum" points to a named collectorFactory in the solrconfig.xml. 
> You can specificy any number of analytic collectors by adding additional 
> cl.analytic parameters.
> Analytic collector factories must return Collector instances that extend 
> DelegatingCollector. 
> A sample analytic collector is provided in the patch through the 
> org.apache.solr.handler.component.SumCollectorFactory.
> This collectorFactory provides a very simple DelegatingCollector that groups 
> by a field and sums a column of floats. The sum collector is not designed to 
> be a fully functional sum function but to be a proof of concept for pluggable 
> analytics through delegating collectors.
> You can send parameters to analytic collectors with solr local param syntax.
> For example:
> cl.analytic=\{! id=1 groupby=field1 column=field2\}sum
> The "id" parameter is mandatory for analytic collectors and is used to 
> identify the output from the collector. In this example the "groupby" and 
> "column" params tell the sum collector which field to group by and sum.
> Analytic collectors are passed a reference to the ResponseBuilder and can 
> place maps with analytic output directory into the SolrQueryResponse with the 
> add() method.
> Maps that are placed in the SolrQueryResponse are automatically added to the 
> outgoing response. The response will include a list named cl.analytic., 
> where id is specified in the local param.
> *Distributed Search*
> The CollectorFactory also has a method called merge(). This method aggregates 
> the results from each of the shards during distributed search. The "default" 
> CollectoryFactory implements the default merge logic for merging documents 
> from each shard. If you define a different docs collector you can override 
> the default merge method to merge documents in accordance with how they are 
> collected at the shard level.
> With analytic collectors, you'll need to override the merge method to merge 
> the ana

[jira] [Commented] (SOLR-5973) Pluggable Ranking Collectors

2014-05-12 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-5973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13995254#comment-13995254
 ] 

Otis Gospodnetic commented on SOLR-5973:


[~joel.bernstein] how does this play with SOLR-1680?  I suspect it doesn't at 
all, right?  Do you think SOLR-1680 has any merit or should it be Won't Fix-ed?


> Pluggable Ranking Collectors
> 
>
> Key: SOLR-5973
> URL: https://issues.apache.org/jira/browse/SOLR-5973
> Project: Solr
>  Issue Type: New Feature
>  Components: search
>Reporter: Joel Bernstein
>Assignee: Joel Bernstein
>Priority: Minor
> Fix For: 4.9
>
> Attachments: SOLR-5973.patch, SOLR-5973.patch, SOLR-5973.patch, 
> SOLR-5973.patch, SOLR-5973.patch, SOLR-5973.patch, SOLR-5973.patch, 
> SOLR-5973.patch, SOLR-5973.patch, SOLR-5973.patch, SOLR-5973.patch, 
> SOLR-5973.patch, SOLR-5973.patch
>
>
> This ticket introduces a new RankQuery and MergeStrategy to Solr. By 
> extending the RankQuery class, and implementing it's interface, you can 
> specify a custom ranking collector (TopDocsCollector) and distributed merge 
> strategy for a Solr query. 
> A new "rq" http parameter was added to support specifying a rank query using 
> a custom QParserPlugin.
> Sample syntax:
> {code}
> q=*:*&wt=json&indent=true&rq={!myranker}
> {code}
> In the sample above the param: {code}rq={!myranker}{code} points to a 
> QParserPlugin that returns a Query that extends RankQuery. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (SOLR-4962) Allow for analytic functions to be performed through altered collectors

2014-05-12 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-4962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13995257#comment-13995257
 ] 

Otis Gospodnetic edited comment on SOLR-4962 at 5/12/14 5:12 PM:
-

Greg - the parent issue is Closed.  Should this be Closed, too?  Not sure how 
it relates to SOLR-5973.


was (Author: otis):
Greg - the parent issue is Closed.  Should this be Closed, too?  Not sure how 
it related to SOLR-5073.

> Allow for analytic functions to be performed through altered collectors
> ---
>
> Key: SOLR-4962
> URL: https://issues.apache.org/jira/browse/SOLR-4962
> Project: Solr
>  Issue Type: Sub-task
>  Components: search
>Reporter: Greg Bowyer
> Fix For: 4.9, 5.0
>
>
> This is a split from SOLR-4465, in that issue the ability to create 
> customised collectors that allow for aggregate functions was born, but 
> suffers from being unable to work well with queryResultCaching and grouping.
> Migrating out this functionality into a collector component within solr, and 
> perhaps pushing down some of the logic towards lucene seems to be the way to 
> go.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-5468) Option to enforce a majority quorum approach to accepting updates in SolrCloud

2014-05-05 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-5468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13990144#comment-13990144
 ] 

Otis Gospodnetic commented on SOLR-5468:


I just skimmed the comments here the other day.  I could be wrong, but aren't 
you guys describing Hinted Handoff?  If so, haven't applications like Voldemort 
and Cassandra and maybe others already dealt with this and may have code or at 
least approaches that that has been in production for a while and that could be 
followed/used?   Maybe ES deals with this too, though I can't recall at the 
moment.  [~gro] do you know?

> Option to enforce a majority quorum approach to accepting updates in SolrCloud
> --
>
> Key: SOLR-5468
> URL: https://issues.apache.org/jira/browse/SOLR-5468
> Project: Solr
>  Issue Type: New Feature
>  Components: SolrCloud
>Affects Versions: 4.5
> Environment: All
>Reporter: Timothy Potter
>Assignee: Timothy Potter
>Priority: Minor
> Attachments: SOLR-5468.patch
>
>
> I've been thinking about how SolrCloud deals with write-availability using 
> in-sync replica sets, in which writes will continue to be accepted so long as 
> there is at least one healthy node per shard.
> For a little background (and to verify my understanding of the process is 
> correct), SolrCloud only considers active/healthy replicas when acknowledging 
> a write. Specifically, when a shard leader accepts an update request, it 
> forwards the request to all active/healthy replicas and only considers the 
> write successful if all active/healthy replicas ack the write. Any down / 
> gone replicas are not considered and will sync up with the leader when they 
> come back online using peer sync or snapshot replication. For instance, if a 
> shard has 3 nodes, A, B, C with A being the current leader, then writes to 
> the shard will continue to succeed even if B & C are down.
> The issue is that if a shard leader continues to accept updates even if it 
> loses all of its replicas, then we have acknowledged updates on only 1 node. 
> If that node, call it A, then fails and one of the previous replicas, call it 
> B, comes back online before A does, then any writes that A accepted while the 
> other replicas were offline are at risk to being lost. 
> SolrCloud does provide a safe-guard mechanism for this problem with the 
> leaderVoteWait setting, which puts any replicas that come back online before 
> node A into a temporary wait state. If A comes back online within the wait 
> period, then all is well as it will become the leader again and no writes 
> will be lost. As a side note, sys admins definitely need to be made more 
> aware of this situation as when I first encountered it in my cluster, I had 
> no idea what it meant.
> My question is whether we want to consider an approach where SolrCloud will 
> not accept writes unless there is a majority of replicas available to accept 
> the write? For my example, under this approach, we wouldn't accept writes if 
> both B&C failed, but would if only C did, leaving A & B online. Admittedly, 
> this lowers the write-availability of the system, so may be something that 
> should be tunable?
> From Mark M: Yeah, this is kind of like one of many little features that we 
> have just not gotten to yet. I’ve always planned for a param that let’s you 
> say how many replicas an update must be verified on before responding 
> success. Seems to make sense to fail that type of request early if you notice 
> there are not enough replicas up to satisfy the param to begin with.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

1 2 3 4 >

1 - 100 of 350 matches

Mail list logo