use mutiple ssd in solr cloud

2017-11-06 Thread Amin Raeiszadeh
Hi
i want to use more than one ssd in each server of solr cluster but i don't
know how to set multiple hdd in solr.xml configurations.
i set on hdd path in solr.xml by:
/media/ssd
but i can't set more than one ssd.
how should i do it.
thanks.


Re: recent utf8 problems

2017-11-06 Thread Dr. Mario Michael Krell
Hi,

thank you for your time and trying to narrow down my problem.

1) When looking for Tübingen in the title, I am expecting the 3092484 results. 
That sounds like a reasonable result. Furthermore, when looking at some of the 
results, they are exactly what I am looking for.

2) I am testing them against the same solr server. This is a very simple 
testing setup, that brings our problem to the core. Originally, we used a 
urlib.request.urlopen query to get the data in Python and then send it to our 
webpage (http://search.mmcommons.org/) as a json object. I think, I should 
explain my test more clearly. We use a webbrowser (Firefox or Chrome) to open 
the admin console of the search engine, which is at 
http://localhost:8983/solr/#/mmc_search3/query 
 on my local device. This is 
the default behavior. In this webbrowser, I use the query  
"title:T%C3%BCbingen” in the field “g” with /select as the “Request-Handler 
(qt) <>”.This approach works like a charm (result wich echoParams attached). 
Also as asked by Rick, the request url displayed in the upper left is just 
perfect:
http://localhost:8983/solr/mmc_search3/select?echoParams=all=title:T%C3%BCbingen=python
 

The problems start to occur, when I click on this url:
{
  'responseHeader':{
'status':0,
'QTime':0,
'params':{
  'q':u'title:T\u00fcbingen',
  'echoParams':'all',
  'wt':'python'}},
  'response':{'numFound':0,'start':0,'docs':[]
  }}
So it seems internally, Solr is changing the request (or a used library?). I 
just don’t have any idea why. But I would like to get the more than 3 million 
results. I could as well just enter the above url into my browser and the url 
will be changed to
http://localhost:8983/solr/mmc_search3/select?echoParams=all=title:Tübingen=python
 

and I get the same result (no found documents). So this is the problem. 
However, when I copy paste the url, it is still displaying the utf8 encoding. I 
thing the “ü” in the url is just an improved layout by the browser.

The confusion with the different solr comes from the fact, that I am 
continuously trying to improve my search index and make it more efficient. 
Hence I reindexed it several times, always to the latest version. The last 
reindexing occurred for Solr 7.0.1. having the indexing for Lucene 7.0.1. 
However, I performed the test also for other versions without any success.

3) As Rick said: "With the Yahoo Flickr Creative Commons 100 Million (YFCC100m) 
dataset, a great novel dataset was introduced to the computer vision and 
multimedia research community." — cool

My objective it to make it better usable, especially by providing different 
search modalities. The dataset consists of 99 Million images and 800k videos, 
but I am only working on the Flickr as well as generated metadata and try to 
add more and more metadata. The next big challenge is similarity search.

4) 
http://localhost:8983/solr/mmc_search3/select?echoParams=all=title:Tübingen=python
 

 is displayed but it is 
http://localhost:8983/solr/mmc_search3/select?echoParams=all=title:T%C3%BCbingen=python
 
.

5) I am searching for Tübingen. It is u-umlaut (LATIN SMALL LETTER U WITH 
DIAERESIS) as Rick said.

6) I am just clicking on it in the admin solr standard interface. I could as 
well copy it into my webbrowser and open it. The result would be the same.
 

7) As you can see in the result, the document seems to be indexed correctly, 
isn’t it? If we can’t figure anything out, I will try to reindex again but this 
will take a while because of the large amount of data and my limited compute 
power.

8) Thanks for the hint with echoparams. The result is displayed above.

9) As shown in the attached search result, there are actually results correctly 
indexed.

10) The above example is now with Python.

11) @Rick: Shall I change the /select handler? I do not quite understand the 
problem with it. But maybe as an explanation, my original config was probably 
based on solr4.x. I basically just updated the Lucene version and I had to 
replace/remove some parts because they were not supported anymore.

12) For playing the ''what changed previous to it being broken” game, I am 
wondering if Solr (6.5 or 7.0.1) has any other dependencies other than Java. 
However, playing this game is quite difficult, because the human mind is not 
that good at it. We only tested once in a while, if requests with special 
symbols work and we mainly tested it only in the Gui without actually clicking 
on the resulting link that is displayed. Later we tested with the webpage, 

Incorrect ngroup count

2017-11-06 Thread Zheng Lin Edwin Yeo
Hi,

I'm using Solr 6.5.1, and I'm facing the issue of incorrect ngroup count
after I have group it by signature field.

Usually, the number of records returned is more than what is shown in the
ngroup. For example, I may get a ngroup of 22, but there are 25 records
being returned.

Below is the part of solrconfig.xml that does the grouping.

  true
 signature false content solr.processor.Lookup3Signature  <
processor class="solr.DistributedUpdateProcessorFactory" />   


This is where I set the grouping to true in the requestHandler

true signature true  <
str name="group.cache.percent">100

What could be the issue that causes this?

Regards,
Edwin


Re: recent utf8 problems

2017-11-06 Thread Rick Leir
Hoss
Clearly it is 
U+00FC  ü   c3 bc   LATIN SMALL LETTER U WITH DIAERESIS
As in Tübingen

"With the Yahoo Flickr Creative Commons 100 Million (YFCC100m) dataset, a great 
novel dataset was introduced to the computer vision and multimedia research 
community." -- cool

I think it is strange that the /select handler was completely default. In my 
experience there is some sort of config for it in solrconfig.xml.

In the SolrAdmin console, query pane, when you have entered some params and 
done a search it shows you the complete URL above the results. Does that match 
your select query?

We could ask ''what changed previous to it being broken". I suspect there was 
something other than the Java upgrade, but it will be interesting if that is 
actually the cause.
Cheers -- Rick
-- 
Sorry for being brief. Alternate email is rickleir at yahoo dot com
-- 
Sorry for being brief. Alternate email is rickleir at yahoo dot com 

Re: Anyone have any comments on current solr monitoring favorites?

2017-11-06 Thread Atita Arora
Hi @Daniel ,

What version of Solr are you using ?
We gave prometheus + Jolokia + InfluxDB + Grafana a try , that came out
well.
With Solr 6.6 the metrics are explosed through the /metrics api, but how do
we go about for the earlier versions , please guide ?
Specifically the cache monitoring.

Thanks in advance,
Atita

On Mon, Nov 6, 2017 at 2:19 PM, Daniel Ortega 
wrote:

> Hi Robert,
>
> We use the following stack:
>
> - Prometheus to scrape metrics (https://prometheus.io/)
> - Prometheus node exporter to export "machine metrics" (Disk, network
> usage, etc.) (https://github.com/prometheus/node_exporter)
> - Prometheus JMX exporter to export "Solr metrics" (Cache usage, QPS,
> Response times...) (https://github.com/prometheus/jmx_exporter)
> - Grafana to visualize all the data scrapped by Prometheus (
> https://grafana.com/)
>
> Best regards
> Daniel Ortega
>
> 2017-11-06 20:13 GMT+01:00 Petersen, Robert (Contr) <
> robert.peters...@ftr.com>:
>
> > PS I knew sematext would be required to chime in here!  
> >
> >
> > Is there a non-expiring dev version I could experiment with? I think I
> did
> > sign up for a trial years ago from a different company... I was actually
> > wondering about hooking it up to my personal AWS based solr cloud
> instance.
> >
> >
> > Thanks
> >
> > Robi
> >
> > 
> > From: Emir Arnautović 
> > Sent: Thursday, November 2, 2017 2:05:10 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Anyone have any comments on current solr monitoring
> favorites?
> >
> > Hi Robi,
> > Did you try Sematext’s SPM? It provides host, JVM and Solr metrics and
> > more. We use it for monitoring our Solr instances and for consulting.
> >
> > Disclaimer - see signature :)
> >
> > Emir
> > --
> > Monitoring - Log Management - Alerting - Anomaly Detection
> > Solr & Elasticsearch Consulting Support Training - http://sematext.com/
> >
> >
> >
> > > On 2 Nov 2017, at 19:35, Walter Underwood 
> wrote:
> > >
> > > We use New Relic for JVM, CPU, and disk monitoring.
> > >
> > > I tried the built-in metrics support in 6.4, but it just didn’t do what
> > we want. We want rates and percentiles for each request handler. That
> gives
> > us 95th percentile for textbooks suggest or for homework search results
> > page, etc. The Solr metrics didn’t do that. The Jetty metrics didn’t do
> > that.
> > >
> > > We built a dedicated servlet filter that goes in front of the Solr
> > webapp and reports metrics. It has some special hacks to handle some
> weird
> > behavior in SolrJ. A request to the “/srp” handler is sent as
> > “/select?qt=/srp”, so we normalize that.
> > >
> > > The metrics start with the cluster name, the hostname, and the
> > collection. The rest is generated like this:
> > >
> > > URL: GET /solr/textbooks/select?q=foo=/auto
> > > Metric: textbooks.GET./auto
> > >
> > > URL: GET /solr/textbooks/select?q=foo
> > > Metric: textbooks.GET./select
> > >
> > > URL: GET /solr/questions/auto
> > > Metric: questions.GET./auto
> > >
> > > So a full metric for the cluster “solr-cloud” and the host “search01"
> > would look like “solr-cloud.search01.solr.textbooks.GET./auto.m1_rate”.
> > >
> > > We send all that to InfluxDB. We’ve configured a template so that each
> > part of the metric name is mapped to a field, so we can write efficient
> > queries in InfluxQL.
> > >
> > > Metrics are graphed in Grafana. We have dashboards that mix Cloudwatch
> > (for the load balancer) and InfluxDB.
> > >
> > > I’m still working out the kinks in some of the more complicated
> queries,
> > but the data is all there. I also want to expand the servlet filter to
> > report HTTP response codes.
> > >
> > > wunder
> > > Walter Underwood
> > > wun...@wunderwood.org
> > > http://observer.wunderwood.org/  (my blog)
> > >
> > >
> > >> On Nov 2, 2017, at 9:30 AM, Petersen, Robert (Contr) <
> > robert.peters...@ftr.com> wrote:
> > >>
> > >> OK I'm probably going to open a can of worms here...  lol
> > >>
> > >>
> > >> In the old old days I used PSI probe to monitor solr running on tomcat
> > which worked ok on a machine by machine basis.
> > >>
> > >>
> > >> Later I had a grafana dashboard on top of graphite monitoring which
> was
> > really nice looking but kind of complicated to set up.
> > >>
> > >>
> > >> Even later I successfully just dropped in a newrelic java agent which
> > had solr monitors and a dashboard right out of the box, but it costs
> money
> > for the full tamale.
> > >>
> > >>
> > >> For basic JVM health and Solr QPS and time percentiles, does anyone
> > have any favorites or other alternative suggestions?
> > >>
> > >>
> > >> Thanks in advance!
> > >>
> > >> Robi
> > >>
> > >> 
> > >>
> > >> This communication is confidential. Frontier only sends and receives
> > email on the basis of the terms set out at
> http://www.frontier.com/email_
> > disclaimer.
> > >
> >
> >
>


Can someone help? Two level nested doc... ChildDocTransformerFactory sytax...

2017-11-06 Thread Petersen, Robert (Contr)
OK no faceting, no filtering, I just want the hierarchy to come backin the 
results. Can't quite get it... googled all over the place too.


Doc:

{ id : asdf, type_s:customer, firstName_s:Manny, lastName_s:Acevedo, 
address_s:"123 Fourth Street", city_s:Gotham, tn_s:1234561234,
  _childDocuments_:[
  { id : adsf_c1,
src_s : "CRM.Customer",
type_s:customerSource,
_childDocuments_:[
{
id : asdf_c1_c1,
type_s:customerSourceType,
"key_s": "id",
"value_s": "GUID"
}
]
},
  { id : adsf_c2,
"src_s": "DPI.SalesOrder",
type_s:customerSource,
_childDocuments_:[
{
id : asdf_c2_c1,
type_s:customerSourceType,
"key_s": "btn",
"value_s": "4052328908"
},
{
id : asdf_c2_c2,
type_s:customerSourceType,
"key_s": "seq",
"value_s": "5"
   },
{
id : asdf_c2_c3,
type_s:customerSourceType,
"key_s": "env",
"value_s": "MS"
}
]
}
]
}


Queries:

Gives all nested docs regardless of level as a flat set
http://localhost:8983/solr/temptest/select?q=id:asdf=id,[child%20parentFilter=type_s:customer]

Gives all nested child docs only
http://localhost:8983/solr/temptest/select?q=id:asdf=id,[child%20parentFilter=type_s:customer%20childFilter=type_s:customerSource]

How to get nested grandchild docs at correct level?
Nope exception:
http://localhost:8983/solr/temptest/select?q=id:asdf=id,[child%20parentFilter=type_s:customer%20childFilter=type_s:customerSource],[child%20parentFilter=type_s:customerSource%20childFilter=type_s:customerSourceType]

Nope exception:
http://localhost:8983/solr/temptest/select?q=id:asdf=id,[child%20parentFilter=type_s:customer%20childFilter=type_s:customerSource],[child%20parentFilter=type_s:customerSource]


Nope but no exception only gets children again tho like above:
http://localhost:8983/solr/temptest/select?q=id:asdf=id,[child%20parentFilter=type_s:customer%20childFilter=type_s:customerSource],[child%20parentFilter=type_s:customer*]

Nope but no exception only gets children 
again:

http://localhost:8983/solr/temptest/select?q=id:asdf=id,[child%20parentFilter=type_s:customer%20childFilter=type_s:customerSource],[child%20parentFilter=type_s:customer*%20childFilter=type_s:customerSourceType]


Nope same again... no grandchildren:

http://localhost:8983/solr/temptest/select?q=id:asdf=id,p:[child%20parentFilter=type_s:customer%20childFilter=type_s:customerSource],q:[child%20parentFilter=-type_s:customer%20parentFilter=type_s:customerSource%20childFilter=type_s:customerSourceType]


Gives all but flat no child to grandchild hierarchy:

http://localhost:8983/solr/temptest/select?q=id:asdf=id,p:[child%20parentFilter=type_s:customer%20childFilter=type_s:customerSource],q:[child%20parentFilter=type_s:customer%20childFilter=type_s:customerSourceType]


Thanks in advance,

Robi



This communication is confidential. Frontier only sends and receives email on 
the basis of the terms set out at http://www.frontier.com/email_disclaimer.


Re: Java 9

2017-11-06 Thread Petersen, Robert (Contr)
Actually I can't believe they're depricating UseConcMarkSweepGC , That was the 
one that finally made solr 'sing' with no OOMs!


I guess they must have found something better, have to look into that...


Robi


From: Chris Hostetter 
Sent: Monday, November 6, 2017 3:07:28 PM
To: solr-user@lucene.apache.org
Subject: Re: Java 9



: Anyone else been noticing this this msg when starting up solr with java 9? 
(This is just an FYI and not a real question)

: Java HotSpot(TM) 64-Bit Server VM warning: Option UseConcMarkSweepGC was 
deprecated in version 9.0 and will likely be removed in a future release.
: Java HotSpot(TM) 64-Bit Server VM warning: Option UseParNewGC was deprecated 
in version 9.0 and will likely be removed in a future release.

IIRC the default GC_TUNE options for Solr still assume java8, but also
work fine with java9 -- although they do cause those deprecation warnings
and result in using the JVM defaults

You are free to customize this in your solr.in.sh if you are running java9 and
don't like the deprecation warnings ... and/or open a Jira w/suggestions
for what Solr's default GC_TUNE option should be when running in java9 (i
don't know if there is any community concensus on that yet -- but you're
welcome to try and build some)


-Hoss
http://www.lucidworks.com/



This communication is confidential. Frontier only sends and receives email on 
the basis of the terms set out at http://www.frontier.com/email_disclaimer.


Re: recent utf8 problems

2017-11-06 Thread Chris Hostetter

: We recently discovered issues with solr with converting utf8 code in the 
search. One or two month ago everything was still working.
: 
: - What might have caused it is a Java update (Java 8 Update 151). 
: - We are using firefox as well as chrome for displaying results.
: - We tested it with Solr 6.5, Solr 7.0.0, 7.0.1, and 7.1.

Just to be clear: in the 2 examples you provde below...

 1) which situation do you consider "correct" ? 
 ("match lots of docs" or "match no docs")
 2) are you testing those against the same live solr server?

I ask Q #2 because you mentioned "One or two month ago everything was 
still working" ... but it's not clear what part of the "results" where 
different one of two months ago.

other things tha are unclear/confusing about your question...

: We created a search engine base on the yfcc100m and in the normal 
: browser (http://localhost:8983/solr/#/mmc_search3/query 
: ), we can search for 
: "title:T%C3%BCbingen” in the query field and get more than 3 million 
: results:

 3) what is "yfcc100m" ?
 4) what is the actual URL you see in your browser?
 5) what is the underlying byte sequence / character sequence you are 
trying to search for?

ie: can you please explicitly name the UNICODE codepoints you are 
intendeing to search for?

: However, when we use the respective web-address, 
: http://localhost:8983/solr/mmc_search3/select?q=title:T%C3%BCbingen=json 


 6) define "use the respective web-address" ?
(how are you using it? what http client is hitting that url?)


Some general advice about debugging possible charst related issues:

 * the problem may be related to how the query is executed -- or it may 
have been realted to how the data was originally indexed, if at that type 
the wrong byte sequences were sent.

 * you can use things like "echoParams=all" in a query to see exactly what 
unicode characters solr is recieving in the q param
 * assuming the field you are searching is stored=true, you can also send 
requests to search for one of the documents you expect by id, and verify 
what unicode characters were indexed.
 * in both types of requests, you can use "wt=python" to help see the 
underlying bytes being returned for each character (the python response 
writer escapes all characters outside of the ascii range)



-Hoss
http://www.lucidworks.com/

Re: Solr 7* Sorry, no dataimport-handler defined

2017-11-06 Thread Shawn Heisey
On 11/6/2017 4:26 PM, Shawn Heisey wrote:
> If I start Solr "normally" or with the cloud example, then the
> dataimport tab shows that error -- which is exactly as expected.

I have opened an improvement issue so that particular error message
isn't as vague.  It's been labeled with "newdev" because it's an easy
change.

https://issues.apache.org/jira/browse/SOLR-11613

Thanks,
Shawn



Re: Solr 7* Sorry, no dataimport-handler defined

2017-11-06 Thread Shawn Heisey
On 11/6/2017 1:47 PM, richardg wrote:
> I see where this was an issue w/ 6.4 and fixed.  I keep getting this error w/
> 7.0.1 and 7.1.0.  Works fine up until 6.6.2.  Could this issue have been
> reintroduced?  Is there somewhere to check what might be going on?  I don't
> see anything in the error logs.

You've included almost no information here.  I assume that the subject
is the error?  If so, where do you see it?  Please be aware that without
information, we can't help.

https://wiki.apache.org/solr/UsingMailingLists

My best guess is that you see that error in the admin UI when you click
on dataimport.  The error is pretty self-explanitory -- Solr couldn't
find any defined dataimport handlers.  At least one such handler must be
set up in the config before you can use the dataimport tab.

You mention that this was an issue with 6.4 and fixed ... but didn't
include a link to the issue.  Are you talking about SOLR-10035?

https://issues.apache.org/jira/browse/SOLR-10035

If I start a fresh Solr 7.1.0 download with the DIH example (bin/solr
start -e dih), clicking on the dataimport tab works on all five cores. 
If I start Solr "normally" or with the cloud example, then the
dataimport tab shows that error -- which is exactly as expected.  The
example configsets in the base Solr download do not have a dataimport
handler.  Only the configs specifically in the DIH example have such
handlers defined.

Have you created a core with a dataimport handler that you KNOW is
working, and seeing that error?  If so, we'll need more detailed
information.  Solr version, everything in the core's conf directory,
details about the error and steps taken to see it, etc.  Any error
messages in your logfile would be helpful as well.

Thanks,
Shawn



Re: Java 9

2017-11-06 Thread Shawn Heisey
On 11/6/2017 3:07 PM, Petersen, Robert (Contr) wrote:
> Anyone else been noticing this this msg when starting up solr with java 9? 
> (This is just an FYI and not a real question)
>
> Java HotSpot(TM) 64-Bit Server VM warning: Option UseConcMarkSweepGC was 
> deprecated in version 9.0 and will likely be removed in a future release.
> Java HotSpot(TM) 64-Bit Server VM warning: Option UseParNewGC was deprecated 
> in version 9.0 and will likely be removed in a future release.

I have not tried Java 9 yet.

Looks like G1 is now the default garbage collector.  I did not know that
they were deprecating CMS and ParNew ... that's a little surprising. 
Solr's default garbage collection tuning uses those two collectors.  It
is likely that those choices will be available in all versions of Java
9.  It would be very uncharacteristic for Oracle to take action on
removing them until version 10, possibly later.

If it were solely up to me, I would adjust Solr's startup script to use
the G1 collector by default, eliminating the warnings you're seeing. 
It's not just up to me though.  Lucene documentation says to NEVER use
the G1 collector because they believe it to be unpredictable and have
the potential to cause problems.  I personally have never had any issues
with it.  There is *one* Lucene issue mentioning problems with G1GC, and
that issue is *specific* to the 32-bit JVM, which is not recommended
because of the limited amount of memory it can use.

My experiments with GC tuning show the G1 collector (now default in Java
9) to have very good characteristics with Solr.  I have a personal page
on the Solr wiki that covers those experiments.

https://wiki.apache.org/solr/ShawnHeisey

Thanks,
Shawn



Re: Java 9

2017-11-06 Thread Chris Hostetter


: Anyone else been noticing this this msg when starting up solr with java 9? 
(This is just an FYI and not a real question)

: Java HotSpot(TM) 64-Bit Server VM warning: Option UseConcMarkSweepGC was 
deprecated in version 9.0 and will likely be removed in a future release.
: Java HotSpot(TM) 64-Bit Server VM warning: Option UseParNewGC was deprecated 
in version 9.0 and will likely be removed in a future release.

IIRC the default GC_TUNE options for Solr still assume java8, but also 
work fine with java9 -- although they do cause those deprecation warnings 
and result in using the JVM defaults

You are free to customize this in your solr.in.sh if you are running java9 and 
don't like the deprecation warnings ... and/or open a Jira w/suggestions 
for what Solr's default GC_TUNE option should be when running in java9 (i 
don't know if there is any community concensus on that yet -- but you're 
welcome to try and build some)


-Hoss
http://www.lucidworks.com/


Re: Java 9

2017-11-06 Thread Chris Hostetter


: Anyone else been noticing this this msg when starting up solr with java 9? 
(This is just an FYI and not a real question)

: Java HotSpot(TM) 64-Bit Server VM warning: Option UseConcMarkSweepGC was 
deprecated in version 9.0 and will likely be removed in a future release.
: Java HotSpot(TM) 64-Bit Server VM warning: Option UseParNewGC was deprecated 
in version 9.0 and will likely be removed in a future release.

IIRC the default GC_TUNE options for Solr still assume java8, but also 
work fine with java9 -- although they do cause those deprecation warnings 
and result in using the JVM defaults

You are free to customize this in your solr.in.sh if you are running java9 and 
don't like the deprecation warnings ... and/or open a Jira w/suggestions 
for what Solr's default GC_TUNE option should be when running in java9 (i 
don't know if there is any community concensus on that yet -- but you're 
welcome to try and build some)


-Hoss
http://www.lucidworks.com/


RE: Anyone have any comments on current solr monitoring favorites?

2017-11-06 Thread Davis, Daniel (NIH/NLM) [C]
I have used Java Melody for this purpose on past Java based servers, but I 
haven't tried to embed it in Jetty.

-Original Message-
From: Petersen, Robert (Contr) [mailto:robert.peters...@ftr.com] 
Sent: Monday, November 06, 2017 4:50 PM
To: solr-user@lucene.apache.org
Subject: Re: Anyone have any comments on current solr monitoring favorites?

Hi Walter,


OK now that sounds really interesting. I actually just turned on logging in 
Jetty and yes did see all the intra-cluster traffic there. I'm pushing our ELK 
team to pick out the get search requests across the cluster and aggregate them 
for me. We'll see how that looks but that would just be for user query analysis 
and not for real time analysis. Still looking for something to monitor real 
time since apparently my company has all it's new relic licenses tied up with 
other level one processes and doesn't want to buy any more of them at this 
time...  lol


And yes when I looked directly at the Graphite data backing Grafana at my last 
position it was just scary!


Thanks

Robi


PS early adapter for influxDB in general or just for this use case?


From: Walter Underwood 
Sent: Monday, November 6, 2017 1:44:01 PM
To: solr-user@lucene.apache.org
Subject: Re: Anyone have any comments on current solr monitoring favorites?

We use New Relic across the site, but it doesn’t split out traffic to different 
endpoints. It also cannot distinguish between search traffic to the cluster and 
intra-cluster traffic. With four shards, the total traffic is 4X bigger than 
the incoming traffic.

We have a bunch of business metrics (orders) and other stuff that is currently 
in Graphite. We’ll almost certainly move all that to InfluxDB and Grafana.

The Solr metrics were overloading the Graphite database, so we’re the first 
service that is trying InfluxDB.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Nov 6, 2017, at 1:31 PM, Petersen, Robert (Contr) 
>  wrote:
>
> Hi Walter,
>
>
> Yes, now I see it. I'm wondering about using Grafana and New Relic at 
> the same time since New Relic has a dashboard and also costs money for 
> corporate use. I guess after a reread you are using Grafana to 
> visualize the influxDB data and New Relic just for JVM right?  Did 
> this give you more control over the solr metrics you are monitoring? 
> (PS I've never heard of influxDB)
>
>
> Thanks
>
> Robi
>
> 
> From: Walter Underwood 
> Sent: Monday, November 6, 2017 11:26:07 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Anyone have any comments on current solr monitoring favorites?
>
> Look back down the string to my post. We use Grafana.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
>> On Nov 6, 2017, at 11:23 AM, Petersen, Robert (Contr) 
>>  wrote:
>>
>> Interesting! Finally a Grafana user... Thanks Daniel, I will follow your 
>> links. That looks promising.
>>
>>
>> Is anyone using Grafana over Graphite?
>>
>>
>> Thanks
>>
>> Robi
>>
>> 
>> From: Daniel Ortega 
>> Sent: Monday, November 6, 2017 11:19:10 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Anyone have any comments on current solr monitoring favorites?
>>
>> Hi Robert,
>>
>> We use the following stack:
>>
>> - Prometheus to scrape metrics (https://prometheus.io/)
>> - Prometheus node exporter to export "machine metrics" (Disk, network 
>> usage, etc.) (https://github.com/prometheus/node_exporter)
>> - Prometheus JMX exporter to export "Solr metrics" (Cache usage, QPS, 
>> Response times...) (https://github.com/prometheus/jmx_exporter)
>> - Grafana to visualize all the data scrapped by Prometheus (
>> https://grafana.com/)
>>
>> Best regards
>> Daniel Ortega
>>
>> 2017-11-06 20:13 GMT+01:00 Petersen, Robert (Contr) <
>> robert.peters...@ftr.com>:
>>
>>> PS I knew sematext would be required to chime in here!  
>>>
>>>
>>> Is there a non-expiring dev version I could experiment with? I think 
>>> I did sign up for a trial years ago from a different company... I 
>>> was actually wondering about hooking it up to my personal AWS based solr 
>>> cloud instance.
>>>
>>>
>>> Thanks
>>>
>>> Robi
>>>
>>> 
>>> From: Emir Arnautović 
>>> Sent: Thursday, November 2, 2017 2:05:10 PM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: Anyone have any comments on current solr monitoring favorites?
>>>
>>> Hi Robi,
>>> Did you try Sematext’s SPM? It provides host, JVM and Solr metrics 
>>> and more. We use it for monitoring our Solr instances and for consulting.
>>>
>>> Disclaimer - see signature :)
>>>
>>> Emir
>>> --
>>> Monitoring - Log Management - Alerting - Anomaly Detection Solr & 
>>> Elasticsearch Consulting Support Training 

Java 9

2017-11-06 Thread Petersen, Robert (Contr)
Hi Guys,


Anyone else been noticing this this msg when starting up solr with java 9? 
(This is just an FYI and not a real question)


Java HotSpot(TM) 64-Bit Server VM warning: Option UseConcMarkSweepGC was 
deprecated in version 9.0 and will likely be removed in a future release.
Java HotSpot(TM) 64-Bit Server VM warning: Option UseParNewGC was deprecated in 
version 9.0 and will likely be removed in a future release.


Robi



This communication is confidential. Frontier only sends and receives email on 
the basis of the terms set out at http://www.frontier.com/email_disclaimer.


Re: Anyone have any comments on current solr monitoring favorites?

2017-11-06 Thread Petersen, Robert (Contr)
Hi Walter,


OK now that sounds really interesting. I actually just turned on logging in 
Jetty and yes did see all the intra-cluster traffic there. I'm pushing our ELK 
team to pick out the get search requests across the cluster and aggregate them 
for me. We'll see how that looks but that would just be for user query analysis 
and not for real time analysis. Still looking for something to monitor real 
time since apparently my company has all it's new relic licenses tied up with 
other level one processes and doesn't want to buy any more of them at this 
time...  lol


And yes when I looked directly at the Graphite data backing Grafana at my last 
position it was just scary!


Thanks

Robi


PS early adapter for influxDB in general or just for this use case?


From: Walter Underwood 
Sent: Monday, November 6, 2017 1:44:01 PM
To: solr-user@lucene.apache.org
Subject: Re: Anyone have any comments on current solr monitoring favorites?

We use New Relic across the site, but it doesn’t split out traffic to different 
endpoints. It also cannot distinguish between search traffic to the cluster and 
intra-cluster traffic. With four shards, the total traffic is 4X bigger than 
the incoming traffic.

We have a bunch of business metrics (orders) and other stuff that is currently 
in Graphite. We’ll almost certainly move all that to InfluxDB and Grafana.

The Solr metrics were overloading the Graphite database, so we’re the first 
service that is trying InfluxDB.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Nov 6, 2017, at 1:31 PM, Petersen, Robert (Contr) 
>  wrote:
>
> Hi Walter,
>
>
> Yes, now I see it. I'm wondering about using Grafana and New Relic at the 
> same time since New Relic has a dashboard and also costs money for corporate 
> use. I guess after a reread you are using Grafana to visualize the influxDB 
> data and New Relic just for JVM right?  Did this give you more control over 
> the solr metrics you are monitoring? (PS I've never heard of influxDB)
>
>
> Thanks
>
> Robi
>
> 
> From: Walter Underwood 
> Sent: Monday, November 6, 2017 11:26:07 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Anyone have any comments on current solr monitoring favorites?
>
> Look back down the string to my post. We use Grafana.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
>> On Nov 6, 2017, at 11:23 AM, Petersen, Robert (Contr) 
>>  wrote:
>>
>> Interesting! Finally a Grafana user... Thanks Daniel, I will follow your 
>> links. That looks promising.
>>
>>
>> Is anyone using Grafana over Graphite?
>>
>>
>> Thanks
>>
>> Robi
>>
>> 
>> From: Daniel Ortega 
>> Sent: Monday, November 6, 2017 11:19:10 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Anyone have any comments on current solr monitoring favorites?
>>
>> Hi Robert,
>>
>> We use the following stack:
>>
>> - Prometheus to scrape metrics (https://prometheus.io/)
>> - Prometheus node exporter to export "machine metrics" (Disk, network
>> usage, etc.) (https://github.com/prometheus/node_exporter)
>> - Prometheus JMX exporter to export "Solr metrics" (Cache usage, QPS,
>> Response times...) (https://github.com/prometheus/jmx_exporter)
>> - Grafana to visualize all the data scrapped by Prometheus (
>> https://grafana.com/)
>>
>> Best regards
>> Daniel Ortega
>>
>> 2017-11-06 20:13 GMT+01:00 Petersen, Robert (Contr) <
>> robert.peters...@ftr.com>:
>>
>>> PS I knew sematext would be required to chime in here!  
>>>
>>>
>>> Is there a non-expiring dev version I could experiment with? I think I did
>>> sign up for a trial years ago from a different company... I was actually
>>> wondering about hooking it up to my personal AWS based solr cloud instance.
>>>
>>>
>>> Thanks
>>>
>>> Robi
>>>
>>> 
>>> From: Emir Arnautović 
>>> Sent: Thursday, November 2, 2017 2:05:10 PM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: Anyone have any comments on current solr monitoring favorites?
>>>
>>> Hi Robi,
>>> Did you try Sematext’s SPM? It provides host, JVM and Solr metrics and
>>> more. We use it for monitoring our Solr instances and for consulting.
>>>
>>> Disclaimer - see signature :)
>>>
>>> Emir
>>> --
>>> Monitoring - Log Management - Alerting - Anomaly Detection
>>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>>>
>>>
>>>
 On 2 Nov 2017, at 19:35, Walter Underwood  wrote:

 We use New Relic for JVM, CPU, and disk monitoring.

 I tried the built-in metrics support in 6.4, but it just didn’t do what
>>> we want. We want rates and percentiles for each request handler. That gives
>>> us 95th percentile for 

Re: Anyone have any comments on current solr monitoring favorites?

2017-11-06 Thread Walter Underwood
We use New Relic across the site, but it doesn’t split out traffic to different 
endpoints. It also cannot distinguish between search traffic to the cluster and 
intra-cluster traffic. With four shards, the total traffic is 4X bigger than 
the incoming traffic.

We have a bunch of business metrics (orders) and other stuff that is currently 
in Graphite. We’ll almost certainly move all that to InfluxDB and Grafana.

The Solr metrics were overloading the Graphite database, so we’re the first 
service that is trying InfluxDB.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Nov 6, 2017, at 1:31 PM, Petersen, Robert (Contr) 
>  wrote:
> 
> Hi Walter,
> 
> 
> Yes, now I see it. I'm wondering about using Grafana and New Relic at the 
> same time since New Relic has a dashboard and also costs money for corporate 
> use. I guess after a reread you are using Grafana to visualize the influxDB 
> data and New Relic just for JVM right?  Did this give you more control over 
> the solr metrics you are monitoring? (PS I've never heard of influxDB)
> 
> 
> Thanks
> 
> Robi
> 
> 
> From: Walter Underwood 
> Sent: Monday, November 6, 2017 11:26:07 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Anyone have any comments on current solr monitoring favorites?
> 
> Look back down the string to my post. We use Grafana.
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 
> 
>> On Nov 6, 2017, at 11:23 AM, Petersen, Robert (Contr) 
>>  wrote:
>> 
>> Interesting! Finally a Grafana user... Thanks Daniel, I will follow your 
>> links. That looks promising.
>> 
>> 
>> Is anyone using Grafana over Graphite?
>> 
>> 
>> Thanks
>> 
>> Robi
>> 
>> 
>> From: Daniel Ortega 
>> Sent: Monday, November 6, 2017 11:19:10 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Anyone have any comments on current solr monitoring favorites?
>> 
>> Hi Robert,
>> 
>> We use the following stack:
>> 
>> - Prometheus to scrape metrics (https://prometheus.io/)
>> - Prometheus node exporter to export "machine metrics" (Disk, network
>> usage, etc.) (https://github.com/prometheus/node_exporter)
>> - Prometheus JMX exporter to export "Solr metrics" (Cache usage, QPS,
>> Response times...) (https://github.com/prometheus/jmx_exporter)
>> - Grafana to visualize all the data scrapped by Prometheus (
>> https://grafana.com/)
>> 
>> Best regards
>> Daniel Ortega
>> 
>> 2017-11-06 20:13 GMT+01:00 Petersen, Robert (Contr) <
>> robert.peters...@ftr.com>:
>> 
>>> PS I knew sematext would be required to chime in here!  
>>> 
>>> 
>>> Is there a non-expiring dev version I could experiment with? I think I did
>>> sign up for a trial years ago from a different company... I was actually
>>> wondering about hooking it up to my personal AWS based solr cloud instance.
>>> 
>>> 
>>> Thanks
>>> 
>>> Robi
>>> 
>>> 
>>> From: Emir Arnautović 
>>> Sent: Thursday, November 2, 2017 2:05:10 PM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: Anyone have any comments on current solr monitoring favorites?
>>> 
>>> Hi Robi,
>>> Did you try Sematext’s SPM? It provides host, JVM and Solr metrics and
>>> more. We use it for monitoring our Solr instances and for consulting.
>>> 
>>> Disclaimer - see signature :)
>>> 
>>> Emir
>>> --
>>> Monitoring - Log Management - Alerting - Anomaly Detection
>>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>>> 
>>> 
>>> 
 On 2 Nov 2017, at 19:35, Walter Underwood  wrote:
 
 We use New Relic for JVM, CPU, and disk monitoring.
 
 I tried the built-in metrics support in 6.4, but it just didn’t do what
>>> we want. We want rates and percentiles for each request handler. That gives
>>> us 95th percentile for textbooks suggest or for homework search results
>>> page, etc. The Solr metrics didn’t do that. The Jetty metrics didn’t do
>>> that.
 
 We built a dedicated servlet filter that goes in front of the Solr
>>> webapp and reports metrics. It has some special hacks to handle some weird
>>> behavior in SolrJ. A request to the “/srp” handler is sent as
>>> “/select?qt=/srp”, so we normalize that.
 
 The metrics start with the cluster name, the hostname, and the
>>> collection. The rest is generated like this:
 
 URL: GET /solr/textbooks/select?q=foo=/auto
 Metric: textbooks.GET./auto
 
 URL: GET /solr/textbooks/select?q=foo
 Metric: textbooks.GET./select
 
 URL: GET /solr/questions/auto
 Metric: questions.GET./auto
 
 So a full metric for the cluster “solr-cloud” and the host “search01"
>>> would look like “solr-cloud.search01.solr.textbooks.GET./auto.m1_rate”.
 
 We send all that to InfluxDB. 

Re: Anyone have any comments on current solr monitoring favorites?

2017-11-06 Thread Petersen, Robert (Contr)
Hi Walter,


Yes, now I see it. I'm wondering about using Grafana and New Relic at the same 
time since New Relic has a dashboard and also costs money for corporate use. I 
guess after a reread you are using Grafana to visualize the influxDB data and 
New Relic just for JVM right?  Did this give you more control over the solr 
metrics you are monitoring? (PS I've never heard of influxDB)


Thanks

Robi


From: Walter Underwood 
Sent: Monday, November 6, 2017 11:26:07 AM
To: solr-user@lucene.apache.org
Subject: Re: Anyone have any comments on current solr monitoring favorites?

Look back down the string to my post. We use Grafana.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Nov 6, 2017, at 11:23 AM, Petersen, Robert (Contr) 
>  wrote:
>
> Interesting! Finally a Grafana user... Thanks Daniel, I will follow your 
> links. That looks promising.
>
>
> Is anyone using Grafana over Graphite?
>
>
> Thanks
>
> Robi
>
> 
> From: Daniel Ortega 
> Sent: Monday, November 6, 2017 11:19:10 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Anyone have any comments on current solr monitoring favorites?
>
> Hi Robert,
>
> We use the following stack:
>
> - Prometheus to scrape metrics (https://prometheus.io/)
> - Prometheus node exporter to export "machine metrics" (Disk, network
> usage, etc.) (https://github.com/prometheus/node_exporter)
> - Prometheus JMX exporter to export "Solr metrics" (Cache usage, QPS,
> Response times...) (https://github.com/prometheus/jmx_exporter)
> - Grafana to visualize all the data scrapped by Prometheus (
> https://grafana.com/)
>
> Best regards
> Daniel Ortega
>
> 2017-11-06 20:13 GMT+01:00 Petersen, Robert (Contr) <
> robert.peters...@ftr.com>:
>
>> PS I knew sematext would be required to chime in here!  
>>
>>
>> Is there a non-expiring dev version I could experiment with? I think I did
>> sign up for a trial years ago from a different company... I was actually
>> wondering about hooking it up to my personal AWS based solr cloud instance.
>>
>>
>> Thanks
>>
>> Robi
>>
>> 
>> From: Emir Arnautović 
>> Sent: Thursday, November 2, 2017 2:05:10 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Anyone have any comments on current solr monitoring favorites?
>>
>> Hi Robi,
>> Did you try Sematext’s SPM? It provides host, JVM and Solr metrics and
>> more. We use it for monitoring our Solr instances and for consulting.
>>
>> Disclaimer - see signature :)
>>
>> Emir
>> --
>> Monitoring - Log Management - Alerting - Anomaly Detection
>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>>
>>
>>
>>> On 2 Nov 2017, at 19:35, Walter Underwood  wrote:
>>>
>>> We use New Relic for JVM, CPU, and disk monitoring.
>>>
>>> I tried the built-in metrics support in 6.4, but it just didn’t do what
>> we want. We want rates and percentiles for each request handler. That gives
>> us 95th percentile for textbooks suggest or for homework search results
>> page, etc. The Solr metrics didn’t do that. The Jetty metrics didn’t do
>> that.
>>>
>>> We built a dedicated servlet filter that goes in front of the Solr
>> webapp and reports metrics. It has some special hacks to handle some weird
>> behavior in SolrJ. A request to the “/srp” handler is sent as
>> “/select?qt=/srp”, so we normalize that.
>>>
>>> The metrics start with the cluster name, the hostname, and the
>> collection. The rest is generated like this:
>>>
>>> URL: GET /solr/textbooks/select?q=foo=/auto
>>> Metric: textbooks.GET./auto
>>>
>>> URL: GET /solr/textbooks/select?q=foo
>>> Metric: textbooks.GET./select
>>>
>>> URL: GET /solr/questions/auto
>>> Metric: questions.GET./auto
>>>
>>> So a full metric for the cluster “solr-cloud” and the host “search01"
>> would look like “solr-cloud.search01.solr.textbooks.GET./auto.m1_rate”.
>>>
>>> We send all that to InfluxDB. We’ve configured a template so that each
>> part of the metric name is mapped to a field, so we can write efficient
>> queries in InfluxQL.
>>>
>>> Metrics are graphed in Grafana. We have dashboards that mix Cloudwatch
>> (for the load balancer) and InfluxDB.
>>>
>>> I’m still working out the kinks in some of the more complicated queries,
>> but the data is all there. I also want to expand the servlet filter to
>> report HTTP response codes.
>>>
>>> wunder
>>> Walter Underwood
>>> wun...@wunderwood.org
>>> http://observer.wunderwood.org/  (my blog)
>>>
>>>
 On Nov 2, 2017, at 9:30 AM, Petersen, Robert (Contr) <
>> robert.peters...@ftr.com> wrote:

 OK I'm probably going to open a can of worms here...  lol


 In the old old days I used PSI probe to monitor solr running on tomcat
>> which worked ok on a machine by machine basis.


 Later I had a grafana 

Solr 7* Sorry, no dataimport-handler defined

2017-11-06 Thread richardg
I see where this was an issue w/ 6.4 and fixed.  I keep getting this error w/
7.0.1 and 7.1.0.  Works fine up until 6.6.2.  Could this issue have been
reintroduced?  Is there somewhere to check what might be going on?  I don't
see anything in the error logs.



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Anyone have any comments on current solr monitoring favorites?

2017-11-06 Thread Emir Arnautović
Hi Robert,
There is a free plan with limited number of nodes and 30 min retention. It 
should be straight forward to install it on AWS based solr cloud instance, but 
if you run into some issues you can used built in chat to get in touch with 
somebody to help you set it up.

Regards,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 6 Nov 2017, at 20:13, Petersen, Robert (Contr)  
> wrote:
> 
> PS I knew sematext would be required to chime in here!  
> 
> 
> Is there a non-expiring dev version I could experiment with? I think I did 
> sign up for a trial years ago from a different company... I was actually 
> wondering about hooking it up to my personal AWS based solr cloud instance.
> 
> 
> Thanks
> 
> Robi
> 
> 
> From: Emir Arnautović 
> Sent: Thursday, November 2, 2017 2:05:10 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Anyone have any comments on current solr monitoring favorites?
> 
> Hi Robi,
> Did you try Sematext’s SPM? It provides host, JVM and Solr metrics and more. 
> We use it for monitoring our Solr instances and for consulting.
> 
> Disclaimer - see signature :)
> 
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
> 
> 
> 
>> On 2 Nov 2017, at 19:35, Walter Underwood  wrote:
>> 
>> We use New Relic for JVM, CPU, and disk monitoring.
>> 
>> I tried the built-in metrics support in 6.4, but it just didn’t do what we 
>> want. We want rates and percentiles for each request handler. That gives us 
>> 95th percentile for textbooks suggest or for homework search results page, 
>> etc. The Solr metrics didn’t do that. The Jetty metrics didn’t do that.
>> 
>> We built a dedicated servlet filter that goes in front of the Solr webapp 
>> and reports metrics. It has some special hacks to handle some weird behavior 
>> in SolrJ. A request to the “/srp” handler is sent as “/select?qt=/srp”, so 
>> we normalize that.
>> 
>> The metrics start with the cluster name, the hostname, and the collection. 
>> The rest is generated like this:
>> 
>> URL: GET /solr/textbooks/select?q=foo=/auto
>> Metric: textbooks.GET./auto
>> 
>> URL: GET /solr/textbooks/select?q=foo
>> Metric: textbooks.GET./select
>> 
>> URL: GET /solr/questions/auto
>> Metric: questions.GET./auto
>> 
>> So a full metric for the cluster “solr-cloud” and the host “search01" would 
>> look like “solr-cloud.search01.solr.textbooks.GET./auto.m1_rate”.
>> 
>> We send all that to InfluxDB. We’ve configured a template so that each part 
>> of the metric name is mapped to a field, so we can write efficient queries 
>> in InfluxQL.
>> 
>> Metrics are graphed in Grafana. We have dashboards that mix Cloudwatch (for 
>> the load balancer) and InfluxDB.
>> 
>> I’m still working out the kinks in some of the more complicated queries, but 
>> the data is all there. I also want to expand the servlet filter to report 
>> HTTP response codes.
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>> 
>>> On Nov 2, 2017, at 9:30 AM, Petersen, Robert (Contr) 
>>>  wrote:
>>> 
>>> OK I'm probably going to open a can of worms here...  lol
>>> 
>>> 
>>> In the old old days I used PSI probe to monitor solr running on tomcat 
>>> which worked ok on a machine by machine basis.
>>> 
>>> 
>>> Later I had a grafana dashboard on top of graphite monitoring which was 
>>> really nice looking but kind of complicated to set up.
>>> 
>>> 
>>> Even later I successfully just dropped in a newrelic java agent which had 
>>> solr monitors and a dashboard right out of the box, but it costs money for 
>>> the full tamale.
>>> 
>>> 
>>> For basic JVM health and Solr QPS and time percentiles, does anyone have 
>>> any favorites or other alternative suggestions?
>>> 
>>> 
>>> Thanks in advance!
>>> 
>>> Robi
>>> 
>>> 
>>> 
>>> This communication is confidential. Frontier only sends and receives email 
>>> on the basis of the terms set out at 
>>> http://www.frontier.com/email_disclaimer.
>> 
> 



Re: Anyone have any comments on current solr monitoring favorites?

2017-11-06 Thread Walter Underwood
Look back down the string to my post. We use Grafana.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Nov 6, 2017, at 11:23 AM, Petersen, Robert (Contr) 
>  wrote:
> 
> Interesting! Finally a Grafana user... Thanks Daniel, I will follow your 
> links. That looks promising.
> 
> 
> Is anyone using Grafana over Graphite?
> 
> 
> Thanks
> 
> Robi
> 
> 
> From: Daniel Ortega 
> Sent: Monday, November 6, 2017 11:19:10 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Anyone have any comments on current solr monitoring favorites?
> 
> Hi Robert,
> 
> We use the following stack:
> 
> - Prometheus to scrape metrics (https://prometheus.io/)
> - Prometheus node exporter to export "machine metrics" (Disk, network
> usage, etc.) (https://github.com/prometheus/node_exporter)
> - Prometheus JMX exporter to export "Solr metrics" (Cache usage, QPS,
> Response times...) (https://github.com/prometheus/jmx_exporter)
> - Grafana to visualize all the data scrapped by Prometheus (
> https://grafana.com/)
> 
> Best regards
> Daniel Ortega
> 
> 2017-11-06 20:13 GMT+01:00 Petersen, Robert (Contr) <
> robert.peters...@ftr.com>:
> 
>> PS I knew sematext would be required to chime in here!  
>> 
>> 
>> Is there a non-expiring dev version I could experiment with? I think I did
>> sign up for a trial years ago from a different company... I was actually
>> wondering about hooking it up to my personal AWS based solr cloud instance.
>> 
>> 
>> Thanks
>> 
>> Robi
>> 
>> 
>> From: Emir Arnautović 
>> Sent: Thursday, November 2, 2017 2:05:10 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Anyone have any comments on current solr monitoring favorites?
>> 
>> Hi Robi,
>> Did you try Sematext’s SPM? It provides host, JVM and Solr metrics and
>> more. We use it for monitoring our Solr instances and for consulting.
>> 
>> Disclaimer - see signature :)
>> 
>> Emir
>> --
>> Monitoring - Log Management - Alerting - Anomaly Detection
>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>> 
>> 
>> 
>>> On 2 Nov 2017, at 19:35, Walter Underwood  wrote:
>>> 
>>> We use New Relic for JVM, CPU, and disk monitoring.
>>> 
>>> I tried the built-in metrics support in 6.4, but it just didn’t do what
>> we want. We want rates and percentiles for each request handler. That gives
>> us 95th percentile for textbooks suggest or for homework search results
>> page, etc. The Solr metrics didn’t do that. The Jetty metrics didn’t do
>> that.
>>> 
>>> We built a dedicated servlet filter that goes in front of the Solr
>> webapp and reports metrics. It has some special hacks to handle some weird
>> behavior in SolrJ. A request to the “/srp” handler is sent as
>> “/select?qt=/srp”, so we normalize that.
>>> 
>>> The metrics start with the cluster name, the hostname, and the
>> collection. The rest is generated like this:
>>> 
>>> URL: GET /solr/textbooks/select?q=foo=/auto
>>> Metric: textbooks.GET./auto
>>> 
>>> URL: GET /solr/textbooks/select?q=foo
>>> Metric: textbooks.GET./select
>>> 
>>> URL: GET /solr/questions/auto
>>> Metric: questions.GET./auto
>>> 
>>> So a full metric for the cluster “solr-cloud” and the host “search01"
>> would look like “solr-cloud.search01.solr.textbooks.GET./auto.m1_rate”.
>>> 
>>> We send all that to InfluxDB. We’ve configured a template so that each
>> part of the metric name is mapped to a field, so we can write efficient
>> queries in InfluxQL.
>>> 
>>> Metrics are graphed in Grafana. We have dashboards that mix Cloudwatch
>> (for the load balancer) and InfluxDB.
>>> 
>>> I’m still working out the kinks in some of the more complicated queries,
>> but the data is all there. I also want to expand the servlet filter to
>> report HTTP response codes.
>>> 
>>> wunder
>>> Walter Underwood
>>> wun...@wunderwood.org
>>> http://observer.wunderwood.org/  (my blog)
>>> 
>>> 
 On Nov 2, 2017, at 9:30 AM, Petersen, Robert (Contr) <
>> robert.peters...@ftr.com> wrote:
 
 OK I'm probably going to open a can of worms here...  lol
 
 
 In the old old days I used PSI probe to monitor solr running on tomcat
>> which worked ok on a machine by machine basis.
 
 
 Later I had a grafana dashboard on top of graphite monitoring which was
>> really nice looking but kind of complicated to set up.
 
 
 Even later I successfully just dropped in a newrelic java agent which
>> had solr monitors and a dashboard right out of the box, but it costs money
>> for the full tamale.
 
 
 For basic JVM health and Solr QPS and time percentiles, does anyone
>> have any favorites or other alternative suggestions?
 
 
 Thanks in advance!
 
 Robi
 
 
 
 This communication is confidential. Frontier only sends 

Re: Anyone have any comments on current solr monitoring favorites?

2017-11-06 Thread Petersen, Robert (Contr)
Interesting! Finally a Grafana user... Thanks Daniel, I will follow your links. 
That looks promising.


Is anyone using Grafana over Graphite?


Thanks

Robi


From: Daniel Ortega 
Sent: Monday, November 6, 2017 11:19:10 AM
To: solr-user@lucene.apache.org
Subject: Re: Anyone have any comments on current solr monitoring favorites?

Hi Robert,

We use the following stack:

- Prometheus to scrape metrics (https://prometheus.io/)
- Prometheus node exporter to export "machine metrics" (Disk, network
usage, etc.) (https://github.com/prometheus/node_exporter)
- Prometheus JMX exporter to export "Solr metrics" (Cache usage, QPS,
Response times...) (https://github.com/prometheus/jmx_exporter)
- Grafana to visualize all the data scrapped by Prometheus (
https://grafana.com/)

Best regards
Daniel Ortega

2017-11-06 20:13 GMT+01:00 Petersen, Robert (Contr) <
robert.peters...@ftr.com>:

> PS I knew sematext would be required to chime in here!  
>
>
> Is there a non-expiring dev version I could experiment with? I think I did
> sign up for a trial years ago from a different company... I was actually
> wondering about hooking it up to my personal AWS based solr cloud instance.
>
>
> Thanks
>
> Robi
>
> 
> From: Emir Arnautović 
> Sent: Thursday, November 2, 2017 2:05:10 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Anyone have any comments on current solr monitoring favorites?
>
> Hi Robi,
> Did you try Sematext’s SPM? It provides host, JVM and Solr metrics and
> more. We use it for monitoring our Solr instances and for consulting.
>
> Disclaimer - see signature :)
>
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 2 Nov 2017, at 19:35, Walter Underwood  wrote:
> >
> > We use New Relic for JVM, CPU, and disk monitoring.
> >
> > I tried the built-in metrics support in 6.4, but it just didn’t do what
> we want. We want rates and percentiles for each request handler. That gives
> us 95th percentile for textbooks suggest or for homework search results
> page, etc. The Solr metrics didn’t do that. The Jetty metrics didn’t do
> that.
> >
> > We built a dedicated servlet filter that goes in front of the Solr
> webapp and reports metrics. It has some special hacks to handle some weird
> behavior in SolrJ. A request to the “/srp” handler is sent as
> “/select?qt=/srp”, so we normalize that.
> >
> > The metrics start with the cluster name, the hostname, and the
> collection. The rest is generated like this:
> >
> > URL: GET /solr/textbooks/select?q=foo=/auto
> > Metric: textbooks.GET./auto
> >
> > URL: GET /solr/textbooks/select?q=foo
> > Metric: textbooks.GET./select
> >
> > URL: GET /solr/questions/auto
> > Metric: questions.GET./auto
> >
> > So a full metric for the cluster “solr-cloud” and the host “search01"
> would look like “solr-cloud.search01.solr.textbooks.GET./auto.m1_rate”.
> >
> > We send all that to InfluxDB. We’ve configured a template so that each
> part of the metric name is mapped to a field, so we can write efficient
> queries in InfluxQL.
> >
> > Metrics are graphed in Grafana. We have dashboards that mix Cloudwatch
> (for the load balancer) and InfluxDB.
> >
> > I’m still working out the kinks in some of the more complicated queries,
> but the data is all there. I also want to expand the servlet filter to
> report HTTP response codes.
> >
> > wunder
> > Walter Underwood
> > wun...@wunderwood.org
> > http://observer.wunderwood.org/  (my blog)
> >
> >
> >> On Nov 2, 2017, at 9:30 AM, Petersen, Robert (Contr) <
> robert.peters...@ftr.com> wrote:
> >>
> >> OK I'm probably going to open a can of worms here...  lol
> >>
> >>
> >> In the old old days I used PSI probe to monitor solr running on tomcat
> which worked ok on a machine by machine basis.
> >>
> >>
> >> Later I had a grafana dashboard on top of graphite monitoring which was
> really nice looking but kind of complicated to set up.
> >>
> >>
> >> Even later I successfully just dropped in a newrelic java agent which
> had solr monitors and a dashboard right out of the box, but it costs money
> for the full tamale.
> >>
> >>
> >> For basic JVM health and Solr QPS and time percentiles, does anyone
> have any favorites or other alternative suggestions?
> >>
> >>
> >> Thanks in advance!
> >>
> >> Robi
> >>
> >> 
> >>
> >> This communication is confidential. Frontier only sends and receives
> email on the basis of the terms set out at http://www.frontier.com/email_
> disclaimer.
> >
>
>



This communication is confidential. Frontier only sends and receives email on 
the basis of the terms set out at http://www.frontier.com/email_disclaimer.


Re: Anyone have any comments on current solr monitoring favorites?

2017-11-06 Thread Daniel Ortega
Hi Robert,

We use the following stack:

- Prometheus to scrape metrics (https://prometheus.io/)
- Prometheus node exporter to export "machine metrics" (Disk, network
usage, etc.) (https://github.com/prometheus/node_exporter)
- Prometheus JMX exporter to export "Solr metrics" (Cache usage, QPS,
Response times...) (https://github.com/prometheus/jmx_exporter)
- Grafana to visualize all the data scrapped by Prometheus (
https://grafana.com/)

Best regards
Daniel Ortega

2017-11-06 20:13 GMT+01:00 Petersen, Robert (Contr) <
robert.peters...@ftr.com>:

> PS I knew sematext would be required to chime in here!  
>
>
> Is there a non-expiring dev version I could experiment with? I think I did
> sign up for a trial years ago from a different company... I was actually
> wondering about hooking it up to my personal AWS based solr cloud instance.
>
>
> Thanks
>
> Robi
>
> 
> From: Emir Arnautović 
> Sent: Thursday, November 2, 2017 2:05:10 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Anyone have any comments on current solr monitoring favorites?
>
> Hi Robi,
> Did you try Sematext’s SPM? It provides host, JVM and Solr metrics and
> more. We use it for monitoring our Solr instances and for consulting.
>
> Disclaimer - see signature :)
>
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 2 Nov 2017, at 19:35, Walter Underwood  wrote:
> >
> > We use New Relic for JVM, CPU, and disk monitoring.
> >
> > I tried the built-in metrics support in 6.4, but it just didn’t do what
> we want. We want rates and percentiles for each request handler. That gives
> us 95th percentile for textbooks suggest or for homework search results
> page, etc. The Solr metrics didn’t do that. The Jetty metrics didn’t do
> that.
> >
> > We built a dedicated servlet filter that goes in front of the Solr
> webapp and reports metrics. It has some special hacks to handle some weird
> behavior in SolrJ. A request to the “/srp” handler is sent as
> “/select?qt=/srp”, so we normalize that.
> >
> > The metrics start with the cluster name, the hostname, and the
> collection. The rest is generated like this:
> >
> > URL: GET /solr/textbooks/select?q=foo=/auto
> > Metric: textbooks.GET./auto
> >
> > URL: GET /solr/textbooks/select?q=foo
> > Metric: textbooks.GET./select
> >
> > URL: GET /solr/questions/auto
> > Metric: questions.GET./auto
> >
> > So a full metric for the cluster “solr-cloud” and the host “search01"
> would look like “solr-cloud.search01.solr.textbooks.GET./auto.m1_rate”.
> >
> > We send all that to InfluxDB. We’ve configured a template so that each
> part of the metric name is mapped to a field, so we can write efficient
> queries in InfluxQL.
> >
> > Metrics are graphed in Grafana. We have dashboards that mix Cloudwatch
> (for the load balancer) and InfluxDB.
> >
> > I’m still working out the kinks in some of the more complicated queries,
> but the data is all there. I also want to expand the servlet filter to
> report HTTP response codes.
> >
> > wunder
> > Walter Underwood
> > wun...@wunderwood.org
> > http://observer.wunderwood.org/  (my blog)
> >
> >
> >> On Nov 2, 2017, at 9:30 AM, Petersen, Robert (Contr) <
> robert.peters...@ftr.com> wrote:
> >>
> >> OK I'm probably going to open a can of worms here...  lol
> >>
> >>
> >> In the old old days I used PSI probe to monitor solr running on tomcat
> which worked ok on a machine by machine basis.
> >>
> >>
> >> Later I had a grafana dashboard on top of graphite monitoring which was
> really nice looking but kind of complicated to set up.
> >>
> >>
> >> Even later I successfully just dropped in a newrelic java agent which
> had solr monitors and a dashboard right out of the box, but it costs money
> for the full tamale.
> >>
> >>
> >> For basic JVM health and Solr QPS and time percentiles, does anyone
> have any favorites or other alternative suggestions?
> >>
> >>
> >> Thanks in advance!
> >>
> >> Robi
> >>
> >> 
> >>
> >> This communication is confidential. Frontier only sends and receives
> email on the basis of the terms set out at http://www.frontier.com/email_
> disclaimer.
> >
>
>


Re: Anyone have any comments on current solr monitoring favorites?

2017-11-06 Thread Petersen, Robert (Contr)
PS I knew sematext would be required to chime in here!  


Is there a non-expiring dev version I could experiment with? I think I did sign 
up for a trial years ago from a different company... I was actually wondering 
about hooking it up to my personal AWS based solr cloud instance.


Thanks

Robi


From: Emir Arnautović 
Sent: Thursday, November 2, 2017 2:05:10 PM
To: solr-user@lucene.apache.org
Subject: Re: Anyone have any comments on current solr monitoring favorites?

Hi Robi,
Did you try Sematext’s SPM? It provides host, JVM and Solr metrics and more. We 
use it for monitoring our Solr instances and for consulting.

Disclaimer - see signature :)

Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 2 Nov 2017, at 19:35, Walter Underwood  wrote:
>
> We use New Relic for JVM, CPU, and disk monitoring.
>
> I tried the built-in metrics support in 6.4, but it just didn’t do what we 
> want. We want rates and percentiles for each request handler. That gives us 
> 95th percentile for textbooks suggest or for homework search results page, 
> etc. The Solr metrics didn’t do that. The Jetty metrics didn’t do that.
>
> We built a dedicated servlet filter that goes in front of the Solr webapp and 
> reports metrics. It has some special hacks to handle some weird behavior in 
> SolrJ. A request to the “/srp” handler is sent as “/select?qt=/srp”, so we 
> normalize that.
>
> The metrics start with the cluster name, the hostname, and the collection. 
> The rest is generated like this:
>
> URL: GET /solr/textbooks/select?q=foo=/auto
> Metric: textbooks.GET./auto
>
> URL: GET /solr/textbooks/select?q=foo
> Metric: textbooks.GET./select
>
> URL: GET /solr/questions/auto
> Metric: questions.GET./auto
>
> So a full metric for the cluster “solr-cloud” and the host “search01" would 
> look like “solr-cloud.search01.solr.textbooks.GET./auto.m1_rate”.
>
> We send all that to InfluxDB. We’ve configured a template so that each part 
> of the metric name is mapped to a field, so we can write efficient queries in 
> InfluxQL.
>
> Metrics are graphed in Grafana. We have dashboards that mix Cloudwatch (for 
> the load balancer) and InfluxDB.
>
> I’m still working out the kinks in some of the more complicated queries, but 
> the data is all there. I also want to expand the servlet filter to report 
> HTTP response codes.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
>> On Nov 2, 2017, at 9:30 AM, Petersen, Robert (Contr) 
>>  wrote:
>>
>> OK I'm probably going to open a can of worms here...  lol
>>
>>
>> In the old old days I used PSI probe to monitor solr running on tomcat which 
>> worked ok on a machine by machine basis.
>>
>>
>> Later I had a grafana dashboard on top of graphite monitoring which was 
>> really nice looking but kind of complicated to set up.
>>
>>
>> Even later I successfully just dropped in a newrelic java agent which had 
>> solr monitors and a dashboard right out of the box, but it costs money for 
>> the full tamale.
>>
>>
>> For basic JVM health and Solr QPS and time percentiles, does anyone have any 
>> favorites or other alternative suggestions?
>>
>>
>> Thanks in advance!
>>
>> Robi
>>
>> 
>>
>> This communication is confidential. Frontier only sends and receives email 
>> on the basis of the terms set out at 
>> http://www.frontier.com/email_disclaimer.
>



String payloads...

2017-11-06 Thread Petersen, Robert (Contr)
Hi Guys,


I was playing with payloads example as I had a possible use case of alternate 
product titles for a product.

https://lucidworks.com/2017/09/14/solr-payloads/

bin/solr start
bin/solr create -c payloads
bin/post -c payloads -type text/csv -out yes -d $'id,vals_dpf\n1,one|1.0 
two|2.0 three|3.0\n2,weig...

I saw you could do this:

http://localhost:8983/solr/payloads/query?q=*:*=csv=id,p:payload(vals_dpf,three)
id,p
1,3.0
2,0.0

So I wanted to do something similar wiht strings and so I loaded solr with


./post -c payloads -type text/csv -out yes -d 
$'id,vals_dps\n1,one|thisisastring two|"this is a string" three|hi\n2,j
son|{asdf:123}'


http://localhost:8983/solr/payloads/query?q=vals_dps:json


[{"id":"2","vals_dps":"json|{asdf:123}","_version_":1583284597287813000}]


OK so here is my question, it seems like the payload function only works 
against numeric payloads. Further I can't see a way to get the payload to come 
out alone without the field value attached. What I would like is something like 
this, is this possible in any way? I know it would be easy enough to do some 
post query processing in a service layer but... just wondering about this. It 
seems like I should be able to get at the payload when it is a string.


http://localhost:8983/solr/payloads/query?q=vals_dps:json=id,p:payloadvalue(vals_dpf,
 json)


[{"id":"2","p":"{asdf:123}","_version_":1583284597287813000}]

Thanks

Robi




This communication is confidential. Frontier only sends and receives email on 
the basis of the terms set out at http://www.frontier.com/email_disclaimer.


Re: Faceting Word Count

2017-11-06 Thread Jokin C
He said that it's using to get a word cloud, if it's not related to the
search and it's a generic word cloud of the index, using the luke request
handler to get the first 250 o 500 word could work.

http://localhost:8983/solr/core/admin/luke?fl=text=500=json


On Mon, Nov 6, 2017 at 4:42 PM, Erick Erickson 
wrote:

> _Why_ do you want to get the word counts? Faceting on all of the
> tokens for 100M docs isn't something Solr is ordinarily used for. As
> Emir says it'll take a huge amount of memory. You can use one of the
> function queries (termfreq IIRC) that will give you the count of any
> individual term you have and will be very fast.
>
> But getting all of the word counts in the index is probably not
> something I'd use Solr for.
>
> This may be an XY problem, you're asking how to do something specific
> (X) without explaining what the problem you're trying to solve is (Y).
> Perhaps there's another way to accomplish (Y) if we knew more about
> what it is.
>
> Best,
> Erick
>
>
>
> On Mon, Nov 6, 2017 at 4:15 AM, Emir Arnautović
>  wrote:
> > Hi Wael,
> > You are faceting on analyzed field. This results in field being
> uninverted - fieldValueCache being built - on first call after every
> commit. This is both time and memory consuming (you can check in admin
> console in stats how much memory it took).
> > What you need to do is to create multivalue string field (not text) and
> parse values (do analysis steps) on client side and store it like that.
> This will allow you to enable docValues on that field and avoid building
> fieldValueCache.
> >
> > HTH,
> > Emir
> > --
> > Monitoring - Log Management - Alerting - Anomaly Detection
> > Solr & Elasticsearch Consulting Support Training - http://sematext.com/
> >
> >
> >
> >> On 6 Nov 2017, at 13:06, Wael Kader  wrote:
> >>
> >> Hi,
> >>
> >> I am using a custom field. Below is the field definition.
> >> I am using this because I don't want stemming.
> >>
> >>
> >> >> positionIncrementGap="100">
> >>  
> >> >> mapping="mapping-ISOLatin1Accent.txt"/>
> >>
> >>
> >> >>ignoreCase="true"
> >>words="stopwords.txt"
> >>enablePositionIncrements="true"
> >>/>
> >> >>protected="protwords.txt"
> >>generateWordParts="0"
> >>generateNumberParts="1"
> >>catenateWords="1"
> >>catenateNumbers="1"
> >>catenateAll="0"
> >>splitOnCaseChange="1"
> >>preserveOriginal="1"/>
> >>
> >>
> >>
> >>  
> >>  
> >> >> mapping="mapping-ISOLatin1Accent.txt"/>
> >>
> >> synonyms="synonyms.txt"
> >> ignoreCase="true" expand="true"/>
> >> >>ignoreCase="true"
> >>words="stopwords.txt"
> >>enablePositionIncrements="true"
> >>/>
> >> 
> >> >>protected="protwords.txt"
> >>generateWordParts="0"
> >>catenateWords="0"
> >>catenateNumbers="0"
> >>catenateAll="0"
> >>splitOnCaseChange="1"
> >>preserveOriginal="1"/>
> >>
> >>
> >>
> >>
> >>  
> >>
> >>
> >>
> >> Regards,
> >> Wael
> >>
> >> On Mon, Nov 6, 2017 at 10:29 AM, Emir Arnautović <
> >> emir.arnauto...@sematext.com> wrote:
> >>
> >>> Hi Wael,
> >>> Can you provide your field definition and sample query.
> >>>
> >>> Thanks,
> >>> Emir
> >>> --
> >>> Monitoring - Log Management - Alerting - Anomaly Detection
> >>> Solr & Elasticsearch Consulting Support Training -
> http://sematext.com/
> >>>
> >>>
> >>>
>  On 6 Nov 2017, at 08:30, Wael Kader  wrote:
> 
>  Hello,
> 
>  I am having an index with around 100 Million documents.
>  I have a multivalued column that I am saving big chunks of text data
> in.
> >>> It
>  has around 20 GB of RAM and 4 CPU's.
> 
>  I was doing faceting on it to get word cloud but it was taking around
> 1
>  second to retrieve when the data was 5-10 Million .
>  Now I have more data and its taking minutes to get the results (that
> is
> >>> if
>  it gets it and SOLR doesn't crash). Whats the best way to make it run
> or
>  maybe its not scalable to make it run on my current schema and design
> >>> with
>  News articles.
> 
>  I am looking to find the best solution for this. Maybe create another
> >>> index
>  to split the data while inserting it or maybe if I change some
> settings
> >>> in
>  SolrConfig or add some RAM, it would perform better.
> 
>  --
>  Regards,
>  Wael
> >>>
> >>>
> >>
> >>
> >> --
> >> Regards,
> >> Wael
> >
>


Re: recent utf8 problems

2017-11-06 Thread Dr. Mario Michael Krell
Hi Rick, Hi Solr Experts,

Thank you for this reply!

My solr database is supposed to be(come) open source. Hence, I am willing to 
share any information. Since I am new to solr, I just did not know what to 
share. But in the mean time, I put some of the information online.

My current configuration can be found at 
https://github.com/multimedia-berkeley/mmc_search_solr/blob/master/mmc_search3/conf/solrconfig.xml
 

 and I also put it at the end of this mail. At this repository link, also the 
other configuration of my search engine can be found. Apart from updating the 
luceneMatchVersion from time to time, I didn’t do any major changes. I once 
tried to switch to the current solrconfig template but could not get it running 
but can’t remember anymore why. I also tested a similar configuration with 
lucene and solr version 6.5 where I knew that originally umlauts as in 
T%C3%BCbingen or Chinese symbols worked but even there I got the previously 
described weird behavior.

At https://multimediacommons.wordpress.com/getting-started/#solr 
, I describe how 
I set up my search engine and theoretically it should be possible to download 
it. However, it is quite big. The search engine is used as backend for a 
webpage, that I am working on (http://search.mmcommons.org/ 
). If required, I could open the firewall port of 
the public solr server for a short time, but this would have to happen under 
personal communication.

Best,

Mario





7.0.1




search
*:*









${solr.ulog.dir:}



600
false











> On Nov 6, 2017, at 04:35, Rick Leir  wrote:
> 
> Dr. Krell
> You could look at your /select query handler, and compare it with the /query 
> query handler in the Admin config. 
> 
> Did you upgrade from a previous version of Solr? Or change your config ( no, 
> you must have thought of that). If it is a bug related to the Java upgrade 
> then you need to show your config before folks can help.
> Cheers -- Rick
> 
> 
> On November 4, 2017 5:11:36 PM EDT, "Dr. Mario Michael Krell" 
>  wrote:
>> Hi,
>> 
>> We recently discovered issues with solr with converting utf8 code in
>> the search. One or two month ago everything was still working.
>> 
>> - What might have caused it is a Java update (Java 8 Update 151). 
>> - We are using firefox as well as chrome for displaying results.
>> - We tested it with Solr 6.5, Solr 7.0.0, 7.0.1, and 7.1.
>> 
>> We created a search engine base on the yfcc100m and in the normal
>> browser (http://localhost:8983/solr/#/mmc_search3/query
>> ), we can search for
>> "title:T%C3%BCbingen” in the query field and get more than 3 million
>> results:
>> 
>> {
>> "responseHeader":{
>>   "status":0,
>>   "QTime":103},
>> "response":{"numFound":3092484,"start":0,"docs":[
>> {
>>   "photoid":"6182384834",
>> 
>> However, when we use the respective web-address, 
>> http://localhost:8983/solr/mmc_search3/select?q=title:T%C3%BCbingen=json
>> 
>> The results are deduced to zero:
>> {
>> "responseHeader":{
>>   "status":0,
>>   "QTime":0},
>> "response":{"numFound":0,"start":0,"docs":[]
>> }}
>> 
>> responseHeader   
>> status   0
>> QTime0
>> response 
>> numFound 0
>> start0
>> docs []
>> 
>> I would be happy for any suggestions on how to fix this problem. For me
>> it seems like a bug in solr caused by Java.
>> 
>> Best,
>> 
>> Mario
> 
> -- 
> Sorry for being brief. Alternate email is rickleir at yahoo dot com



Re: Faceting Word Count

2017-11-06 Thread Erick Erickson
_Why_ do you want to get the word counts? Faceting on all of the
tokens for 100M docs isn't something Solr is ordinarily used for. As
Emir says it'll take a huge amount of memory. You can use one of the
function queries (termfreq IIRC) that will give you the count of any
individual term you have and will be very fast.

But getting all of the word counts in the index is probably not
something I'd use Solr for.

This may be an XY problem, you're asking how to do something specific
(X) without explaining what the problem you're trying to solve is (Y).
Perhaps there's another way to accomplish (Y) if we knew more about
what it is.

Best,
Erick



On Mon, Nov 6, 2017 at 4:15 AM, Emir Arnautović
 wrote:
> Hi Wael,
> You are faceting on analyzed field. This results in field being uninverted - 
> fieldValueCache being built - on first call after every commit. This is both 
> time and memory consuming (you can check in admin console in stats how much 
> memory it took).
> What you need to do is to create multivalue string field (not text) and parse 
> values (do analysis steps) on client side and store it like that. This will 
> allow you to enable docValues on that field and avoid building 
> fieldValueCache.
>
> HTH,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
>> On 6 Nov 2017, at 13:06, Wael Kader  wrote:
>>
>> Hi,
>>
>> I am using a custom field. Below is the field definition.
>> I am using this because I don't want stemming.
>>
>>
>>> positionIncrementGap="100">
>>  
>>> mapping="mapping-ISOLatin1Accent.txt"/>
>>
>>
>>>ignoreCase="true"
>>words="stopwords.txt"
>>enablePositionIncrements="true"
>>/>
>>>protected="protwords.txt"
>>generateWordParts="0"
>>generateNumberParts="1"
>>catenateWords="1"
>>catenateNumbers="1"
>>catenateAll="0"
>>splitOnCaseChange="1"
>>preserveOriginal="1"/>
>>
>>
>>
>>  
>>  
>>> mapping="mapping-ISOLatin1Accent.txt"/>
>>
>>> ignoreCase="true" expand="true"/>
>>>ignoreCase="true"
>>words="stopwords.txt"
>>enablePositionIncrements="true"
>>/>
>> 
>>>protected="protwords.txt"
>>generateWordParts="0"
>>catenateWords="0"
>>catenateNumbers="0"
>>catenateAll="0"
>>splitOnCaseChange="1"
>>preserveOriginal="1"/>
>>
>>
>>
>>
>>  
>>
>>
>>
>> Regards,
>> Wael
>>
>> On Mon, Nov 6, 2017 at 10:29 AM, Emir Arnautović <
>> emir.arnauto...@sematext.com> wrote:
>>
>>> Hi Wael,
>>> Can you provide your field definition and sample query.
>>>
>>> Thanks,
>>> Emir
>>> --
>>> Monitoring - Log Management - Alerting - Anomaly Detection
>>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>>>
>>>
>>>
 On 6 Nov 2017, at 08:30, Wael Kader  wrote:

 Hello,

 I am having an index with around 100 Million documents.
 I have a multivalued column that I am saving big chunks of text data in.
>>> It
 has around 20 GB of RAM and 4 CPU's.

 I was doing faceting on it to get word cloud but it was taking around 1
 second to retrieve when the data was 5-10 Million .
 Now I have more data and its taking minutes to get the results (that is
>>> if
 it gets it and SOLR doesn't crash). Whats the best way to make it run or
 maybe its not scalable to make it run on my current schema and design
>>> with
 News articles.

 I am looking to find the best solution for this. Maybe create another
>>> index
 to split the data while inserting it or maybe if I change some settings
>>> in
 SolrConfig or add some RAM, it would perform better.

 --
 Regards,
 Wael
>>>
>>>
>>
>>
>> --
>> Regards,
>> Wael
>


Re: recent utf8 problems

2017-11-06 Thread Rick Leir
Dr. Krell
You could look at your /select query handler, and compare it with the /query 
query handler in the Admin config. 

Did you upgrade from a previous version of Solr? Or change your config ( no, 
you must have thought of that). If it is a bug related to the Java upgrade then 
you need to show your config before folks can help.
Cheers -- Rick


On November 4, 2017 5:11:36 PM EDT, "Dr. Mario Michael Krell" 
 wrote:
>Hi,
>
>We recently discovered issues with solr with converting utf8 code in
>the search. One or two month ago everything was still working.
>
>- What might have caused it is a Java update (Java 8 Update 151). 
>- We are using firefox as well as chrome for displaying results.
>- We tested it with Solr 6.5, Solr 7.0.0, 7.0.1, and 7.1.
>
>We created a search engine base on the yfcc100m and in the normal
>browser (http://localhost:8983/solr/#/mmc_search3/query
>), we can search for
>"title:T%C3%BCbingen” in the query field and get more than 3 million
>results:
>
>{
>  "responseHeader":{
>"status":0,
>"QTime":103},
>  "response":{"numFound":3092484,"start":0,"docs":[
>  {
>"photoid":"6182384834",
>
>However, when we use the respective web-address, 
>http://localhost:8983/solr/mmc_search3/select?q=title:T%C3%BCbingen=json
>
>The results are deduced to zero:
>{
>  "responseHeader":{
>"status":0,
>"QTime":0},
>  "response":{"numFound":0,"start":0,"docs":[]
>  }}
>
>responseHeader 
>status 0
>QTime  0
>response   
>numFound   0
>start  0
>docs   []
>
>I would be happy for any suggestions on how to fix this problem. For me
>it seems like a bug in solr caused by Java.
>
>Best,
>
>Mario

-- 
Sorry for being brief. Alternate email is rickleir at yahoo dot com 

Re: Fwd: configuring Solr with Tesseract

2017-11-06 Thread Admin eLawJournal
Thanks Rick, minutes of CPU is definitely going to break my site. I'm
looking for someone to hire as I have no coding knowledge. Please let me
know if you are up for it.

On Mon, Nov 6, 2017 at 8:05 PM, Rick Leir  wrote:

> Anand,
> As Charlie says you should have a separate process for this. Also, if you
> go back about ten months in this mailing list you will see some discussion
> about how OCR can take minutes of CPU per page, and needs some
> preprocessing with Imagemagick or Graphicsmagick. You will want to do some
> fine tuning with this, then save your OCR output in a DB or the filesystem.
> Then you will want to be able to re-index Solr easily as you fine tune Solr.
>
> Yes, use Python or your preferred Scripting language.
> Cheers -- Rick
>
> On November 6, 2017 4:05:42 AM EST, Charlie Hull 
> wrote:
> >On 03/11/2017 15:32, Admin eLawJournal wrote:
> >> Hi,
> >> I have read that we can use tesseract with solr to index image files.
> >I
> >> would like some guidance on setting this up.
> >>
> >> Currently, I am using solr for searching my wordpress installation
> >via the
> >> WPSOLR plugin.
> >>
> >> I have Solr 6.6 installed on ubuntu 14.04 which is working fine with
> >> wordpress.
> >>
> >> I have also installed tesseract but have no clue on configuring it.
> >>
> >>
> >> I am new to solr so will greatly appreciate a detailed step by step
> >> instruction.
> >
> >Hi,
> >
> >I'm guessing if you're using a preconfigured Solr plugin for WP you
> >probably haven't got your hands properly dirty with Solr yet.
> >
> >One way to use Tesseract would be via Apache Tika
> >https://wiki.apache.org/tika/TikaOCR which is an awesome library for
> >extracting plain text from many different document formats and types.
> >There's a direct way to use Tesseract from within Solr (the
> >ExtractingRequestHandler
> >https://lucene.apache.org/solr/guide/6_6/uploading-data-
> with-solr-cell-using-apache-tika.html#uploading-data-with-
> solr-cell-using-apache-tika)
> >
> >but we don't generally recommend this, as dodgy files can sometimes eat
> >
> >all your resources during parsing and if Tika dies then so does Solr.
> >We
> >usually process the files externally and the feed them to Solr using
> >its
> >HTTP API.
> >
> >Here's one way to do it - a simple server wrapper around Tika
> >https://github.com/mattflax/dropwizard-tika-server written by my
> >colleague Matt Pearce.
> >
> >So you're going to need to do some coding I think - Python would be a
> >good choice - to feed your source files to Tika for OCR and extraction,
> >
> >and then the resulting text to Solr for indexing.
> >
> >Cheers
> >
> >Charlie
> >
> >>
> >> Thank you very much
> >>
> >
> >
> >--
> >Charlie Hull
> >Flax - Open Source Enterprise Search
> >
> >tel/fax: +44 (0)8700 118334
> >mobile:  +44 (0)7767 825828
> >web: www.flax.co.uk
>
> --
> Sorry for being brief. Alternate email is rickleir at yahoo dot com


Re: Faceting Word Count

2017-11-06 Thread Emir Arnautović
Hi Wael,
You are faceting on analyzed field. This results in field being uninverted - 
fieldValueCache being built - on first call after every commit. This is both 
time and memory consuming (you can check in admin console in stats how much 
memory it took). 
What you need to do is to create multivalue string field (not text) and parse 
values (do analysis steps) on client side and store it like that. This will 
allow you to enable docValues on that field and avoid building fieldValueCache.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 6 Nov 2017, at 13:06, Wael Kader  wrote:
> 
> Hi,
> 
> I am using a custom field. Below is the field definition.
> I am using this because I don't want stemming.
> 
> 
> positionIncrementGap="100">
>  
> mapping="mapping-ISOLatin1Accent.txt"/>
>
> 
>ignoreCase="true"
>words="stopwords.txt"
>enablePositionIncrements="true"
>/>
>protected="protwords.txt"
>generateWordParts="0"
>generateNumberParts="1"
>catenateWords="1"
>catenateNumbers="1"
>catenateAll="0"
>splitOnCaseChange="1"
>preserveOriginal="1"/>
>
> 
>
>  
>  
> mapping="mapping-ISOLatin1Accent.txt"/>
>
> ignoreCase="true" expand="true"/>
>ignoreCase="true"
>words="stopwords.txt"
>enablePositionIncrements="true"
>/>
> 
>protected="protwords.txt"
>generateWordParts="0"
>catenateWords="0"
>catenateNumbers="0"
>catenateAll="0"
>splitOnCaseChange="1"
>preserveOriginal="1"/>
>
>
>
>
>  
>
> 
> 
> Regards,
> Wael
> 
> On Mon, Nov 6, 2017 at 10:29 AM, Emir Arnautović <
> emir.arnauto...@sematext.com> wrote:
> 
>> Hi Wael,
>> Can you provide your field definition and sample query.
>> 
>> Thanks,
>> Emir
>> --
>> Monitoring - Log Management - Alerting - Anomaly Detection
>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>> 
>> 
>> 
>>> On 6 Nov 2017, at 08:30, Wael Kader  wrote:
>>> 
>>> Hello,
>>> 
>>> I am having an index with around 100 Million documents.
>>> I have a multivalued column that I am saving big chunks of text data in.
>> It
>>> has around 20 GB of RAM and 4 CPU's.
>>> 
>>> I was doing faceting on it to get word cloud but it was taking around 1
>>> second to retrieve when the data was 5-10 Million .
>>> Now I have more data and its taking minutes to get the results (that is
>> if
>>> it gets it and SOLR doesn't crash). Whats the best way to make it run or
>>> maybe its not scalable to make it run on my current schema and design
>> with
>>> News articles.
>>> 
>>> I am looking to find the best solution for this. Maybe create another
>> index
>>> to split the data while inserting it or maybe if I change some settings
>> in
>>> SolrConfig or add some RAM, it would perform better.
>>> 
>>> --
>>> Regards,
>>> Wael
>> 
>> 
> 
> 
> -- 
> Regards,
> Wael



Re: Faceting Word Count

2017-11-06 Thread Wael Kader
Hi,

I am using a custom field. Below is the field definition.
I am using this because I don't want stemming.



  








  
  










  



Regards,
Wael

On Mon, Nov 6, 2017 at 10:29 AM, Emir Arnautović <
emir.arnauto...@sematext.com> wrote:

> Hi Wael,
> Can you provide your field definition and sample query.
>
> Thanks,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 6 Nov 2017, at 08:30, Wael Kader  wrote:
> >
> > Hello,
> >
> > I am having an index with around 100 Million documents.
> > I have a multivalued column that I am saving big chunks of text data in.
> It
> > has around 20 GB of RAM and 4 CPU's.
> >
> > I was doing faceting on it to get word cloud but it was taking around 1
> > second to retrieve when the data was 5-10 Million .
> > Now I have more data and its taking minutes to get the results (that is
> if
> > it gets it and SOLR doesn't crash). Whats the best way to make it run or
> > maybe its not scalable to make it run on my current schema and design
> with
> > News articles.
> >
> > I am looking to find the best solution for this. Maybe create another
> index
> > to split the data while inserting it or maybe if I change some settings
> in
> > SolrConfig or add some RAM, it would perform better.
> >
> > --
> > Regards,
> > Wael
>
>


-- 
Regards,
Wael


Re: Fwd: configuring Solr with Tesseract

2017-11-06 Thread Rick Leir
Anand,
As Charlie says you should have a separate process for this. Also, if you go 
back about ten months in this mailing list you will see some discussion about 
how OCR can take minutes of CPU per page, and needs some preprocessing with 
Imagemagick or Graphicsmagick. You will want to do some fine tuning with this, 
then save your OCR output in a DB or the filesystem. Then you will want to be 
able to re-index Solr easily as you fine tune Solr. 

Yes, use Python or your preferred Scripting language.
Cheers -- Rick

On November 6, 2017 4:05:42 AM EST, Charlie Hull  wrote:
>On 03/11/2017 15:32, Admin eLawJournal wrote:
>> Hi,
>> I have read that we can use tesseract with solr to index image files.
>I
>> would like some guidance on setting this up.
>> 
>> Currently, I am using solr for searching my wordpress installation
>via the
>> WPSOLR plugin.
>> 
>> I have Solr 6.6 installed on ubuntu 14.04 which is working fine with
>> wordpress.
>> 
>> I have also installed tesseract but have no clue on configuring it.
>> 
>> 
>> I am new to solr so will greatly appreciate a detailed step by step
>> instruction.
>
>Hi,
>
>I'm guessing if you're using a preconfigured Solr plugin for WP you 
>probably haven't got your hands properly dirty with Solr yet.
>
>One way to use Tesseract would be via Apache Tika 
>https://wiki.apache.org/tika/TikaOCR which is an awesome library for 
>extracting plain text from many different document formats and types. 
>There's a direct way to use Tesseract from within Solr (the 
>ExtractingRequestHandler 
>https://lucene.apache.org/solr/guide/6_6/uploading-data-with-solr-cell-using-apache-tika.html#uploading-data-with-solr-cell-using-apache-tika)
>
>but we don't generally recommend this, as dodgy files can sometimes eat
>
>all your resources during parsing and if Tika dies then so does Solr.
>We 
>usually process the files externally and the feed them to Solr using
>its 
>HTTP API.
>
>Here's one way to do it - a simple server wrapper around Tika 
>https://github.com/mattflax/dropwizard-tika-server written by my 
>colleague Matt Pearce.
>
>So you're going to need to do some coding I think - Python would be a 
>good choice - to feed your source files to Tika for OCR and extraction,
>
>and then the resulting text to Solr for indexing.
>
>Cheers
>
>Charlie
>
>> 
>> Thank you very much
>> 
>
>
>-- 
>Charlie Hull
>Flax - Open Source Enterprise Search
>
>tel/fax: +44 (0)8700 118334
>mobile:  +44 (0)7767 825828
>web: www.flax.co.uk

-- 
Sorry for being brief. Alternate email is rickleir at yahoo dot com 

Re: Fwd: configuring Solr with Tesseract

2017-11-06 Thread Admin eLawJournal
Hi Charlie,

Thanks for the reply. You're right. I haven't got my hands dirty with solr
yet. I am not from an IT background and learnt everything I know through
lots of reading online. However, all the documentation on solr assumes that
the reader has advanced IT knowledge. In fact, it took me a week to learn
to install and configure solr index to work with WordPress.

Getting solr to ocr appears to be beyond me. And I can't code.

*Would you consider setting this up for me for a fee? *

And also with a step by step guide for dummies in case I intend to upgrade
in the future.

I also noticed that Tika 1.14 is capable of ocr by itself. I would be okay
with a setup of solr using Tika 1.14 to ocr the PDF if that is possible.

Best regards,
Anand


On Nov 6, 2017 5:05 PM, "Charlie Hull"  wrote:

On 03/11/2017 15:32, Admin eLawJournal wrote:

> Hi,
> I have read that we can use tesseract with solr to index image files. I
> would like some guidance on setting this up.
>
> Currently, I am using solr for searching my wordpress installation via the
> WPSOLR plugin.
>
> I have Solr 6.6 installed on ubuntu 14.04 which is working fine with
> wordpress.
>
> I have also installed tesseract but have no clue on configuring it.
>
>
> I am new to solr so will greatly appreciate a detailed step by step
> instruction.
>

Hi,

I'm guessing if you're using a preconfigured Solr plugin for WP you
probably haven't got your hands properly dirty with Solr yet.

One way to use Tesseract would be via Apache Tika
https://wiki.apache.org/tika/TikaOCR which is an awesome library for
extracting plain text from many different document formats and types.
There's a direct way to use Tesseract from within Solr (the
ExtractingRequestHandler https://lucene.apache.org/solr
/guide/6_6/uploading-data-with-solr-cell-using-apache-tika.
html#uploading-data-with-solr-cell-using-apache-tika) but we don't
generally recommend this, as dodgy files can sometimes eat all your
resources during parsing and if Tika dies then so does Solr. We usually
process the files externally and the feed them to Solr using its HTTP API.

Here's one way to do it - a simple server wrapper around Tika
https://github.com/mattflax/dropwizard-tika-server written by my colleague
Matt Pearce.

So you're going to need to do some coding I think - Python would be a good
choice - to feed your source files to Tika for OCR and extraction, and then
the resulting text to Solr for indexing.

Cheers

Charlie


> Thank you very much
>
>

-- 
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk


Re: Faceting Word Count

2017-11-06 Thread Emir Arnautović
Hi Wael,
Can you provide your field definition and sample query.

Thanks,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 6 Nov 2017, at 08:30, Wael Kader  wrote:
> 
> Hello,
> 
> I am having an index with around 100 Million documents.
> I have a multivalued column that I am saving big chunks of text data in. It
> has around 20 GB of RAM and 4 CPU's.
> 
> I was doing faceting on it to get word cloud but it was taking around 1
> second to retrieve when the data was 5-10 Million .
> Now I have more data and its taking minutes to get the results (that is if
> it gets it and SOLR doesn't crash). Whats the best way to make it run or
> maybe its not scalable to make it run on my current schema and design with
> News articles.
> 
> I am looking to find the best solution for this. Maybe create another index
> to split the data while inserting it or maybe if I change some settings in
> SolrConfig or add some RAM, it would perform better.
> 
> -- 
> Regards,
> Wael



Re: Fwd: configuring Solr with Tesseract

2017-11-06 Thread Charlie Hull

On 03/11/2017 15:32, Admin eLawJournal wrote:

Hi,
I have read that we can use tesseract with solr to index image files. I
would like some guidance on setting this up.

Currently, I am using solr for searching my wordpress installation via the
WPSOLR plugin.

I have Solr 6.6 installed on ubuntu 14.04 which is working fine with
wordpress.

I have also installed tesseract but have no clue on configuring it.


I am new to solr so will greatly appreciate a detailed step by step
instruction.


Hi,

I'm guessing if you're using a preconfigured Solr plugin for WP you 
probably haven't got your hands properly dirty with Solr yet.


One way to use Tesseract would be via Apache Tika 
https://wiki.apache.org/tika/TikaOCR which is an awesome library for 
extracting plain text from many different document formats and types. 
There's a direct way to use Tesseract from within Solr (the 
ExtractingRequestHandler 
https://lucene.apache.org/solr/guide/6_6/uploading-data-with-solr-cell-using-apache-tika.html#uploading-data-with-solr-cell-using-apache-tika) 
but we don't generally recommend this, as dodgy files can sometimes eat 
all your resources during parsing and if Tika dies then so does Solr. We 
usually process the files externally and the feed them to Solr using its 
HTTP API.


Here's one way to do it - a simple server wrapper around Tika 
https://github.com/mattflax/dropwizard-tika-server written by my 
colleague Matt Pearce.


So you're going to need to do some coding I think - Python would be a 
good choice - to feed your source files to Tika for OCR and extraction, 
and then the resulting text to Solr for indexing.


Cheers

Charlie



Thank you very much




--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk