Re: SolrCloud scaling/optimization for high request rate

2018-10-26 Thread Shawn Heisey

On 10/26/2018 9:55 AM, Sofiya Strochyk wrote:


We have a SolrCloud setup with the following configuration:



I'm late to this party.  You've gotten some good replies already.  I 
hope I can add something useful.



  * 4 nodes (3x128GB RAM Intel Xeon E5-1650v2, 1x64GB RAM Intel Xeon
E5-1650v2, 12 cores, with SSDs)
  * One collection, 4 shards, each has only a single replica (so 4
replicas in total), using compositeId router
  * Total index size is about 150M documents/320GB, so about 40M/80GB
per node



With 80GB of index data and one node that only has 64GB of memory, the 
full index won't fit into memory on that one server. With approximately 
56GB of memory (assuming there's nothing besides Solr running on these 
servers and the size of all Java heaps on the system is 8GB) to cache 
80GB of index data, performance might be good.  Or it might be 
terrible.  It's impossible to predict effectively.



  * Heap size is set to 8GB.



I'm not sure that an 8GB heap is large enough.  Especially given what 
you said later about experiencing OOM and seeing a lot of full GCs.


If properly tuned, the G1 collector is overall more efficient than CMS, 
but CMS can be quite good.  If GC is not working well with CMS, chances 
are that switching to G1 will not help.  The root problem is likely to 
be something that a different collector can't fix -- like the heap being 
too small.


I wrote the page you referenced for GC tuning.  I have *never* had a 
single problem using G1 with Solr.


Target query rate is up to 500 qps, maybe 300, and we need to keep 
response time at <200ms. But at the moment we only see very good 
search performance with up to 100 requests per second. Whenever it 
grows to about 200, average response time abruptly increases to 0.5-1 
second. (Also it seems that request rate reported by SOLR in admin 
metrics is 2x higher than the real one, because for every query, every 
shard receives 2 requests: one to obtain IDs and second one to get 
data by IDs; so target rate for SOLR metrics would be 1000 qps).




Getting 100 requests per second on a single replica is quite good, 
especially with a sharded index.  I never could get performance like 
that.  To handle hundreds of requests per second, you need several replicas.


If you can reduce the number of shards, the amount of work involved for 
a single request will decrease, which MIGHT increase the queries per 
second your hardware can handle.  With four shards, one query typically 
is actually 9 requests.


Unless your clients are all Java-based, to avoid a single point of 
failure, you need a load balancer as well.  (The Java client can talk to 
the entire SolrCloud cluster and wouldn't need a load balancer)


What you are seeing where there is a sharp drop in performance from a 
relatively modest load increase is VERY common.  This is the way that 
almost all software systems behave when faced with extreme loads.  
Search for "knee" on this page:


https://www.oreilly.com/library/view/the-art-of/9780596155858/ch04.html

During high request load, CPU usage increases dramatically on the SOLR 
nodes. It doesn't reach 100% but averages at 50-70% on 3 servers and 
about 93% on 1 server (random server each time, not the smallest one).




Very likely the one with a higher load is the one that is aggregating 
shard requests for a full result.


The documentation mentions replication to spread the load between the 
servers. We tested replicating to smaller servers (32GB RAM, Intel 
Core i7-4770). However, when we tested it, the replicas were going out 
of sync all the time (possibly during commits) and reported errors 
like "PeerSync Recovery was not successful - trying replication." Then 
they proceed with replication which takes hours and the leader handles 
all requests singlehandedly during that time. Also both leaders and 
replicas started encountering OOM errors (heap space) for unknown reason.




With only 32GB of memory, assuming 8GB is allocated to the heap, there's 
only 24GB to cache the 80GB of index data.  That's not enough, and 
performance would be MUCH worse than your 64GB or 128GB machines.


I would suspect extreme GC pauses and/or general performance issues from 
not enough cache memory to be the root cause of the sync and recovery 
problems.


Heap dump analysis shows that most of the memory is consumed by [J 
(array of long) type, my best guess would be that it is "_version_" 
field, but it's still unclear why it happens.




I'm not familiar enough with how Lucene allocates memory internally to 
have any hope of telling you exactly what that memory structure is.


Also, even though with replication request rate and CPU usage drop 2 
times, it doesn't seem to affect mean_ms, stddev_ms or p95_ms numbers 
(p75_ms is much smaller on nodes with replication, but still not as 
low as under load of <100 requests/s).


Garbage collection is much more active during high load as well. Full 
GC happens almost exclusively during 

Re: SolrCloud scaling/optimization for high request rate

2018-10-26 Thread Toke Eskildsen
Sofiya Strochyk  wrote:
> Target query rate is up to 500 qps, maybe 300, and we need
> to keep response time at <200ms. But at the moment we only
> see very good search performance with up to 100 requests
> per second. Whenever it grows to about 200, average response
> time abruptly increases to 0.5-1 second. 

Keep in mind that upping the number of concurrent searches in Solr does not 
raise throughput, if the system is already saturated. On the contrary, this 
will lower throughput due to thread- and memory-congestion.

As your machines has 12 cores (including HyperThreading) and IO does not seem 
to be an issue, 500 or even just 200 concurrent searches seems likely to result 
in lower throughput than (really guessing here) 100 concurrent searches. As 
Walther point out, the end result is collapse, but slowdown happens before that.

Consider putting a proxy in front with a max amount of concurrent connections 
and a sensible queue. Preferably after a bit of testing to locale where the 
highest throughput is. It won't make you hit your overall goal, but it can move 
you closer to it.

- Toke Eskildsen


Re: SolrCloud scaling/optimization for high request rate

2018-10-26 Thread Erick Erickson
Sofiya:

I haven't said so before, but it's a great pleasure to work with
someone who's done a lot of homework before pinging the list. The only
unfortunate bit is that it usually means the simple "Oh, I can fix
that without thinking about it much" doesn't work ;)

2.  I'll clarify a bit here. Any TLOG replica can become the leader.
Here's the process for an update:
> doc comes in to the leader (may be TLOG)
> doc is forwarded to all TLOG replicas, _but it is not indexed there_.
> If the leader fails, the other TLOG replicas have enough documents in _their_ 
> tlogs to "catch up" and one is elected
> You're totally right that PULL replicas cannot become leaders
> having all TLOG replicas means that the CPU cycles otherwise consumed by 
> indexing are available for query processing.

The point here is that TLOG replicas don't need to expend CPU cycles
to index documents, freeing up all those cycles for serving queries.

Now, that said you report that QPS rate doesn't particularly seem to
be affected by whether you're indexing or not, so that makes using
TLOG and PULL replicas less likely to solve your problem. I was
thinking about your statement that you index as fast as possible


6. This is a little surprising. Here's my guess: You're  indexing in
large batches and the batch is only really occupying a thread or two
so it's effectively serialized thus not consuming a huge amount of
resources.

So unless G1 really solves a lot of problems, more replicas are
indicated. On machines with large amounts of RAM and lots of CPUs, one
other option is to run multiple JVMs per physical node that's
sometimes helpful.

One other possibility. In Solr 7.5, you have a ton of metrics
available. If you hit the admin/metrics end point you'll see 150-200
available metrics. Apart from running  a profiler to see what's
consuming the most cycles, the metrics can give you a view into what
Solr is doing and may help you pinpoint what's using the most cycles.

Best,
Erick
On Fri, Oct 26, 2018 at 12:23 PM Toke Eskildsen  wrote:
>
> David Hastings  wrote:
> > Would adding the docValues in the schema, but not reindexing, cause
> > errors?  IE, only apply the doc values after the next reindex, but in the
> > meantime keep functioning as there were none until then?
>
> As soon as you specify in the schema that a field has docValues=true, Solr 
> treats all existing documents as having docValues enabled for that field. As 
> there is no docValue content, DocValues-aware functionality such as sorting 
> and faceting will not work for that field, until the documents has been 
> re-indexed.
>
> - Toke Eskildsen


Re: LTR features on solr

2018-10-26 Thread Kamuela Lau
I have never done such a thing myself, but I think that dynamic field would
probably be the way to go.

I've not used it myself, but you might also be able to do what you want
with payloads:

https://lucene.apache.org/solr/guide/7_5/function-queries.html#payload-function

https://lucidworks.com/2017/09/14/solr-payloads/

Hope that answers your question.

2018年10月26日(金) 18:28 Midas A :

> *Thanks for relpy . Please find my answers below inline.*
>
>
> On Fri, Oct 26, 2018 at 2:41 PM Kamuela Lau  wrote:
>
> > Hi,
> >
> > Just to confirm, are you asking about the following?
> >
> > For a particular query, you have a list of documents, and for each
> > document, you have data
> > on the number of times the document was clicked on, added to a cart, and
> > ordered, and you
> > would like to use this data for features. Is this correct?
> > *[ME] :Yes*
> > If this is the case, are you indexing that data?
> >
>*[ME]* : *Yes we are planing to index the data but my question is how we
> should store it in to solr .*
> * should i create dynamic field to store the click, cart and order
> data per query for document?.*
> * Please guide me how we should store. *
>
> >
> > I believe that the features which can be used for the LTR module is
> > information that is either indexed,
> > or indexed information which has been manipulated through the use of
> > function queries.
> >
> > https://lucene.apache.org/solr/guide/7_5/learning-to-rank.html
> >
> > It seems to me that you would have to frequently index the click data, if
> > you need to refresh the data frequently
> >
>   *  [ME] : we are planing to refresh this data weekly.*
>
> >
> > On Fri, Oct 26, 2018 at 4:24 PM Midas A  wrote:
> >
> > > Hi  All,
> > >
> > > I am new in implementing solr LTR .  so facing few challenges
> > > Broadly  we have 3 kind of features
> > > a) Based on query
> > > b) based on document
> > > *c) Based on query-document from click ,cart and order  from tracker
> > data.*
> > >
> > > So my question here is how to store c) type of features
> > >- Old queries and corresponding clicks ((query-clicks)
> > > - Old query -cart addition  and
> > >   - Old query -order data
> > >  into solr to run LTR model
> > > and secoundly how to build features for query-clicks, query-cart and
> > > query-orders because we need to refresh  this data frequently.
> > >
> > > What approch should i follow .
> > >
> > > Hope i am able to explain my problem.
> > >
> >
>


Re: Tesseract language

2018-10-26 Thread Rohan Kasat
Hi Martin,

Are you using it For image formats , I think you can try tess4j and use
give TESSDATA_PREFIX as the home for tessarct Configs.

I have tried it and it works pretty well in my local machine.

I have used java 8 and tesseact 3 for the same.

Regards,
Rohan Kasat

On Fri, Oct 26, 2018 at 12:31 PM Martin Frank Hansen (MHQ) 
wrote:

> Hi Tim,
>
> You were right.
>
> When I called `tesseract testing/eurotext.png testing/eurotext-dan -l
> dan`, I got an error message so I downloaded "dan.traineddata" and added it
> to the Tesseract-OCR/tessdata folder. Furthermore I added the
> 'TESSDATA_PREFIX' variable to the path-variables pointing to
> "Tesseract-OCR/tessdata".
>
> Now Tesseract works with Danish language from the CMD, but now I can't
> make the code work in Java, not even with default settings (which I could
> before). Am I missing something or just mixing some things up?
>
>
>
> -Original Message-
> From: Tim Allison 
> Sent: 26. oktober 2018 19:58
> To: solr-user@lucene.apache.org
> Subject: Re: Tesseract language
>
> Tika relies on you to install tesseract and all the language libraries
> you'll need.
>
> If you can successfully call `tesseract testing/eurotext.png
> testing/eurotext-dan -l dan`, Tika _should_ be able to specify "dan"
> with your code above.
> On Fri, Oct 26, 2018 at 10:49 AM Martin Frank Hansen (MHQ) 
> wrote:
> >
> > Hi again,
> >
> > Now I moved the OCR part to Tika, but I still can't make it work with
> Danish. It works when using default language settings and it seems like
> Tika is missing Danish dictionary.
> >
> > My java code looks like this:
> >
> > {
> > File file = new File(pathfilename);
> >
> > Metadata meta = new Metadata();
> >
> > InputStream stream = TikaInputStream.get(file);
> >
> > Parser parser = new AutoDetectParser();
> > BodyContentHandler handler = new
> > BodyContentHandler(Integer.MAX_VALUE);
> >
> > TesseractOCRConfig config = new TesseractOCRConfig();
> > config.setLanguage("dan"); // code works if this phrase is
> commented out.
> >
> > ParseContext parseContext = new ParseContext();
> >
> >  parseContext.set(TesseractOCRConfig.class, config);
> >
> > parser.parse(stream, handler, meta, parseContext);
> > System.out.println(handler.toString());
> > }
> >
> > Hope that someone can help here.
> >
> > -Original Message-
> > From: Martin Frank Hansen (MHQ) 
> > Sent: 22. oktober 2018 07:58
> > To: solr-user@lucene.apache.org
> > Subject: SV: Tessera
> ct
> language
> >
> > Hi Erick,
> >
> > Thanks for the help! I will take a look at it.
> >
> >
> > Martin Frank Hansen, Senior Data Analytiker
> >
> > Data, IM & Analytics
> >
> >
> >
> > Lautrupparken 40-42, DK-2750 Ballerup
> > E-mail m...@kmd.dk  Web www.kmd.dk
> > Mobil +4525571418
> >
> > -Oprindelig meddelelse-
> > Fra: Erick Erickson 
> > Sendt: 21. oktober 2018 22:49
> > Til: solr-user 
> > Emne: Re: Tesseract language
> >
> > Here's a skeletal program that uses Tika in a stand-alone client. Rip
> the RDBMS parts out
> >
> > https://lucidworks.com/2012/02/14/indexing-with-solrj/
> > On Sun, Oct 21, 2018 at 1:13 PM Alexandre Rafalovitch <
> arafa...@gmail.com> wrote:
> > >
> > > Usually, we just say to do a custom solution using SolrJ client to
> > > connect. This gives you maximum flexibility and allows to integrate
> > > Tika either inside your code or as a server. Latest Tika actually
> > > has some off-thread handling I believe, to make it safer to embed.
> > >
> > > For DIH alternatives, if you want configuration over custom code,
> > > you could look at something like Apache NiFI. It can push data into
> Solr.
> > > Obviously it is a bigger solution, but it is correspondingly more
> > > robust too.
> > >
> > > Regards,
> > >Alex.
> > > On Sun, 21 Oct 2018 at 11:07, Martin Frank Hansen (MHQ) 
> wrote:
> > > >
> > > > Hi Alexandre,
> > > >
> > > > Thanks for your reply.
> > > >
> > > > Yes right now it is just for testing the possibilities of Solr and
> Tesseract.
> > > >
> > > > I will take a look at the Tika documentation to see if I can make it
> work.
> > > >
> > > > You said that DIH are not recommended for production usage, what is
> the recommended method(s) to upload data to a Solr instance?
> > > >
> > > > Best regards
> > > >
> > > > Martin Frank Hansen
> > > >
> > > > -Oprindelig meddelelse-
> > > > Fra: Alexandre Rafalovitch 
> > > > Sendt: 21. oktober 2018 16:26
> > > > Til: solr-user 
> > > > Emne: Re: Tesseract language
> > > >
> > > > There is a couple of things mixed in here:
> > > > 1) Extract handler is not recommended for production usage. It is
> great for a quick test, just like you did it, but going to production,
> running it externally is better. Tika - especially with large files can use
> up a lot of memory and trip up the Solr inst

RE: Tesseract language

2018-10-26 Thread Martin Frank Hansen (MHQ)
Hi Tim,

You were right.

When I called `tesseract testing/eurotext.png testing/eurotext-dan -l dan`, I 
got an error message so I downloaded "dan.traineddata" and added it to the 
Tesseract-OCR/tessdata folder. Furthermore I added the 'TESSDATA_PREFIX' 
variable to the path-variables pointing to "Tesseract-OCR/tessdata".

Now Tesseract works with Danish language from the CMD, but now I can't make the 
code work in Java, not even with default settings (which I could before). Am I 
missing something or just mixing some things up?



-Original Message-
From: Tim Allison 
Sent: 26. oktober 2018 19:58
To: solr-user@lucene.apache.org
Subject: Re: Tesseract language

Tika relies on you to install tesseract and all the language libraries you'll 
need.

If you can successfully call `tesseract testing/eurotext.png 
testing/eurotext-dan -l dan`, Tika _should_ be able to specify "dan"
with your code above.
On Fri, Oct 26, 2018 at 10:49 AM Martin Frank Hansen (MHQ)  wrote:
>
> Hi again,
>
> Now I moved the OCR part to Tika, but I still can't make it work with Danish. 
> It works when using default language settings and it seems like Tika is 
> missing Danish dictionary.
>
> My java code looks like this:
>
> {
> File file = new File(pathfilename);
>
> Metadata meta = new Metadata();
>
> InputStream stream = TikaInputStream.get(file);
>
> Parser parser = new AutoDetectParser();
> BodyContentHandler handler = new
> BodyContentHandler(Integer.MAX_VALUE);
>
> TesseractOCRConfig config = new TesseractOCRConfig();
> config.setLanguage("dan"); // code works if this phrase is 
> commented out.
>
> ParseContext parseContext = new ParseContext();
>
>  parseContext.set(TesseractOCRConfig.class, config);
>
> parser.parse(stream, handler, meta, parseContext);
> System.out.println(handler.toString());
> }
>
> Hope that someone can help here.
>
> -Original Message-
> From: Martin Frank Hansen (MHQ) 
> Sent: 22. oktober 2018 07:58
> To: solr-user@lucene.apache.org
> Subject: SV: Tesseract language
>
> Hi Erick,
>
> Thanks for the help! I will take a look at it.
>
>
> Martin Frank Hansen, Senior Data Analytiker
>
> Data, IM & Analytics
>
>
>
> Lautrupparken 40-42, DK-2750 Ballerup
> E-mail m...@kmd.dk  Web www.kmd.dk
> Mobil +4525571418
>
> -Oprindelig meddelelse-
> Fra: Erick Erickson 
> Sendt: 21. oktober 2018 22:49
> Til: solr-user 
> Emne: Re: Tesseract language
>
> Here's a skeletal program that uses Tika in a stand-alone client. Rip the 
> RDBMS parts out
>
> https://lucidworks.com/2012/02/14/indexing-with-solrj/
> On Sun, Oct 21, 2018 at 1:13 PM Alexandre Rafalovitch  
> wrote:
> >
> > Usually, we just say to do a custom solution using SolrJ client to
> > connect. This gives you maximum flexibility and allows to integrate
> > Tika either inside your code or as a server. Latest Tika actually
> > has some off-thread handling I believe, to make it safer to embed.
> >
> > For DIH alternatives, if you want configuration over custom code,
> > you could look at something like Apache NiFI. It can push data into Solr.
> > Obviously it is a bigger solution, but it is correspondingly more
> > robust too.
> >
> > Regards,
> >Alex.
> > On Sun, 21 Oct 2018 at 11:07, Martin Frank Hansen (MHQ)  wrote:
> > >
> > > Hi Alexandre,
> > >
> > > Thanks for your reply.
> > >
> > > Yes right now it is just for testing the possibilities of Solr and 
> > > Tesseract.
> > >
> > > I will take a look at the Tika documentation to see if I can make it work.
> > >
> > > You said that DIH are not recommended for production usage, what is the 
> > > recommended method(s) to upload data to a Solr instance?
> > >
> > > Best regards
> > >
> > > Martin Frank Hansen
> > >
> > > -Oprindelig meddelelse-
> > > Fra: Alexandre Rafalovitch 
> > > Sendt: 21. oktober 2018 16:26
> > > Til: solr-user 
> > > Emne: Re: Tesseract language
> > >
> > > There is a couple of things mixed in here:
> > > 1) Extract handler is not recommended for production usage. It is great 
> > > for a quick test, just like you did it, but going to production, running 
> > > it externally is better. Tika - especially with large files can use up a 
> > > lot of memory and trip up the Solr instance it is running within.
> > > 2) If you are still just testing, you can configure Tika within Solr but 
> > > specifying parseContent.config file as shown at the link and described 
> > > further down in the same document:
> > > https://lucene.apache.org/solr/guide/7_5/uploading-data-with-solr-
> > > ce
> > > ll-using-apache-tika.html#configuring-the-solr-extractingrequestha
> > > nd ler You still need to check with Tika documentation with
> > > Tesseract can take its configuration from the parseContext file.
> > > 3) If you are still testing with multiple files, Data Import Handler can 
> > > iterate through files and then - as a nested enti

Re: SolrCloud scaling/optimization for high request rate

2018-10-26 Thread Toke Eskildsen
David Hastings  wrote:
> Would adding the docValues in the schema, but not reindexing, cause
> errors?  IE, only apply the doc values after the next reindex, but in the
> meantime keep functioning as there were none until then?

As soon as you specify in the schema that a field has docValues=true, Solr 
treats all existing documents as having docValues enabled for that field. As 
there is no docValue content, DocValues-aware functionality such as sorting and 
faceting will not work for that field, until the documents has been re-indexed. 

- Toke Eskildsen


Re: SolrCloud scaling/optimization for high request rate

2018-10-26 Thread David Hastings
Would adding the docValues in the schema, but not reindexing, cause
errors?  IE, only apply the doc values after the next reindex, but in the
meantime keep functioning as there were none until then?

On Fri, Oct 26, 2018 at 2:15 PM Toke Eskildsen  wrote:

> Sofiya Strochyk  wrote:
> > 5. Yes, docValues are enabled for the fields we sort on
> > (except score which is an internal field); [...]
>
> I am currently working on
> https://issues.apache.org/jira/browse/LUCENE-8374
> which speeds up DocValues-operations for indexes with many documents.
>
> What "many" means is hard to say, but as can be seen in the JIRA, Tim
> Underwood sees a nice speed up for faceting with his 80M doc index.
> Hopefully it can also benefit your 40M doc (per shard) index with sorting
> on (I infer) multiple DocValued fields. I'd be happy to assist, should you
> need help with the patch.
>
> - Toke Eskildsen
>


Re: SolrCloud scaling/optimization for high request rate

2018-10-26 Thread Toke Eskildsen
Sofiya Strochyk  wrote:
> 5. Yes, docValues are enabled for the fields we sort on
> (except score which is an internal field); [...]

I am currently working on
https://issues.apache.org/jira/browse/LUCENE-8374
which speeds up DocValues-operations for indexes with many documents.

What "many" means is hard to say, but as can be seen in the JIRA, Tim Underwood 
sees a nice speed up for faceting with his 80M doc index. Hopefully it can also 
benefit your 40M doc (per shard) index with sorting on (I infer) multiple 
DocValued fields. I'd be happy to assist, should you need help with the patch.

- Toke Eskildsen


Re: Tesseract language

2018-10-26 Thread Tim Allison
Tika relies on you to install tesseract and all the language libraries
you'll need.

If you can successfully call `tesseract testing/eurotext.png
testing/eurotext-dan -l dan`, Tika _should_ be able to specify "dan"
with your code above.
On Fri, Oct 26, 2018 at 10:49 AM Martin Frank Hansen (MHQ)  wrote:
>
> Hi again,
>
> Now I moved the OCR part to Tika, but I still can't make it work with Danish. 
> It works when using default language settings and it seems like Tika is 
> missing Danish dictionary.
>
> My java code looks like this:
>
> {
> File file = new File(pathfilename);
>
> Metadata meta = new Metadata();
>
> InputStream stream = TikaInputStream.get(file);
>
> Parser parser = new AutoDetectParser();
> BodyContentHandler handler = new 
> BodyContentHandler(Integer.MAX_VALUE);
>
> TesseractOCRConfig config = new TesseractOCRConfig();
> config.setLanguage("dan"); // code works if this phrase is 
> commented out.
>
> ParseContext parseContext = new ParseContext();
>
>  parseContext.set(TesseractOCRConfig.class, config);
>
> parser.parse(stream, handler, meta, parseContext);
> System.out.println(handler.toString());
> }
>
> Hope that someone can help here.
>
> -Original Message-
> From: Martin Frank Hansen (MHQ) 
> Sent: 22. oktober 2018 07:58
> To: solr-user@lucene.apache.org
> Subject: SV: Tesseract language
>
> Hi Erick,
>
> Thanks for the help! I will take a look at it.
>
>
> Martin Frank Hansen, Senior Data Analytiker
>
> Data, IM & Analytics
>
>
>
> Lautrupparken 40-42, DK-2750 Ballerup
> E-mail m...@kmd.dk  Web www.kmd.dk
> Mobil +4525571418
>
> -Oprindelig meddelelse-
> Fra: Erick Erickson 
> Sendt: 21. oktober 2018 22:49
> Til: solr-user 
> Emne: Re: Tesseract language
>
> Here's a skeletal program that uses Tika in a stand-alone client. Rip the 
> RDBMS parts out
>
> https://lucidworks.com/2012/02/14/indexing-with-solrj/
> On Sun, Oct 21, 2018 at 1:13 PM Alexandre Rafalovitch  
> wrote:
> >
> > Usually, we just say to do a custom solution using SolrJ client to
> > connect. This gives you maximum flexibility and allows to integrate
> > Tika either inside your code or as a server. Latest Tika actually has
> > some off-thread handling I believe, to make it safer to embed.
> >
> > For DIH alternatives, if you want configuration over custom code, you
> > could look at something like Apache NiFI. It can push data into Solr.
> > Obviously it is a bigger solution, but it is correspondingly more
> > robust too.
> >
> > Regards,
> >Alex.
> > On Sun, 21 Oct 2018 at 11:07, Martin Frank Hansen (MHQ)  wrote:
> > >
> > > Hi Alexandre,
> > >
> > > Thanks for your reply.
> > >
> > > Yes right now it is just for testing the possibilities of Solr and 
> > > Tesseract.
> > >
> > > I will take a look at the Tika documentation to see if I can make it work.
> > >
> > > You said that DIH are not recommended for production usage, what is the 
> > > recommended method(s) to upload data to a Solr instance?
> > >
> > > Best regards
> > >
> > > Martin Frank Hansen
> > >
> > > -Oprindelig meddelelse-
> > > Fra: Alexandre Rafalovitch 
> > > Sendt: 21. oktober 2018 16:26
> > > Til: solr-user 
> > > Emne: Re: Tesseract language
> > >
> > > There is a couple of things mixed in here:
> > > 1) Extract handler is not recommended for production usage. It is great 
> > > for a quick test, just like you did it, but going to production, running 
> > > it externally is better. Tika - especially with large files can use up a 
> > > lot of memory and trip up the Solr instance it is running within.
> > > 2) If you are still just testing, you can configure Tika within Solr but 
> > > specifying parseContent.config file as shown at the link and described 
> > > further down in the same document:
> > > https://lucene.apache.org/solr/guide/7_5/uploading-data-with-solr-ce
> > > ll-using-apache-tika.html#configuring-the-solr-extractingrequesthand
> > > ler You still need to check with Tika documentation with Tesseract
> > > can take its configuration from the parseContext file.
> > > 3) If you are still testing with multiple files, Data Import Handler can 
> > > iterate through files and then - as a nested entity - feed it to Tika 
> > > processor for further extraction. I think one of the examples shows that.
> > > However, I am not sure you can pass parseContext that way and DIH is also 
> > > not recommended for production.
> > >
> > > I hope this helps,
> > > Alex.
> > >
> > > On Sun, 21 Oct 2018 at 09:24, Martin Frank Hansen (MHQ)  
> > > wrote:
> > >
> > > > Hi again,
> > > >
> > > >
> > > >
> > > > Is there anyone who has some experience of using Tesseract’s OCR
> > > > module within Solr? The files I am trying to read into Solr is
> > > > Danish Tiff documents.
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > *Martin Frank Hansen*, Senior Data Analytiker
> > > >
> > > > Dat

Re: SolrCloud scaling/optimization for high request rate

2018-10-26 Thread Walter Underwood
The G1 collector should improve 95th percentile performance, because it limits 
the length of pauses.

With the CMS/ParNew collector, I ran very large Eden spaces, 2 Gb out of an 8 
Gb heap. Nearly all of the allocations in Solr have the lifetime of one 
request, so you don’t want any of those allocations to be promoted to tenured 
space. Tenured space should be mostly cache evictions and should grow slowly.

For our clusters, when we hit 70% CPU, we add more CPUs. If we drive Solr much 
harder than that, it goes into congestion collapse. That is totally expected. 
When you use all of a resource, things get slow. Request more than all of a 
resource and things get very, very slow.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Oct 26, 2018, at 10:21 AM, Sofiya Strochyk  wrote:
> 
> Thanks Erick,
> 
> 1. We already use Solr 7.5, upgraded some of our nodes only recently to see 
> if this eliminates the difference in performance (it doesn't, but I'll test 
> and see if the situation with replicas syncing/recovery has improved since 
> then)
> 2. Yes, we only open searcher once every 30 minutes so it is not an NRT case. 
> But it is only recommended 
> 
>  to use NRT/TLOG/TLOG+PULL replica types together (currently we have all NRT 
> replicas), would you suggest we change leaders to TLOG and slaves to PULL? 
> And this would also eliminate the redundancy provided by replication because 
> PULL replicas can't become leaders, right?
> 3. Yes but then it would be reflected in iowait metric, which is almost 
> always near zero on our servers. Is there anything else Solr could be waiting 
> for, and is there a way to check it? If we are going to need even more 
> servers for faster response and faceting then there must be a way to know 
> which resource we should get more of.
> 5. Yes, docValues are enabled for the fields we sort on (except score which 
> is an internal field); _version_ is left at default i think (type="long" 
> indexed="false" stored="false", and it's also marked as having DocValues in 
> the admin UI)
> 6. QPS and response time seem to be about the same with and without indexing; 
> server load also looks about the same so i assume indexing doesn't take up a 
> lot of resources (a little strange, but possible if it is limited by network 
> or some other things from point 3).
> 
> 7. Will try using G1 if nothing else helps... Haven't tested it yet because 
> it is considered unsafe and i'd like to have all other options exhausted 
> first. (And even then it is probably going to be a minor improvement? How 
> much more efficient could it possibly be?)
> 
> On 26.10.18 19:18, Erick Erickson wrote:
>> Some ideas:
>> 
>> 1> What version of Solr? Solr 7.3 completely re-wrote Leader Initiated
>> Recovery and 7.5 has other improvements for recovery, we're hoping
>> that the recovery situation is much improved.
>> 
>> 2> In the 7x code line, there are TLOG and PULL replicas. As of 7.5,
>> you can set up so the queries are served by replica type, see:
>> https://issues.apache.org/jira/browse/SOLR-11982 
>> . This might help you
>> out. This moves all the indexing to the leader and reserves the rest
>> of the nodes for queries only, using old-style replication. I'm
>> assuming from your commit rate that latency between when updates
>> happen and the updates are searchable isn't a big concern.
>> 
>> 3> Just because the CPU isn't 100% doesn't mean Solr is running flat
>> out. There's I/O waits while sub-requests are serviced and the like.
>> 
>> 4> As for how to add faceting without slowing down querying, there's
>> no way. Extra work is extra work. Depending on _what_ you're faceting
>> on, you may be able to do some tricks, but without details it's hard
>> to say. You need to get the query rate target first though ;)
>> 
>> 5> OOMs Hmm, you say you're doing complex sorts, are all fields
>> involved in sorts docValues=true? They have to be to be used in
>> function queries of course, but what about any fields that aren't?
>> What about your _version_ field?
>> 
>> 6>  bq. "...indexed 2 times/day, as fast as the SOLR allows..." One
>> experiment I'd run is to test your QPS rate when there was _no_
>> indexing going on. That would give you a hint as to whether the
>> TLOG/PULL configuration would be helpful. There's been talk of
>> separate thread pools for indexing and querying to give queries a
>> better shot at the CPU, but that's not in place yet.
>> 
>> 7> G1GC may also help rather than CMS, but as you're well aware GC
>> tuning "is more art than science" ;).
>> 
>> Good luck!
>> Erick
>> 
>> On Fri, Oct 26, 2018 at 8:55 AM Sofiya Strochyk  
>>  wrote:
>>> Hi everyone,
>>> 
>>> We have a SolrCloud setup with the following configuration:
>>> 
>>> 4 nodes (3

Re: A different result with filters

2018-10-26 Thread Владислав Властовский
Andrey, thx. You are an expert!

пт, 26 окт. 2018 г. в 18:40, Kydryavtsev Andrey :

> There supposed to be support of children "filters" attribute in latest
> releases. Try it out.
>
>  Link
> https://lucene.apache.org/solr/guide/7_3/other-parsers.html#filtering-and-tagging-2
>
> 26.10.2018, 16:34, "Владислав Властовский" :
> > Andrey, ok
> >
> > How can I tag the filter then?
> >
> > I send:
> > {
> >   "query": "*:*",
> >   "limit": 1000,
> >   "filter": [
> > "{!parent which=kind_s:edition}condition_s:0 AND
> {!tag=price}price_i:[*
> > TO 75]"
> >   ]
> > }
> >
> > I got:
> > {
> >   "error": {
> > "metadata": [
> >   "error-class",
> >   "org.apache.solr.common.SolrException",
> >   "root-error-class",
> >   "org.apache.solr.parser.ParseException"
> > ],
> > "msg": "org.apache.solr.search.SyntaxError: Cannot parse
> 'price_i:[*':
> > Encountered \"\" at line 1, column 10.\nWas expecting:\n \"TO\"
> > ...\n ",
> > "code": 400
> >   }
> > }
> >
> > пт, 26 окт. 2018 г. в 16:23, Kydryavtsev Andrey :
> >
> >>  This two queries are not similar. If you have parent with two children
> -
> >>  "{condition_s:0, price_i: 100}" and "{condition_s:1, price_i:
> 10}", it
> >>  will be matched by first query, it won't be matched by second.
> >>
> >>  26.10.2018, 09:50, "Владислав Властовский" :
> >>  > Hi, I use 7.5.0 Solr
> >>  >
> >>  > Why do I get two different results for similar requests?
> >>  >
> >>  > First req/res:
> >>  > {
> >>  > "query": "*:*",
> >>  > "limit": 0,
> >>  > "filter": [
> >>  > "{!parent which=kind_s:edition}condition_s:0",
> >>  > "{!parent which=kind_s:edition}price_i:[* TO 75]"
> >>  > ]
> >>  > }
> >>  >
> >>  > {
> >>  > "response": {
> >>  > "numFound": 453,
> >>  > "start": 0,
> >>  > "docs": []
> >>  > }
> >>  > }
> >>  >
> >>  > And second query:
> >>  > {
> >>  > "query": "*:*",
> >>  > "limit": 0,
> >>  > "filter": [
> >>  > "{!parent which=kind_s:edition}condition_s:0 AND price_i:[* TO
> >>  75]"
> >>  > ]
> >>  > }
> >>  >
> >>  > {
> >>  > "response": {
> >>  > "numFound": 452,
> >>  > "start": 0,
> >>  > "docs": []
> >>  > }
> >>  > }
>


Re: SolrCloud scaling/optimization for high request rate

2018-10-26 Thread Sofiya Strochyk

Thanks Erick,

1. We already use Solr 7.5, upgraded some of our nodes only recently to 
see if this eliminates the difference in performance (it doesn't, but 
I'll test and see if the situation with replicas syncing/recovery has 
improved since then)


2. Yes, we only open searcher once every 30 minutes so it is not an NRT 
case. But it is only recommended 
 
to use NRT/TLOG/TLOG+PULL replica types together (currently we have all 
NRT replicas), would you suggest we change leaders to TLOG and slaves to 
PULL? And this would also eliminate the redundancy provided by 
replication because PULL replicas can't become leaders, right?


3. Yes but then it would be reflected in iowait metric, which is almost 
always near zero on our servers. Is there anything else Solr could be 
waiting for, and is there a way to check it? If we are going to need 
even more servers for faster response and faceting then there must be a 
way to know which resource we should get more of.


5. Yes, docValues are enabled for the fields we sort on (except score 
which is an internal field); _version_ is left at default i think 
(type="long" indexed="false" stored="false", and it's also marked as 
having DocValues in the admin UI)


6. QPS and response time seem to be about the same with and without 
indexing; server load also looks about the same so i assume indexing 
doesn't take up a lot of resources (a little strange, but possible if it 
is limited by network or some other things from point 3).


7. Will try using G1 if nothing else helps... Haven't tested it yet 
because it is considered unsafe and i'd like to have all other options 
exhausted first. (And even then it is probably going to be a minor 
improvement? How much more efficient could it possibly be?)



On 26.10.18 19:18, Erick Erickson wrote:

Some ideas:

1> What version of Solr? Solr 7.3 completely re-wrote Leader Initiated
Recovery and 7.5 has other improvements for recovery, we're hoping
that the recovery situation is much improved.

2> In the 7x code line, there are TLOG and PULL replicas. As of 7.5,
you can set up so the queries are served by replica type, see:
https://issues.apache.org/jira/browse/SOLR-11982. This might help you
out. This moves all the indexing to the leader and reserves the rest
of the nodes for queries only, using old-style replication. I'm
assuming from your commit rate that latency between when updates
happen and the updates are searchable isn't a big concern.

3> Just because the CPU isn't 100% doesn't mean Solr is running flat
out. There's I/O waits while sub-requests are serviced and the like.

4> As for how to add faceting without slowing down querying, there's
no way. Extra work is extra work. Depending on _what_ you're faceting
on, you may be able to do some tricks, but without details it's hard
to say. You need to get the query rate target first though ;)

5> OOMs Hmm, you say you're doing complex sorts, are all fields
involved in sorts docValues=true? They have to be to be used in
function queries of course, but what about any fields that aren't?
What about your _version_ field?

6>  bq. "...indexed 2 times/day, as fast as the SOLR allows..." One
experiment I'd run is to test your QPS rate when there was _no_
indexing going on. That would give you a hint as to whether the
TLOG/PULL configuration would be helpful. There's been talk of
separate thread pools for indexing and querying to give queries a
better shot at the CPU, but that's not in place yet.

7> G1GC may also help rather than CMS, but as you're well aware GC
tuning "is more art than science" ;).

Good luck!
Erick

On Fri, Oct 26, 2018 at 8:55 AM Sofiya Strochyk  wrote:

Hi everyone,

We have a SolrCloud setup with the following configuration:

4 nodes (3x128GB RAM Intel Xeon E5-1650v2, 1x64GB RAM Intel Xeon E5-1650v2, 12 
cores, with SSDs)
One collection, 4 shards, each has only a single replica (so 4 replicas in 
total), using compositeId router
Total index size is about 150M documents/320GB, so about 40M/80GB per node
Zookeeper is on a separate server
Documents consist of about 20 fields (most of them are both stored and 
indexed), average document size is about 2kB
Queries are mostly 2-3 words in the q field, with 2 fq parameters, with complex 
sort expression (containing IF functions)
We don't use faceting due to performance reasons but need to add it in the 
future
Majority of the documents are reindexed 2 times/day, as fast as the SOLR 
allows, in batches of 1000-1 docs. Some of the documents are also deleted 
(by id, not by query)
autoCommit is set to maxTime of 1 minute with openSearcher=false and 
autoSoftCommit maxTime is 30 minutes with openSearcher=true. Commits from 
clients are ignored.
Heap size is set to 8GB.

Target query rate is up to 500 qps, maybe 300, and we need to keep response time 
at <200ms. But at the moment we only see very go

Re: SolrCloud scaling/optimization for high request rate

2018-10-26 Thread Erick Erickson
Some ideas:

1> What version of Solr? Solr 7.3 completely re-wrote Leader Initiated
Recovery and 7.5 has other improvements for recovery, we're hoping
that the recovery situation is much improved.

2> In the 7x code line, there are TLOG and PULL replicas. As of 7.5,
you can set up so the queries are served by replica type, see:
https://issues.apache.org/jira/browse/SOLR-11982. This might help you
out. This moves all the indexing to the leader and reserves the rest
of the nodes for queries only, using old-style replication. I'm
assuming from your commit rate that latency between when updates
happen and the updates are searchable isn't a big concern.

3> Just because the CPU isn't 100% doesn't mean Solr is running flat
out. There's I/O waits while sub-requests are serviced and the like.

4> As for how to add faceting without slowing down querying, there's
no way. Extra work is extra work. Depending on _what_ you're faceting
on, you may be able to do some tricks, but without details it's hard
to say. You need to get the query rate target first though ;)

5> OOMs Hmm, you say you're doing complex sorts, are all fields
involved in sorts docValues=true? They have to be to be used in
function queries of course, but what about any fields that aren't?
What about your _version_ field?

6>  bq. "...indexed 2 times/day, as fast as the SOLR allows..." One
experiment I'd run is to test your QPS rate when there was _no_
indexing going on. That would give you a hint as to whether the
TLOG/PULL configuration would be helpful. There's been talk of
separate thread pools for indexing and querying to give queries a
better shot at the CPU, but that's not in place yet.

7> G1GC may also help rather than CMS, but as you're well aware GC
tuning "is more art than science" ;).

Good luck!
Erick

On Fri, Oct 26, 2018 at 8:55 AM Sofiya Strochyk  wrote:
>
> Hi everyone,
>
> We have a SolrCloud setup with the following configuration:
>
> 4 nodes (3x128GB RAM Intel Xeon E5-1650v2, 1x64GB RAM Intel Xeon E5-1650v2, 
> 12 cores, with SSDs)
> One collection, 4 shards, each has only a single replica (so 4 replicas in 
> total), using compositeId router
> Total index size is about 150M documents/320GB, so about 40M/80GB per node
> Zookeeper is on a separate server
> Documents consist of about 20 fields (most of them are both stored and 
> indexed), average document size is about 2kB
> Queries are mostly 2-3 words in the q field, with 2 fq parameters, with 
> complex sort expression (containing IF functions)
> We don't use faceting due to performance reasons but need to add it in the 
> future
> Majority of the documents are reindexed 2 times/day, as fast as the SOLR 
> allows, in batches of 1000-1 docs. Some of the documents are also deleted 
> (by id, not by query)
> autoCommit is set to maxTime of 1 minute with openSearcher=false and 
> autoSoftCommit maxTime is 30 minutes with openSearcher=true. Commits from 
> clients are ignored.
> Heap size is set to 8GB.
>
> Target query rate is up to 500 qps, maybe 300, and we need to keep response 
> time at <200ms. But at the moment we only see very good search performance 
> with up to 100 requests per second. Whenever it grows to about 200, average 
> response time abruptly increases to 0.5-1 second. (Also it seems that request 
> rate reported by SOLR in admin metrics is 2x higher than the real one, 
> because for every query, every shard receives 2 requests: one to obtain IDs 
> and second one to get data by IDs; so target rate for SOLR metrics would be 
> 1000 qps).
>
> During high request load, CPU usage increases dramatically on the SOLR nodes. 
> It doesn't reach 100% but averages at 50-70% on 3 servers and about 93% on 1 
> server (random server each time, not the smallest one).
>
> The documentation mentions replication to spread the load between the 
> servers. We tested replicating to smaller servers (32GB RAM, Intel Core 
> i7-4770). However, when we tested it, the replicas were going out of sync all 
> the time (possibly during commits) and reported errors like "PeerSync 
> Recovery was not successful - trying replication." Then they proceed with 
> replication which takes hours and the leader handles all requests 
> singlehandedly during that time. Also both leaders and replicas started 
> encountering OOM errors (heap space) for unknown reason. Heap dump analysis 
> shows that most of the memory is consumed by [J (array of long) type, my best 
> guess would be that it is "_version_" field, but it's still unclear why it 
> happens. Also, even though with replication request rate and CPU usage drop 2 
> times, it doesn't seem to affect mean_ms, stddev_ms or p95_ms numbers (p75_ms 
> is much smaller on nodes with replication, but still not as low as under load 
> of <100 requests/s).
>
> Garbage collection is much more active during high load as well. Full GC 
> happens almost exclusively during those times. We have tried tuning GC 
> options like suggested here and it didn't change 

SolrCloud scaling/optimization for high request rate

2018-10-26 Thread Sofiya Strochyk

Hi everyone,

We have a SolrCloud setup with the following configuration:

 * 4 nodes (3x128GB RAM Intel Xeon E5-1650v2, 1x64GB RAM Intel Xeon
   E5-1650v2, 12 cores, with SSDs)
 * One collection, 4 shards, each has only a single replica (so 4
   replicas in total), using compositeId router
 * Total index size is about 150M documents/320GB, so about 40M/80GB
   per node
 * Zookeeper is on a separate server
 * Documents consist of about 20 fields (most of them are both stored
   and indexed), average document size is about2kB
 * Queries are mostly 2-3 words in the q field, with 2 fq parameters,
   with complex sort expression (containing IF functions)
 * We don't use faceting due to performance reasons but need to add it
   in the future
 * Majority of the documents are reindexed 2 times/day, as fast as the
   SOLR allows, in batches of 1000-1 docs. Some of the documents
   are also deleted (by id, not by query)
 * autoCommit is set to maxTime of 1 minute with openSearcher=false and
   autoSoftCommit maxTime is 30 minutes with openSearcher=true. Commits
   from clients are ignored.
 * Heap size is set to 8GB.

Target query rate is up to 500 qps, maybe 300, and we need to keep 
response time at <200ms. But at the moment we only see very good search 
performance with up to 100 requests per second. Whenever it grows to 
about 200, average response time abruptly increases to 0.5-1 second. 
(Also it seems that request rate reported by SOLR in admin metrics is 2x 
higher than the real one, because for every query, every shard receives 
2 requests: one to obtain IDs and second one to get data by IDs; so 
target rate for SOLR metrics would be 1000 qps).


During high request load, CPU usage increases dramatically on the SOLR 
nodes. It doesn't reach 100% but averages at 50-70% on 3 servers and 
about 93% on 1 server (random server each time, not the smallest one).


The documentation mentions replication to spread the load between the 
servers. We tested replicating to smaller servers (32GB RAM, Intel Core 
i7-4770). However, when we tested it, the replicas were going out of 
sync all the time (possibly during commits) and reported errors like 
"PeerSync Recovery was not successful - trying replication." Then they 
proceed with replication which takes hours and the leader handles all 
requests singlehandedly during that time. Also both leaders and replicas 
started encountering OOM errors (heap space) for unknown reason. Heap 
dump analysis shows that most of the memory is consumed by [J (array of 
long) type, my best guess would be that it is "_version_" field, but 
it's still unclear why it happens. Also, even though with replication 
request rate and CPU usage drop 2 times, it doesn't seem to affect 
mean_ms, stddev_ms or p95_ms numbers (p75_ms is much smaller on nodes 
with replication, but still not as low as under load of <100 requests/s).


Garbage collection is much more active during high load as well. Full GC 
happens almost exclusively during those times. We have tried tuning GC 
options like suggested here 
 
and it didn't change things though.


My questions are

 * How do we increase throughput? Is replication the only solution?
 * if yes - then why doesn't it affect response times, considering that
   CPU is not 100% used and index fits into memory?
 * How to deal with OOM and replicas going into recovery?
 * Is memory or CPU the main problem? (When searching on the internet,
   i never see CPU as main bottleneck for SOLR, but our case might be
   different)
 * Or do we need smaller shards? Could segments merging be a problem?
 * How to add faceting without search queries slowing down too much?
 * How to diagnose these problems and narrow down to the real reason in
   hardware or setup?

Any help would be much appreciated.

Thanks!

--
Email Signature
*Sofiia Strochyk
*


s...@interlogic.com.ua 
InterLogic
www.interlogic.com.ua 

Facebook icon  LinkedIn 
icon 




Re: Edismax query returning the same number of results using AND as it does with OR

2018-10-26 Thread Shawn Heisey
Followup:

I had a theory that Nicky tested, and I think what was observed confirms the 
theory.

TL;DR:

In previous versions, I think there was a bug where the presence of boolean 
operators caused edismax to ignore the mm parameter, and only rely on the 
boolean operator(s).

After that bug got fixed, mm will apply to any SHOULD clauses in the query. A 
query of "a OR b" has two SHOULD clauses, and the mm value present in this 
query requires all clauses to match, so it is effectively the same as "a AND b".

A potential workaround that appears to work: Detect when the query contains a 
boolean operator, and in that situation, send mm=0 with the query. Alternately, 
just do that when the query contains "OR" - things work right with AND & NOT 
because these don't produce SHOULD clauses.

Thanks,
Shawn



⁣Sent from TypeApp ​

On Oct 25, 2018, 15:24, at 15:24, Nicky Mastin  wrote:
>
>Oddity with edismax and queries involving boolean operators.  Here's
>the
>"parsedquery_toString" from two different queries:
>input:  "dog AND kiwi":
>https://apaste.info/gaQl
>input:  "dog OR kiwi":
>https://apaste.info/sBwa
>Both queries return the same number of results (389).  The query with
>OR was
>expected to have a much higher numFound.  Those pastes have a one week
>lifetime.
>The two parsed queries are almost identical.  The AND query has a
>couple of
>extra plus signs compared to the OR query, and the OR query has a ~2
>after a
>right paren that the AND query doesn't have.  I'm at a loss as to what
>this
>all means, except to say that it didn't turn out as expected.
>Should the two queries have returned different numbers of results?  If
>not,
>why is that the case?
>Here is the output from echoParams=all on the OR query:
>true
>text
>true
>LINE
>enum
>3
> 0.4
>5
>
>title^100 kw1ranked^100 kw1^100 keywordsranked_bm25_no_norms^50
>keywords_bm25_no_norms^50 authors text description species
>
>
>before
>after
>
>subdocuments,keywords,authors
>3<-1 6<-3 9<30%
>true
>html
>on
>
>
>max(recip(ms(NOW/DAY+1YEAR,dateint),3.16E-11,10,6),documentdatefix)
>
>rank
>
>true
>1000
>breakIterator
>true
>year
>2015
>spell_file
>true
>all
>
>id,title,description,url,objecttypeid,contexturl,defaultsourceid,sourceid,score
>
>false
>100
>5
>5
>
>
>
>{!ex=dt key="Last10yr"}dateint:[NOW/YEAR-10YEARS TO *]
>
>
>{!ex=dt key="Last5yr"}dateint:[NOW/YEAR-5YEARS TO *]
>
>
>{!ex=dt key="Last3yr"}dateint:[NOW/YEAR-3YEARS TO *]
>
>
>{!ex=dt key="Last1yr"}dateint:[NOW/YEAR-1YEAR TO *]
>
>
>edismax
>false
>enum
>xml
>true
>*:*
>
>folderid
>sourceid
>speciesid
>admin
>
>enum
>map
>0
>true
>25
>2
>true
>dog OR kiwi
>1970
>
>
>title~20^5000 keywordsranked_bm25_no_norms~20^5000 kw1ranked~10^5000
>keywords_bm25_no_norms~20^1500 kw1~10^500 authors^250 text~20^1000
>text~100^500 description^1
>
>1
>unified
>10
>
>title~22^1000 keywordsranked_bm25_no_norms~22^1000
>keywords_bm25_no_norms~12^500 kw1ranked~12^100 kw1~12^100 text~22^100
>
>authors~11 species~11
>on
>If anyone has any ideas about whether this behavior is expected or
>unexpected, I'd appreciate hearing them.  It is Solr 7.1.0 with a patch
>for
>SOLR-12243 applied.
>There might be information that would be helpful that isn't provided.
>If
>there is something else needed, please let me know, so I can provide
>it.



Re: A different result with filters

2018-10-26 Thread Kydryavtsev Andrey
There supposed to be support of children "filters" attribute in latest 
releases. Try it out.

 Link 
https://lucene.apache.org/solr/guide/7_3/other-parsers.html#filtering-and-tagging-2

26.10.2018, 16:34, "Владислав Властовский" :
> Andrey, ok
>
> How can I tag the filter then?
>
> I send:
> {
>   "query": "*:*",
>   "limit": 1000,
>   "filter": [
> "{!parent which=kind_s:edition}condition_s:0 AND {!tag=price}price_i:[*
> TO 75]"
>   ]
> }
>
> I got:
> {
>   "error": {
> "metadata": [
>   "error-class",
>   "org.apache.solr.common.SolrException",
>   "root-error-class",
>   "org.apache.solr.parser.ParseException"
> ],
> "msg": "org.apache.solr.search.SyntaxError: Cannot parse 'price_i:[*':
> Encountered \"\" at line 1, column 10.\nWas expecting:\n \"TO\"
> ...\n ",
> "code": 400
>   }
> }
>
> пт, 26 окт. 2018 г. в 16:23, Kydryavtsev Andrey :
>
>>  This two queries are not similar. If you have parent with two children -
>>  "{condition_s:0, price_i: 100}" and "{condition_s:1, price_i: 10}", it
>>  will be matched by first query, it won't be matched by second.
>>
>>  26.10.2018, 09:50, "Владислав Властовский" :
>>  > Hi, I use 7.5.0 Solr
>>  >
>>  > Why do I get two different results for similar requests?
>>  >
>>  > First req/res:
>>  > {
>>  > "query": "*:*",
>>  > "limit": 0,
>>  > "filter": [
>>  > "{!parent which=kind_s:edition}condition_s:0",
>>  > "{!parent which=kind_s:edition}price_i:[* TO 75]"
>>  > ]
>>  > }
>>  >
>>  > {
>>  > "response": {
>>  > "numFound": 453,
>>  > "start": 0,
>>  > "docs": []
>>  > }
>>  > }
>>  >
>>  > And second query:
>>  > {
>>  > "query": "*:*",
>>  > "limit": 0,
>>  > "filter": [
>>  > "{!parent which=kind_s:edition}condition_s:0 AND price_i:[* TO
>>  75]"
>>  > ]
>>  > }
>>  >
>>  > {
>>  > "response": {
>>  > "numFound": 452,
>>  > "start": 0,
>>  > "docs": []
>>  > }
>>  > }


RE: Tesseract language

2018-10-26 Thread Martin Frank Hansen (MHQ)
Hi again,

Now I moved the OCR part to Tika, but I still can't make it work with Danish. 
It works when using default language settings and it seems like Tika is missing 
Danish dictionary.

My java code looks like this:

{
File file = new File(pathfilename);

Metadata meta = new Metadata();

InputStream stream = TikaInputStream.get(file);

Parser parser = new AutoDetectParser();
BodyContentHandler handler = new 
BodyContentHandler(Integer.MAX_VALUE);

TesseractOCRConfig config = new TesseractOCRConfig();
config.setLanguage("dan"); // code works if this phrase is 
commented out.

ParseContext parseContext = new ParseContext();

 parseContext.set(TesseractOCRConfig.class, config);

parser.parse(stream, handler, meta, parseContext);
System.out.println(handler.toString());
}

Hope that someone can help here.

-Original Message-
From: Martin Frank Hansen (MHQ) 
Sent: 22. oktober 2018 07:58
To: solr-user@lucene.apache.org
Subject: SV: Tesseract language

Hi Erick,

Thanks for the help! I will take a look at it.


Martin Frank Hansen, Senior Data Analytiker

Data, IM & Analytics



Lautrupparken 40-42, DK-2750 Ballerup
E-mail m...@kmd.dk  Web www.kmd.dk
Mobil +4525571418

-Oprindelig meddelelse-
Fra: Erick Erickson 
Sendt: 21. oktober 2018 22:49
Til: solr-user 
Emne: Re: Tesseract language

Here's a skeletal program that uses Tika in a stand-alone client. Rip the RDBMS 
parts out

https://lucidworks.com/2012/02/14/indexing-with-solrj/
On Sun, Oct 21, 2018 at 1:13 PM Alexandre Rafalovitch  
wrote:
>
> Usually, we just say to do a custom solution using SolrJ client to
> connect. This gives you maximum flexibility and allows to integrate
> Tika either inside your code or as a server. Latest Tika actually has
> some off-thread handling I believe, to make it safer to embed.
>
> For DIH alternatives, if you want configuration over custom code, you
> could look at something like Apache NiFI. It can push data into Solr.
> Obviously it is a bigger solution, but it is correspondingly more
> robust too.
>
> Regards,
>Alex.
> On Sun, 21 Oct 2018 at 11:07, Martin Frank Hansen (MHQ)  wrote:
> >
> > Hi Alexandre,
> >
> > Thanks for your reply.
> >
> > Yes right now it is just for testing the possibilities of Solr and 
> > Tesseract.
> >
> > I will take a look at the Tika documentation to see if I can make it work.
> >
> > You said that DIH are not recommended for production usage, what is the 
> > recommended method(s) to upload data to a Solr instance?
> >
> > Best regards
> >
> > Martin Frank Hansen
> >
> > -Oprindelig meddelelse-
> > Fra: Alexandre Rafalovitch 
> > Sendt: 21. oktober 2018 16:26
> > Til: solr-user 
> > Emne: Re: Tesseract language
> >
> > There is a couple of things mixed in here:
> > 1) Extract handler is not recommended for production usage. It is great for 
> > a quick test, just like you did it, but going to production, running it 
> > externally is better. Tika - especially with large files can use up a lot 
> > of memory and trip up the Solr instance it is running within.
> > 2) If you are still just testing, you can configure Tika within Solr but 
> > specifying parseContent.config file as shown at the link and described 
> > further down in the same document:
> > https://lucene.apache.org/solr/guide/7_5/uploading-data-with-solr-ce
> > ll-using-apache-tika.html#configuring-the-solr-extractingrequesthand
> > ler You still need to check with Tika documentation with Tesseract
> > can take its configuration from the parseContext file.
> > 3) If you are still testing with multiple files, Data Import Handler can 
> > iterate through files and then - as a nested entity - feed it to Tika 
> > processor for further extraction. I think one of the examples shows that.
> > However, I am not sure you can pass parseContext that way and DIH is also 
> > not recommended for production.
> >
> > I hope this helps,
> > Alex.
> >
> > On Sun, 21 Oct 2018 at 09:24, Martin Frank Hansen (MHQ)  wrote:
> >
> > > Hi again,
> > >
> > >
> > >
> > > Is there anyone who has some experience of using Tesseract’s OCR
> > > module within Solr? The files I am trying to read into Solr is
> > > Danish Tiff documents.
> > >
> > >
> > >
> > >
> > >
> > > *Martin Frank Hansen*, Senior Data Analytiker
> > >
> > > Data, IM & Analytics
> > >
> > > [image: cid:image001.png@01D383C9.6C129A60]
> > >
> > >
> > > Lautrupparken 40-42, DK-2750 Ballerup E-mail m...@kmd.dk  Web
> > > www.kmd.dk Mobil +4525571418
> > >
> > >
> > >
> > > *Fra:* Martin Frank Hansen (MHQ) 
> > > *Sendt:* 18. oktober 2018 13:30
> > > *Til:* solr-user@lucene.apache.org
> > > *Emne:* Tesseract language
> > >
> > >
> > >
> > > Hi,
> > >
> > > I have been trying to use Tesseract through the
> > > data-import-handler in Solr and it actually works very well – with
> > > English. As the documents are in Danish, I nee

Solr IndexSearcher lifecycle

2018-10-26 Thread Xiaolong Zheng
Hi,

I would like to have more understanding of the lifecycle for IndexSearcher in 
Solr, I understand that IndexSearcher for Lucene would recommended that “For 
performance reasons, if your index is unchanging, you should share a single 
IndexSearcher instance across multiple searches instead of creating a new one 
per-search.” (based on Lucene 4.6.1 
version).

But when things come to Solr world which in a Java Webapp with servlet 
dispatcher. Do we also keep reusing the same IndexSearcher instance as long as 
there is no index changing?

I see “Hossman” had this talk for the lifecycle of the solr search 
request, but he doesn’t mention 
anything about how we handle/cleanup the indexsearcher.


Thanks,
Xiaolong





Re: A different result with filters

2018-10-26 Thread Владислав Властовский
Andrey, ok

How can I tag the filter then?

I send:
{
  "query": "*:*",
  "limit": 1000,
  "filter": [
"{!parent which=kind_s:edition}condition_s:0 AND {!tag=price}price_i:[*
TO 75]"
  ]
}

I got:
{
  "error": {
"metadata": [
  "error-class",
  "org.apache.solr.common.SolrException",
  "root-error-class",
  "org.apache.solr.parser.ParseException"
],
"msg": "org.apache.solr.search.SyntaxError: Cannot parse 'price_i:[*':
Encountered \"\" at line 1, column 10.\nWas expecting:\n\"TO\"
...\n",
"code": 400
  }
}

пт, 26 окт. 2018 г. в 16:23, Kydryavtsev Andrey :

> This two queries are not similar.  If you have parent with two children -
> "{condition_s:0, price_i: 100}"  and "{condition_s:1, price_i: 10}", it
> will be matched by first query, it won't be matched by second.
>
>
>
>
> 26.10.2018, 09:50, "Владислав Властовский" :
> > Hi, I use 7.5.0 Solr
> >
> > Why do I get two different results for similar requests?
> >
> > First req/res:
> > {
> >   "query": "*:*",
> >   "limit": 0,
> >   "filter": [
> > "{!parent which=kind_s:edition}condition_s:0",
> > "{!parent which=kind_s:edition}price_i:[* TO 75]"
> >   ]
> > }
> >
> > {
> >   "response": {
> > "numFound": 453,
> > "start": 0,
> > "docs": []
> >   }
> > }
> >
> > And second query:
> > {
> >   "query": "*:*",
> >   "limit": 0,
> >   "filter": [
> > "{!parent which=kind_s:edition}condition_s:0 AND price_i:[* TO
> 75]"
> >   ]
> > }
> >
> > {
> >   "response": {
> > "numFound": 452,
> > "start": 0,
> > "docs": []
> >   }
> > }
>


Re: A different result with filters

2018-10-26 Thread Kydryavtsev Andrey
This two queries are not similar.  If you have parent with two children - 
"{condition_s:0, price_i: 100}"  and "{condition_s:1, price_i: 10}", it 
will be matched by first query, it won't be matched by second. 
   



26.10.2018, 09:50, "Владислав Властовский" :
> Hi, I use 7.5.0 Solr
>
> Why do I get two different results for similar requests?
>
> First req/res:
> {
>   "query": "*:*",
>   "limit": 0,
>   "filter": [
> "{!parent which=kind_s:edition}condition_s:0",
> "{!parent which=kind_s:edition}price_i:[* TO 75]"
>   ]
> }
>
> {
>   "response": {
> "numFound": 453,
> "start": 0,
> "docs": []
>   }
> }
>
> And second query:
> {
>   "query": "*:*",
>   "limit": 0,
>   "filter": [
> "{!parent which=kind_s:edition}condition_s:0 AND price_i:[* TO 75]"
>   ]
> }
>
> {
>   "response": {
> "numFound": 452,
> "start": 0,
> "docs": []
>   }
> }


RE: Reading data using Tika to Solr

2018-10-26 Thread Martin Frank Hansen (MHQ)
Hi Tim,

Thanks again, I will update Tika and try it again.

-Original Message-
From: Tim Allison 
Sent: 26. oktober 2018 12:53
To: solr-user@lucene.apache.org
Subject: Re: Reading data using Tika to Solr

Ha...emails passed in the ether.

As you saw, we added the RecursiveParserWrapper a while back into Tika so no 
need to re-invent that wheel.  That’s my preferred method/format because it 
maintains metadata from attachments and lets you know about exceptions in 
embedded files. The legacy method concatenates contents, throws out attachment 
metadata and silently swallows attachment exceptions.

On Fri, Oct 26, 2018 at 6:25 AM Martin Frank Hansen (MHQ) 
wrote:

> Hi again,
>
> Never mind, I got manage to get the content of the msg-files as well
> using the following link as inspiration:
> https://wiki.apache.org/tika/RecursiveMetadata
>
> But thanks again for all your help!
>
> -Original Message-
> From: Martin Frank Hansen (MHQ) 
> Sent: 26. oktober 2018 10:14
> To: solr-user@lucene.apache.org
> Subject: RE: Reading data using Tika to Solr
>
> Hi Tim,
>
> It is msg files and I added tika-app-1.14.jar to the build path - and
> now it works 😊 But how do I get it to read the attachments as well?
>
> -Original Message-
> From: Tim Allison 
> Sent: 25. oktober 2018 21:57
> To: solr-user@lucene.apache.org
> Subject: Re: Reading data using Tika to Solr
>
> If you’re processing actual msg (not eml), you’ll also need poi and
> poi-scratchpad and their dependencies, but then those msgs could have
> attachments, at which point, you may as just add tika-app. :D
>
> On Thu, Oct 25, 2018 at 2:46 PM Martin Frank Hansen (MHQ) 
> wrote:
>
> > Hi Erick and Tim,
> >
> > Thanks for your answers, I can see that my mail got messed up on the
> > way through the server. It looked much more readable at my end 😉
> > The attachment simply included my build-path.
> >
> > @Erick I am compiling the program using Netbeans at the moment.
> >
> > I updated to tika-1.7 but that did not help, and I haven't tried
> > maven yet but will probably have to give that a chance. I just find
> > it a bit odd that I can see the dependencies are included in the jar
> > files I added to the project, but I must be missing something?
> >
> > My buildpath looks as follows:
> >
> > Tika-parsers-1.4.jar
> > Tika-core-1.4.jar
> > Commons-io-2.5.jar
> > Httpclient-4.5.3
> > Httpcore-4.4.6.jar
> > Httpmime-4.5.3.jar
> > Slf4j-api1-7-24.jar
> > Jcl-over--slf4j-1.7.24.jar
> > Solr-cell-7.5.0.jar
> > Solr-core-7.5.0.jar
> > Solr-solrj-7.5.0.jar
> > Noggit-0.8.jar
> >
> >
> >
> > -Original Message-
> > From: Tim Allison 
> > Sent: 25. oktober 2018 20:21
> > To: solr-user@lucene.apache.org
> > Subject: Re: Reading data using Tika to Solr
> >
> > To follow up w Erick’s point, there are a bunch of transitive
> > dependencies from tika-parsers. If you aren’t using maven or similar
> > build system to grab the dependencies, it can be tricky to get it
> > right. If you aren’t using maven, and you can afford the risks of
> > jar hell, consider using tika-app or, better perhaps, tika-server.
> >
> > Stay tuned for SOLR-11721...
> >
> > On Thu, Oct 25, 2018 at 1:08 PM Erick Erickson
> > 
> > wrote:
> >
> > > Martin:
> > >
> > > The mail server is pretty aggressive about stripping attachments,
> > > your png didn't come though. You might also get a more informed
> > > answer on the Tika mailing list.
> > >
> > > That said (and remember I can't see your png so this may be a
> > > silly question), how are you executing the program .vs. compiling
> > > it? You mentioned the "build path". I'm usually lazy and just
> > > execute it in IntelliJ for development and have forgotten to set
> > > my classpath on _numerous_ occasions when running it from a
> > > command line ;)
> > >
> > > Best,
> > > Erick
> > >
> > > On Thu, Oct 25, 2018 at 2:55 AM Martin Frank Hansen (MHQ)
> > > 
> > > wrote:
> > > >
> > > > Hi,
> > > >
> > > >
> > > >
> > > > I am trying to read content of msg-files using Tika and index
> > > > these in
> > > Solr, however I am having some problems with the OfficeParser(). I
> > > keep getting the error java.lang.NoClassDefFoundError for the
> > > OfficeParcer, even though both tika-core and tika-parsers are
> > > included
> > in the build path.
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > I am using Java with the following code:
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > public static void main(final String[] args) throws
> > > IOException,SAXException, TikaException {
> > > >
> > > >
> > > >
> > > > processDocument(pathtofile)
> > > >
> > > >
> > > >
> > > >  }
> > > >
> > > >
> > > >
> > > > private static void
> > > > processDocument(String
> > > pathfilename)  {
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >  try {
> > > >
> > > >
> > > >
> > > > F

Re: Reading data using Tika to Solr

2018-10-26 Thread Tim Allison
Ha...emails passed in the ether.

As you saw, we added the RecursiveParserWrapper a while back into Tika so
no need to re-invent that wheel.  That’s my preferred method/format because
it maintains metadata from attachments and lets you know about exceptions
in embedded files. The legacy method concatenates contents, throws out
attachment metadata and silently swallows attachment exceptions.

On Fri, Oct 26, 2018 at 6:25 AM Martin Frank Hansen (MHQ) 
wrote:

> Hi again,
>
> Never mind, I got manage to get the content of the msg-files as well using
> the following link as inspiration:
> https://wiki.apache.org/tika/RecursiveMetadata
>
> But thanks again for all your help!
>
> -Original Message-
> From: Martin Frank Hansen (MHQ) 
> Sent: 26. oktober 2018 10:14
> To: solr-user@lucene.apache.org
> Subject: RE: Reading data using Tika to Solr
>
> Hi Tim,
>
> It is msg files and I added tika-app-1.14.jar to the build path - and now
> it works 😊 But how do I get it to read the attachments as well?
>
> -Original Message-
> From: Tim Allison 
> Sent: 25. oktober 2018 21:57
> To: solr-user@lucene.apache.org
> Subject: Re: Reading data using Tika to Solr
>
> If you’re processing actual msg (not eml), you’ll also need poi and
> poi-scratchpad and their dependencies, but then those msgs could have
> attachments, at which point, you may as just add tika-app. :D
>
> On Thu, Oct 25, 2018 at 2:46 PM Martin Frank Hansen (MHQ) 
> wrote:
>
> > Hi Erick and Tim,
> >
> > Thanks for your answers, I can see that my mail got messed up on the
> > way through the server. It looked much more readable at my end 😉 The
> > attachment simply included my build-path.
> >
> > @Erick I am compiling the program using Netbeans at the moment.
> >
> > I updated to tika-1.7 but that did not help, and I haven't tried maven
> > yet but will probably have to give that a chance. I just find it a bit
> > odd that I can see the dependencies are included in the jar files I
> > added to the project, but I must be missing something?
> >
> > My buildpath looks as follows:
> >
> > Tika-parsers-1.4.jar
> > Tika-core-1.4.jar
> > Commons-io-2.5.jar
> > Httpclient-4.5.3
> > Httpcore-4.4.6.jar
> > Httpmime-4.5.3.jar
> > Slf4j-api1-7-24.jar
> > Jcl-over--slf4j-1.7.24.jar
> > Solr-cell-7.5.0.jar
> > Solr-core-7.5.0.jar
> > Solr-solrj-7.5.0.jar
> > Noggit-0.8.jar
> >
> >
> >
> > -Original Message-
> > From: Tim Allison 
> > Sent: 25. oktober 2018 20:21
> > To: solr-user@lucene.apache.org
> > Subject: Re: Reading data using Tika to Solr
> >
> > To follow up w Erick’s point, there are a bunch of transitive
> > dependencies from tika-parsers. If you aren’t using maven or similar
> > build system to grab the dependencies, it can be tricky to get it
> > right. If you aren’t using maven, and you can afford the risks of jar
> > hell, consider using tika-app or, better perhaps, tika-server.
> >
> > Stay tuned for SOLR-11721...
> >
> > On Thu, Oct 25, 2018 at 1:08 PM Erick Erickson
> > 
> > wrote:
> >
> > > Martin:
> > >
> > > The mail server is pretty aggressive about stripping attachments,
> > > your png didn't come though. You might also get a more informed
> > > answer on the Tika mailing list.
> > >
> > > That said (and remember I can't see your png so this may be a silly
> > > question), how are you executing the program .vs. compiling it? You
> > > mentioned the "build path". I'm usually lazy and just execute it in
> > > IntelliJ for development and have forgotten to set my classpath on
> > > _numerous_ occasions when running it from a command line ;)
> > >
> > > Best,
> > > Erick
> > >
> > > On Thu, Oct 25, 2018 at 2:55 AM Martin Frank Hansen (MHQ)
> > > 
> > > wrote:
> > > >
> > > > Hi,
> > > >
> > > >
> > > >
> > > > I am trying to read content of msg-files using Tika and index
> > > > these in
> > > Solr, however I am having some problems with the OfficeParser(). I
> > > keep getting the error java.lang.NoClassDefFoundError for the
> > > OfficeParcer, even though both tika-core and tika-parsers are
> > > included
> > in the build path.
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > I am using Java with the following code:
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > public static void main(final String[] args) throws
> > > IOException,SAXException, TikaException {
> > > >
> > > >
> > > >
> > > > processDocument(pathtofile)
> > > >
> > > >
> > > >
> > > >  }
> > > >
> > > >
> > > >
> > > > private static void
> > > > processDocument(String
> > > pathfilename)  {
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >  try {
> > > >
> > > >
> > > >
> > > > File file
> > > > = new
> > > File(pathfilename);
> > > >
> > > >
> > > >
> > > > Metadata
> > > > meta =
> > > new Metadata();
> > > >
> > > >
> > > >
> > > >
> > > > I

Re: Reading data using Tika to Solr

2018-10-26 Thread Tim Allison
IIRC, somewhere btwn 1.14 and now (1.19.1), we changed the default behavior
for the AutoDetectParser from skip attachments to include attachments.

So, two options: 1) upgrade to 1.19.1 and use the AutoDetectParser or 2)
pass an AutoDetectParser via the ParseContext to be used for attachments.

If you’re wondering why you might upgrade to 1.19.1, look no further than:
https://tika.apache.org/security.html



On Fri, Oct 26, 2018 at 4:14 AM Martin Frank Hansen (MHQ) 
wrote:

> Hi Tim,
>
> It is msg files and I added tika-app-1.14.jar to the build path - and now
> it works 😊 But how do I get it to read the attachments as well?
>
> -Original Message-
> From: Tim Allison 
> Sent: 25. oktober 2018 21:57
> To: solr-user@lucene.apache.org
> Subject: Re: Reading data using Tika to Solr
>
> If you’re processing actual msg (not eml), you’ll also need poi and
> poi-scratchpad and their dependencies, but then those msgs could have
> attachments, at which point, you may as just add tika-app. :D
>


RE: Reading data using Tika to Solr

2018-10-26 Thread Martin Frank Hansen (MHQ)
Hi again,

Never mind, I got manage to get the content of the msg-files as well using the 
following link as inspiration: https://wiki.apache.org/tika/RecursiveMetadata

But thanks again for all your help!

-Original Message-
From: Martin Frank Hansen (MHQ) 
Sent: 26. oktober 2018 10:14
To: solr-user@lucene.apache.org
Subject: RE: Reading data using Tika to Solr

Hi Tim,

It is msg files and I added tika-app-1.14.jar to the build path - and now it 
works 😊 But how do I get it to read the attachments as well?

-Original Message-
From: Tim Allison 
Sent: 25. oktober 2018 21:57
To: solr-user@lucene.apache.org
Subject: Re: Reading data using Tika to Solr

If you’re processing actual msg (not eml), you’ll also need poi and 
poi-scratchpad and their dependencies, but then those msgs could have 
attachments, at which point, you may as just add tika-app. :D

On Thu, Oct 25, 2018 at 2:46 PM Martin Frank Hansen (MHQ) 
wrote:

> Hi Erick and Tim,
>
> Thanks for your answers, I can see that my mail got messed up on the
> way through the server. It looked much more readable at my end 😉 The
> attachment simply included my build-path.
>
> @Erick I am compiling the program using Netbeans at the moment.
>
> I updated to tika-1.7 but that did not help, and I haven't tried maven
> yet but will probably have to give that a chance. I just find it a bit
> odd that I can see the dependencies are included in the jar files I
> added to the project, but I must be missing something?
>
> My buildpath looks as follows:
>
> Tika-parsers-1.4.jar
> Tika-core-1.4.jar
> Commons-io-2.5.jar
> Httpclient-4.5.3
> Httpcore-4.4.6.jar
> Httpmime-4.5.3.jar
> Slf4j-api1-7-24.jar
> Jcl-over--slf4j-1.7.24.jar
> Solr-cell-7.5.0.jar
> Solr-core-7.5.0.jar
> Solr-solrj-7.5.0.jar
> Noggit-0.8.jar
>
>
>
> -Original Message-
> From: Tim Allison 
> Sent: 25. oktober 2018 20:21
> To: solr-user@lucene.apache.org
> Subject: Re: Reading data using Tika to Solr
>
> To follow up w Erick’s point, there are a bunch of transitive
> dependencies from tika-parsers. If you aren’t using maven or similar
> build system to grab the dependencies, it can be tricky to get it
> right. If you aren’t using maven, and you can afford the risks of jar
> hell, consider using tika-app or, better perhaps, tika-server.
>
> Stay tuned for SOLR-11721...
>
> On Thu, Oct 25, 2018 at 1:08 PM Erick Erickson
> 
> wrote:
>
> > Martin:
> >
> > The mail server is pretty aggressive about stripping attachments,
> > your png didn't come though. You might also get a more informed
> > answer on the Tika mailing list.
> >
> > That said (and remember I can't see your png so this may be a silly
> > question), how are you executing the program .vs. compiling it? You
> > mentioned the "build path". I'm usually lazy and just execute it in
> > IntelliJ for development and have forgotten to set my classpath on
> > _numerous_ occasions when running it from a command line ;)
> >
> > Best,
> > Erick
> >
> > On Thu, Oct 25, 2018 at 2:55 AM Martin Frank Hansen (MHQ)
> > 
> > wrote:
> > >
> > > Hi,
> > >
> > >
> > >
> > > I am trying to read content of msg-files using Tika and index
> > > these in
> > Solr, however I am having some problems with the OfficeParser(). I
> > keep getting the error java.lang.NoClassDefFoundError for the
> > OfficeParcer, even though both tika-core and tika-parsers are
> > included
> in the build path.
> > >
> > >
> > >
> > >
> > >
> > > I am using Java with the following code:
> > >
> > >
> > >
> > >
> > >
> > > public static void main(final String[] args) throws
> > IOException,SAXException, TikaException {
> > >
> > >
> > >
> > > processDocument(pathtofile)
> > >
> > >
> > >
> > >  }
> > >
> > >
> > >
> > > private static void
> > > processDocument(String
> > pathfilename)  {
> > >
> > >
> > >
> > >
> > >
> > >  try {
> > >
> > >
> > >
> > > File file
> > > = new
> > File(pathfilename);
> > >
> > >
> > >
> > > Metadata
> > > meta =
> > new Metadata();
> > >
> > >
> > >
> > >
> > > InputStream
> > input = TikaInputStream.get(file);
> > >
> > >
> > >
> > >
> > BodyContentHandler handler = new BodyContentHandler();
> > >
> > >
> > >
> > > Parser
> > > parser =
> > new OfficeParser();
> > >
> > >
> > > ParseContext
> > context = new ParseContext();
> > >
> > >
> > parser.parse(input, handler, meta, context);
> > >
> > >
> > >
> > >  String
> > doccontent = handler.toString();
> > >
> > >
> > >
> > >
> > >
> > >
> >  System.out.println(doccontent);
> > >
> > >
> >  System.out.println(meta);
> > >
> > >
> > >
> > >  }
> > >
> > >  }
> > >
> > > In the buildpath I have the f

Re: LTR features on solr

2018-10-26 Thread Midas A
*Thanks for relpy . Please find my answers below inline.*


On Fri, Oct 26, 2018 at 2:41 PM Kamuela Lau  wrote:

> Hi,
>
> Just to confirm, are you asking about the following?
>
> For a particular query, you have a list of documents, and for each
> document, you have data
> on the number of times the document was clicked on, added to a cart, and
> ordered, and you
> would like to use this data for features. Is this correct?
> *[ME] :Yes*
> If this is the case, are you indexing that data?
>
   *[ME]* : *Yes we are planing to index the data but my question is how we
should store it in to solr .*
* should i create dynamic field to store the click, cart and order
data per query for document?.*
* Please guide me how we should store. *

>
> I believe that the features which can be used for the LTR module is
> information that is either indexed,
> or indexed information which has been manipulated through the use of
> function queries.
>
> https://lucene.apache.org/solr/guide/7_5/learning-to-rank.html
>
> It seems to me that you would have to frequently index the click data, if
> you need to refresh the data frequently
>
  *  [ME] : we are planing to refresh this data weekly.*

>
> On Fri, Oct 26, 2018 at 4:24 PM Midas A  wrote:
>
> > Hi  All,
> >
> > I am new in implementing solr LTR .  so facing few challenges
> > Broadly  we have 3 kind of features
> > a) Based on query
> > b) based on document
> > *c) Based on query-document from click ,cart and order  from tracker
> data.*
> >
> > So my question here is how to store c) type of features
> >- Old queries and corresponding clicks ((query-clicks)
> > - Old query -cart addition  and
> >   - Old query -order data
> >  into solr to run LTR model
> > and secoundly how to build features for query-clicks, query-cart and
> > query-orders because we need to refresh  this data frequently.
> >
> > What approch should i follow .
> >
> > Hope i am able to explain my problem.
> >
>


Re: LTR features on solr

2018-10-26 Thread Kamuela Lau
Hi,

Just to confirm, are you asking about the following?

For a particular query, you have a list of documents, and for each
document, you have data
on the number of times the document was clicked on, added to a cart, and
ordered, and you
would like to use this data for features. Is this correct?

If this is the case, are you indexing that data?

I believe that the features which can be used for the LTR module is
information that is either indexed,
or indexed information which has been manipulated through the use of
function queries.

https://lucene.apache.org/solr/guide/7_5/learning-to-rank.html

It seems to me that you would have to frequently index the click data, if
you need to refresh the data frequently

On Fri, Oct 26, 2018 at 4:24 PM Midas A  wrote:

> Hi  All,
>
> I am new in implementing solr LTR .  so facing few challenges
> Broadly  we have 3 kind of features
> a) Based on query
> b) based on document
> *c) Based on query-document from click ,cart and order  from tracker data.*
>
> So my question here is how to store c) type of features
>- Old queries and corresponding clicks ((query-clicks)
> - Old query -cart addition  and
>   - Old query -order data
>  into solr to run LTR model
> and secoundly how to build features for query-clicks, query-cart and
> query-orders because we need to refresh  this data frequently.
>
> What approch should i follow .
>
> Hope i am able to explain my problem.
>


RE: Reading data using Tika to Solr

2018-10-26 Thread Martin Frank Hansen (MHQ)
Hi Tim,

It is msg files and I added tika-app-1.14.jar to the build path - and now it 
works 😊 But how do I get it to read the attachments as well?

-Original Message-
From: Tim Allison 
Sent: 25. oktober 2018 21:57
To: solr-user@lucene.apache.org
Subject: Re: Reading data using Tika to Solr

If you’re processing actual msg (not eml), you’ll also need poi and 
poi-scratchpad and their dependencies, but then those msgs could have 
attachments, at which point, you may as just add tika-app. :D

On Thu, Oct 25, 2018 at 2:46 PM Martin Frank Hansen (MHQ) 
wrote:

> Hi Erick and Tim,
>
> Thanks for your answers, I can see that my mail got messed up on the
> way through the server. It looked much more readable at my end 😉 The
> attachment simply included my build-path.
>
> @Erick I am compiling the program using Netbeans at the moment.
>
> I updated to tika-1.7 but that did not help, and I haven't tried maven
> yet but will probably have to give that a chance. I just find it a bit
> odd that I can see the dependencies are included in the jar files I
> added to the project, but I must be missing something?
>
> My buildpath looks as follows:
>
> Tika-parsers-1.4.jar
> Tika-core-1.4.jar
> Commons-io-2.5.jar
> Httpclient-4.5.3
> Httpcore-4.4.6.jar
> Httpmime-4.5.3.jar
> Slf4j-api1-7-24.jar
> Jcl-over--slf4j-1.7.24.jar
> Solr-cell-7.5.0.jar
> Solr-core-7.5.0.jar
> Solr-solrj-7.5.0.jar
> Noggit-0.8.jar
>
>
>
> -Original Message-
> From: Tim Allison 
> Sent: 25. oktober 2018 20:21
> To: solr-user@lucene.apache.org
> Subject: Re: Reading data using Tika to Solr
>
> To follow up w Erick’s point, there are a bunch of transitive
> dependencies from tika-parsers. If you aren’t using maven or similar
> build system to grab the dependencies, it can be tricky to get it
> right. If you aren’t using maven, and you can afford the risks of jar
> hell, consider using tika-app or, better perhaps, tika-server.
>
> Stay tuned for SOLR-11721...
>
> On Thu, Oct 25, 2018 at 1:08 PM Erick Erickson
> 
> wrote:
>
> > Martin:
> >
> > The mail server is pretty aggressive about stripping attachments,
> > your png didn't come though. You might also get a more informed
> > answer on the Tika mailing list.
> >
> > That said (and remember I can't see your png so this may be a silly
> > question), how are you executing the program .vs. compiling it? You
> > mentioned the "build path". I'm usually lazy and just execute it in
> > IntelliJ for development and have forgotten to set my classpath on
> > _numerous_ occasions when running it from a command line ;)
> >
> > Best,
> > Erick
> >
> > On Thu, Oct 25, 2018 at 2:55 AM Martin Frank Hansen (MHQ)
> > 
> > wrote:
> > >
> > > Hi,
> > >
> > >
> > >
> > > I am trying to read content of msg-files using Tika and index
> > > these in
> > Solr, however I am having some problems with the OfficeParser(). I
> > keep getting the error java.lang.NoClassDefFoundError for the
> > OfficeParcer, even though both tika-core and tika-parsers are
> > included
> in the build path.
> > >
> > >
> > >
> > >
> > >
> > > I am using Java with the following code:
> > >
> > >
> > >
> > >
> > >
> > > public static void main(final String[] args) throws
> > IOException,SAXException, TikaException {
> > >
> > >
> > >
> > > processDocument(pathtofile)
> > >
> > >
> > >
> > >  }
> > >
> > >
> > >
> > > private static void
> > > processDocument(String
> > pathfilename)  {
> > >
> > >
> > >
> > >
> > >
> > >  try {
> > >
> > >
> > >
> > > File file
> > > = new
> > File(pathfilename);
> > >
> > >
> > >
> > > Metadata
> > > meta =
> > new Metadata();
> > >
> > >
> > >
> > >
> > > InputStream
> > input = TikaInputStream.get(file);
> > >
> > >
> > >
> > >
> > BodyContentHandler handler = new BodyContentHandler();
> > >
> > >
> > >
> > > Parser
> > > parser =
> > new OfficeParser();
> > >
> > >
> > > ParseContext
> > context = new ParseContext();
> > >
> > >
> > parser.parse(input, handler, meta, context);
> > >
> > >
> > >
> > >  String
> > doccontent = handler.toString();
> > >
> > >
> > >
> > >
> > >
> > >
> >  System.out.println(doccontent);
> > >
> > >
> >  System.out.println(meta);
> > >
> > >
> > >
> > >  }
> > >
> > >  }
> > >
> > > In the buildpath I have the following dependencies:
> > >
> > >
> > >
> > >
> > >
> > > Any help is appreciate.
> > >
> > >
> > >
> > > Thanks in advance.
> > >
> > >
> > >
> > > Best regards,
> > >
> > >
> > >
> > > Martin Hansen
> > >
> > >
> > >
> > > Beskyttelse af dine personlige oplysninger er vigtig for os. Her
> > > finder
> > du KMD’s Privatlivspolitik, der fortæller, hvordan vi behandler
> 

Re: A different result with filters

2018-10-26 Thread Владислав Властовский
Emir, no

пт, 26 окт. 2018 г. в 10:17, Emir Arnautović :

> Hi,
> The second query is equivalent to:
> > {
> >  "query": "*:*",
> >  "limit": 0,
> >  "filter": [
> >"{!parent which=kind_s:edition}condition_s:0",
> >"price_i:[* TO 75]"
> >  ]
> > }
>
>
> HTH,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 26 Oct 2018, at 08:49, Владислав Властовский  wrote:
> >
> > Hi, I use 7.5.0 Solr
> >
> > Why do I get two different results for similar requests?
> >
> > First req/res:
> > {
> >  "query": "*:*",
> >  "limit": 0,
> >  "filter": [
> >"{!parent which=kind_s:edition}condition_s:0",
> >"{!parent which=kind_s:edition}price_i:[* TO 75]"
> >  ]
> > }
> >
> > {
> >  "response": {
> >"numFound": 453,
> >"start": 0,
> >"docs": []
> >  }
> > }
> >
> > And second query:
> > {
> >  "query": "*:*",
> >  "limit": 0,
> >  "filter": [
> >"{!parent which=kind_s:edition}condition_s:0 AND price_i:[* TO
> 75]"
> >  ]
> > }
> >
> > {
> >  "response": {
> >"numFound": 452,
> >"start": 0,
> >"docs": []
> >  }
> > }
>
>


LTR features on solr

2018-10-26 Thread Midas A
Hi  All,

I am new in implementing solr LTR .  so facing few challenges
Broadly  we have 3 kind of features
a) Based on query
b) based on document
*c) Based on query-document from click ,cart and order  from tracker data.*

So my question here is how to store c) type of features
   - Old queries and corresponding clicks ((query-clicks)
- Old query -cart addition  and
  - Old query -order data
 into solr to run LTR model
and secoundly how to build features for query-clicks, query-cart and
query-orders because we need to refresh  this data frequently.

What approch should i follow .

Hope i am able to explain my problem.


Re: A different result with filters

2018-10-26 Thread Emir Arnautović
Hi,
The second query is equivalent to:
> {
>  "query": "*:*",
>  "limit": 0,
>  "filter": [
>"{!parent which=kind_s:edition}condition_s:0",
>"price_i:[* TO 75]"
>  ]
> }


HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 26 Oct 2018, at 08:49, Владислав Властовский  wrote:
> 
> Hi, I use 7.5.0 Solr
> 
> Why do I get two different results for similar requests?
> 
> First req/res:
> {
>  "query": "*:*",
>  "limit": 0,
>  "filter": [
>"{!parent which=kind_s:edition}condition_s:0",
>"{!parent which=kind_s:edition}price_i:[* TO 75]"
>  ]
> }
> 
> {
>  "response": {
>"numFound": 453,
>"start": 0,
>"docs": []
>  }
> }
> 
> And second query:
> {
>  "query": "*:*",
>  "limit": 0,
>  "filter": [
>"{!parent which=kind_s:edition}condition_s:0 AND price_i:[* TO 75]"
>  ]
> }
> 
> {
>  "response": {
>"numFound": 452,
>"start": 0,
>"docs": []
>  }
> }