Is it possible to make multiple joins on Solr?

2016-11-08 Thread Lucas Cotta
I was able to implement one join (https://wiki.apache.org/solr/Join) in my
query but I couldn't find the correct syntax to use multiple joins... is
that possible? Could someone please give me some example?

Thanks!


Re: SolrJ optimize method -- not returning immediately when the "wait" options are false

2016-11-08 Thread Yonik Seeley
https://issues.apache.org/jira/browse/SOLR-2018
There used to be a waitFlush parameter (wait until the IndexWriter has
written all the changes) as well as a waitSearcher parameter (wait
until a new searcher has been registered... i.e. whatever changes you
made will be guaranteed to be visible).
The waitFlush parameter was removed because it was never implemented
(we always wait until IW has flushed).  So no, you should not expect
to see an immediate return with waitSearcher=false since it only
represents the open-and-register-searcher part.

-Yonik


On Tue, Nov 8, 2016 at 5:55 PM, Shawn Heisey  wrote:
> I have this code in my SolrJ program:
>
>   LOG.info("{}: background optimizing", logPrefix);
>   myOptimizeSolrClient.optimize(myName, false, false);
>   elapsedMillis = (System.nanoTime() - startNanos) / 100;
>   LOG.info("{}: Background optimize completed, elapsed={}", logPrefix,
> elapsedMillis);
>
> This is what I get when this code runs.  I expected it to return
> immediately, but it took 49 seconds:
>
> INFO  - 2016-11-08 15:10:56.316;   409; shard.c.inc.inclive.optimize;
> shard.c.inc.inclive: Background optimize completed, elapsed=49339
>
> I'm using SolrJ 5.5.3, and the SolrClient object is HttpSolrClient.  I
> have not tried 6.x versions.  The server that this is talking to is
> 5.3.2-SNAPSHOT.
>
> I found this in solr.log:
>
> 2016-11-08 15:10:56.315 INFO  (qtp1164175787-708968) [   x:inclive]
> org.apache.solr.update.processor.LogUpdateProcessor [inclive]
> webapp=/solr path=/update
> params={optimize=true=1=true=javabin=2}
> {optimize=} 0 49338
>
> It looks like waitSearcher is not being set properly by the SolrJ code.
> I could not see any obvious problem in the master branch, which I
> realize is not the same as the 5.5 code I'm running.
>
> I did try the request manually, both with waitSearcher set to true and
> to false, and in both cases, the request DID wait until the optimize was
> finished before it returned a response.  So even if the SolrJ problem is
> fixed, Solr itself will not work the way I'm expecting.  Is it correct
> to expect an immediate return for optimize when waitSearcher is false?
>
> I am not in a position to try this in 6.x versions.  Is there anyone out
> there who does have a 6.x index they can try it on, see if it's still a
> problem?
>
> Thanks,
> Shawn
>


Re: High CPU Usage in export handler

2016-11-08 Thread Erick Erickson
Joel:

I did a little work with SOLR-9296 to try to reduce the number of
objects created, which would relieve GC pressure both at creation and
collection time. I didn't measure CPU utilization before/after, but I
did see up to a 11% increase in throughput.

It wouldn't hurt my feelings at all to have someone grab that JIRA
away from me since it's pretty obvious I'm not going to get back to it
for a while.

Erick

On Tue, Nov 8, 2016 at 11:44 AM, Ray Niu  wrote:
> Thanks Joel.
>
> 2016-11-08 11:43 GMT-08:00 Joel Bernstein :
>
>> It sounds like your scenario, is around 25 queries per second, each pulling
>> entire results. This would be enough to drive up CPU usage as you have more
>> concurrent requests then CPU's. Since there isn't much IO blocking
>> happening, in the scenario you describe, I would expect some pretty busy
>> CPU's.
>>
>> That being said I think it would useful to understand exactly where the
>> hotspots are in Lucene to see if we can make this more efficient.
>>
>> Leading up to the 6.4 release I'll try to spend some time understanding the
>> Lucene hotspots with /export. I'll report back to this thread when I have
>> more info.
>>
>>
>> Joel Bernstein
>> http://joelsolr.blogspot.com/
>>
>> On Mon, Nov 7, 2016 at 3:44 PM, Ray Niu  wrote:
>>
>> > Hello:
>> >Any follow up?
>> >
>> > 2016-11-03 11:18 GMT-07:00 Ray Niu :
>> >
>> > > the soft commit is 15 seconds and hard commit is 10 minutes.
>> > >
>> > > 2016-11-03 11:11 GMT-07:00 Erick Erickson :
>> > >
>> > >> Followup question: You say you're indexing 100 docs/second.  How often
>> > >> are you _committing_? Either
>> > >> soft commit
>> > >> or
>> > >> hardcommit with openSearcher=true
>> > >>
>> > >> ?
>> > >>
>> > >> Best,
>> > >> Erick
>> > >>
>> > >> On Thu, Nov 3, 2016 at 11:00 AM, Ray Niu  wrote:
>> > >> > Thanks Joel
>> > >> > here is the information you requested.
>> > >> > Are you doing heavy writes at the time?
>> > >> > we are doing write very frequently, but not very heavy, we will
>> update
>> > >> > about 100 solr document per second.
>> > >> > How many concurrent reads are are happening?
>> > >> > the concurrent reads are about 1000-2000 per minute per node
>> > >> > What version of Solr are you using?
>> > >> > we are using solr 5.5.2
>> > >> > What is the field definition for the double, is it docValues?
>> > >> > the field definition is
>> > >> > > > >> > docValues="true"/>
>> > >> >
>> > >> >
>> > >> > 2016-11-03 6:30 GMT-07:00 Joel Bernstein :
>> > >> >
>> > >> >> Are you doing heavy writes at the time?
>> > >> >>
>> > >> >> How many concurrent reads are are happening?
>> > >> >>
>> > >> >> What version of Solr are you using?
>> > >> >>
>> > >> >> What is the field definition for the double, is it docValues?
>> > >> >>
>> > >> >>
>> > >> >>
>> > >> >>
>> > >> >> Joel Bernstein
>> > >> >> http://joelsolr.blogspot.com/
>> > >> >>
>> > >> >> On Thu, Nov 3, 2016 at 12:56 AM, Ray Niu 
>> > wrote:
>> > >> >>
>> > >> >> > Hello:
>> > >> >> >We are using export handler in Solr Cloud to get some data, we
>> > >> only
>> > >> >> > request for one field, which type is tdouble, it works well at
>> the
>> > >> >> > beginning, but recently we saw high CPU issue in all the solr
>> cloud
>> > >> >> nodes,
>> > >> >> > we took some thread dump and found following information:
>> > >> >> >
>> > >> >> >java.lang.Thread.State: RUNNABLE
>> > >> >> >
>> > >> >> > at java.lang.Thread.isAlive(Native Method)
>> > >> >> >
>> > >> >> > at
>> > >> >> > org.apache.lucene.util.CloseableThreadLocal.purge(
>> > >> >> > CloseableThreadLocal.java:115)
>> > >> >> >
>> > >> >> > - locked <0x0006e24d86a8> (a java.util.WeakHashMap)
>> > >> >> >
>> > >> >> > at
>> > >> >> > org.apache.lucene.util.CloseableThreadLocal.maybePurge(
>> > >> >> > CloseableThreadLocal.java:105)
>> > >> >> >
>> > >> >> > at
>> > >> >> > org.apache.lucene.util.CloseableThreadLocal.get(
>> > >> >> > CloseableThreadLocal.java:88)
>> > >> >> >
>> > >> >> > at
>> > >> >> > org.apache.lucene.index.CodecReader.getNumericDocValues(
>> > >> >> > CodecReader.java:143)
>> > >> >> >
>> > >> >> > at
>> > >> >> > org.apache.lucene.index.FilterLeafReader.getNumericDocValues(
>> > >> >> > FilterLeafReader.java:430)
>> > >> >> >
>> > >> >> > at
>> > >> >> > org.apache.lucene.uninverting.UninvertingReader.
>> > getNumericDocValues(
>> > >> >> > UninvertingReader.java:239)
>> > >> >> >
>> > >> >> > at
>> > >> >> > org.apache.lucene.index.FilterLeafReader.getNumericDocValues(
>> > >> >> > FilterLeafReader.java:430)
>> > >> >> >
>> > >> >> > Is this a known issue for export handler? As we only fetch up to
>> > 5000
>> > >> >> > documents, it should not be data volume issue.
>> > >> >> >
>> > >> >> > Can anyone help on that? Thanks a lot.
>> > >> 

Re: Parallelize Cursor approach

2016-11-08 Thread Erick Erickson
Hmm, that should work fine. Let us know what the logs show if anything
because this is weird.

Best,
Erick

On Tue, Nov 8, 2016 at 1:00 PM, Chetas Joshi  wrote:
> Hi Erick,
>
> This is how I use the streaming approach.
>
> Here is the solrconfig block.
>
> 
> 
> {!xport}
> xsort
> false
> 
> 
> query
> 
> 
>
> And here is the code in which SolrJ is being used.
>
> String zkHost = args[0];
> String collection = args[1];
>
> Map props = new HashMap();
> props.put("q", "*:*");
> props.put("qt", "/export");
> props.put("sort", "fieldA asc");
> props.put("fl", "fieldA,fieldB,fieldC");
>
> CloudSolrStream cloudstream = new CloudSolrStream(zkHost,collection,props);
>
> And then I iterate through the cloud stream (TupleStream).
> So I am using streaming expressions (SolrJ).
>
> I have not looked at the solr logs while I started getting the JSON parsing
> exceptions. But I will let you know what I see the next time I run into the
> same exceptions.
>
> Thanks
>
> On Sat, Nov 5, 2016 at 9:32 PM, Erick Erickson 
> wrote:
>
>> Hmmm, export is supposed to handle 10s of million result sets. I know
>> of a situation where the Streaming Aggregation functionality back
>> ported to Solr 4.10 processes on that scale. So do you have any clue
>> what exactly is failing? Is there anything in the Solr logs?
>>
>> _How_ are you using /export, through Streaming Aggregation (SolrJ) or
>> just the raw xport handler? It might be worth trying to do this from
>> SolrJ if you're not, it should be a very quick program to write, just
>> to test we're talking 100 lines max.
>>
>> You could always roll your own cursor mark stuff by partitioning the
>> data amongst N threads/processes if you have any reasonable
>> expectation that you could form filter queries that partition the
>> result set anywhere near evenly.
>>
>> For example, let's say you have a field with random numbers between 0
>> and 100. You could spin off 10 cursorMark-aware processes each with
>> its own fq clause like
>>
>> fq=partition_field:[0 TO 10}
>> fq=[10 TO 20}
>> 
>> fq=[90 TO 100]
>>
>> Note the use of inclusive/exclusive end points
>>
>> Each one would be totally independent of all others with no
>> overlapping documents. And since the fq's would presumably be cached
>> you should be able to go as fast as you can drive your cluster. Of
>> course you lose query-wide sorting and the like, if that's important
>> you'd need to figure something out there.
>>
>> Do be aware of a potential issue. When regular doc fields are
>> returned, for each document returned, a 16K block of data will be
>> decompressed to get the stored field data. Streaming Aggregation
>> (/xport) reads docValues entries which are held in MMapDirectory space
>> so will be much, much faster. As of Solr 5.5. You can override the
>> decompression stuff, see:
>> https://issues.apache.org/jira/browse/SOLR-8220 for fields that are
>> both stored and docvalues...
>>
>> Best,
>> Erick
>>
>> On Sat, Nov 5, 2016 at 6:41 PM, Chetas Joshi 
>> wrote:
>> > Thanks Yonik for the explanation.
>> >
>> > Hi Erick,
>> > I was using the /xport functionality. But it hasn't been stable (Solr
>> > 5.5.0). I started running into run time Exceptions (JSON parsing
>> > exceptions) while reading the stream of Tuples. This started happening as
>> > the size of my collection increased 3 times and I started running queries
>> > that return millions of documents (>10mm). I don't know if it is the
>> query
>> > result size or the actual data size (total number of docs in the
>> > collection) that is causing the instability.
>> >
>> > org.noggit.JSONParser$ParseException: Expected ',' or '}':
>> > char=5,position=110938 BEFORE='uuid":"0lG99s8vyaKB2I/
>> > I","space":"uuid","timestamp":1 5' AFTER='DB6 474294954},{"uuid":"
>> > 0lG99sHT8P5e'
>> >
>> > I won't be able to move to Solr 6.0 due to some constraints in our
>> > production environment and hence moving back to the cursor approach. Do
>> you
>> > have any other suggestion for me?
>> >
>> > Thanks,
>> > Chetas.
>> >
>> > On Fri, Nov 4, 2016 at 10:17 PM, Erick Erickson > >
>> > wrote:
>> >
>> >> Have you considered the /xport functionality?
>> >>
>> >> On Fri, Nov 4, 2016 at 5:56 PM, Yonik Seeley  wrote:
>> >> > No, you can't get cursor-marks ahead of time.
>> >> > They are the serialized representation of the last sort values
>> >> > encountered (hence not known ahead of time).
>> >> >
>> >> > -Yonik
>> >> >
>> >> >
>> >> > On Fri, Nov 4, 2016 at 8:48 PM, Chetas Joshi 
>> >> wrote:
>> >> >> Hi,
>> >> >>
>> >> >> I am using the cursor approach to fetch results from Solr (5.5.0).
>> Most
>> >> of
>> >> >> my queries return millions of results. Is there a way I can read the
>> >> pages
>> >> >> in parallel? Is there a way I can get all the cursors well in
>> advance?
>> >> >>
>> >> >> 

Re: SolrJ optimize method -- not returning immediately when the "wait" options are false

2016-11-08 Thread Shawn Heisey
On 11/8/2016 3:55 PM, Shawn Heisey wrote:
> I am not in a position to try this in 6.x versions. Is there anyone
> out there who does have a 6.x index they can try it on, see if it's
> still a problem?

I upgraded a dev version of the program to SolrJ 6.2.1 (newest currently
available via ivy), the server still received waitSearcher=true, even
though my code explicitly said false.  Optimizing the index took 46 seconds.

It's the server side that I can't try on 6.x without a lot of work.

Thanks,
Shawn



Re: edismax, phrase field gets ignored for keyword tokenizer

2016-11-08 Thread Vincenzo D'Amore
Hi Stefan, I've been very busy today, I've read your mail but no time to
write an answer.
So now at last, finally everybody is sleeping around me :)

Let's start from the very beginning, sorry if I didn't get everything about
your first question, I just got you're unable to find the phone number when
KeywordTokenizerFactory was enabled.

Let me say again, there isn't anything of strange in what you're
experiencing using solr.KeywordTokenizerFactory, may be you have just to
read how analyzers, tokenizers and filters works.

https://cwiki.apache.org/confluence/display/solr/Understanding+Analyzers,+Tokenizers,+and+Filters

I'm not the best guy here to explain what's your problem, I'll try to do my
best to put some light on, or at least, what I know trying to avoid all the
complexity on partial matching, boosts and others.

There are two terms you should know: "index time" and "query time". Index
time occurs when you save your text into a field document, so when your
text is tokenized and saved. Query time occurs when a search is run, when
your text is processed in order to be tokenized and searched.

Consider that all the searches in Solr usually are based on tokens
matching, so search tokens should match as much as possible index tokens,
as specified by the parameter mm (minimum should match).

So when you search something with Edismax, your phrase is divided into its
component tokens (aka words or terms). And Edismax will process all the
tokens across all the fields you have defined in qf parameter.

So, when you ask for +49 1234 12345678, at search time your query is
divided in three tokens, and each token is searched across the fields (and
in your case the field phone_number).

The KeywordTokenizerFactory at index time does not tokenize the text, you
have only one big token '+49 1234 12345678' on the other hand at query time
edismax is looking for 3 tokens +49, 1234 and 12345678.

As you can understand not even one of those 3 tokens match with the only
token content inside the field phone_number.

But when you use StandardTokenizerFactory, your input string '+49 1234
12345678' is tokenized into 3 tokens, like edismax did at query time. I
think you can now understand what's happening.

There isn't any chance your phrase query will happen if you don't tokenize
at index time the text in a way edismax will be able to search for.

Hope this helps.

Best regards,
Vincenzo






On Tue, Nov 8, 2016 at 10:46 PM, Stefan Matheis 
wrote:

> Any more thoughts on this? The longer i look at this situation, the
> more i’m thinking i’m at fault here - expection something that isn’t
> to be expected at all?
>
> Whatever is on your mind once you’ve read mail - don’t keep to it, let me
> know.
>
> -Stefan
>
>
> On November 7, 2016 at 5:23:58 PM, Stefan Matheis
> (matheis.ste...@gmail.com) wrote:
> > Which is everything fine by itself - but doesn’t shed more light on my
> initial question
> > Vincenzo, does it? probably i shoudn’t have mentioned partial matches in
> the first place,
> > that might have lead into the wrong direction - they are not relevant
> for now / not for this
> > question.
> >
> > I’d like to know why & where edismax drops out phrase fields which are
> using a Keyword Tokenizer.
> > Maybe there is a larger idea behind this behavior, but i don’t see it
> (yet).
> >
> > -Stefan
> >
> >
> > On November 7, 2016 at 5:09:04 PM, Vincenzo D'Amore (v.dam...@gmail.com)
> wrote:
> > > If you don't want partial matches with edismax you should always use
> > > StandardTokenizerFactory and play with mm parameter.
> > >
> > > On Mon, Nov 7, 2016 at 4:50 PM, Stefan Matheis
> > > wrote:
> > >
> > > > Vincenzo,
> > > >
> > > > thanks for the response - i know that only the Keyword Tokenizer by
> > > > itself does not do anything. as pointed at the end of the initial
> > > > mail, i’m applying a pattern replace for everything non-numeric to
> > > > make it actually useful.
> > > >
> > > > and especially because of the tokenization based on whitespaces i’d
> > > > like to use the very same field once again as phrase field to around
> > > > this issue. Shawn mentioned in #solr in the meantime that there is
> > > > SOLR-9185 which is similar and would be helpful, but currently very
> > > > very in-the-works.
> > > >
> > > > Standard Tokenizer you’ve mentioned does split on whitespace - as
> > > > edismax does by default in the first place. so i’m not sure how that
> > > > would help? For now, i don’t want to have partial matches on phone
> > > > numbers .. at least not yet.
> > > >
> > > > -Stefan
> > > >
> > > >
> > > > On November 7, 2016 at 4:41:50 PM, Vincenzo D'Amore (
> v.dam...@gmail.com)
> > > > wrote:
> > > > > Hi Stefan,
> > > > >
> > > > > I think the problem is solr.KeywordTokenizerFactory.
> > > > > This tokeniser does not make any tokenisation to the string, it
> returns
> > > > > exactly what you have.
> > > > >
> > > > > '+49 1234 12345678' -> '+49 1234 12345678'
> > > > >
> > > > > 

SolrJ optimize method -- not returning immediately when the "wait" options are false

2016-11-08 Thread Shawn Heisey
I have this code in my SolrJ program:

  LOG.info("{}: background optimizing", logPrefix);
  myOptimizeSolrClient.optimize(myName, false, false);
  elapsedMillis = (System.nanoTime() - startNanos) / 100;
  LOG.info("{}: Background optimize completed, elapsed={}", logPrefix,
elapsedMillis);

This is what I get when this code runs.  I expected it to return
immediately, but it took 49 seconds:

INFO  - 2016-11-08 15:10:56.316;   409; shard.c.inc.inclive.optimize;
shard.c.inc.inclive: Background optimize completed, elapsed=49339

I'm using SolrJ 5.5.3, and the SolrClient object is HttpSolrClient.  I
have not tried 6.x versions.  The server that this is talking to is
5.3.2-SNAPSHOT.

I found this in solr.log:

2016-11-08 15:10:56.315 INFO  (qtp1164175787-708968) [   x:inclive]
org.apache.solr.update.processor.LogUpdateProcessor [inclive]
webapp=/solr path=/update
params={optimize=true=1=true=javabin=2}
{optimize=} 0 49338

It looks like waitSearcher is not being set properly by the SolrJ code. 
I could not see any obvious problem in the master branch, which I
realize is not the same as the 5.5 code I'm running.

I did try the request manually, both with waitSearcher set to true and
to false, and in both cases, the request DID wait until the optimize was
finished before it returned a response.  So even if the SolrJ problem is
fixed, Solr itself will not work the way I'm expecting.  Is it correct
to expect an immediate return for optimize when waitSearcher is false?

I am not in a position to try this in 6.x versions.  Is there anyone out
there who does have a 6.x index they can try it on, see if it's still a
problem?

Thanks,
Shawn



Re: edismax, phrase field gets ignored for keyword tokenizer

2016-11-08 Thread Stefan Matheis
Any more thoughts on this? The longer i look at this situation, the
more i’m thinking i’m at fault here - expection something that isn’t
to be expected at all?

Whatever is on your mind once you’ve read mail - don’t keep to it, let me know.

-Stefan


On November 7, 2016 at 5:23:58 PM, Stefan Matheis
(matheis.ste...@gmail.com) wrote:
> Which is everything fine by itself - but doesn’t shed more light on my 
> initial question
> Vincenzo, does it? probably i shoudn’t have mentioned partial matches in the 
> first place,
> that might have lead into the wrong direction - they are not relevant for now 
> / not for this
> question.
>
> I’d like to know why & where edismax drops out phrase fields which are using 
> a Keyword Tokenizer.
> Maybe there is a larger idea behind this behavior, but i don’t see it (yet).
>
> -Stefan
>
>
> On November 7, 2016 at 5:09:04 PM, Vincenzo D'Amore (v.dam...@gmail.com) 
> wrote:
> > If you don't want partial matches with edismax you should always use
> > StandardTokenizerFactory and play with mm parameter.
> >
> > On Mon, Nov 7, 2016 at 4:50 PM, Stefan Matheis
> > wrote:
> >
> > > Vincenzo,
> > >
> > > thanks for the response - i know that only the Keyword Tokenizer by
> > > itself does not do anything. as pointed at the end of the initial
> > > mail, i’m applying a pattern replace for everything non-numeric to
> > > make it actually useful.
> > >
> > > and especially because of the tokenization based on whitespaces i’d
> > > like to use the very same field once again as phrase field to around
> > > this issue. Shawn mentioned in #solr in the meantime that there is
> > > SOLR-9185 which is similar and would be helpful, but currently very
> > > very in-the-works.
> > >
> > > Standard Tokenizer you’ve mentioned does split on whitespace - as
> > > edismax does by default in the first place. so i’m not sure how that
> > > would help? For now, i don’t want to have partial matches on phone
> > > numbers .. at least not yet.
> > >
> > > -Stefan
> > >
> > >
> > > On November 7, 2016 at 4:41:50 PM, Vincenzo D'Amore (v.dam...@gmail.com)
> > > wrote:
> > > > Hi Stefan,
> > > >
> > > > I think the problem is solr.KeywordTokenizerFactory.
> > > > This tokeniser does not make any tokenisation to the string, it returns
> > > > exactly what you have.
> > > >
> > > > '+49 1234 12345678' -> '+49 1234 12345678'
> > > >
> > > > On the other hand, using edismax you are looking for '+49', '1234' and
> > > > '12345678' and none of these keywords match your phone_number field.
> > > >
> > > > Try using a different tokenizer like solr.StandardTokenizerFactory, this
> > > > should change your results.
> > > >
> > > > Bests,
> > > > Vincenzo
> > > >
> > > > On Mon, Nov 7, 2016 at 4:05 PM, Stefan Matheis
> > > > wrote:
> > > >
> > > > > I’m guessing that i’m missing something obvious here - so feel free to
> > > > > ask for more details and as well point out other directions i should
> > > > > following.
> > > > >
> > > > > the problem goes as follows: the input in one case might be a phone
> > > > > number (like +49 1234 12345678), since we’re using edismax the parts
> > > > > gets split on whitespaces - which is fine. bringing the same field
> > > > > (based on TextField) to the party (using qf) doesn’t change a thing.
> > > > >
> > > > > > responseHeader:
> > > > > > params:
> > > > > > q: '+49 1234 12345678'
> > > > > > defType: edismax
> > > > > > qf: person_mobile
> > > > > > pf: person_mobile^5
> > > > > > debug:
> > > > > > rawquerystring: '+49 1234 12345678'
> > > > > > querystring: '+49 1234 12345678'
> > > > > > parsedquery: '(+(+DisjunctionMaxQuery((person_mobile:49))
> > > > > DisjunctionMaxQuery((person_mobile:1234)) DisjunctionMaxQuery((person_
> > > mobile:12345678)))
> > > > > ())/no_coord'
> > > > > > parsedquery_toString: '+(+(person_mobile:49) (person_mobile:1234)
> > > > > (person_mobile:12345678)) ()’
> > > > >
> > > > > but .. as far as i was able to reduce the culprit, that only happens
> > > > > when i’m using solr.KeywordTokenizerFactory . as soon as i’m changing
> > > > > that to solr.StandardTokenizerFactory the phrase query appears as
> > > > > expected:
> > > > >
> > > > > > responseHeader:
> > > > > > params:
> > > > > > q: '+49 1234 12345678'
> > > > > > defType: edismax
> > > > > > qf: person_mobile
> > > > > > pf: person_mobile^5
> > > > > > debug:
> > > > > > rawquerystring: '+49 1234 12345678'
> > > > > > querystring: '+49 1234 12345678'
> > > > > > parsedquery: '(+(+DisjunctionMaxQuery((person_mobile:49))
> > > > > DisjunctionMaxQuery((person_mobile:1234)) DisjunctionMaxQuery((person_
> > > mobile:12345678)))
> > > > > DisjunctionMaxQuery(((person_mobile:"49 1234
> > > 12345678")^5.0)))/no_coord'
> > > > > > parsedquery_toString: '+(+(person_mobile:49) (person_mobile:1234)
> > > > > (person_mobile:12345678)) ((person_mobile:"49 1234 12345678")^5.0)’
> > > > >
> > > > > removing the + at the beginning, doesn’t make a difference either
> > > > > (just mentioning since 

Re: Challenges with new Solrcloud Backup/Restore functionality

2016-11-08 Thread Hrishikesh Gadre
Hi Stephen,

Thanks for the update.

Regarding SOLR-9527 - I think we need a unit test for verifying
"createNodeSet" functionality. I will spend some time on it in next couple
of days.

Also regarding #2, i also found similar issue (doc count mismatch after
restore) while testing with a large collection (~50GB index size). I have
opened SOLR-9598 to track this. Please take a look and comment if you have
any insight.

-Hrishikesh

On Tue, Nov 8, 2016 at 12:54 PM, Stephen Weiss  wrote:

> Just wanted to note that we tested out the patch from SOLR-9527 and it
> worked perfectly for the balancing issue - thank you so much for that!
>
> As for issue #2, we've resorted to doing a hard commit, stopping all
> indexing against the index, and then taking the backup, and we have a
> reasonably good success rate with that.  The system is set up to
> automatically delete and retry the backup/restore process if the cores
> don't match, so that's allowed us to smooth over that problem and get this
> process out into production.   We've been using it for several weeks now
> without any major issue!
>
> We just looked because Solr 6.3 was out, and wanted to know if we could
> upgrade without patching again, but it appears this ticket hasn't gone
> anywhere yet.  I know one users' testing is probably not enough, but given
> that it seems the patch works just fine, are there any plans to merge it
> into release yet?
>
> --
> Steve
>
> On Tue, Oct 4, 2016 at 6:46 PM, Stephen Lewis > wrote:
> Hi All,
>
> I have been experiencing error#1 too with the current branch_6_2 build. I
> started noticing after I applied my patch to that branch<
> https://issues.apache.org/jira/browse/SOLR-9527> (on issue #2), but it
> appears to occur without the patch as well. I haven't seen this issue with
> solr 6.1.0 despite extensive testing. I haven't confirmed if this occurs on
> the official 6.2.0 release build. I will try to confirm and gather more
> data soon.
>
> As with Stephen Weiss, I also am not seeing any errors logged in the index
> after backup and the task is marked as succeeded. However, after each
> backup which is missing a large amount of data, the restore command fails,
> in the sense that the collection is created, but the initialized cores are
> blank and the logs contain errors about "incomplete segments". I will try
> to research further and get back with more data soon.
>
>
>
> On Mon, Sep 26, 2016 at 11:26 AM, Hrishikesh Gadre  > wrote:
> Hi Stephen,
>
> regarding #1, can you verify following steps during backup/restore?
>
> - Before backup command, make sure to run a "hard" commit on the original
> collection. The backup operation will capture only hard committed data.
> - After restore command, check the Solr web UI to verify that all replicas
> of the new (or restored) collection are in the "active" state. During my
> testing, I found that when one or more replicas are in "recovery" state,
> the doc count of the restored collection doesn't match the doc count of the
> original collection. But after the recovery is complete, the doc counts
> match. I will file a JIRA to fix this issue.
>
> Thanks
> Hrishikesh
>
> On Mon, Sep 26, 2016 at 9:34 AM, Stephen Weiss  > wrote:
>
> > #2 - that's great news.  I'll try to patch it in and test it out.
> >
> > #1 - In all cases, the backup and restore both appear successful.  There
> > are no failure messages for any of the shards, no warnings, etc - I
> didn't
> > even realize at first that data was missing until I noticed differences
> in
> > some of the query results when we were testing.  Either manual restore of
> > the data or using the restore API (with all data on one node), we see the
> > same, so I think it's more a problem in the backup process than the
> restore
> > process.
> >
> > If there's any kind of debugging output we can provide that can help
> solve
> > this, let me know.
> >
> > --
> > Steve
> >
> > On Sun, Sep 25, 2016 at 7:17 PM, Hrishikesh Gadre  >
> > wrote:
> >
> >> Hi Steve,
> >>
> >> Regarding the 2nd issue, a JIRA is already created and patch is uploaded
> >> (SOLR-9527). Can someone review and commit the patch?
> >>
> >> Regarding 1st issue, does backup command succeeds? Also do you see any
> >> warning/error log messages? How about the restore command?
> >>
> >> Thanks
> >> Hrishikesh
> >>
> >>
> >>
> >> On Sat, Sep 24, 2016 at 12:14 PM, Stephen Weiss  >
> >> wrote:
> >>
> >>> Hi everyone,
> >>>
> >>> We're very excited about SolrCloud's new backup / restore collection
> >>> APIs, which should introduce some major new efficiencies into our
> indexing
> >>> workflow.  Unfortunately, we've run into some snags with it that are
> >>> preventing us from moving into production.  I was hoping 

Re: Parallelize Cursor approach

2016-11-08 Thread Chetas Joshi
Hi Erick,

This is how I use the streaming approach.

Here is the solrconfig block.



{!xport}
xsort
false


query



And here is the code in which SolrJ is being used.

String zkHost = args[0];
String collection = args[1];

Map props = new HashMap();
props.put("q", "*:*");
props.put("qt", "/export");
props.put("sort", "fieldA asc");
props.put("fl", "fieldA,fieldB,fieldC");

CloudSolrStream cloudstream = new CloudSolrStream(zkHost,collection,props);

And then I iterate through the cloud stream (TupleStream).
So I am using streaming expressions (SolrJ).

I have not looked at the solr logs while I started getting the JSON parsing
exceptions. But I will let you know what I see the next time I run into the
same exceptions.

Thanks

On Sat, Nov 5, 2016 at 9:32 PM, Erick Erickson 
wrote:

> Hmmm, export is supposed to handle 10s of million result sets. I know
> of a situation where the Streaming Aggregation functionality back
> ported to Solr 4.10 processes on that scale. So do you have any clue
> what exactly is failing? Is there anything in the Solr logs?
>
> _How_ are you using /export, through Streaming Aggregation (SolrJ) or
> just the raw xport handler? It might be worth trying to do this from
> SolrJ if you're not, it should be a very quick program to write, just
> to test we're talking 100 lines max.
>
> You could always roll your own cursor mark stuff by partitioning the
> data amongst N threads/processes if you have any reasonable
> expectation that you could form filter queries that partition the
> result set anywhere near evenly.
>
> For example, let's say you have a field with random numbers between 0
> and 100. You could spin off 10 cursorMark-aware processes each with
> its own fq clause like
>
> fq=partition_field:[0 TO 10}
> fq=[10 TO 20}
> 
> fq=[90 TO 100]
>
> Note the use of inclusive/exclusive end points
>
> Each one would be totally independent of all others with no
> overlapping documents. And since the fq's would presumably be cached
> you should be able to go as fast as you can drive your cluster. Of
> course you lose query-wide sorting and the like, if that's important
> you'd need to figure something out there.
>
> Do be aware of a potential issue. When regular doc fields are
> returned, for each document returned, a 16K block of data will be
> decompressed to get the stored field data. Streaming Aggregation
> (/xport) reads docValues entries which are held in MMapDirectory space
> so will be much, much faster. As of Solr 5.5. You can override the
> decompression stuff, see:
> https://issues.apache.org/jira/browse/SOLR-8220 for fields that are
> both stored and docvalues...
>
> Best,
> Erick
>
> On Sat, Nov 5, 2016 at 6:41 PM, Chetas Joshi 
> wrote:
> > Thanks Yonik for the explanation.
> >
> > Hi Erick,
> > I was using the /xport functionality. But it hasn't been stable (Solr
> > 5.5.0). I started running into run time Exceptions (JSON parsing
> > exceptions) while reading the stream of Tuples. This started happening as
> > the size of my collection increased 3 times and I started running queries
> > that return millions of documents (>10mm). I don't know if it is the
> query
> > result size or the actual data size (total number of docs in the
> > collection) that is causing the instability.
> >
> > org.noggit.JSONParser$ParseException: Expected ',' or '}':
> > char=5,position=110938 BEFORE='uuid":"0lG99s8vyaKB2I/
> > I","space":"uuid","timestamp":1 5' AFTER='DB6 474294954},{"uuid":"
> > 0lG99sHT8P5e'
> >
> > I won't be able to move to Solr 6.0 due to some constraints in our
> > production environment and hence moving back to the cursor approach. Do
> you
> > have any other suggestion for me?
> >
> > Thanks,
> > Chetas.
> >
> > On Fri, Nov 4, 2016 at 10:17 PM, Erick Erickson  >
> > wrote:
> >
> >> Have you considered the /xport functionality?
> >>
> >> On Fri, Nov 4, 2016 at 5:56 PM, Yonik Seeley  wrote:
> >> > No, you can't get cursor-marks ahead of time.
> >> > They are the serialized representation of the last sort values
> >> > encountered (hence not known ahead of time).
> >> >
> >> > -Yonik
> >> >
> >> >
> >> > On Fri, Nov 4, 2016 at 8:48 PM, Chetas Joshi 
> >> wrote:
> >> >> Hi,
> >> >>
> >> >> I am using the cursor approach to fetch results from Solr (5.5.0).
> Most
> >> of
> >> >> my queries return millions of results. Is there a way I can read the
> >> pages
> >> >> in parallel? Is there a way I can get all the cursors well in
> advance?
> >> >>
> >> >> Let's say my query returns 2M documents and I have set rows=100,000.
> >> >> Can I have multiple threads iterating over different pages like
> >> >> Thread1 -> docs 1 to 100K
> >> >> Thread2 -> docs 101K to 200K
> >> >> ..
> >> >> ..
> >> >>
> >> >> for this to happen, can I get all the cursorMarks for a given query
> so
> >> that
> >> >> I can 

Re: Challenges with new Solrcloud Backup/Restore functionality

2016-11-08 Thread Stephen Weiss
Just wanted to note that we tested out the patch from SOLR-9527 and it worked 
perfectly for the balancing issue - thank you so much for that!

As for issue #2, we've resorted to doing a hard commit, stopping all indexing 
against the index, and then taking the backup, and we have a reasonably good 
success rate with that.  The system is set up to automatically delete and retry 
the backup/restore process if the cores don't match, so that's allowed us to 
smooth over that problem and get this process out into production.   We've been 
using it for several weeks now without any major issue!

We just looked because Solr 6.3 was out, and wanted to know if we could upgrade 
without patching again, but it appears this ticket hasn't gone anywhere yet.  I 
know one users' testing is probably not enough, but given that it seems the 
patch works just fine, are there any plans to merge it into release yet?

--
Steve

On Tue, Oct 4, 2016 at 6:46 PM, Stephen Lewis 
> wrote:
Hi All,

I have been experiencing error#1 too with the current branch_6_2 build. I 
started noticing after I applied my patch to that 
branch (on issue #2), but it 
appears to occur without the patch as well. I haven't seen this issue with solr 
6.1.0 despite extensive testing. I haven't confirmed if this occurs on the 
official 6.2.0 release build. I will try to confirm and gather more data soon.

As with Stephen Weiss, I also am not seeing any errors logged in the index 
after backup and the task is marked as succeeded. However, after each backup 
which is missing a large amount of data, the restore command fails, in the 
sense that the collection is created, but the initialized cores are blank and 
the logs contain errors about "incomplete segments". I will try to research 
further and get back with more data soon.



On Mon, Sep 26, 2016 at 11:26 AM, Hrishikesh Gadre 
> wrote:
Hi Stephen,

regarding #1, can you verify following steps during backup/restore?

- Before backup command, make sure to run a "hard" commit on the original
collection. The backup operation will capture only hard committed data.
- After restore command, check the Solr web UI to verify that all replicas
of the new (or restored) collection are in the "active" state. During my
testing, I found that when one or more replicas are in "recovery" state,
the doc count of the restored collection doesn't match the doc count of the
original collection. But after the recovery is complete, the doc counts
match. I will file a JIRA to fix this issue.

Thanks
Hrishikesh

On Mon, Sep 26, 2016 at 9:34 AM, Stephen Weiss 
> wrote:

> #2 - that's great news.  I'll try to patch it in and test it out.
>
> #1 - In all cases, the backup and restore both appear successful.  There
> are no failure messages for any of the shards, no warnings, etc - I didn't
> even realize at first that data was missing until I noticed differences in
> some of the query results when we were testing.  Either manual restore of
> the data or using the restore API (with all data on one node), we see the
> same, so I think it's more a problem in the backup process than the restore
> process.
>
> If there's any kind of debugging output we can provide that can help solve
> this, let me know.
>
> --
> Steve
>
> On Sun, Sep 25, 2016 at 7:17 PM, Hrishikesh Gadre 
> >
> wrote:
>
>> Hi Steve,
>>
>> Regarding the 2nd issue, a JIRA is already created and patch is uploaded
>> (SOLR-9527). Can someone review and commit the patch?
>>
>> Regarding 1st issue, does backup command succeeds? Also do you see any
>> warning/error log messages? How about the restore command?
>>
>> Thanks
>> Hrishikesh
>>
>>
>>
>> On Sat, Sep 24, 2016 at 12:14 PM, Stephen Weiss 
>> >
>> wrote:
>>
>>> Hi everyone,
>>>
>>> We're very excited about SolrCloud's new backup / restore collection
>>> APIs, which should introduce some major new efficiencies into our indexing
>>> workflow.  Unfortunately, we've run into some snags with it that are
>>> preventing us from moving into production.  I was hoping someone on the
>>> list could help.
>>>
>>> 1) Data inconsistencies
>>>
>>> There seems to be a problem getting all the data consistently.
>>> Sometimes, the backup will contain all of the data in the collection, and
>>> sometimes, large portions of the collection (as much as 40%) will be
>>> missing.  We haven't quite figured out what might cause this yet, although
>>> one thing I've noticed is the chances of success are greater when we are
>>> only backing up one collection at a time.  Unfortunately, for our workflow,
>>> it will be difficult to make that work, and there still doesn't seem to be
>>> a guarantee of success either way.
>>>
>>> 2) Shards are not distributed
>>>

Re: Re-register a deleted Collection SorlCloud

2016-11-08 Thread Chetas Joshi
I won't be able to achieve the correct mapping as I did not store the
mapping info any where. I don't know if core-node1 was mapped to
shard1_recplica1 or shard2_replica1 in my old collection. But I am not
worried about that as I am not going to update any existing document.

 This is what I did.

I created a new collection with the same schema and the same config.
Shut the SolrCloud down.
Then I copied the data directory.


hadoop fs -cp hdfs://prod/solr53/collection_old/*
hdfs://prod/solr53/collection_new/


Re-started the SolrCloud and I could see documents in the Solr UI when I
queried using the "/select" handler.


Thanks!



On Mon, Nov 7, 2016 at 2:59 PM, Erick Erickson 
wrote:

> You've got it. You should be quite safe if you
> 1> create the same number of shards as you used to have
> 2> match the shard bits. I.e. collection1_shard1_replica1 as long as
> the collection1_shard# parts match you should be fine. If this isn't
> done correctly, the symptom will be that when you update an existing
> document, you may have two copies returned eventually.
>
> Best,
> Erick
>
> On Mon, Nov 7, 2016 at 1:47 PM, Chetas Joshi 
> wrote:
> > Thanks Erick.
> >
> > I had replicationFactor=1 in my old collection and going to have the same
> > config for the new collection.
> > When I create a new collection with number of Shards =20 and max shards
> per
> > node = 1, the shards are going to start on 20 hosts out of my 25 hosts
> Solr
> > cluster. When you say "get each shard's index to the corresponding shard
> on
> > your new collection", do you mean the following?
> >
> > shard1_replica1 -> core_node1 (old collection)
> > shard1_replica1 -> has to be core_node1 (new collection) (I don't have
> this
> > mapping for the old collection as the collection no longer exists!!)
> >
> > Thanks,
> > Chetas.
> >
> > On Mon, Nov 7, 2016 at 1:03 PM, Erick Erickson 
> > wrote:
> >
> >> That should work. The caveat here is that you need to get the each
> >> shards index to the corresponding shard on your new collection.
> >>
> >> Of course I'd back up _all_ of these indexes before even starting.
> >>
> >> And one other trick. First create your collection with 1 replica per
> >> shard (leader-only). Then copy the indexes (and, btw, I'd have the
> >> associated Solr nodes down during the copy) and verify the collection
> >> is as you'd expect.
> >>
> >> Now use ADDREPLICA to expand your collection, that'll handle the
> >> copying from the leader correctly.
> >>
> >> Best,
> >> Erick
> >>
> >> On Mon, Nov 7, 2016 at 12:49 PM, Chetas Joshi 
> >> wrote:
> >> > I have a Solr Cloud deployed on top of HDFS.
> >> >
> >> > I accidentally deleted a collection using the collection API. So,
> >> ZooKeeper
> >> > cluster has lost all the info related to that collection. I don't
> have a
> >> > backup that I can restore from. However, I have indices and
> transaction
> >> > logs on HDFS.
> >> >
> >> > If I create a new collection and copy the existing data directory to
> the
> >> > data directory path of the new collection I have created, will I be
> able
> >> to
> >> > go back to the state where I was? Is there anything else I would have
> to
> >> do?
> >> >
> >> > Thanks,
> >> >
> >> > Chetas.
> >>
>


Re: OOM Error

2016-11-08 Thread Susheel Kumar
Hello,

Ran into OOM Error again right after two weeks. Below is the GC log viewer
graph.  The first time we run into this was after 3 months and then second
time in two weeks. After first incident reduced the cache size and increase
heap from 8 to 10G.  Interestingly query and ingestion load is like normal
other days and heap utilisation remains stable and suddenly jumps to x2.

We are looking to reproduce this in test environment by producing similar
queries/ingestion but wondering if running into some memory leak or bug
like  "SOLR-8922 - DocSetCollector can allocate massive garbage on large
indexes" which can cause this issue.  Also we have frequent updates and
wondering if not optimizing the index can result into this situation

Any thoughts ?

GC Viewer

https://www.dropbox.com/s/bb29ub5q2naljdl/gc_log_snapshot.png?dl=0




On Wed, Oct 26, 2016 at 10:47 AM, Susheel Kumar 
wrote:

> Hi Toke,
>
> I think your guess is right.  We have ingestion running in batches.  We
> have 6 shards & 6 replicas on 12 VM's each around 40+ million docs on each
> shard.
>
> Thanks everyone for the suggestions/pointers.
>
> Thanks,
> Susheel
>
> On Wed, Oct 26, 2016 at 1:52 AM, Toke Eskildsen 
> wrote:
>
>> On Tue, 2016-10-25 at 15:04 -0400, Susheel Kumar wrote:
>> > Thanks, Toke.  Analyzing GC logs helped to determine that it was a
>> > sudden
>> > death.
>>
>> > The peaks in last 20 mins... See   http://tinypic.com/r/n2zonb/9
>>
>> Peaks yes, but there is a pattern of
>>
>> 1) Stable memory use
>> 2) Temporary doubling of the memory used and a lot of GC
>> 3) Increased (relative to last stable period) but stable memory use
>> 4) Goto 2
>>
>> Should I guess, I would say that you are running ingests in batches,
>> which temporarily causes 2 searchers to be open at the same time. That
>> is 2 in the list above. After the batch ingest, the baseline moves up,
>> assumedly because your have added quite a lot of documents, relative to
>> the overall number of documents.
>>
>>
>> The temporary doubling of the baseline is hard to avoid, but I am
>> surprised of the amount of heap that you need in the stable periods.
>> Just to be clear: This is from a Solr with 8GB of heap handling only 1
>> shard of 20GB and you are using DocValues? How many documents do you
>> have in such a shard?
>>
>> - Toke Eskildsen, State and University Library, Denmark
>>
>
>


Solr boost function taking precedence over relevance boosting

2016-11-08 Thread ruby
I have the following query (which was working until I migrated to Solr 5.1)
with boost:

http://localhost:8983/solr/?wt=json={!boost+b=recip(ms(NOW,modification_date),3.16e-11,1,1)}{!boost+b=recip(ms(NOW,creation_date),3.16e-11,1,1)}Copy_field:(bolt)^10
OR object_name:(bolt)^300

The above query would list the objects with name "bolt" first and then other
objects having "bolt" value in other properties. If two objects had "bolt"
in name, then it would boost the ones with recent modification,creation
dates.

This has been working until I moved to Solr 5.1. If I remove the date boost
functions then objects having "bolt" in name are listed first. However, I
still want to apply date boost functions to make sure if multiple objects
having "bolt" in name is returned then the application shows the recent ones
first. 

Did something change in the Solr date boost functions?
If not, then is there something wrong with my query?
Is my understanding of date boost function with relevancy boosting current?

Thanks,
Ruby



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-boost-function-taking-precedence-over-relevance-boosting-tp4305129.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: High CPU Usage in export handler

2016-11-08 Thread Ray Niu
Thanks Joel.

2016-11-08 11:43 GMT-08:00 Joel Bernstein :

> It sounds like your scenario, is around 25 queries per second, each pulling
> entire results. This would be enough to drive up CPU usage as you have more
> concurrent requests then CPU's. Since there isn't much IO blocking
> happening, in the scenario you describe, I would expect some pretty busy
> CPU's.
>
> That being said I think it would useful to understand exactly where the
> hotspots are in Lucene to see if we can make this more efficient.
>
> Leading up to the 6.4 release I'll try to spend some time understanding the
> Lucene hotspots with /export. I'll report back to this thread when I have
> more info.
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Mon, Nov 7, 2016 at 3:44 PM, Ray Niu  wrote:
>
> > Hello:
> >Any follow up?
> >
> > 2016-11-03 11:18 GMT-07:00 Ray Niu :
> >
> > > the soft commit is 15 seconds and hard commit is 10 minutes.
> > >
> > > 2016-11-03 11:11 GMT-07:00 Erick Erickson :
> > >
> > >> Followup question: You say you're indexing 100 docs/second.  How often
> > >> are you _committing_? Either
> > >> soft commit
> > >> or
> > >> hardcommit with openSearcher=true
> > >>
> > >> ?
> > >>
> > >> Best,
> > >> Erick
> > >>
> > >> On Thu, Nov 3, 2016 at 11:00 AM, Ray Niu  wrote:
> > >> > Thanks Joel
> > >> > here is the information you requested.
> > >> > Are you doing heavy writes at the time?
> > >> > we are doing write very frequently, but not very heavy, we will
> update
> > >> > about 100 solr document per second.
> > >> > How many concurrent reads are are happening?
> > >> > the concurrent reads are about 1000-2000 per minute per node
> > >> > What version of Solr are you using?
> > >> > we are using solr 5.5.2
> > >> > What is the field definition for the double, is it docValues?
> > >> > the field definition is
> > >> >  > >> > docValues="true"/>
> > >> >
> > >> >
> > >> > 2016-11-03 6:30 GMT-07:00 Joel Bernstein :
> > >> >
> > >> >> Are you doing heavy writes at the time?
> > >> >>
> > >> >> How many concurrent reads are are happening?
> > >> >>
> > >> >> What version of Solr are you using?
> > >> >>
> > >> >> What is the field definition for the double, is it docValues?
> > >> >>
> > >> >>
> > >> >>
> > >> >>
> > >> >> Joel Bernstein
> > >> >> http://joelsolr.blogspot.com/
> > >> >>
> > >> >> On Thu, Nov 3, 2016 at 12:56 AM, Ray Niu 
> > wrote:
> > >> >>
> > >> >> > Hello:
> > >> >> >We are using export handler in Solr Cloud to get some data, we
> > >> only
> > >> >> > request for one field, which type is tdouble, it works well at
> the
> > >> >> > beginning, but recently we saw high CPU issue in all the solr
> cloud
> > >> >> nodes,
> > >> >> > we took some thread dump and found following information:
> > >> >> >
> > >> >> >java.lang.Thread.State: RUNNABLE
> > >> >> >
> > >> >> > at java.lang.Thread.isAlive(Native Method)
> > >> >> >
> > >> >> > at
> > >> >> > org.apache.lucene.util.CloseableThreadLocal.purge(
> > >> >> > CloseableThreadLocal.java:115)
> > >> >> >
> > >> >> > - locked <0x0006e24d86a8> (a java.util.WeakHashMap)
> > >> >> >
> > >> >> > at
> > >> >> > org.apache.lucene.util.CloseableThreadLocal.maybePurge(
> > >> >> > CloseableThreadLocal.java:105)
> > >> >> >
> > >> >> > at
> > >> >> > org.apache.lucene.util.CloseableThreadLocal.get(
> > >> >> > CloseableThreadLocal.java:88)
> > >> >> >
> > >> >> > at
> > >> >> > org.apache.lucene.index.CodecReader.getNumericDocValues(
> > >> >> > CodecReader.java:143)
> > >> >> >
> > >> >> > at
> > >> >> > org.apache.lucene.index.FilterLeafReader.getNumericDocValues(
> > >> >> > FilterLeafReader.java:430)
> > >> >> >
> > >> >> > at
> > >> >> > org.apache.lucene.uninverting.UninvertingReader.
> > getNumericDocValues(
> > >> >> > UninvertingReader.java:239)
> > >> >> >
> > >> >> > at
> > >> >> > org.apache.lucene.index.FilterLeafReader.getNumericDocValues(
> > >> >> > FilterLeafReader.java:430)
> > >> >> >
> > >> >> > Is this a known issue for export handler? As we only fetch up to
> > 5000
> > >> >> > documents, it should not be data volume issue.
> > >> >> >
> > >> >> > Can anyone help on that? Thanks a lot.
> > >> >> >
> > >> >>
> > >>
> > >
> > >
> >
>


Re: High CPU Usage in export handler

2016-11-08 Thread Joel Bernstein
It sounds like your scenario, is around 25 queries per second, each pulling
entire results. This would be enough to drive up CPU usage as you have more
concurrent requests then CPU's. Since there isn't much IO blocking
happening, in the scenario you describe, I would expect some pretty busy
CPU's.

That being said I think it would useful to understand exactly where the
hotspots are in Lucene to see if we can make this more efficient.

Leading up to the 6.4 release I'll try to spend some time understanding the
Lucene hotspots with /export. I'll report back to this thread when I have
more info.


Joel Bernstein
http://joelsolr.blogspot.com/

On Mon, Nov 7, 2016 at 3:44 PM, Ray Niu  wrote:

> Hello:
>Any follow up?
>
> 2016-11-03 11:18 GMT-07:00 Ray Niu :
>
> > the soft commit is 15 seconds and hard commit is 10 minutes.
> >
> > 2016-11-03 11:11 GMT-07:00 Erick Erickson :
> >
> >> Followup question: You say you're indexing 100 docs/second.  How often
> >> are you _committing_? Either
> >> soft commit
> >> or
> >> hardcommit with openSearcher=true
> >>
> >> ?
> >>
> >> Best,
> >> Erick
> >>
> >> On Thu, Nov 3, 2016 at 11:00 AM, Ray Niu  wrote:
> >> > Thanks Joel
> >> > here is the information you requested.
> >> > Are you doing heavy writes at the time?
> >> > we are doing write very frequently, but not very heavy, we will update
> >> > about 100 solr document per second.
> >> > How many concurrent reads are are happening?
> >> > the concurrent reads are about 1000-2000 per minute per node
> >> > What version of Solr are you using?
> >> > we are using solr 5.5.2
> >> > What is the field definition for the double, is it docValues?
> >> > the field definition is
> >> >  >> > docValues="true"/>
> >> >
> >> >
> >> > 2016-11-03 6:30 GMT-07:00 Joel Bernstein :
> >> >
> >> >> Are you doing heavy writes at the time?
> >> >>
> >> >> How many concurrent reads are are happening?
> >> >>
> >> >> What version of Solr are you using?
> >> >>
> >> >> What is the field definition for the double, is it docValues?
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> Joel Bernstein
> >> >> http://joelsolr.blogspot.com/
> >> >>
> >> >> On Thu, Nov 3, 2016 at 12:56 AM, Ray Niu 
> wrote:
> >> >>
> >> >> > Hello:
> >> >> >We are using export handler in Solr Cloud to get some data, we
> >> only
> >> >> > request for one field, which type is tdouble, it works well at the
> >> >> > beginning, but recently we saw high CPU issue in all the solr cloud
> >> >> nodes,
> >> >> > we took some thread dump and found following information:
> >> >> >
> >> >> >java.lang.Thread.State: RUNNABLE
> >> >> >
> >> >> > at java.lang.Thread.isAlive(Native Method)
> >> >> >
> >> >> > at
> >> >> > org.apache.lucene.util.CloseableThreadLocal.purge(
> >> >> > CloseableThreadLocal.java:115)
> >> >> >
> >> >> > - locked <0x0006e24d86a8> (a java.util.WeakHashMap)
> >> >> >
> >> >> > at
> >> >> > org.apache.lucene.util.CloseableThreadLocal.maybePurge(
> >> >> > CloseableThreadLocal.java:105)
> >> >> >
> >> >> > at
> >> >> > org.apache.lucene.util.CloseableThreadLocal.get(
> >> >> > CloseableThreadLocal.java:88)
> >> >> >
> >> >> > at
> >> >> > org.apache.lucene.index.CodecReader.getNumericDocValues(
> >> >> > CodecReader.java:143)
> >> >> >
> >> >> > at
> >> >> > org.apache.lucene.index.FilterLeafReader.getNumericDocValues(
> >> >> > FilterLeafReader.java:430)
> >> >> >
> >> >> > at
> >> >> > org.apache.lucene.uninverting.UninvertingReader.
> getNumericDocValues(
> >> >> > UninvertingReader.java:239)
> >> >> >
> >> >> > at
> >> >> > org.apache.lucene.index.FilterLeafReader.getNumericDocValues(
> >> >> > FilterLeafReader.java:430)
> >> >> >
> >> >> > Is this a known issue for export handler? As we only fetch up to
> 5000
> >> >> > documents, it should not be data volume issue.
> >> >> >
> >> >> > Can anyone help on that? Thanks a lot.
> >> >> >
> >> >>
> >>
> >
> >
>


[ANNOUNCE] Apache Solr 6.3.0 released

2016-11-08 Thread Shalin Shekhar Mangar
8 November 2016, Apache Solr 6.3.0 available

Solr is the popular, blazing fast, open source NoSQL search platform
from the Apache Lucene project. Its major features include powerful
full-text search, hit highlighting, faceted search and analytics, rich
document parsing, geospatial search, extensive REST APIs as well as
parallel SQL. Solr is enterprise grade, secure and highly scalable,
providing fault tolerant distributed search and indexing, and powers
the search and navigation features of many of the world's largest
internet sites.

Solr 6.3.0 is available for immediate download at:

 . http://lucene.apache.org/solr/mirrors-solr-latest-redir.html

Please read CHANGES.txt for a full list of new features and changes:

 . https://lucene.apache.org/solr/6_3_0/changes/Changes.html

Solr 6.3 Release Highlights:

DocValues, streaming, /export, machine learning:
* Optimize, store and deploy AI models in Solr
* Ability to add custom streaming expressions
* New streaming expressions such as "fetch", "executor", and "commit" added.
* Parallel SQL accepts <, >, =, etc., symbols.
* Support facet scoring with the scoreNodes expression
* Retrieving docValues as stored values was sped up by using the
proper leaf reader rather than ask for a global view.  In extreme
cases, this leads to a 100x speedup.

Faceting:
* facet.method=enum can bypass exact counts calculation with
facet.exists=true, it just returns 1 for terms which exists in result
docset
* Add "overrequest" parameter to JSON Facet API to control amount of
overrequest  on a distributed terms facet

Logging:
* You can now set Solr's log level through environment variable SOLR_LOG_LEVEL
* GC logs are rotated by JVM to a max of 9 files, and backed up via
bin/solr scripts
* Solr's logging verbosity at the INFO level has been greatly reduced
by moving much logging to DEBUG level
* The solr-8983-console.log file now only logs STDOUT and STDERR
output, not all log4j logs as before
* Solr's main log file, solr.log, is now written to SOLR_LOGS_DIR
without changing log4j.properties

Start scripts:
* Allow 180 seconds for shutdown before killing solr (configurable,
old limit 5s) (Unix only)
* Start scripts now exits with informative message if using wrong Java version
* Fixed "bin/solr.cmd zk upconfig" command which was broken on windows
* You can now ask for DEBUG logging simply with '-v' option, and for
WARN logging with '-q' option

SolrCloud:
* The DELETEREPLICA API can accept a 'count' parameter and remove
"count" number of replicas from each shard if the shard name is not
provided
* The config API shows expanded useParams for request handlers inline
* Ability to create/delete/list snapshots at collection level
* The modify collection API now waits for the modified properties to
show up in the cluster state before returning
* Many bug fixes related to SolrCloud recovery for data safety and
faster recovery times.

Security:
* SolrJ now supports Kerberos delegation tokens
* Pooled SSL connections were not being re-used. This is now fixed.
* Fix for the blockUnknown property which made inter-node
communication impossible
* Support SOLR_AUTHENTICATION_OPTS and
SOLR_AUTHENTICATION_CLIENT_CONFIGURER in windows bin/solr.cmd script
* New parameter -u  in bin/post to pass basicauth credentials

Misc changes:
* Optimizations to lower memory allocations when indexing JSON as well
as for replication between solr cloud nodes.
* A new Excel workbook (.xlsx) response writer has been added. Use
'wt=xlsx' request parameter on a query request to enable.

Further details of changes are available in the change log available
at: http://lucene.apache.org/solr/6_3_0/changes/Changes.html

Please report any feedback to the mailing lists
(http://lucene.apache.org/solr/discussion.html)

Note: The Apache Software Foundation uses an extensive mirroring
network for distributing releases. It is possible that the mirror you
are using may not have replicated the release yet. If that is the
case, please try another mirror. This also applies to Maven access.


-- 
Regards,
Shalin Shekhar Mangar.


Re: Backup to HDFS while running cluster on local disk

2016-11-08 Thread Hrishikesh Gadre
Hi Mike,

Thanks for bringing this up. You can certainly backup the index data stored
on local file-system to HDFS.

The HDFS backup repository implementation uses the same configuration
properties as expected by the HDFS directory factory. Here is the
description of the parameters,

   - location (Optional) - This configuration parameter defines the default
   location where the backups can be stored. If this parameter is not
   configured, then you will need to explicitly specify the location parameter
   to your backup and restore commands.
   - solr.hdfs.home (Required) - This configuration parameter defines the
   fully qualified URI for the root path of HDFS. e.g. hdfs://name-node-1/. In
   case the index files are also stored on HDFS, this path refers to the
   directory used to store index files in HDFS e.g. hdfs://name-node-1/solr
   - solr.hdfs.confdir (Optional) - A directory (on local file-system)
   which contains the configuration files for HDFS (e.g. hdfs-site.xml,
   core-site.xml etc.)


I will also update the docs accordingly.

-Hrishikesh


On Tue, Nov 8, 2016 at 3:36 AM, Mike Thomsen  wrote:

> We have SolrCloud running on bare metal but want the nightly snapshots to
> be written to HDFS. Can someone give me some help on configuring the
> HdfsBackupRepository?
>
> 
>  "org.apache.solr.core.backup.repository.HdfsBackupRepository" default=
> "false">
> ${solr.hdfs.default.backup.path}
> ${solr.hdfs.home:}
> ${solr.hdfs.confdir:}
> 
> 
>
> Not sure how to proceed on configuring this because the documentation is a
> bit sparse on what some of those values mean in this context. The example
> looked geared toward someone using HDFS both to store the index and do
> backup/restore.
>
> Thanks,
>
> Mike
>


Re: Search across nested child docs

2016-11-08 Thread Vinod Singh
Yes that works well. Somehow I missed to mention the full condition in
previous message. This is what I am looking for -

fq={!parent which=PARENT_DOC_TYPE:PARENT}((childA_field_1:234 AND
childA_field_2:3) OR (childA_field_1:432 AND childA_field_2:6))

Regards,
Vinod



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Search-across-nested-child-docs-tp4304956p4305105.html
Sent from the Solr - User mailing list archive at Nabble.com.


Data indexing with full import issue

2016-11-08 Thread Aniket Khare
Hi,

I am facing an issue with delta indexing implemeted with Full import and
SortedMapBackedCache as implemented below.

https://wiki.apache.org/solr/DataImportHandlerDeltaQueryViaFullImport

cacheKey="id" cacheLookup="parent.id" processor="SqlEntityProcessor"
cacheImpl="SortedMapBackedCache"

The delta is shwoing up in debug and verbose, but it is not reflecting in
Solr. Please note that I am using commit=true with the debug and verbose.

-- 
Regards,

Aniket S. Khare


Re: Search across nested child docs

2016-11-08 Thread Mikhail Khludnev
Giving 'across two child docs', I think you are looking for
fq={!parent which=PARENT_DOC_TYPE:PARENT}childA_field_1:432={!parent
which=PARENT_DOC_TYPE:PARENT}childA_field_2:6

On Mon, Nov 7, 2016 at 9:45 PM, Vinod Singh  wrote:

> I have nested documents indexed in SOLR 6.2. The block join query works
> well
> on both parent and child documents. My use case has a scenario where a
> condition needs to be fulfilled across two child docs as shown below -
>
> fq={!parent which=PARENT_DOC_TYPE:PARENT}(childA_field_1:432 AND
> childA_field_2:6)
>
> But this does not give any results though indexed documents has data that
> fulfills the condition.
>
> How can I have search condition that spans multiple child docs ?
>
> Regards,
> Vinod
>
>
>
> --
> View this message in context: http://lucene.472066.n3.
> nabble.com/Search-across-nested-child-docs-tp4304956.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Sincerely yours
Mikhail Khludnev


Backup to HDFS while running cluster on local disk

2016-11-08 Thread Mike Thomsen
We have SolrCloud running on bare metal but want the nightly snapshots to
be written to HDFS. Can someone give me some help on configuring the
HdfsBackupRepository?



${solr.hdfs.default.backup.path}
${solr.hdfs.home:}
${solr.hdfs.confdir:}



Not sure how to proceed on configuring this because the documentation is a
bit sparse on what some of those values mean in this context. The example
looked geared toward someone using HDFS both to store the index and do
backup/restore.

Thanks,

Mike


Edismax query parsing in Solr 4 vs Solr 6

2016-11-08 Thread Max Bridgewater
I am migrating a solr based app from Solr 4 to Solr 6.  One of the
discrepancies I am noticing is around edismax query parsing. My code makes
the following call:


 userQuery="+(title:shirts isbn:shirts) +(id:20446 id:82876)"
  Query query=QParser.getParser(userQuery, "edismax", req).getQuery();


With Solr 4, query becomes:

+(+(title:shirt isbn:shirts) +(id:20446 id:82876))

With Solr 6 it however becomes:

+(+(+title:shirt +isbn:shirts) +(+id:20446 +id:82876))

Digging deeper, it appears that parseOriginalQuery() in
ExtendedDismaxQParser is adding those additional + signs.


Is there a way to prevent this altering of queries?

Thanks,
Max.