Re: MatchAllDocsQuery is much slower in solr5.3.1 compare to solr4.7

2015-11-06 Thread wei
Thanks Yonik.

A JIRA bug is opened:
https://issues.apache.org/jira/browse/SOLR-8251

Wei

On Fri, Nov 6, 2015 at 7:10 PM, Yonik Seeley  wrote:

> On Fri, Nov 6, 2015 at 9:56 PM, wei  wrote:
> > Good point! I tried that, on solr5 the query time is around 100-110ms,
> and
> > on solr4 it is around 60-63ms(very consistent). Solr5 is slower.
>
> When it's something easy, there comes a point when it makes sense to
> stop asking more questions and just try it yourself...
> I just did this, and can confirm what you're seeing.   For me, 5.3.1
> is about 5x slower than 4.10 for this particular query.
> Thanks for your persistence / patience in reporting this.  Could you
> open a JIRA issue for it?
>
> -Yonik
>


Re: Is it impossible to update an index that is undergoing an optimize?

2015-11-06 Thread Ishan Chattopadhyaya
On Sat, Nov 7, 2015 at 9:09 AM, Yonik Seeley  wrote:

> On Fri, Nov 6, 2015 at 10:20 PM, Shawn Heisey  wrote:
> >  Is there a decent API for getting uniqueKey?
>
> Not off the top of my head.
> I deeply regret making it configurable and not just using "id" ;-)
>

Maybe this?
https://cwiki.apache.org/confluence/display/solr/Schema+API#SchemaAPI-ListUniqueKey


>
> -Yonik
>


Re: Is it impossible to update an index that is undergoing an optimize?

2015-11-06 Thread Yonik Seeley
On Fri, Nov 6, 2015 at 10:20 PM, Shawn Heisey  wrote:
>  Is there a decent API for getting uniqueKey?

Not off the top of my head.
I deeply regret making it configurable and not just using "id" ;-)

-Yonik


Re: Is it impossible to update an index that is undergoing an optimize?

2015-11-06 Thread Shawn Heisey
On 11/6/2015 6:18 PM, Yonik Seeley wrote:
> On Wed, Nov 4, 2015 at 3:36 PM, Shawn Heisey  wrote:
>> The specific index update that fails during the optimize is the SolrJ
>> deleteByQuery call.
> 
> deleteByQuery may be the outlier here... we have to jump through extra
> hoops internally because we don't know which documents it will affect.
> Normal adds and deletes should proceed in parallel though.

I'm not doing the delete query on the uniqueKey field.  It's on a
separate (but also unique) "delete id" field.  Because I query each
shard before I actually do the delete, I could retrieve the uniqueKey
field and then issue a standard delete request with the IDs that I receive.

I would prefer to be able to query Solr on some API to determine the
uniqueKey field name, so I don't need to have that field name in my
configuration, then use that info on the query.  I would definitely
prefer to not use the "files" api to retrieve schema.xml, because I
would then need to parse XML.  Is there a decent API for getting uniqueKey?

Thanks,
Shawn



Re: MatchAllDocsQuery is much slower in solr5.3.1 compare to solr4.7

2015-11-06 Thread Yonik Seeley
On Fri, Nov 6, 2015 at 9:56 PM, wei  wrote:
> Good point! I tried that, on solr5 the query time is around 100-110ms, and
> on solr4 it is around 60-63ms(very consistent). Solr5 is slower.

When it's something easy, there comes a point when it makes sense to
stop asking more questions and just try it yourself...
I just did this, and can confirm what you're seeing.   For me, 5.3.1
is about 5x slower than 4.10 for this particular query.
Thanks for your persistence / patience in reporting this.  Could you
open a JIRA issue for it?

-Yonik


Re: MatchAllDocsQuery is much slower in solr5.3.1 compare to solr4.7

2015-11-06 Thread wei
Good point! I tried that, on solr5 the query time is around 100-110ms, and
on solr4 it is around 60-63ms(very consistent). Solr5 is slower.

Thanks,
Wei

On Fri, Nov 6, 2015 at 6:46 PM, Yonik Seeley  wrote:

> On Fri, Nov 6, 2015 at 9:30 PM, wei  wrote:
> > in solr 5.3.1, there is actually a boost, and the score is product of
> boost
> > & queryNorm.
>
> Hmmm, well, it's worth putting on the list of stuff to investigate.
> Boosting was also changed in lucene.
>
> What happens if you try this multiple times in a row?
>
> &rows=2&fl=id&q={!cache=false}*:*&fq=categoryIdsPath:1001
>
> (basically just add {!cache=false} as a prefix to the main query.)
>
> This would allow hotspot time to compile methods, and ensure that the
> filter query was cached, and do a better job of isolating the
> "filtered match-all-docs" part of the execution.
>
> -Yonik
>


Re: MatchAllDocsQuery is much slower in solr5.3.1 compare to solr4.7

2015-11-06 Thread Yonik Seeley
On Fri, Nov 6, 2015 at 9:30 PM, wei  wrote:
> in solr 5.3.1, there is actually a boost, and the score is product of boost
> & queryNorm.

Hmmm, well, it's worth putting on the list of stuff to investigate.
Boosting was also changed in lucene.

What happens if you try this multiple times in a row?

&rows=2&fl=id&q={!cache=false}*:*&fq=categoryIdsPath:1001

(basically just add {!cache=false} as a prefix to the main query.)

This would allow hotspot time to compile methods, and ensure that the
filter query was cached, and do a better job of isolating the
"filtered match-all-docs" part of the execution.

-Yonik


Re: MatchAllDocsQuery is much slower in solr5.3.1 compare to solr4.7

2015-11-06 Thread wei
Hi Shawn,

I took care of the warm up problem during the test. I setup jmeter project,
get query log from our production(>10 queries), and run the same query
log through jmeter to hit the solr instances with the same qps(about 40). I
removed warmup queries in both the solr setup, and also set the autowarmup
of cache to 0 in the solrconfig. I run the test for 1 hour. these two
instances are not serving other query traffic but they both get update
traffic. I disabled softcommit in solr5 and set the hardcommit to 2
minutes. The solr4 instance is a slave node replicating from solr4 master
instance, and the master also has 2 minutes commit cycle, and the testing
solr4 instance replicate the index every 2 minutes.

The solr5 is slower than solr4. After some investigation I realized that it
seems the queries containing q=*:* are causing the problem. I splitted the
query log into two log files, one with q=*:* and another without(almost all
our queries have filter queries). when I run the test, solr5 is faster when
running query with query keyword, but is much slower when run "q=*:*" query
log.

There is no other query traffic to both the two instance.(there is index
traffic). When I get the query debug log in my first email, I make sure
there is no filter cache (verified through the solr console. after hard
commit, the filtercache is cleaned)

Hope my email address your concern about how I do the test. What obvious to
me is that solr5 is faster in one test(with query keyword) and is slower in
the other test(without query keyword).

Thanks,
Wei

On Fri, Nov 6, 2015 at 1:41 PM, Shawn Heisey  wrote:

> On 11/6/2015 1:01 PM, wei wrote:
> > Thanks Jack and Shawn. I checked these Jira tickets, but I am not sure if
> > the slowness of MatchAllDocsQuery is also caused by the removal of
> > fieldcache. Can someone please explain a little bit?
>
> I only glanced at your full output in the message at the start of this
> thread.  I thought I saw facet output in it, but it turns out that the
> only mention of facets was the timing information from the debug, so
> that very likely rules out the FieldCache change as a culprit.
>
> I am suspecting that the 4.7 index is warmed better, and may have the
> specific filter query (categoryIdsPath:1001)already sitting in the
> filterCache.
>
> Try running that query a few of times on both versions, then restart
> Solr on both versions so they both start clean, and run the query *once*
> on each system, and see whether there's still a large discrepancy.
>
> If one of the systems is receiving queries from active clients and the
> other is not, then the comparison will be unfair, and biased towards the
> one that is getting additional queries.  Query activity, even if it
> seems unrelated to the query you are testing, has a tendency to reduce
> overall qtime values.
>
> Thanks,
> Shawn
>
>


Re: MatchAllDocsQuery is much slower in solr5.3.1 compare to solr4.7

2015-11-06 Thread wei
the explain part are different in solr4.7 and solr 5.3.1. In solr 4.7,
there is only one line

 
 1.0 = (MATCH) MatchAllDocsQuery, product of:
  1.0 = queryNorm
 1.0 = (MATCH) MatchAllDocsQuery, product of:
  1.0 = queryNorm
  

in solr 5.3.1, there is actually a boost, and the score is product of boost
& queryNorm.

Can that cause the problem? if solr5 need to calculate the product of all
the hits. I am not sure where the boost come from, and why it is different
in solr4.7

  
 1.0 = *:*, product of:
  1.0 = boost
  1.0 = queryNorm
 1.0 = *:*, product of:
  1.0 = boost
  1.0 = queryNorm
  


Re: Is it impossible to update an index that is undergoing an optimize?

2015-11-06 Thread Walter Underwood
It is pretty handy, though. Great for expunging docs that are marked deleted or 
are expired.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Nov 6, 2015, at 5:31 PM, Alexandre Rafalovitch  wrote:
> 
> Elasticsearch removed deleteByQuery from the core all together.
> Definitely an outlier :-)
> 
> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> http://www.solr-start.com/
> 
> 
> On 6 November 2015 at 20:18, Yonik Seeley  wrote:
>> On Wed, Nov 4, 2015 at 3:36 PM, Shawn Heisey  wrote:
>>> The specific index update that fails during the optimize is the SolrJ
>>> deleteByQuery call.
>> 
>> deleteByQuery may be the outlier here... we have to jump through extra
>> hoops internally because we don't know which documents it will affect.
>> Normal adds and deletes should proceed in parallel though.
>> 
>> -Yonik



Re: MatchAllDocsQuery is much slower in solr5.3.1 compare to solr4.7

2015-11-06 Thread wei
Hi Jack,

I also run the test with queries that have query terms(with filter too).
Solr5 is faster compare to solr4 in the test. I got the queries set from
our production log, almost all of our queries have filter. So that suggest
to me that it is not the filter query that is slow.

I copy the fq query to the q field (i did not remove fq though), the solr5
is slightly faster than solr 4 for the query

solr4:


   
  0
  64
  
 id
 0
 +categoryIdsPath:1001
 true
 +categoryIdsPath:1001
 2
  
   
   
  
 36652255
  
  
 36651884
  
   
   
  +categoryIdsPath:1001
  +categoryIdsPath:1001
  +categoryIdsPath:1001
  +categoryIdsPath:1001
  
 20.451632 = (MATCH)
weight(categoryIdsPath:1001 in 19) [], result of:
  20.451632 = score(doc=19,freq=1.0 = termFreq=1.0
), product of:
4.522348 = queryWeight, product of:
  4.522348 = idf(docFreq=610392, maxDocs=20670250)
  1.0 = queryNorm
4.522348 = fieldWeight in 19, product of:
  1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
  4.522348 = idf(docFreq=610392, maxDocs=20670250)
  1.0 = fieldNorm(doc=19)
 20.451632 = (MATCH)
weight(categoryIdsPath:1001 in 44) [], result of:
  20.451632 = score(doc=44,freq=1.0 = termFreq=1.0
), product of:
4.522348 = queryWeight, product of:
  4.522348 = idf(docFreq=610392, maxDocs=20670250)
  1.0 = queryNorm
4.522348 = fieldWeight in 44, product of:
  1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
  4.522348 = idf(docFreq=610392, maxDocs=20670250)
  1.0 = fieldNorm(doc=44)
  
  LuceneQParser
  
 +categoryIdsPath:1001
  
  
 +categoryIdsPath:1001
  
  
 63.0
 
3.0

   3.0


   0.0


   0.0


   0.0


   0.0


   0.0

 
 
60.0

   57.0


   0.0


   0.0


   0.0


   0.0


   3.0

 
  
   


solr5:


   
  0
  51
  
 id
 0
 +categoryIdsPath:1001
 true
 +categoryIdsPath:1001
 2
  
   
   
  
 36652255
  
  
 36651884
  
   
   
  +categoryIdsPath:1001
  +categoryIdsPath:1001
  +categoryIdsPath:1001
  +categoryIdsPath:1001
  
 20.420362 = weight(categoryIdsPath:1001
in 20) [], result of:
  20.420362 = score(doc=20,freq=1.0), product of:
4.5188894 = queryWeight, product of:
  4.5188894 = idf(docFreq=602005, maxDocs=20315855)
  1.0 = queryNorm
4.5188894 = fieldWeight in 20, product of:
  1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
  4.5188894 = idf(docFreq=602005, maxDocs=20315855)
  1.0 = fieldNorm(doc=20)
 20.420362 = weight(categoryIdsPath:1001
in 49) [], result of:
  20.420362 = score(doc=49,freq=1.0), product of:
4.5188894 = queryWeight, product of:
  4.5188894 = idf(docFreq=602005, maxDocs=20315855)
  1.0 = queryNorm
4.5188894 = fieldWeight in 49, product of:
  1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
  4.5188894 = idf(docFreq=602005, maxDocs=20315855)
  1.0 = fieldNorm(doc=49)
  
  LuceneQParser
  
 +categoryIdsPath:1001
  
  
 +categoryIdsPath:1001
  
  
 51.0
 
1.0

   1.0


   0.0


   0.0


   0.0


   0.0


   0.0


   0.0


   0.0

 
 
50.0

   48.0


   0.0


   0.0


   0.0


   0.0


   0.0


   0.0


   2.0

 
  
   



On Fri, Nov 6, 2015 at 12:12 PM, Jack Krupansky 
wrote:

> Just to be clear, I was suggesting that the filter query (fq) was slow, not
> the MatchAllDocsQuery, which should be just as speedy as before. You can
> test for yourself whether the MADQ by itself is any slower.
>
> You could also test using the fq as the main query (q) - with no fq
> parameter, and see if that is a

Re: Is it impossible to update an index that is undergoing an optimize?

2015-11-06 Thread Alexandre Rafalovitch
Elasticsearch removed deleteByQuery from the core all together.
Definitely an outlier :-)

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 6 November 2015 at 20:18, Yonik Seeley  wrote:
> On Wed, Nov 4, 2015 at 3:36 PM, Shawn Heisey  wrote:
>> The specific index update that fails during the optimize is the SolrJ
>> deleteByQuery call.
>
> deleteByQuery may be the outlier here... we have to jump through extra
> hoops internally because we don't know which documents it will affect.
> Normal adds and deletes should proceed in parallel though.
>
> -Yonik


Re: Is it impossible to update an index that is undergoing an optimize?

2015-11-06 Thread Yonik Seeley
On Wed, Nov 4, 2015 at 3:36 PM, Shawn Heisey  wrote:
> The specific index update that fails during the optimize is the SolrJ
> deleteByQuery call.

deleteByQuery may be the outlier here... we have to jump through extra
hoops internally because we don't know which documents it will affect.
Normal adds and deletes should proceed in parallel though.

-Yonik


Re: MatchAllDocsQuery is much slower in solr5.3.1 compare to solr4.7

2015-11-06 Thread Yonik Seeley
On Fri, Nov 6, 2015 at 3:12 PM, Jack Krupansky  wrote:
> Just to be clear, I was suggesting that the filter query (fq) was slow

That's a possibility.  Filters were actually removed in Lucene, so
it's a very different code path now.

In 4.10, filters were first class, and SolrIndexSearcher used methods like:
search(query, pf.filter, collector);
And BitSet based filters were pushed down to the leaves of a query
(which the filter generated from MatchAllDocsQuery would have been).

At some point, those were changed to use FilteredQuery instead.  But I
think at some point prior Lucene converted a Filter to a
FilteredQuery, so that change in Solr may not have mattered at that
point.

Then in LUCENE-6583, Filters were removed and the code in
SolrIndexSearcher was changed to use a BooleanQuery:
   if (pf.filter != null) {
  Query query = new BooleanQuery.Builder()
  .add(main, Occur.MUST)
  .add(pf.filter, Occur.FILTER)
  .build();
  search(query, collector);

So... lots of changes over time, no idea which (if any) is the cause.

-Yonik


Re: data import extremely slow

2015-11-06 Thread Yangrui Guo
Thanks for the reply. I just removed CacheKeyLookUp and CachedKey and used
WHERE clause instead. Everything works fine now.

Yangrui

On Friday, November 6, 2015, Shawn Heisey  wrote:

> On 11/6/2015 10:32 AM, Yangrui Guo wrote:
> > 
> There's a good chance that JDBC is trying to read the entire result set
> (all three million rows) into memory before sending any of that info to
> Solr.
>
> Set the batchSize to -1 for MySQL so that it will stream results to Solr
> as soon as they are available, and not wait for all of them.  Here's
> more info on the situation, which frequently causes OutOfMemory problems
> for users:
>
>
> http://wiki.apache.org/solr/DataImportHandlerFaq?highlight=%28mysql%29|%28batchsize%29#I.27m_using_DataImportHandler_with_a_MySQL_database._My_table_is_huge_and_DataImportHandler_is_going_out_of_memory._Why_does_DataImportHandler_bring_everything_to_memory.3F
>
>
> Thanks,
> Shawn
>
>


Re: Trying to apply patch for SOLR-7036

2015-11-06 Thread r b
Ah, thanks for that. The 4.10 branch was it. If I have time, I'll
study up on what this patch is doing and see if I can't port it to 5x.

On Fri, Nov 6, 2015 at 6:24 AM, Shawn Heisey  wrote:
> On 11/5/2015 7:04 PM, r b wrote:
>> I just wanted to double check that my steps were not too off base.
>>
>> I am trying to apply the patch from 8/May/15 and it seems to be
>> slightly off. Inside the working revision is 1658487 so I checked that
>> out from svn. This is what I did.
>>
>> svn checkout
>> http://svn.apache.org/repos/asf/lucene/dev/trunk@1658487 lucene_trunk
>> cd lucene_trunk/solr
>> curl 
>> https://issues.apache.org/jira/secure/attachment/12731517/SOLR-7036.patch
>> | patch -p0
>>
>> But `patch` still fails on a few hunks. I figured this patch was made
>> with `svn diff` so it should apply smoothly to that same revision,
>> shouldn't it?
>
> Erick had the same problem with the patch back in July, and asked the
> submitter to update the patch to trunk.  I tried applying the patch to
> branch_5x at the specified revision and that failed too.
>
> When I pulled down that specific revision of the lucene_solr_4_10
> branch, then it would cleanly apply.  There are vast differences between
> all 4.x branches/tags and the newer branches, which is why you cannot
> get the patch applied.  A huge amount of work went into the code for
> version 5.0.0, and the work on trunk and branch_5x since that release
> has been enormous.
>
> Getting this patch into 5x or trunk is going to require a lot of manual
> work.  The original patch author is best qualified to do that work.  If
> you want to tackle the job, feel free.  If you do so, please upload a
> new patch to the issue.
>
> Thanks,
> Shawn
>


Re: MatchAllDocsQuery is much slower in solr5.3.1 compare to solr4.7

2015-11-06 Thread Shawn Heisey
On 11/6/2015 1:01 PM, wei wrote:
> Thanks Jack and Shawn. I checked these Jira tickets, but I am not sure if
> the slowness of MatchAllDocsQuery is also caused by the removal of
> fieldcache. Can someone please explain a little bit?

I only glanced at your full output in the message at the start of this
thread.  I thought I saw facet output in it, but it turns out that the
only mention of facets was the timing information from the debug, so
that very likely rules out the FieldCache change as a culprit.

I am suspecting that the 4.7 index is warmed better, and may have the
specific filter query (categoryIdsPath:1001)already sitting in the
filterCache.

Try running that query a few of times on both versions, then restart
Solr on both versions so they both start clean, and run the query *once*
on each system, and see whether there's still a large discrepancy.

If one of the systems is receiving queries from active clients and the
other is not, then the comparison will be unfair, and biased towards the
one that is getting additional queries.  Query activity, even if it
seems unrelated to the query you are testing, has a tendency to reduce
overall qtime values.

Thanks,
Shawn



Re: data import extremely slow

2015-11-06 Thread Shawn Heisey
On 11/6/2015 10:32 AM, Yangrui Guo wrote:
> http://wiki.apache.org/solr/DataImportHandlerFaq?highlight=%28mysql%29|%28batchsize%29#I.27m_using_DataImportHandler_with_a_MySQL_database._My_table_is_huge_and_DataImportHandler_is_going_out_of_memory._Why_does_DataImportHandler_bring_everything_to_memory.3F


Thanks,
Shawn



Re: Is it impossible to update an index that is undergoing an optimize?

2015-11-06 Thread Shawn Heisey
On 11/6/2015 2:23 PM, Pushkar Raste wrote:
> I may be wrong but I think 'delete' and 'optimize' can not be executed
> concurrently on a Lucene index

It certainly is looking that way.  After discussing it with Hoss on IRC,
I tried a manual test where I started an optimize and then did some
"add" requests from the admin UI.  Those succeeded, but whenever my
SolrJ code does a delete request on an index that is being optimized,
that request blocks until either the optimize finishes or the SO_TIMEOUT
(15 minutes) that I configured on HttpClient is reached.

My SolrJ code always starts with any deletes that need to happen, then
moves onto other changes like reinserts and adds.

If delete and optimize cannot happen at the same time, is that a bug? 
This is happening on both 4.9.1 and 5.2.1.

Thanks,
Shawn



Re: Is it impossible to update an index that is undergoing an optimize?

2015-11-06 Thread Pushkar Raste
I may be wrong but I think 'delete' and 'optimize' can not be executed
concurrently on a Lucene index

On 4 November 2015 at 15:36, Shawn Heisey  wrote:

> On 11/4/2015 1:17 PM, Yonik Seeley wrote:
> > On Wed, Nov 4, 2015 at 3:06 PM, Shawn Heisey 
> wrote:
> >> I had understood that since 4.0, Solr (Lucene) can continue to update an
> >> index even while that index is optimizing.
> > Yes, that should be the case.
> >
> >> I have discovered in the logs of my SolrJ index maintenance program that
> >> this does not appear to actually be true.
> > Hmmm, perhaps some other resource is getting exhausted, like number of
> > background merges hit the limit?
>
> I hope it's a misconfiguration, not a bug.
>
> Below is my indexConfig.  I have already increased maxMergeCount because
> without that, full-import from MySQL will stop processing updates during
> a large merge, and the pause is long enough that the JDBC connection
> times out and closes.
>
> 
>   
> 35
> 35
> 105
>   
>   
> 1
> 6
>   
>   48
>   false
> 
>
> The specific index update that fails during the optimize is the SolrJ
> deleteByQuery call.
>
> Thanks,
> Shawn
>
>


Re: MatchAllDocsQuery is much slower in solr5.3.1 compare to solr4.7

2015-11-06 Thread Jack Krupansky
Just to be clear, I was suggesting that the filter query (fq) was slow, not
the MatchAllDocsQuery, which should be just as speedy as before. You can
test for yourself whether the MADQ by itself is any slower.

You could also test using the fq as the main query (q) - with no fq
parameter, and see if that is a lot faster, both with old and new Solr.

-- Jack Krupansky

On Fri, Nov 6, 2015 at 3:01 PM, wei  wrote:

> Thanks Jack and Shawn. I checked these Jira tickets, but I am not sure if
> the slowness of MatchAllDocsQuery is also caused by the removal of
> fieldcache. Can someone please explain a little bit?
>
> Thanks,
> Wei
>
> On Fri, Nov 6, 2015 at 7:15 AM, Shawn Heisey  wrote:
>
> > On 11/5/2015 10:25 PM, Jack Krupansky wrote:
> > > I vaguely recall some discussion concerning removal of the field cache
> in
> > > Lucene.
> >
> > The FieldCache wasn't exactly *removed* ... it's more like it was
> > renamed, improved, and sort of hidden in a miscellaneous package.  Some
> > things still require this functionality, so they use the hidden class
> > instead, which was changed to use the DocValues API.
> >
> > https://issues.apache.org/jira/browse/LUCENE-5666
> >
> > I am not qualified to discuss LUCENE-5666 beyond what I wrote in the
> > paragraph above, and it's possible that some of what I said is wrong
> > because I do not really understand the APIs involved.
> >
> > The change has caused problems for Solr.  End result from Solr's
> > perspective: Certain things which used to work perfectly fine (mostly
> > facets and grouping) in Solr 4.x have one of two problems in 5.x:
> > Either they don't work at all, or performance has gone way down.  Some
> > of these problems are documented in Jira.  These are the issues I know
> > about:
> >
> > https://issues.apache.org/jira/browse/SOLR-8088
> > https://issues.apache.org/jira/browse/SOLR-7495
> > https://issues.apache.org/jira/browse/SOLR-8096
> >
> > For fields where adding docValues is a viable option (most field types
> > other than solr.TextField), adding docValues and reindexing is very
> > likely to solve those problems.
> >
> > Sometimes adding docValues won't work, either because the field type
> > doesn't allow it, or because it's the indexed terms that are needed, not
> > the original field value.  For those situations, there is currently no
> > solution.
> >
> > Thanks,
> > Shawn
> >
> >
>


Re: MatchAllDocsQuery is much slower in solr5.3.1 compare to solr4.7

2015-11-06 Thread wei
Thanks Jack and Shawn. I checked these Jira tickets, but I am not sure if
the slowness of MatchAllDocsQuery is also caused by the removal of
fieldcache. Can someone please explain a little bit?

Thanks,
Wei

On Fri, Nov 6, 2015 at 7:15 AM, Shawn Heisey  wrote:

> On 11/5/2015 10:25 PM, Jack Krupansky wrote:
> > I vaguely recall some discussion concerning removal of the field cache in
> > Lucene.
>
> The FieldCache wasn't exactly *removed* ... it's more like it was
> renamed, improved, and sort of hidden in a miscellaneous package.  Some
> things still require this functionality, so they use the hidden class
> instead, which was changed to use the DocValues API.
>
> https://issues.apache.org/jira/browse/LUCENE-5666
>
> I am not qualified to discuss LUCENE-5666 beyond what I wrote in the
> paragraph above, and it's possible that some of what I said is wrong
> because I do not really understand the APIs involved.
>
> The change has caused problems for Solr.  End result from Solr's
> perspective: Certain things which used to work perfectly fine (mostly
> facets and grouping) in Solr 4.x have one of two problems in 5.x:
> Either they don't work at all, or performance has gone way down.  Some
> of these problems are documented in Jira.  These are the issues I know
> about:
>
> https://issues.apache.org/jira/browse/SOLR-8088
> https://issues.apache.org/jira/browse/SOLR-7495
> https://issues.apache.org/jira/browse/SOLR-8096
>
> For fields where adding docValues is a viable option (most field types
> other than solr.TextField), adding docValues and reindexing is very
> likely to solve those problems.
>
> Sometimes adding docValues won't work, either because the field type
> doesn't allow it, or because it's the indexed terms that are needed, not
> the original field value.  For those situations, there is currently no
> solution.
>
> Thanks,
> Shawn
>
>


Re: Solr results relevancy / scoring

2015-11-06 Thread Doug Turnbull
You might paste your URL into http://splainer.io and it will explain your
results ranking to you in a perhaps more helpful way

-Doug

On Fri, Nov 6, 2015 at 2:04 PM, Brian Narsi  wrote:

> I have a situation where.
>
> User search query
>
> q=15%
>
> Solr results contain several documents that are
>
> 15%
> 15%
> 15%
> 15%
> 15 (why?)
> 15%
> 15%
>
> I have debugged the query and can see that the score for 15 is higher than
> the ones below it.
>
> Why is that? Where can I read in detail about how the scoring is being
> done?
>
> Thanks
>



-- 
*Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections
, LLC | 240.476.9983
Author: Relevant Search 
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.


Re: Solr results relevancy / scoring

2015-11-06 Thread Erick Erickson
I'm not sure what the question your asking is. You say
that you have debugged the query and the score for 15 is
higher than the ones below it. What's surprising about that?

Are you saying you don't understand how the score is
calculated? Or the output when adding &debug=true
is inconsistent or what?

Best,
Erick

On Fri, Nov 6, 2015 at 11:04 AM, Brian Narsi  wrote:
> I have a situation where.
>
> User search query
>
> q=15%
>
> Solr results contain several documents that are
>
> 15%
> 15%
> 15%
> 15%
> 15 (why?)
> 15%
> 15%
>
> I have debugged the query and can see that the score for 15 is higher than
> the ones below it.
>
> Why is that? Where can I read in detail about how the scoring is being done?
>
> Thanks


Solr results relevancy / scoring

2015-11-06 Thread Brian Narsi
I have a situation where.

User search query

q=15%

Solr results contain several documents that are

15%
15%
15%
15%
15 (why?)
15%
15%

I have debugged the query and can see that the score for 15 is higher than
the ones below it.

Why is that? Where can I read in detail about how the scoring is being done?

Thanks


Re: solr-8983-console.log is huge

2015-11-06 Thread davidphilip cherian
>From mail archives

https://support.lucidworks.com/hc/en-us/articles/207072137-Solr-5-X-Console-Logging-solr-8983-console-log

On Fri, Nov 6, 2015 at 1:10 PM, Shawn Heisey  wrote:

> On 11/6/2015 9:13 AM, Alexandre Rafalovitch wrote:
> > What about the Garbage Collection output? I think we have the same
> > issue there. Frankly, I don't know how many people know what to do
> > with that in a first place.
>
> Turns out that Java has rotation capability built in to GC logging:
>
> http://stackoverflow.com/a/12277309/2665648
>
> Thanks,
> Shawn
>
>


Re: solr-8983-console.log is huge

2015-11-06 Thread Shawn Heisey
On 11/6/2015 9:13 AM, Alexandre Rafalovitch wrote:
> What about the Garbage Collection output? I think we have the same
> issue there. Frankly, I don't know how many people know what to do
> with that in a first place.

Turns out that Java has rotation capability built in to GC logging:

http://stackoverflow.com/a/12277309/2665648

Thanks,
Shawn



Re: solr-8983-console.log is huge

2015-11-06 Thread Erick Erickson
Yep, I looked at the new JIRA and finally figured out what the
problem is.

It should be changed, but in the meantime one can go in and
take the CONSOLE appender out of the logging properties file.

Or restart Solr periodically. Ugly but it would work.

On Fri, Nov 6, 2015 at 8:13 AM, Alexandre Rafalovitch
 wrote:
> What about the Garbage Collection output? I think we have the same
> issue there. Frankly, I don't know how many people know what to do
> with that in a first place.
>
>
> 
> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> http://www.solr-start.com/
>
>
> On 6 November 2015 at 11:11, Upayavira  wrote:
>> Erick,
>>
>> bin/start pipes stdout to solr-$PORT-console.log or such. With no
>> rotation. So we are setting people up to fail right from the get-go.
>>
>> That's what I'm hoping the attached ticket will resolve.
>>
>> Upayavira
>>
>> On Fri, Nov 6, 2015, at 03:52 PM, Erick Erickson wrote:
>>> How do you start solr? If you pipe console output
>>> to a file it'll grow forever. Either pipe the
>>> output to dev/null or follow Sara's link and
>>> take the CONSOLE appender out of log4j.properties
>>>
>>> Best,
>>> Erick
>>>
>>> On Fri, Nov 6, 2015 at 2:12 AM, sara hajili 
>>> wrote:
>>> > You can change solr loglevel.bydefault solr logs for every thing.
>>> > You can change this by go in solrconsole.inlog/level and edit levels for
>>> > just error for example.
>>> > And this is temporary way.
>>> > You can also change solrconfig.insolr_home
>>> > In /log and change logging4j
>>> > Config.
>>> > For more info look at:
>>> > https://cwiki.apache.org/confluence/display/solr/Configuring+Logging
>>> > That log file is constantly growing. And it is now ~60GB. what can i 
>>> > change
>>> > to fix this?
>>> >
>>> >
>>> >
>>> > --
>>> > View this message in context:
>>> > http://lucene.472066.n3.nabble.com/solr-8983-console-log-is-huge-tp4238613.html
>>> > Sent from the Solr - User mailing list archive at Nabble.com.


data import extremely slow

2015-11-06 Thread Yangrui Guo
Hi

I'm using Solr's data import handler and MySQL 5.5 to index imdb database.
However the data-import takes a few minutes to process one document while
there are over 3 million movies. This is going to take forever yet I can
select the rows in MySQL in no time. Where am I doing wrong? My
data-config.xml is like below:











I created views for the database:

movie:

SELECT
`title`.`id` AS `id`
FROM
`title`

movie_actor:

SELECT
CONCAT('movie.',
`title`.`id`,
'.actor.',
`cast_info`.`person_id`) AS `id`,
`title`.`id` AS `parent`,
`name`.`name` AS `name`,
FROM
((`title`
JOIN `cast_info` ON ((`cast_info`.`movie_id` = `title`.`id`)))
JOIN `name` ON ((`cast_info`.`person_id` = `name`.`id`)))
WHERE
(`cast_info`.`role_id` = 1)

movie_actress:

SELECT
CONCAT('movie.',
`title`.`id`,
'.actress.',
`cast_info`.`person_id`) AS `id`,
`title`.`id` AS `parent`,
`name`.`name` AS `name`,
FROM
((`title`
JOIN `cast_info` ON ((`cast_info`.`movie_id` = `title`.`id`)))
JOIN `name` ON ((`cast_info`.`person_id` = `name`.`id`)))
WHERE
(`cast_info`.`role_id` = 2)

Thanks,

Yangrui


Boost query at search time according set of roles with least performance impact

2015-11-06 Thread Andrea Roggerone
Hi all,
I am working on a mechanism that applies additional boosts to documents
according to the role covered by the author. For instance we have

CEO|5 Architect|3 Developer|1 TeamLeader|2

keeping in mind that an author could cover multiple roles (e.g. for a
design document, a Team Leader could be also a Developer).

I am aware that is possible to implement a function that leverages
payloads, however the weights need to be configurable so I can't store the
payload at index time.
Passing all the weights at query time is not an option as we have more than
20 roles and query readability and performance would be heavily affected.

Do we have any "out of the box mechanism" in Solr to implement the
described behavior? If not, what other options do we have?


Re: solr-8983-console.log is huge

2015-11-06 Thread Alexandre Rafalovitch
What about the Garbage Collection output? I think we have the same
issue there. Frankly, I don't know how many people know what to do
with that in a first place.



Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 6 November 2015 at 11:11, Upayavira  wrote:
> Erick,
>
> bin/start pipes stdout to solr-$PORT-console.log or such. With no
> rotation. So we are setting people up to fail right from the get-go.
>
> That's what I'm hoping the attached ticket will resolve.
>
> Upayavira
>
> On Fri, Nov 6, 2015, at 03:52 PM, Erick Erickson wrote:
>> How do you start solr? If you pipe console output
>> to a file it'll grow forever. Either pipe the
>> output to dev/null or follow Sara's link and
>> take the CONSOLE appender out of log4j.properties
>>
>> Best,
>> Erick
>>
>> On Fri, Nov 6, 2015 at 2:12 AM, sara hajili 
>> wrote:
>> > You can change solr loglevel.bydefault solr logs for every thing.
>> > You can change this by go in solrconsole.inlog/level and edit levels for
>> > just error for example.
>> > And this is temporary way.
>> > You can also change solrconfig.insolr_home
>> > In /log and change logging4j
>> > Config.
>> > For more info look at:
>> > https://cwiki.apache.org/confluence/display/solr/Configuring+Logging
>> > That log file is constantly growing. And it is now ~60GB. what can i change
>> > to fix this?
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> > http://lucene.472066.n3.nabble.com/solr-8983-console-log-is-huge-tp4238613.html
>> > Sent from the Solr - User mailing list archive at Nabble.com.


Re: solr-8983-console.log is huge

2015-11-06 Thread Upayavira
Erick,

bin/start pipes stdout to solr-$PORT-console.log or such. With no
rotation. So we are setting people up to fail right from the get-go.

That's what I'm hoping the attached ticket will resolve.

Upayavira

On Fri, Nov 6, 2015, at 03:52 PM, Erick Erickson wrote:
> How do you start solr? If you pipe console output
> to a file it'll grow forever. Either pipe the
> output to dev/null or follow Sara's link and
> take the CONSOLE appender out of log4j.properties
> 
> Best,
> Erick
> 
> On Fri, Nov 6, 2015 at 2:12 AM, sara hajili 
> wrote:
> > You can change solr loglevel.bydefault solr logs for every thing.
> > You can change this by go in solrconsole.inlog/level and edit levels for
> > just error for example.
> > And this is temporary way.
> > You can also change solrconfig.insolr_home
> > In /log and change logging4j
> > Config.
> > For more info look at:
> > https://cwiki.apache.org/confluence/display/solr/Configuring+Logging
> > That log file is constantly growing. And it is now ~60GB. what can i change
> > to fix this?
> >
> >
> >
> > --
> > View this message in context:
> > http://lucene.472066.n3.nabble.com/solr-8983-console-log-is-huge-tp4238613.html
> > Sent from the Solr - User mailing list archive at Nabble.com.


Re: No live SolrServers available to handle this request

2015-11-06 Thread Erick Erickson
The host may be running well, but my bet is that
you have an error in the schema.xml file so it's
no longer valid XML and the core did not load.

So while the solr instance is up and running, no
core using that schema is running, thus no
live servers.

Look at the admin UI, cloud>>graph view and
if the collection you're trying to operate on is
not green, then that's probably the issue.

Otherwise look through the Solr log file and
you should see some exceptions that may
point the way.

Best,
Erick

On Thu, Nov 5, 2015 at 11:58 PM, wilanjar .  wrote:
> Hi All,
>
> I'm very new handle the solrcloud.
> I've changed the scema.xml with adding field to index but after reload the
> collection we got error from logging " No live SolrServers available to
> handle this request".
>
> i have check solrcloud from localhost each node and running  well.
> i'm using solr version 4.10.4 lucene version 4.10.4
> tomcat 8.0.27
> zookeeper 3.4.6.
>
> I already googling but not get solution yet.
>
> Thank you.


Re: [SolrJ Clients] RequestWriter VS BinaryRequestWriter

2015-11-06 Thread Erick Erickson
And the other large benefit of CloudSolrClient is that it
routes documents directly to the correct leader, i.e. does
the routing on the client rather than have the Solr
instances forward docs to the routing. Using CloudSolrClient
should scale more nearly linearly with increasing
shards.

Best,
Erick

On Fri, Nov 6, 2015 at 6:39 AM, Shawn Heisey  wrote:
> On 11/6/2015 7:15 AM, Vincenzo D'Amore wrote:
>> I have followed your same path, having a look at java source. I inherited
>> an installation with CloudSolrServer (I still had solrcloud 4.8) but I was
>> not sure it was the right choice instead of the (apparently) more appealing
>> ConcurrentUpdateSolrClient.
>>
>> As far as I understood, ConcurrentUpdateSolrClient is rooted with older
>> versions of solr, may be older than the cloud version.
>> Because of ConcurrentUpdateSolrClient constructors signature, they don't
>> accept a zookeeper client or host:port as parameter.
>>
>> On the other hand, well, I'm not sure that a concurrent client does a job
>> better than the standard CloudSolrServer.
>
> The concurrent client has one glaring flaw:  It puts all update requests
> into background threads, so any exceptions thrown by those requests are
> logged and ignored.  When you send an add or delete request, the client
> returns immediately to your program and indicates success (by not
> throwing an exception) ... even if the server you're talking to is
> completely offline.
>
> In a bulk insert situation, you might not care about error handling, but
> most people DO care about it.
>
> For most situations, you will want to use HttpSolrClient or
> CloudSolrClient, depending on whether the target is running SolrCloud.
>
> Thanks,
> Shawn
>


Re: solr-8983-console.log is huge

2015-11-06 Thread Erick Erickson
How do you start solr? If you pipe console output
to a file it'll grow forever. Either pipe the
output to dev/null or follow Sara's link and
take the CONSOLE appender out of log4j.properties

Best,
Erick

On Fri, Nov 6, 2015 at 2:12 AM, sara hajili  wrote:
> You can change solr loglevel.bydefault solr logs for every thing.
> You can change this by go in solrconsole.inlog/level and edit levels for
> just error for example.
> And this is temporary way.
> You can also change solrconfig.insolr_home
> In /log and change logging4j
> Config.
> For more info look at:
> https://cwiki.apache.org/confluence/display/solr/Configuring+Logging
> That log file is constantly growing. And it is now ~60GB. what can i change
> to fix this?
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/solr-8983-console-log-is-huge-tp4238613.html
> Sent from the Solr - User mailing list archive at Nabble.com.


RE: tikaparser docx file fails with exception

2015-11-06 Thread Allison, Timothy B.
Agree with all below, and don't hesitate to open a ticket on Tika's Jira and/or 
POI's bugzilla...especially if you can share the triggering document.

-Original Message-
From: Alexandre Rafalovitch [mailto:arafa...@gmail.com] 
Sent: Thursday, November 05, 2015 6:05 PM
To: solr-user 
Subject: Re: tikaparser docx file fails with exception

It is quite clear actually that the problem is this:
Caused by: java.io.CharConversionException: Characters larger than 4 bytes are 
not supported: byte 0xb7 implies a length of more than 4 bytes
  at 
org.apache.xmlbeans.impl.piccolo.xml.UTF8XMLDecoder.decode(UTF8XMLDecoder.java:162)
  at 
org.apache.xmlbeans.impl.piccolo.xml.XMLStreamReader$FastStreamDecoder.read(XMLStreamReader.java:762)
  at 
org.apache.xmlbeans.impl.piccolo.xml.XMLStreamReader.read(XMLStreamReader.java:162)
  at 
org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.yy_refill(PiccoloLexer.java:3477)

If you search for something like: PiccoloLexer.yy_refill Characters larger than 
4 bytes are not supported:
you get lots of various matches in different forums for different (java-based? 
tika-based?) software. Most likely Tika found something obscure in the document 
that there is no implementations for yet. E.g.
an image inside a text field inside a footer section. Just as an example

I would basically try standalone Tika and look for the most expressive debug 
flag. It should tell you which file inside the zip that docx actually is caused 
the problem. That should give you some hint.

Regards,
   Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 5 November 2015 at 17:36, Aswath Srinivasan (TMS) 
 wrote:
> Thank you for attempting to answer. I will try out with solrj and standalone 
> java with tika parser. I completely understand that a bad document could 
> cause this, however, when I opened up the document I couldn't find anything 
> suspicious expect for some binary images/pictures embedded into the document.
>
>
> -Original Message-
> From: Erick Erickson [mailto:erickerick...@gmail.com]
> Sent: Wednesday, November 04, 2015 4:33 PM
> To: solr-user 
> Subject: Re: tikaparser docx file fails with exception
>
> Possibly a corrupt file? Tika does its best, but bad data is...bad data.
>
> You can experiment a bit with using Tika in Java, that might give you a 
> better idea of what's really going on, here's a SolrJ example:
>
> https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/
>
> Best,
> Erick
>
> On Wed, Nov 4, 2015 at 3:49 PM, Aswath Srinivasan (TMS) 
>  wrote:
>>
>> Trying to index a document. A docx file. Ending up with the below exception. 
>> Not sure why it is erroring out. When I opened the docx I was able to see 
>> lots of binary data like embedded pictures etc., Is there a possible 
>> solution to this or is it a bug? Only one such file fails. Rest of the files 
>> are smoothly indexed.
>>
>> 2015-11-04 23:16:11.549 INFO  (coreLoadExecutor-6-thread-1) [   x:tika] 
>> o.a.s.c.CoreContainer registering core: tika
>> 2015-11-04 23:16:11.549 INFO  
>> (searcherExecutor-7-thread-1-processing-x:tika) [   x:tika] o.a.s.c.SolrCore 
>> QuerySenderListener sending requests to Searcher@1eb69b2[tika] 
>> main{ExitableDirectoryReader(UninvertingDirectoryReader())}
>> 2015-11-04 23:16:11.585 INFO  
>> (searcherExecutor-7-thread-1-processing-x:tika) [   x:tika] 
>> o.a.s.c.S.Request [tika] webapp=null path=null 
>> params={q=static+firstSearcher+warming+in+solrconfig.xml&distrib=false&event=firstSearcher}
>>  hits=0 status=0 QTime=34
>> 2015-11-04 23:16:11.586 INFO  
>> (searcherExecutor-7-thread-1-processing-x:tika) [   x:tika] o.a.s.c.SolrCore 
>> QuerySenderListener done.
>> 2015-11-04 23:16:11.586 INFO  
>> (searcherExecutor-7-thread-1-processing-x:tika) [   x:tika] 
>> o.a.s.h.c.SpellCheckComponent Loading spell index for spellchecker: default
>> 2015-11-04 23:16:11.586 INFO  
>> (searcherExecutor-7-thread-1-processing-x:tika) [   x:tika] 
>> o.a.s.h.c.SpellCheckComponent Loading spell index for spellchecker: wordbreak
>> 2015-11-04 23:16:11.586 INFO  
>> (searcherExecutor-7-thread-1-processing-x:tika) [   x:tika] 
>> o.a.s.h.c.SuggestComponent buildOnStartup: mySuggester
>> 2015-11-04 23:16:11.586 INFO  
>> (searcherExecutor-7-thread-1-processing-x:tika) [   x:tika] 
>> o.a.s.s.s.SolrSuggester SolrSuggester.build(mySuggester)
>> 2015-11-04 23:16:11.605 INFO  
>> (searcherExecutor-7-thread-1-processing-x:tika) [   x:tika] o.a.s.c.SolrCore 
>> [tika] Registered new searcher Searcher@1eb69b2[tika] 
>> main{ExitableDirectoryReader(UninvertingDirectoryReader())}
>> 2015-11-04 23:16:25.923 INFO  (qtp7980742-16) [   x:tika] 
>> o.a.s.h.d.DataImporter Loading DIH Configuration: tika-data-config.xml
>> 2015-11-04 23:16:25.937 INFO  (qtp7980742-16) [   x:tika] 
>> o.a.s.h.d.DataImporter Data Configuration loaded successfully
>> 2015-11-04 23:16:25.947 INFO  (qtp7980742-16) [   x:tika] o.a.s.c.S.Request 
>> [tik

Re: Securing field level access permission by filtering the query itself

2015-11-06 Thread Alessandro Benedetti
Are you basically saying that you are going to model 3 collections, 1 per
role .
Each collection schema will contain only the sensitive field.
When you query you simply search in the related collection and retrieve all
the fields.
that's it ?

Cheers

On 6 November 2015 at 15:05, Douglas McGilvray  wrote:

> You know what guys, I have had a change in perspective…
>
> I previously thought: do I want to index all these documents multiple
> times just to protect 3 fields
> I am now thinking: do I really want to try to parse all the fields in a
> query when there are only 3 roles.
>
> I have only 4k documents and 3 roles, so thats 8k more documents and I
> doubt I will need to cross query with the other documents …
>
> Until I have more or more complex roles, or more protected documents, I
> think multiple cores is the best option …
>
> Cheers
> D
>
>
> > On 5 Nov 2015, at 12:50, Alessandro Benedetti 
> wrote:
> >
> > Be careful to the suggester as well. You don't want to show suggestions
> > coming from sensitive fields.
> >
> > Cheers
> >
> > On 5 November 2015 at 15:28, Scott Stults <
> sstu...@opensourceconnections.com
> >> wrote:
> >
> >> Good to hear! Depending on how far you want to take it, you can then
> scan
> >> the initial request coming in from the client (and the final response)
> for
> >> raw Solr fields -- that shouldn't happen. I've used mod_security as a
> >> general-purpose application firewall and would recommend it.
> >>
> >> k/r,
> >> Scott
> >>
> >> On Wed, Nov 4, 2015 at 1:40 PM, Douglas McGilvray 
> wrote:
> >>
> >>>
> >>> Thanks Alessandro, I had overlooked the highlighting component.
> >>>
> >>> I will also add a reminder to exclude these fields from spellcheck
> >> fields,
> >>> (or maintain different spellcheck fields for different roles).
> >>>
> >>> @Scott - Once I started planning my code the penny finally dropped
> >>> regarding your point about aliasing the fields - it removes the need
> for
> >>> calculating which fields to request in the app itself.
> >>>
> >>> Regards,
> >>> D
> >>>
> >>>
>  On 4 Nov 2015, at 14:53, Alessandro Benedetti 
> >>> wrote:
> 
>  Of course it depends of all the query parameter you use and you
> process
> >>> in
>  the response.
>  The list you wrote should be ok if you use only those components.
> 
>  For example if you use highlight, it's not ok and you need to take
> care
> >>> of
>  the highlighted fields as well.
> 
>  Cheers
> 
>  On 30 October 2015 at 14:51, Douglas McGilvray 
> >> wrote:
> 
> >
> > Scott thanks for the reply. I like the idea of mapping all the
> >>> fieldnames
> > internally, adding security through obscurity. My question therefore
> >>> would
> > be what is the definitive list of query parameters that one must
> >> filter
> >>> to
> > ensure a particular field is not exposed in the query response? Am I
> > missing in the following?
> >
> > fl
> > facect.field
> > facet.pivot
> > json.facet
> > terms.fl
> >
> >
> > kr
> > Douglas
> >
> >
> >> On 30 Oct 2015, at 07:37, Scott Stults <
> > sstu...@opensourceconnections.com> wrote:
> >>
> >> Douglas,
> >>
> >> Managing a per-user-group whitelist of fields outside of Solr seems
> >> the
> >> best approach. When the query comes in you can then filter out any
> >>> fields
> >> not contained in the whitelist before you send the request to Solr.
> >> The
> >> easy part will be to do that on URL parameters like fl. Depending on
> >>> how
> >> your app generates the actual query string, you may want to also
> scan
> > that
> >> for fielded query clauses (eg "badfield:value") and localParams (eg
> >> "{!dismax qf=badfield}value").
> >>
> >> Secondly, you can map internal Solr fields to aliases using this
> >> syntax
> > in
> >> the fl parameter: "display_name:real_solr_name". So when the request
> > comes
> >> in from your app, first you'll map from the requested field alias
> >> names
> > to
> >> internal Solr names (while enforcing the whitelist), and then in the
> >> fl
> >> parameter supply the aliases you want sent in the response.
> >>
> >>
> >> k/r,
> >> Scott
> >>
> >> On Wed, Oct 28, 2015 at 6:58 PM, Douglas McGilvray  >
> > wrote:
> >>
> >>> Hi all,
> >>>
> >>> First I’d like to say the nested facets and the json facet api in
> >>> particular have made my world much better, I thank everyone
> >> involved,
> > you
> >>> are all awesome.
> >>>
> >>> In my implementation has much of the solr query building working on
> >>> the
> >>> browser, solr is behind a php server which acts as “proxy” and
> >>> doorman,
> >>> filtering at the document level according to user role and
> supplying
> > some
> >>> sensible maximums …
> >>>
> >>> However we now wish to filter just one or two potentia

Re: Adding SanderC to the ContributorsGroup

2015-11-06 Thread Shawn Heisey
On 11/6/2015 2:58 AM, Sander Clompen wrote:
> Could you please add me to the ContributorsGroup (username: SanderC)?
> 
> I would like to participate and contribute on the wiki page, I would like to 
> translate the wiki to Dutch (or French, German).

Done.



Re: MatchAllDocsQuery is much slower in solr5.3.1 compare to solr4.7

2015-11-06 Thread Shawn Heisey
On 11/5/2015 10:25 PM, Jack Krupansky wrote:
> I vaguely recall some discussion concerning removal of the field cache in
> Lucene.

The FieldCache wasn't exactly *removed* ... it's more like it was
renamed, improved, and sort of hidden in a miscellaneous package.  Some
things still require this functionality, so they use the hidden class
instead, which was changed to use the DocValues API.

https://issues.apache.org/jira/browse/LUCENE-5666

I am not qualified to discuss LUCENE-5666 beyond what I wrote in the
paragraph above, and it's possible that some of what I said is wrong
because I do not really understand the APIs involved.

The change has caused problems for Solr.  End result from Solr's
perspective: Certain things which used to work perfectly fine (mostly
facets and grouping) in Solr 4.x have one of two problems in 5.x:
Either they don't work at all, or performance has gone way down.  Some
of these problems are documented in Jira.  These are the issues I know
about:

https://issues.apache.org/jira/browse/SOLR-8088
https://issues.apache.org/jira/browse/SOLR-7495
https://issues.apache.org/jira/browse/SOLR-8096

For fields where adding docValues is a viable option (most field types
other than solr.TextField), adding docValues and reindexing is very
likely to solve those problems.

Sometimes adding docValues won't work, either because the field type
doesn't allow it, or because it's the indexed terms that are needed, not
the original field value.  For those situations, there is currently no
solution.

Thanks,
Shawn



Re: Securing field level access permission by filtering the query itself

2015-11-06 Thread Douglas McGilvray
You know what guys, I have had a change in perspective… 

I previously thought: do I want to index all these documents multiple times 
just to protect 3 fields
I am now thinking: do I really want to try to parse all the fields in a query 
when there are only 3 roles. 

I have only 4k documents and 3 roles, so thats 8k more documents and I doubt I 
will need to cross query with the other documents … 

Until I have more or more complex roles, or more protected documents, I think 
multiple cores is the best option … 

Cheers
D


> On 5 Nov 2015, at 12:50, Alessandro Benedetti  wrote:
> 
> Be careful to the suggester as well. You don't want to show suggestions
> coming from sensitive fields.
> 
> Cheers
> 
> On 5 November 2015 at 15:28, Scott Stults > wrote:
> 
>> Good to hear! Depending on how far you want to take it, you can then scan
>> the initial request coming in from the client (and the final response) for
>> raw Solr fields -- that shouldn't happen. I've used mod_security as a
>> general-purpose application firewall and would recommend it.
>> 
>> k/r,
>> Scott
>> 
>> On Wed, Nov 4, 2015 at 1:40 PM, Douglas McGilvray  wrote:
>> 
>>> 
>>> Thanks Alessandro, I had overlooked the highlighting component.
>>> 
>>> I will also add a reminder to exclude these fields from spellcheck
>> fields,
>>> (or maintain different spellcheck fields for different roles).
>>> 
>>> @Scott - Once I started planning my code the penny finally dropped
>>> regarding your point about aliasing the fields - it removes the need for
>>> calculating which fields to request in the app itself.
>>> 
>>> Regards,
>>> D
>>> 
>>> 
 On 4 Nov 2015, at 14:53, Alessandro Benedetti 
>>> wrote:
 
 Of course it depends of all the query parameter you use and you process
>>> in
 the response.
 The list you wrote should be ok if you use only those components.
 
 For example if you use highlight, it's not ok and you need to take care
>>> of
 the highlighted fields as well.
 
 Cheers
 
 On 30 October 2015 at 14:51, Douglas McGilvray 
>> wrote:
 
> 
> Scott thanks for the reply. I like the idea of mapping all the
>>> fieldnames
> internally, adding security through obscurity. My question therefore
>>> would
> be what is the definitive list of query parameters that one must
>> filter
>>> to
> ensure a particular field is not exposed in the query response? Am I
> missing in the following?
> 
> fl
> facect.field
> facet.pivot
> json.facet
> terms.fl
> 
> 
> kr
> Douglas
> 
> 
>> On 30 Oct 2015, at 07:37, Scott Stults <
> sstu...@opensourceconnections.com> wrote:
>> 
>> Douglas,
>> 
>> Managing a per-user-group whitelist of fields outside of Solr seems
>> the
>> best approach. When the query comes in you can then filter out any
>>> fields
>> not contained in the whitelist before you send the request to Solr.
>> The
>> easy part will be to do that on URL parameters like fl. Depending on
>>> how
>> your app generates the actual query string, you may want to also scan
> that
>> for fielded query clauses (eg "badfield:value") and localParams (eg
>> "{!dismax qf=badfield}value").
>> 
>> Secondly, you can map internal Solr fields to aliases using this
>> syntax
> in
>> the fl parameter: "display_name:real_solr_name". So when the request
> comes
>> in from your app, first you'll map from the requested field alias
>> names
> to
>> internal Solr names (while enforcing the whitelist), and then in the
>> fl
>> parameter supply the aliases you want sent in the response.
>> 
>> 
>> k/r,
>> Scott
>> 
>> On Wed, Oct 28, 2015 at 6:58 PM, Douglas McGilvray 
> wrote:
>> 
>>> Hi all,
>>> 
>>> First I’d like to say the nested facets and the json facet api in
>>> particular have made my world much better, I thank everyone
>> involved,
> you
>>> are all awesome.
>>> 
>>> In my implementation has much of the solr query building working on
>>> the
>>> browser, solr is behind a php server which acts as “proxy” and
>>> doorman,
>>> filtering at the document level according to user role and supplying
> some
>>> sensible maximums …
>>> 
>>> However we now wish to filter just one or two potentially sensitive
> fields
>>> in one document type according to user role (as determined in the
>> php
>>> proxy). Duplicating documents (or cores) seems like overkill for
>> just
> two
>>> fields in one document type .. I wondered if it would be feasible
>> (in
> the
>>> interests of preventing malicious activity) to filter the query
>> itself
>>> whether it be parameters (fl, facet.fields, terms, etc) … or even
>> deny
> any
>>> request in which fieldname occurs …
>>> 
>>> Is there someway someone might obscure a fieldname in a request?
>>> 
>>> Ki

Re: solr-8983-console.log is huge

2015-11-06 Thread Shawn Heisey
On 11/6/2015 6:17 AM, Upayavira wrote:
> On Fri, Nov 6, 2015, at 10:12 AM, sara hajili wrote:
>> You can change solr loglevel.bydefault solr logs for every thing.
>> You can change this by go in solrconsole.inlog/level and edit levels for
>> just error for example.
>> And this is temporary way.
>> You can also change solrconfig.insolr_home
>> In /log and change logging4j
>> Config.
>> For more info look at:
>> https://cwiki.apache.org/confluence/display/solr/Configuring+Logging
>> That log file is constantly growing. And it is now ~60GB. what can i
>> change
>> to fix this?
> 
> I recently created this ticket:
> 
> https://issues.apache.org/jira/browse/SOLR-8232
> 
> It is all well and good saying you can change your logging to be less
> aggressive, but if the log file is never rotated, it WILL use up disk
> space one way or another. The correct way to fix this, I'd suggest is to
> not log anything to the console, and use log4j.properties to send log
> events to a file that *is* rotated.

I just commented on SOLR-8232 with what I think is a viable solution to
the problem -- change CONSOLE logging in all the log4j.properties files
to only log at WARN severity or higher.  There is some value to a
console log, but only if it doesn't duplicate every single informational
message that goes into the main log.

Thanks,
Shawn



Re: [SolrJ Clients] RequestWriter VS BinaryRequestWriter

2015-11-06 Thread Shawn Heisey
On 11/6/2015 7:15 AM, Vincenzo D'Amore wrote:
> I have followed your same path, having a look at java source. I inherited
> an installation with CloudSolrServer (I still had solrcloud 4.8) but I was
> not sure it was the right choice instead of the (apparently) more appealing
> ConcurrentUpdateSolrClient.
> 
> As far as I understood, ConcurrentUpdateSolrClient is rooted with older
> versions of solr, may be older than the cloud version.
> Because of ConcurrentUpdateSolrClient constructors signature, they don't
> accept a zookeeper client or host:port as parameter.
> 
> On the other hand, well, I'm not sure that a concurrent client does a job
> better than the standard CloudSolrServer.

The concurrent client has one glaring flaw:  It puts all update requests
into background threads, so any exceptions thrown by those requests are
logged and ignored.  When you send an add or delete request, the client
returns immediately to your program and indicates success (by not
throwing an exception) ... even if the server you're talking to is
completely offline.

In a bulk insert situation, you might not care about error handling, but
most people DO care about it.

For most situations, you will want to use HttpSolrClient or
CloudSolrClient, depending on whether the target is running SolrCloud.

Thanks,
Shawn



Re: Trying to apply patch for SOLR-7036

2015-11-06 Thread Shawn Heisey
On 11/5/2015 7:04 PM, r b wrote:
> I just wanted to double check that my steps were not too off base.
> 
> I am trying to apply the patch from 8/May/15 and it seems to be
> slightly off. Inside the working revision is 1658487 so I checked that
> out from svn. This is what I did.
> 
> svn checkout
> http://svn.apache.org/repos/asf/lucene/dev/trunk@1658487 lucene_trunk
> cd lucene_trunk/solr
> curl 
> https://issues.apache.org/jira/secure/attachment/12731517/SOLR-7036.patch
> | patch -p0
> 
> But `patch` still fails on a few hunks. I figured this patch was made
> with `svn diff` so it should apply smoothly to that same revision,
> shouldn't it?

Erick had the same problem with the patch back in July, and asked the
submitter to update the patch to trunk.  I tried applying the patch to
branch_5x at the specified revision and that failed too.

When I pulled down that specific revision of the lucene_solr_4_10
branch, then it would cleanly apply.  There are vast differences between
all 4.x branches/tags and the newer branches, which is why you cannot
get the patch applied.  A huge amount of work went into the code for
version 5.0.0, and the work on trunk and branch_5x since that release
has been enormous.

Getting this patch into 5x or trunk is going to require a lot of manual
work.  The original patch author is best qualified to do that work.  If
you want to tackle the job, feel free.  If you do so, please upload a
new patch to the issue.

Thanks,
Shawn



Re: [SolrJ Clients] RequestWriter VS BinaryRequestWriter

2015-11-06 Thread Alessandro Benedetti
Hi Vincenzo,
according to our discoveries I would say the CloudSolrClient to be the most
efficient way to interact with a Solr Cloud cluster.

ConcurrentUpdateSolrServer will be efficient for a single Solr instance,
but using under the hood the XML Response Writer.
Even if you prefer to use the javabin one ( which should be more efficient)
.

Cheers

On 6 November 2015 at 14:15, Vincenzo D'Amore  wrote:

> Hi Alessandro,
>
> I have followed your same path, having a look at java source. I inherited
> an installation with CloudSolrServer (I still had solrcloud 4.8) but I was
> not sure it was the right choice instead of the (apparently) more appealing
> ConcurrentUpdateSolrClient.
>
> As far as I understood, ConcurrentUpdateSolrClient is rooted with older
> versions of solr, may be older than the cloud version.
> Because of ConcurrentUpdateSolrClient constructors signature, they don't
> accept a zookeeper client or host:port as parameter.
>
> On the other hand, well, I'm not sure that a concurrent client does a job
> better than the standard CloudSolrServer.
>
> Best,
> Vincenzo
>
>
> On Thu, Nov 5, 2015 at 12:30 PM, Alessandro Benedetti <
> abenede...@apache.org
> > wrote:
>
> > Hi guys,
> > I was taking a look to the implementation details to understand how Solr
> > requests are written by SolrJ APIs.
> > The interesting classes are :
> >
> > *org.apache.solr.client.solrj.request.RequestWriter*
> >
> > *org.apache.solr.client.solrj.impl.BinaryRequestWriter* ( wrong package
> ? )
> >
> > I discovered that :
> >
> > *CloudSolrClient *- is using the javabin format ( *BinaryRequestWriter*)
> > *HttpSolrClient *and* LBHttpSolrClient* - are using the *RequestWriter* (
> > which writes xml)
> >
> > In consequence the ConcurrentUpdateSolrClient is using the xml
> > ResponseWriter as well.
> >
> > Is there any reason in this ?
> > I did know that the javabin  format is the most efficient for Solr
> > requests.
> > Why the xml RequestWriter is still used as default with those
> SolrClients ?
> >
> > Cheers
> >
> > --
> > --
> >
> > Benedetti Alessandro
> > Visiting card : http://about.me/alessandro_benedetti
> >
> > "Tyger, tyger burning bright
> > In the forests of the night,
> > What immortal hand or eye
> > Could frame thy fearful symmetry?"
> >
> > William Blake - Songs of Experience -1794 England
> >
>
>
>
> --
> Vincenzo D'Amore
> email: v.dam...@gmail.com
> skype: free.dev
> mobile: +39 349 8513251
>



-- 
--

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England


Re: [SolrJ Clients] RequestWriter VS BinaryRequestWriter

2015-11-06 Thread Vincenzo D'Amore
Hi Alessandro,

I have followed your same path, having a look at java source. I inherited
an installation with CloudSolrServer (I still had solrcloud 4.8) but I was
not sure it was the right choice instead of the (apparently) more appealing
ConcurrentUpdateSolrClient.

As far as I understood, ConcurrentUpdateSolrClient is rooted with older
versions of solr, may be older than the cloud version.
Because of ConcurrentUpdateSolrClient constructors signature, they don't
accept a zookeeper client or host:port as parameter.

On the other hand, well, I'm not sure that a concurrent client does a job
better than the standard CloudSolrServer.

Best,
Vincenzo


On Thu, Nov 5, 2015 at 12:30 PM, Alessandro Benedetti  wrote:

> Hi guys,
> I was taking a look to the implementation details to understand how Solr
> requests are written by SolrJ APIs.
> The interesting classes are :
>
> *org.apache.solr.client.solrj.request.RequestWriter*
>
> *org.apache.solr.client.solrj.impl.BinaryRequestWriter* ( wrong package ? )
>
> I discovered that :
>
> *CloudSolrClient *- is using the javabin format ( *BinaryRequestWriter*)
> *HttpSolrClient *and* LBHttpSolrClient* - are using the *RequestWriter* (
> which writes xml)
>
> In consequence the ConcurrentUpdateSolrClient is using the xml
> ResponseWriter as well.
>
> Is there any reason in this ?
> I did know that the javabin  format is the most efficient for Solr
> requests.
> Why the xml RequestWriter is still used as default with those SolrClients ?
>
> Cheers
>
> --
> --
>
> Benedetti Alessandro
> Visiting card : http://about.me/alessandro_benedetti
>
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
>
> William Blake - Songs of Experience -1794 England
>



-- 
Vincenzo D'Amore
email: v.dam...@gmail.com
skype: free.dev
mobile: +39 349 8513251


Re: solr-8983-console.log is huge

2015-11-06 Thread Upayavira


On Fri, Nov 6, 2015, at 10:12 AM, sara hajili wrote:
> You can change solr loglevel.bydefault solr logs for every thing.
> You can change this by go in solrconsole.inlog/level and edit levels for
> just error for example.
> And this is temporary way.
> You can also change solrconfig.insolr_home
> In /log and change logging4j
> Config.
> For more info look at:
> https://cwiki.apache.org/confluence/display/solr/Configuring+Logging
> That log file is constantly growing. And it is now ~60GB. what can i
> change
> to fix this?

I recently created this ticket:

https://issues.apache.org/jira/browse/SOLR-8232

It is all well and good saying you can change your logging to be less
aggressive, but if the log file is never rotated, it WILL use up disk
space one way or another. The correct way to fix this, I'd suggest is to
not log anything to the console, and use log4j.properties to send log
events to a file that *is* rotated.

Upayavira


Re: Child document and parent document with same key

2015-11-06 Thread Jamie Johnson
Thanks that's what I suspected given what I'm seeing but wanted to make
sure.  Again thanks
On Nov 5, 2015 1:08 PM, "Mikhail Khludnev" 
wrote:

> On Fri, Oct 16, 2015 at 10:41 PM, Jamie Johnson  wrote:
>
> > Is this expected to work?
>
>
> I think it is. I'm still not sure I understand the question. But let me
> bring some details from SOLR-3076:
> - Solr's  backs on Lucene's "deleteTerm" which is supplied into
> indexWriter.updateDocument();
> - when parent document has children,  is not a deleteTerm but
> its' value is used for "deleteTerm" for field "_root_" see
>
> https://github.com/apache/lucene-solr/blob/trunk/solr/core/src/java/org/apache/solr/update/DirectUpdateHandler2.java#L251
> - thus for block updates uniqueKey is (almost) meaningless.
> It lacks of elegance, but that's it.
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
> 
> 
>


Adding SanderC to the ContributorsGroup

2015-11-06 Thread Sander Clompen
Hi,


Could you please add me to the ContributorsGroup (username: SanderC)?

I would like to participate and contribute on the wiki page, I would like to 
translate the wiki to Dutch (or French, German).


Kind regards,

SanderC


Re: solr-8983-console.log is huge

2015-11-06 Thread sara hajili
You can change solr loglevel.bydefault solr logs for every thing.
You can change this by go in solrconsole.inlog/level and edit levels for
just error for example.
And this is temporary way.
You can also change solrconfig.insolr_home
In /log and change logging4j
Config.
For more info look at:
https://cwiki.apache.org/confluence/display/solr/Configuring+Logging
That log file is constantly growing. And it is now ~60GB. what can i change
to fix this?



--
View this message in context:
http://lucene.472066.n3.nabble.com/solr-8983-console-log-is-huge-tp4238613.html
Sent from the Solr - User mailing list archive at Nabble.com.


solr-8983-console.log is huge

2015-11-06 Thread CrazyDiamond
That log file is constantly growing. And it is now ~60GB. what can i change
to fix this?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-8983-console-log-is-huge-tp4238613.html
Sent from the Solr - User mailing list archive at Nabble.com.