Re: Synonym Phrase

2013-07-27 Thread Mikhail Khludnev
Hello,

As far as I know
http://nolanlawson.com/2012/10/31/better-synonym-handling-in-solr/ has some
usage in the industry.


On Fri, Jul 26, 2013 at 8:28 PM, Jack Krupansky wrote:

> Hmmm... Actually, I think there was also a solution where you could
> specify an alternate tokenizer for the synonym file which would not
> tokenize on space, so that the full phrase would be passed to the query
> parser/generator as a single "term" so that it would generate a phrase (if
> you have the autogeneratePhraseQuery attribute of the field type set to
> true.) But, I don't recall the details... and it's not the default, which
> maybe it should be.
>
>
> -- Jack Krupansky
>
> -Original Message- From: Furkan KAMACI
> Sent: Friday, July 26, 2013 12:18 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Synonym Phrase
>
>
> Why Solr does not split that terms by*;* I think that it both split
> by   *;*   and white space character?
>
> 2013/7/26 Jack Krupansky 
>
>  Well, that's one of the areas where Solr synonym support breaks down. The
>> LucidWorks Search query parser has a proprietary solution for that
>> problem,
>> but it won't help you with bare Solr. Some people have used shingles.
>>
>> In short, for query-time synonym phrases your best bet is to parse the
>> query at the application level and generate a Solr query that has the
>> synonyms pre-expanded.
>>
>> Application preprocessing could be as simple as scanning for the synonym
>> phrases and then adding "OR" terms for the synonym phrases.
>>
>> -- Jack Krupansky
>>
>> -Original Message- From: Furkan KAMACI
>> Sent: Friday, July 26, 2013 10:53 AM
>> To: solr-user@lucene.apache.org
>> Subject: Synonym Phrase
>>
>>
>> I have a synonyms file as like that:
>>
>> cart; shopping cart; market trolley
>>
>> When I analyse my query I see that when I search cart these becomes
>> synonyms:
>>
>>
>> cart, shopping, market, trolley
>>
>> so cart is synonym with shopping. How should I define my synonyms.txt file
>> that it will understand that cart is synonym to "shopping cart"?
>>
>>
>


-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics


 


Re: paging vs streaming. spawn from (Processing a lot of results in Solr)

2013-07-27 Thread Mikhail Khludnev
Otis,
You gave links to 'deep paging' when I asked about response streaming.
Let me understand. From my POV, deep paging is a special case for regular
search scenarios. We definitely need it in Solr. However, if we are talking
about data analytic like problems, when we need to select an "endless"
stream of responses (or store them in file as Roman did), 'deep paging' is
a suboptimal hack.
What's your vision on this?


Re: processing documents in solr

2013-07-27 Thread Roman Chyla
Dear list,
I'vw written a special processor exactly for this kind of operations

https://github.com/romanchyla/montysolr/tree/master/contrib/adsabs/src/java/org/apache/solr/handler/batch

This is how we use it
http://labs.adsabs.harvard.edu/trac/ads-invenio/wiki/SearchEngineBatch

It is capable of processing index of 200gb in few minutes,
copying/streaming large amounts of data is normal

If there is general interest, we can create jira issue - but given my
current workload time, it will take longer and also somebody else will
*have to* invest their time and energy in testing it, reporting, etc. Of
course, feel free to create the jira yourself or reuse the code -
hopefully, you will improve it and let me know ;-)

Roman
On 27 Jul 2013 01:03, "Joe Zhang"  wrote:

> Dear list:
>
> I have an ever-growing solr repository, and I need to process every single
> document to extract statistics. What would be a reasonable process that
> satifies the following properties:
>
> - Exhaustive: I have to traverse every single document
> - Incremental: in other words, it has to allow me to divide and conquer ---
> if I have processed the first 20k docs, next time I can start with 20001.
>
> A simple "*:*" query would satisfy the 1st but not the 2nd property. In
> fact, given that the processing will take very long, and the repository
> keeps growing, it is not even clear that the exhaustiveness is achieved.
>
> I'm running solr 3.6.2 in a single-machine setting; no hadoop capability
> yet. But I guess the same issues still hold even if I have the solr cloud
> environment, right, say in each shard?
>
> Any help would be greatly appreciated.
>
> Joe
>


Re: paging vs streaming. spawn from (Processing a lot of results in Solr)

2013-07-27 Thread Roman Chyla
Mikhail,
If your solution gives lazy loading of solr docs /and thus streaming of
huge result lists/ it should be big YES!
Roman
On 27 Jul 2013 07:55, "Mikhail Khludnev"  wrote:

> Otis,
> You gave links to 'deep paging' when I asked about response streaming.
> Let me understand. From my POV, deep paging is a special case for regular
> search scenarios. We definitely need it in Solr. However, if we are talking
> about data analytic like problems, when we need to select an "endless"
> stream of responses (or store them in file as Roman did), 'deep paging' is
> a suboptimal hack.
> What's your vision on this?
>


RE: How to Make That Domains Should Be First?

2013-07-27 Thread Markus Jelsma
Hi - To make this work you'll need a homepage flag and some specific hostname 
analysis and function query boosting. I assume you're still using Nutch so 
getting detecting homepages is easy using NUTCH-1325. To actually get the 
homepage flag in Solr you need to modify the indexer to ingest the HostDB and 
look for HostDatum values in the reducer and set the homepage flag there. You 
can also modify the CrawlDB update tool to read the HostDB so you'll have the 
homepage flag in your CrawlDatums.

In Solr you need some analysis on the host field, split it on dots or make 
NGrams. Then, using function queries you can conditionally check for the 
existance of the homepage flag and if so, do a conditional query using the 
user's search terms. If you set the operator to AND you'll make sure the 
homepage only come at position one if the user only types terms that occur in 
the host field. So `wiki spain` won't boost the homepage at all.

Depending on URL length would not be a good idea because it doesn't allow 
longer hostnames or redirects if a homepage is not on /.

https://issues.apache.org/jira/browse/NUTCH-1325
 
-Original message-
> From:Furkan KAMACI 
> Sent: Friday 26th July 2013 18:11
> To: solr-user@lucene.apache.org
> Subject: How to Make That Domains Should Be First?
> 
> When I search wikipedia the home page of wikipedia is not at first result:
> 
> http://www.wikipedia.org/
> 
> first result is that:
> 
> http://en.wikipedia.org/wiki/Spain
> 
> How can I say that domains of web sites should be first at SolrCloud? (I
> want something like grouping at domains and boosting at url length )
> 


Re: problems about solr replication in 4.3

2013-07-27 Thread Erick Erickson
Well, a full import is going to re-import everything in the database, and the
presumption is that each every document would be replaced (because
presumably you're  is the same). So every document
will be deleted and re-added. So essentially you'll get a completely
new index every time.

In 3.6 are you sure you weren't doing a delta query?

Best
Erick

On Thu, Jul 25, 2013 at 9:57 PM, xiaoqi  wrote:
> thank u for replying very much .
>
> in fact ,we make a process for this problem , we found when master building
> index, it will clean self index when building index . so slave every minute
> to sync index, destroy self index folder.
>
> by the way : we building index using
> dataimport0?command=full-import&clean=false
> ,dataimport1?command=full-import&clean=false,
> dataimport2?command=full-import&clean=false .
>
>  when i using in solr3.6 has no problem ,never delete at first .
>
> does solr 4 need to special config anything ?
>
> thanks a lot .
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/problems-about-solr-replication-in-4-3-tp4079665p4080480.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: SolrCloud 4.3.1 - "Failure to open existing log file (non fatal)" errors under high load

2013-07-27 Thread Erick Erickson
What is your autocommit limit? Is it possible that your transaction
logs are simply getting too large? tlogs are truncated whenever
you do a hard commit (autocommit) with openSearcher either
true for false it doesn't matter.

FWIW,
Erick

On Fri, Jul 26, 2013 at 12:56 AM, Tim Vaillancourt  
wrote:
> Thanks Shawn and Yonik!
>
> Yonik: I noticed this error appears to be fairly trivial, but it is not
> appearing after a previous crash. Every time I run this high-volume test
> that produced my stack trace, I zero out the logs, Solr data and Zookeeper
> data and start over from scratch with a brand new collection and zero'd out
> logs.
>
> The test is mostly high volume (2000-4000 updates/sec) and at the start the
> SolrCloud runs decently for a good 20-60~ minutes, no errors in the logs at
> all. Then that stack trace occurs on all 3 nodes (staggered), I immediately
> get some replica down messages and then some "cannot connect" errors to all
> other cluster nodes, who have all crashed the same way. The tlog error could
> be a symptom of the problem of running out of threads perhaps.
>
> Shawn: thanks so much for sharing those details! Yes, they seem to be nice
> servers, for sure - I don't get to touch/see them but they're fast! I'll
> look into firmwares for sure and will try again after updating them. These
> Solr instances are not-bare metal and are actually KVM VMs so that's another
> layer to look into, although it is consistent between the two clusters.
>
> I am not currently increasing the 'nofiles' ulimit to above default like you
> are, but does Solr use 10,000+ file handles? It won't hurt to try it I guess
> :). To rule out Java 7, I'll probably also try Jetty 8 and Java 1.6 as an
> experiment as well.
>
> Thanks!
>
> Tim
>
>
> On 25/07/13 05:55 PM, Yonik Seeley wrote:
>>
>> On Thu, Jul 25, 2013 at 7:44 PM, Tim Vaillancourt
>> wrote:
>>>
>>> "ERROR [2013-07-25 19:34:24.264] [org.apache.solr.common.SolrException]
>>> Failure to open existing log file (non fatal)
>>>
>> That itself isn't necessarily a problem (and why it says "non fatal")
>> - it just means that most likely the a transaction log file was
>> truncated from a previous crash.  It may be unrelated to the other
>> issues you are seeing.
>>
>> -Yonik
>> http://lucidworks.com


Re: Solr-4663 - Alternatives to use same data dir in different cores for optimal cache performance

2013-07-27 Thread Erick Erickson
You can certainly have multiple Solrs pointing to the same
underlying physical index if (and only if) you absolutely
guarantee that only one Solr will write to the index at a
time.

But I'm not sure this is premature optimization or not. Problem
is that your multiple Solrs are eating up the same physical
memory, so I'm not quite sure whether the fact that you have
different query characteristics is best served by multiple cores
or not.

Have you measured improvements with your proposed
architecture?

Best
Erick

On Fri, Jul 26, 2013 at 3:23 AM, Dominik Siebel  wrote:
> Hi,
>
> I just found SOLR-4663 beeing patched in the latest update I did.
> Does anyone know any other solution to use ONE physical index for various
> purposes?
>
> Why? I would like to use different solconfig.xmls in terms of cache sizes,
> result window size, etc. per business case for optimal performance, while
> relying on the same data.
> This is due to the fact the queries are mostly completely different in
> structure and result size and we only have one unified search index
> (indexing performance).
> Any suggestions (besides replicating the index to another core on the same
> machine, of course ;) )?
>
>
> Cheers!
> Dom


Re: Querying a specific core in solr cloud

2013-07-27 Thread Erick Erickson
Not quite sure what's happening here. It would be interesting to
see whether the requests are actually going to the right IP, by
tailing out the logs.

It _may_ be that the &distrib=false isn't honored if there is no core
on the target machine (I haven't looked at the code). To test that,
go ahead and tail out the log on ip1 (in your first scenario) and then
send the query to ip2.

Actually, I'm asking you to do an experiment for me ...

distrib=false _may_ just be going to the first core specified that
can be found.

Best
Erick

On Fri, Jul 26, 2013 at 3:41 AM, vicky desai  wrote:
> Hi Erick,
>
> First Of all sorry for the late reply.
>
> The scenario is as follows
> 1. Create a solr set up on two machines say (ip1 and ip2) with shard=1 and
> external zoo-keeper
> 2. Now if i create a core x on machine with ip1 only and use the query
> http://ip1:port1/solr/x/select?q=*:*&distrib=false
> http://ip2:port2/solr/x/select?q=*:*&distrib=false
>
> I get the same result that is docs are visible. However practically the core
> is not on instance with ip2 so i was expecting that query to fail
>
> 3. Now if I create the core on machine 2 as well and then hit those two
> queries second query gives me response 0 untill it comes in sync with ip1.
> This behavious is as expected.
>
> Please correct me if my expections are wrong and thanks for all the help
> provided untill now
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Querying-a-specific-core-in-solr-cloud-tp4079964p4080528.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Sending shard requests to all replicas

2013-07-27 Thread Erick Erickson
This has been suggested, but so far it's not been implemented
as far as I know.

I'm curious though, how many shards are you dealing with? I
wonder if it would be a better idea to try to figure out _why_
you so often have a slow shard and whether the problem could
be cured with, say, better warming queries on the shards...

Best
Erick

On Fri, Jul 26, 2013 at 8:23 AM, Isaac Hebsh  wrote:
> Hi!
>
> When SolrClound executes a query, it creates shard requests, which is sent
> to one replica of each shard. Total QTime is determined by the slowest
> shard response (plus some extra time). [For simplicity, let's assume that
> no stored fields are requested.]
>
> I suffer from a situation where in every query, some shards are much slower
> than others.
>
> We might consider a different approach, which sends the shard request to
> *ALL* replicas of each shard. Solr will continue when responses are got
> from at least one replica of each shard.
>
> Of course, the amount of work that is wasted is big (multiplied by
> replicationFactor), but in my case, there are very few concurrent queries,
> and the most important performance is the qtime. Such a solution might
> improve qtime significantly.
>
>
> Did someone tried this before?
> Any tip from where should I start in the code?


Re: processing documents in solr

2013-07-27 Thread Joe Zhang
Thanks for sharing, Roman. I'll look into your code.

One more thought on your suggestion, Shawn. In fact, for the id, we need
more than "unique" and "rangeable"; we also need some sense of atomic
values. Your approach might run into risk with a text-based id field, say:

the id/key has values 'a', 'c', 'f', 'g', and our pagesize is 2. Your
suggestion would work fine. But with newly added documents, there is no
guarantee that they are not going to use the key value 'b'. And this new
document would be missed in your algorithm, right?


On Sat, Jul 27, 2013 at 5:32 AM, Roman Chyla  wrote:

> Dear list,
> I'vw written a special processor exactly for this kind of operations
>
>
> https://github.com/romanchyla/montysolr/tree/master/contrib/adsabs/src/java/org/apache/solr/handler/batch
>
> This is how we use it
> http://labs.adsabs.harvard.edu/trac/ads-invenio/wiki/SearchEngineBatch
>
> It is capable of processing index of 200gb in few minutes,
> copying/streaming large amounts of data is normal
>
> If there is general interest, we can create jira issue - but given my
> current workload time, it will take longer and also somebody else will
> *have to* invest their time and energy in testing it, reporting, etc. Of
> course, feel free to create the jira yourself or reuse the code -
> hopefully, you will improve it and let me know ;-)
>
> Roman
> On 27 Jul 2013 01:03, "Joe Zhang"  wrote:
>
> > Dear list:
> >
> > I have an ever-growing solr repository, and I need to process every
> single
> > document to extract statistics. What would be a reasonable process that
> > satifies the following properties:
> >
> > - Exhaustive: I have to traverse every single document
> > - Incremental: in other words, it has to allow me to divide and conquer
> ---
> > if I have processed the first 20k docs, next time I can start with 20001.
> >
> > A simple "*:*" query would satisfy the 1st but not the 2nd property. In
> > fact, given that the processing will take very long, and the repository
> > keeps growing, it is not even clear that the exhaustiveness is achieved.
> >
> > I'm running solr 3.6.2 in a single-machine setting; no hadoop capability
> > yet. But I guess the same issues still hold even if I have the solr cloud
> > environment, right, say in each shard?
> >
> > Any help would be greatly appreciated.
> >
> > Joe
> >
>


Re: processing documents in solr

2013-07-27 Thread Shawn Heisey
On 7/27/2013 11:17 AM, Joe Zhang wrote:
> Thanks for sharing, Roman. I'll look into your code.
> 
> One more thought on your suggestion, Shawn. In fact, for the id, we need
> more than "unique" and "rangeable"; we also need some sense of atomic
> values. Your approach might run into risk with a text-based id field, say:
> 
> the id/key has values 'a', 'c', 'f', 'g', and our pagesize is 2. Your
> suggestion would work fine. But with newly added documents, there is no
> guarantee that they are not going to use the key value 'b'. And this new
> document would be missed in your algorithm, right?

That's why I said that you would either have to not update the index or
ensure that (in your example) a 'b' document never gets added.  Because
you can't make that kind of guarantee in most situations, not updating
the index is safer.

Thanks,
Shawn



Re: processing documents in solr

2013-07-27 Thread Joe Zhang
I have a constantly growing index, so not updating the index can't be
practical...

Going back to the beginning of this thread: when we use the vanilla
"*:*"+pagination approach, would the ordering of documents remain stable?
 the index is dynamic: update/insertion only, no deletion.


On Sat, Jul 27, 2013 at 10:28 AM, Shawn Heisey  wrote:

> On 7/27/2013 11:17 AM, Joe Zhang wrote:
> > Thanks for sharing, Roman. I'll look into your code.
> >
> > One more thought on your suggestion, Shawn. In fact, for the id, we need
> > more than "unique" and "rangeable"; we also need some sense of atomic
> > values. Your approach might run into risk with a text-based id field,
> say:
> >
> > the id/key has values 'a', 'c', 'f', 'g', and our pagesize is 2. Your
> > suggestion would work fine. But with newly added documents, there is no
> > guarantee that they are not going to use the key value 'b'. And this new
> > document would be missed in your algorithm, right?
>
> That's why I said that you would either have to not update the index or
> ensure that (in your example) a 'b' document never gets added.  Because
> you can't make that kind of guarantee in most situations, not updating
> the index is safer.
>
> Thanks,
> Shawn
>
>


Re: paging vs streaming. spawn from (Processing a lot of results in Solr)

2013-07-27 Thread Mikhail Khludnev
Roman,

Let me briefly explain  the design

special RequestParser stores servlet output stream into the context
https://github.com/m-khl/solr-patches/compare/streaming#L7R22

then special component injects special PostFilter/DelegatingCollector which
writes right into output
https://github.com/m-khl/solr-patches/compare/streaming#L2R146

here is how it streams the doc, you see it's lazy enough
https://github.com/m-khl/solr-patches/compare/streaming#L2R181

I mention that it disables later collectors
https://github.com/m-khl/solr-patches/compare/streaming#L2R57
hence, no facets with streaming, yet as well as memory consumption.

This test shows how it works
https://github.com/m-khl/solr-patches/compare/streaming#L15R115

all other code purposed for distributed search.



On Sat, Jul 27, 2013 at 4:44 PM, Roman Chyla  wrote:

> Mikhail,
> If your solution gives lazy loading of solr docs /and thus streaming of
> huge result lists/ it should be big YES!
> Roman
> On 27 Jul 2013 07:55, "Mikhail Khludnev" 
> wrote:
>
> > Otis,
> > You gave links to 'deep paging' when I asked about response streaming.
> > Let me understand. From my POV, deep paging is a special case for regular
> > search scenarios. We definitely need it in Solr. However, if we are
> talking
> > about data analytic like problems, when we need to select an "endless"
> > stream of responses (or store them in file as Roman did), 'deep paging'
> is
> > a suboptimal hack.
> > What's your vision on this?
> >
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics


 


Re: processing documents in solr

2013-07-27 Thread Shawn Heisey
On 7/27/2013 11:38 AM, Joe Zhang wrote:
> I have a constantly growing index, so not updating the index can't be
> practical...
> 
> Going back to the beginning of this thread: when we use the vanilla
> "*:*"+pagination approach, would the ordering of documents remain stable?
>  the index is dynamic: update/insertion only, no deletion.

If you use a sort parameter with pagination, then you have stable
ordering, unless as described with the 'b' example, a new document gets
inserted into a position in the sort sequence that's before the current
result page.

One thing that you could do is make a copy of your index, set up a
separate Solr installation that's not getting updates, and use that for
your inspection.

Thanks,
Shawn



Re: SolrCloud 4.3.1 - "Failure to open existing log file (non fatal)" errors under high load

2013-07-27 Thread Tim Vaillancourt

Thanks for the reply Erick,

Hard Commit - 15000ms, openSearcher=false
Soft Commit - 1000ms, openSearcher=true

15sec hard commit was sort of a guess, I could try a smaller number. 
When you say "getting too large" what limit do you think it would be 
hitting: a ulimit (nofiles), disk space, number of changes, a limit in 
Solr itself?


By my math there would be 15 tlogs max per core, but I don't really know 
how it all works if someone could fill me in/point me somewhere.


Cheers,

Tim

On 27/07/13 07:57 AM, Erick Erickson wrote:

What is your autocommit limit? Is it possible that your transaction
logs are simply getting too large? tlogs are truncated whenever
you do a hard commit (autocommit) with openSearcher either
true for false it doesn't matter.

FWIW,
Erick

On Fri, Jul 26, 2013 at 12:56 AM, Tim Vaillancourt  
wrote:

Thanks Shawn and Yonik!

Yonik: I noticed this error appears to be fairly trivial, but it is not
appearing after a previous crash. Every time I run this high-volume test
that produced my stack trace, I zero out the logs, Solr data and Zookeeper
data and start over from scratch with a brand new collection and zero'd out
logs.

The test is mostly high volume (2000-4000 updates/sec) and at the start the
SolrCloud runs decently for a good 20-60~ minutes, no errors in the logs at
all. Then that stack trace occurs on all 3 nodes (staggered), I immediately
get some replica down messages and then some "cannot connect" errors to all
other cluster nodes, who have all crashed the same way. The tlog error could
be a symptom of the problem of running out of threads perhaps.

Shawn: thanks so much for sharing those details! Yes, they seem to be nice
servers, for sure - I don't get to touch/see them but they're fast! I'll
look into firmwares for sure and will try again after updating them. These
Solr instances are not-bare metal and are actually KVM VMs so that's another
layer to look into, although it is consistent between the two clusters.

I am not currently increasing the 'nofiles' ulimit to above default like you
are, but does Solr use 10,000+ file handles? It won't hurt to try it I guess
:). To rule out Java 7, I'll probably also try Jetty 8 and Java 1.6 as an
experiment as well.

Thanks!

Tim


On 25/07/13 05:55 PM, Yonik Seeley wrote:

On Thu, Jul 25, 2013 at 7:44 PM, Tim Vaillancourt
wrote:

"ERROR [2013-07-25 19:34:24.264] [org.apache.solr.common.SolrException]
Failure to open existing log file (non fatal)


That itself isn't necessarily a problem (and why it says "non fatal")
- it just means that most likely the a transaction log file was
truncated from a previous crash.  It may be unrelated to the other
issues you are seeing.

-Yonik
http://lucidworks.com


Early Access Release #4 for Solr 4.x Deep Dive book is now available for download on Lulu.com

2013-07-27 Thread Jack Krupansky
Okay, it’s hot off the e-presses: Solr 4.x Deep Dive, Early Access Release #4 
is now available for purchase and download as an e-book for $9.99 on Lulu.com 
at:

http://www.lulu.com/shop/jack-krupansky/solr-4x-deep-dive-early-access-release-1/ebook/product-21079719.html


(That link says “1”, but it apparently correctly redirects to EAR #4.)

My recent blog posts over the past week detailed the changes from EAR#3, but 
the main focus was completion of coverage of Solr 4.4 features, in addition to 
general improvements. I didn’t expect to publish this EAR until next week, but 
Solr 4.4 came out this week and I was essentially done with 4.4 coverage 
anyway. I’ll use the extra week so that I can spend next three full weeks 
trying to focus on a little more deeper coverage for EAR#5 on August 16, 2013. 
Topic(s) to be decided.

See:
http://basetechnology.blogspot.com/

If you have purchased EAR#1 or #2 or #3, there is no need to rush out and pick 
up EAR#4. I mean, the technical content changes were relatively modest (69 new 
pages), and EAR#5 will be out in another two weeks anyway. That said, EAR#4 is 
a significant improvement over EAR#1 and EAR#2 and EAR#3.

Thanks for your ongoing support!

-- Jack Krupansky

Re: paging vs streaming. spawn from (Processing a lot of results in Solr)

2013-07-27 Thread Roman Chyla
Hi Mikhail,

I can see it is lazy-loading, but I can't judge how much complex it becomes
(presumably, the filter dispatching mechanism is doing also other things -
it is there not only for streaming).

Let me just explain better what I found when I dug inside solr: documents
(results of the query) are loaded before they are passed into a writer - so
the writers are expecting to encounter the solr documents, but these
documents were loaded by one of the components before rendering them - so
it is kinda 'hard-coded'. But if solr was NOT loading these docs before
passing them to a writer, writer can load them instead (hence lazy loading,
but the difference is in numbers - it could deal with hundreds of thousands
of docs, instead of few thousands now).

I see one crucial point: this could work without any new handler/servlet -
solr would just gain a new parameter, something like: 'lazy=true' ;) and
people can use whatever 'wt' they did before

disclaimer: i don't know whether that would break other stuff, I only know
that I am using the same idea to dump what i need without breaking things
(so far...;-)) - but obviously, i didn't want to patch solr core

roman


On Sat, Jul 27, 2013 at 3:52 PM, Mikhail Khludnev <
mkhlud...@griddynamics.com> wrote:

> Roman,
>
> Let me briefly explain  the design
>
> special RequestParser stores servlet output stream into the context
> https://github.com/m-khl/solr-patches/compare/streaming#L7R22
>
> then special component injects special PostFilter/DelegatingCollector which
> writes right into output
> https://github.com/m-khl/solr-patches/compare/streaming#L2R146
>
> here is how it streams the doc, you see it's lazy enough
> https://github.com/m-khl/solr-patches/compare/streaming#L2R181
>
> I mention that it disables later collectors
> https://github.com/m-khl/solr-patches/compare/streaming#L2R57
> hence, no facets with streaming, yet as well as memory consumption.
>
> This test shows how it works
> https://github.com/m-khl/solr-patches/compare/streaming#L15R115
>
> all other code purposed for distributed search.
>
>
>
> On Sat, Jul 27, 2013 at 4:44 PM, Roman Chyla 
> wrote:
>
> > Mikhail,
> > If your solution gives lazy loading of solr docs /and thus streaming of
> > huge result lists/ it should be big YES!
> > Roman
> > On 27 Jul 2013 07:55, "Mikhail Khludnev" 
> > wrote:
> >
> > > Otis,
> > > You gave links to 'deep paging' when I asked about response streaming.
> > > Let me understand. From my POV, deep paging is a special case for
> regular
> > > search scenarios. We definitely need it in Solr. However, if we are
> > talking
> > > about data analytic like problems, when we need to select an "endless"
> > > stream of responses (or store them in file as Roman did), 'deep paging'
> > is
> > > a suboptimal hack.
> > > What's your vision on this?
> > >
> >
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
> 
>  
>


Re: processing documents in solr

2013-07-27 Thread Roman Chyla
On Sat, Jul 27, 2013 at 4:17 PM, Shawn Heisey  wrote:

> On 7/27/2013 11:38 AM, Joe Zhang wrote:
> > I have a constantly growing index, so not updating the index can't be
> > practical...
> >
> > Going back to the beginning of this thread: when we use the vanilla
> > "*:*"+pagination approach, would the ordering of documents remain stable?
> >  the index is dynamic: update/insertion only, no deletion.
>
> If you use a sort parameter with pagination, then you have stable
> ordering, unless as described with the 'b' example, a new document gets
> inserted into a position in the sort sequence that's before the current
> result page.
>
> One thing that you could do is make a copy of your index, set up a
> separate Solr installation that's not getting updates, and use that for
> your inspection.
>

Hi Shawn,

I guess if something prevents the current searcher from being recycled
(e.g. incrementing its ref count), it would be possible to re-use it for
the pagination - then the consumer could get tight to the reader and the
order is stable (seeing the same data) but there probably is not a
mechanism for this (?) nor would it be very wise to have such a mechanism
(?).

roman


>
> Thanks,
> Shawn
>
>


Re: SolrCloud 4.3.1 - "Failure to open existing log file (non fatal)" errors under high load

2013-07-27 Thread Jack Krupansky
No hard numbers, but the general guidance is that you should set your hard 
commit interval to match your expectations for how quickly nodes should come 
up if they need to be restarted. Specifically, a hard commit assures that 
all changes have been committed to disk and are ready for immediate access 
on restart, but any and all soft commit changes since the last hard commit 
must be "replayed" (reexecuted) on restart of a node.


How long does it take to replay the changes in the update log? No firm 
numbers, but treat it as if all of those uncommitted updates had to be 
resent and reprocessed by Solr. It's probably faster than that, but you get 
the picture.


I would suggest thinking in terms of minutes rather than seconds for hard 
commits 5 minutes, 10, 15, 20, 30 minutes.


Hard commits may result in kicking off segment merges, so too rapid a rate 
of segment creation might cause problems or at least be counterproductive.


So, instead of 15 seconds, try 15 minutes.

OTOH, if you really need to handle 4,000 update a seconds... you are clearly 
in "uncharted territory" and need to expect to need to do some heavy duty 
trial and error tuning on your own.


-- Jack Krupansky

-Original Message- 
From: Tim Vaillancourt

Sent: Saturday, July 27, 2013 4:21 PM
To: solr-user@lucene.apache.org
Subject: Re: SolrCloud 4.3.1 - "Failure to open existing log file (non 
fatal)" errors under high load


Thanks for the reply Erick,

Hard Commit - 15000ms, openSearcher=false
Soft Commit - 1000ms, openSearcher=true

15sec hard commit was sort of a guess, I could try a smaller number.
When you say "getting too large" what limit do you think it would be
hitting: a ulimit (nofiles), disk space, number of changes, a limit in
Solr itself?

By my math there would be 15 tlogs max per core, but I don't really know
how it all works if someone could fill me in/point me somewhere.

Cheers,

Tim

On 27/07/13 07:57 AM, Erick Erickson wrote:

What is your autocommit limit? Is it possible that your transaction
logs are simply getting too large? tlogs are truncated whenever
you do a hard commit (autocommit) with openSearcher either
true for false it doesn't matter.

FWIW,
Erick

On Fri, Jul 26, 2013 at 12:56 AM, Tim Vaillancourt 
wrote:

Thanks Shawn and Yonik!

Yonik: I noticed this error appears to be fairly trivial, but it is not
appearing after a previous crash. Every time I run this high-volume test
that produced my stack trace, I zero out the logs, Solr data and 
Zookeeper
data and start over from scratch with a brand new collection and zero'd 
out

logs.

The test is mostly high volume (2000-4000 updates/sec) and at the start 
the
SolrCloud runs decently for a good 20-60~ minutes, no errors in the logs 
at
all. Then that stack trace occurs on all 3 nodes (staggered), I 
immediately
get some replica down messages and then some "cannot connect" errors to 
all
other cluster nodes, who have all crashed the same way. The tlog error 
could

be a symptom of the problem of running out of threads perhaps.

Shawn: thanks so much for sharing those details! Yes, they seem to be 
nice

servers, for sure - I don't get to touch/see them but they're fast! I'll
look into firmwares for sure and will try again after updating them. 
These
Solr instances are not-bare metal and are actually KVM VMs so that's 
another

layer to look into, although it is consistent between the two clusters.

I am not currently increasing the 'nofiles' ulimit to above default like 
you
are, but does Solr use 10,000+ file handles? It won't hurt to try it I 
guess

:). To rule out Java 7, I'll probably also try Jetty 8 and Java 1.6 as an
experiment as well.

Thanks!

Tim


On 25/07/13 05:55 PM, Yonik Seeley wrote:

On Thu, Jul 25, 2013 at 7:44 PM, Tim Vaillancourt
wrote:

"ERROR [2013-07-25 19:34:24.264] [org.apache.solr.common.SolrException]
Failure to open existing log file (non fatal)


That itself isn't necessarily a problem (and why it says "non fatal")
- it just means that most likely the a transaction log file was
truncated from a previous crash.  It may be unrelated to the other
issues you are seeing.

-Yonik
http://lucidworks.com 




Re: paging vs streaming. spawn from (Processing a lot of results in Solr)

2013-07-27 Thread Yonik Seeley
On Sat, Jul 27, 2013 at 4:30 PM, Roman Chyla  wrote:
> Let me just explain better what I found when I dug inside solr: documents
> (results of the query) are loaded before they are passed into a writer - so
> the writers are expecting to encounter the solr documents, but these
> documents were loaded by one of the components before rendering them

Hmmm, are you saying that it looks like documents are not being streamed?
Solr was designed to stream documents from a single server from day 1...
currently all that is collected up-front is the list of internal
docids (an int[]) and the stored fields are loaded and streamed back
one by one.

Of course it's certainly possible that someone introduced a bug, so we
should investigate if you think you see non-streaming action from a
single server.  Distributed is a different can of worms ;-)

-Yonik
http://lucidworks.com


Re: paging vs streaming. spawn from (Processing a lot of results in Solr)

2013-07-27 Thread Mikhail Khludnev
Hello,

Please find below


> Let me just explain better what I found when I dug inside solr: documents
> (results of the query) are loaded before they are passed into a writer - so
> the writers are expecting to encounter the solr documents, but these
> documents were loaded by one of the components before rendering them - so
> it is kinda 'hard-coded'.

there is the code
https://github.com/apache/lucene-solr/blob/trunk/solr/core/src/java/org/apache/solr/handler/component/QueryComponent.java#L445which
pulls documents into document's cache
to achieve your goal you can try to remove documents cache, or disable lazy
fields loading.


> But if solr was NOT loading these docs before
> passing them to a writer, writer can load them instead (hence lazy loading,
> but the difference is in numbers - it could deal with hundreds of thousands
> of docs, instead of few thousands now).
>

anyway, even if writer pulls docs one by one, it doesn't allow to stream a
billion of them. Solr writes out DocList, which is really problematic even
in deep-paging scenarios.


>
>
> roman
>
>
> On Sat, Jul 27, 2013 at 3:52 PM, Mikhail Khludnev <
> mkhlud...@griddynamics.com> wrote:
>
> > Roman,
> >
> > Let me briefly explain  the design
> >
> > special RequestParser stores servlet output stream into the context
> > https://github.com/m-khl/solr-patches/compare/streaming#L7R22
> >
> > then special component injects special PostFilter/DelegatingCollector
> which
> > writes right into output
> > https://github.com/m-khl/solr-patches/compare/streaming#L2R146
> >
> > here is how it streams the doc, you see it's lazy enough
> > https://github.com/m-khl/solr-patches/compare/streaming#L2R181
> >
> > I mention that it disables later collectors
> > https://github.com/m-khl/solr-patches/compare/streaming#L2R57
> > hence, no facets with streaming, yet as well as memory consumption.
> >
> > This test shows how it works
> > https://github.com/m-khl/solr-patches/compare/streaming#L15R115
> >
> > all other code purposed for distributed search.
> >
> >
> >
> > On Sat, Jul 27, 2013 at 4:44 PM, Roman Chyla 
> > wrote:
> >
> > > Mikhail,
> > > If your solution gives lazy loading of solr docs /and thus streaming of
> > > huge result lists/ it should be big YES!
> > > Roman
> > > On 27 Jul 2013 07:55, "Mikhail Khludnev" 
> > > wrote:
> > >
> > > > Otis,
> > > > You gave links to 'deep paging' when I asked about response
> streaming.
> > > > Let me understand. From my POV, deep paging is a special case for
> > regular
> > > > search scenarios. We definitely need it in Solr. However, if we are
> > > talking
> > > > about data analytic like problems, when we need to select an
> "endless"
> > > > stream of responses (or store them in file as Roman did), 'deep
> paging'
> > > is
> > > > a suboptimal hack.
> > > > What's your vision on this?
> > > >
> > >
> >
> >
> >
> > --
> > Sincerely yours
> > Mikhail Khludnev
> > Principal Engineer,
> > Grid Dynamics
> >
> > 
> >  
> >
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics


 


Re: SolrCloud 4.3.1 - "Failure to open existing log file (non fatal)" errors under high load

2013-07-27 Thread Erick Erickson
Tim:

15 seconds isn't unreasonable, I was mostly wondering if it was hours.

Take a look at the size of the tlogs as you're indexing, you should see them
truncate every 15 seconds or so. There'll be a varying number of tlogs kept
around, although under heavy indexing I'd only expect 1 or 2 inactive ones,
the internal number is that there'll be enough tlogs kept around to
hold 100 docs.

There should only be 1 open tlog/core as I understand it. When a commit
happens (hard, openSearcher = true or false doesn't matter) the current
tlog is closed and a new one opened. Then some cleanup happens so there
are only enough tlogs kept around to hold 100 docs.

Strange, Im kind of out of ideas.
Erick

On Sat, Jul 27, 2013 at 4:41 PM, Jack Krupansky  wrote:
> No hard numbers, but the general guidance is that you should set your hard
> commit interval to match your expectations for how quickly nodes should come
> up if they need to be restarted. Specifically, a hard commit assures that
> all changes have been committed to disk and are ready for immediate access
> on restart, but any and all soft commit changes since the last hard commit
> must be "replayed" (reexecuted) on restart of a node.
>
> How long does it take to replay the changes in the update log? No firm
> numbers, but treat it as if all of those uncommitted updates had to be
> resent and reprocessed by Solr. It's probably faster than that, but you get
> the picture.
>
> I would suggest thinking in terms of minutes rather than seconds for hard
> commits 5 minutes, 10, 15, 20, 30 minutes.
>
> Hard commits may result in kicking off segment merges, so too rapid a rate
> of segment creation might cause problems or at least be counterproductive.
>
> So, instead of 15 seconds, try 15 minutes.
>
> OTOH, if you really need to handle 4,000 update a seconds... you are clearly
> in "uncharted territory" and need to expect to need to do some heavy duty
> trial and error tuning on your own.
>
> -- Jack Krupansky
>
> -Original Message- From: Tim Vaillancourt
> Sent: Saturday, July 27, 2013 4:21 PM
> To: solr-user@lucene.apache.org
> Subject: Re: SolrCloud 4.3.1 - "Failure to open existing log file (non
> fatal)" errors under high load
>
>
> Thanks for the reply Erick,
>
> Hard Commit - 15000ms, openSearcher=false
> Soft Commit - 1000ms, openSearcher=true
>
> 15sec hard commit was sort of a guess, I could try a smaller number.
> When you say "getting too large" what limit do you think it would be
> hitting: a ulimit (nofiles), disk space, number of changes, a limit in
> Solr itself?
>
> By my math there would be 15 tlogs max per core, but I don't really know
> how it all works if someone could fill me in/point me somewhere.
>
> Cheers,
>
> Tim
>
> On 27/07/13 07:57 AM, Erick Erickson wrote:
>>
>> What is your autocommit limit? Is it possible that your transaction
>> logs are simply getting too large? tlogs are truncated whenever
>> you do a hard commit (autocommit) with openSearcher either
>> true for false it doesn't matter.
>>
>> FWIW,
>> Erick
>>
>> On Fri, Jul 26, 2013 at 12:56 AM, Tim Vaillancourt
>> wrote:
>>>
>>> Thanks Shawn and Yonik!
>>>
>>> Yonik: I noticed this error appears to be fairly trivial, but it is not
>>> appearing after a previous crash. Every time I run this high-volume test
>>> that produced my stack trace, I zero out the logs, Solr data and
>>> Zookeeper
>>> data and start over from scratch with a brand new collection and zero'd
>>> out
>>> logs.
>>>
>>> The test is mostly high volume (2000-4000 updates/sec) and at the start
>>> the
>>> SolrCloud runs decently for a good 20-60~ minutes, no errors in the logs
>>> at
>>> all. Then that stack trace occurs on all 3 nodes (staggered), I
>>> immediately
>>> get some replica down messages and then some "cannot connect" errors to
>>> all
>>> other cluster nodes, who have all crashed the same way. The tlog error
>>> could
>>> be a symptom of the problem of running out of threads perhaps.
>>>
>>> Shawn: thanks so much for sharing those details! Yes, they seem to be
>>> nice
>>> servers, for sure - I don't get to touch/see them but they're fast! I'll
>>> look into firmwares for sure and will try again after updating them.
>>> These
>>> Solr instances are not-bare metal and are actually KVM VMs so that's
>>> another
>>> layer to look into, although it is consistent between the two clusters.
>>>
>>> I am not currently increasing the 'nofiles' ulimit to above default like
>>> you
>>> are, but does Solr use 10,000+ file handles? It won't hurt to try it I
>>> guess
>>> :). To rule out Java 7, I'll probably also try Jetty 8 and Java 1.6 as an
>>> experiment as well.
>>>
>>> Thanks!
>>>
>>> Tim
>>>
>>>
>>> On 25/07/13 05:55 PM, Yonik Seeley wrote:

 On Thu, Jul 25, 2013 at 7:44 PM, Tim Vaillancourt
 wrote:
>
> "ERROR [2013-07-25 19:34:24.264] [org.apache.solr.common.SolrException]
> Failure to open existing log file (non fatal)
>
 That itself isn't necessarily a problem (

Re: paging vs streaming. spawn from (Processing a lot of results in Solr)

2013-07-27 Thread Yonik Seeley
On Sat, Jul 27, 2013 at 5:05 PM, Mikhail Khludnev
 wrote:
> anyway, even if writer pulls docs one by one, it doesn't allow to stream a
> billion of them. Solr writes out DocList, which is really problematic even
> in deep-paging scenarios.

Which part is problematic... the creation of the DocList (the search),
or it's memory requirements (an int per doc)?

-Yonik
http://lucidworks.com


Re: Sending shard requests to all replicas

2013-07-27 Thread Isaac Hebsh
Hi Erick, thanks.

I have about 40 shards. repFactor=2.
The cause of slower shards is very interesting, and this is the main
approach we took.
Note that in every query, it is another shard which is the slowest. In 20%
of the queries, the slowest shard takes about 4 times more than the average
shard qtime.
While continuing investigation, remember it might be the virtualization /
storage-access / network / gc /..., so I thought that reducing the effect
of the slow shards might be a good (temporary or permanent) solution.

I thought it should be an almost trivial code change (for proving the
concept). Isn't it?


On Sat, Jul 27, 2013 at 6:11 PM, Erick Erickson wrote:

> This has been suggested, but so far it's not been implemented
> as far as I know.
>
> I'm curious though, how many shards are you dealing with? I
> wonder if it would be a better idea to try to figure out _why_
> you so often have a slow shard and whether the problem could
> be cured with, say, better warming queries on the shards...
>
> Best
> Erick
>
> On Fri, Jul 26, 2013 at 8:23 AM, Isaac Hebsh 
> wrote:
> > Hi!
> >
> > When SolrClound executes a query, it creates shard requests, which is
> sent
> > to one replica of each shard. Total QTime is determined by the slowest
> > shard response (plus some extra time). [For simplicity, let's assume that
> > no stored fields are requested.]
> >
> > I suffer from a situation where in every query, some shards are much
> slower
> > than others.
> >
> > We might consider a different approach, which sends the shard request to
> > *ALL* replicas of each shard. Solr will continue when responses are got
> > from at least one replica of each shard.
> >
> > Of course, the amount of work that is wasted is big (multiplied by
> > replicationFactor), but in my case, there are very few concurrent
> queries,
> > and the most important performance is the qtime. Such a solution might
> > improve qtime significantly.
> >
> >
> > Did someone tried this before?
> > Any tip from where should I start in the code?
>


Re: SolrCloud 4.3.1 - "Failure to open existing log file (non fatal)" errors under high load

2013-07-27 Thread Tim Vaillancourt

Thanks Jack/Erick,

I don't know if this is true or not, but I've read there is a tlog per 
soft commit, which is then truncated by the hard commit. If this were 
true, a 15sec hard-commit with a 1sec soft-commit could generate around 
15~ tlogs, but I've never checked. I like Erick's scenario more if it is 
1 tlog/core though. I'll try to find out some more.



Another two test/things I really should try for sanity are:
- Java 1.6 and Jetty 8: just to rule things out (wouldn't actually 
launch this way).

- ulimit for 'nofiles': the default is pretty high but why not?
- Monitor size + # of tlogs.


I'll be sure to share findings and really appreciate the help guys!


PS: This is asking a lot, but if anyone can take a look at that thread 
dump, or give me some pointers on what to look for in a 
stall/thread-pile up thread dump like this, I would really appreciate 
it. I'm quite weak at deciphering those (I use Thread Dump Analyzer) but 
I'm sure it would tell a lot.



Cheers,


Tim


On 27/07/13 02:24 PM, Erick Erickson wrote:

Tim:

15 seconds isn't unreasonable, I was mostly wondering if it was hours.

Take a look at the size of the tlogs as you're indexing, you should see them
truncate every 15 seconds or so. There'll be a varying number of tlogs kept
around, although under heavy indexing I'd only expect 1 or 2 inactive ones,
the internal number is that there'll be enough tlogs kept around to
hold 100 docs.

There should only be 1 open tlog/core as I understand it. When a commit
happens (hard, openSearcher = true or false doesn't matter) the current
tlog is closed and a new one opened. Then some cleanup happens so there
are only enough tlogs kept around to hold 100 docs.

Strange, Im kind of out of ideas.
Erick

On Sat, Jul 27, 2013 at 4:41 PM, Jack Krupansky  wrote:

No hard numbers, but the general guidance is that you should set your hard
commit interval to match your expectations for how quickly nodes should come
up if they need to be restarted. Specifically, a hard commit assures that
all changes have been committed to disk and are ready for immediate access
on restart, but any and all soft commit changes since the last hard commit
must be "replayed" (reexecuted) on restart of a node.

How long does it take to replay the changes in the update log? No firm
numbers, but treat it as if all of those uncommitted updates had to be
resent and reprocessed by Solr. It's probably faster than that, but you get
the picture.

I would suggest thinking in terms of minutes rather than seconds for hard
commits 5 minutes, 10, 15, 20, 30 minutes.

Hard commits may result in kicking off segment merges, so too rapid a rate
of segment creation might cause problems or at least be counterproductive.

So, instead of 15 seconds, try 15 minutes.

OTOH, if you really need to handle 4,000 update a seconds... you are clearly
in "uncharted territory" and need to expect to need to do some heavy duty
trial and error tuning on your own.

-- Jack Krupansky

-Original Message- From: Tim Vaillancourt
Sent: Saturday, July 27, 2013 4:21 PM
To: solr-user@lucene.apache.org
Subject: Re: SolrCloud 4.3.1 - "Failure to open existing log file (non
fatal)" errors under high load


Thanks for the reply Erick,

Hard Commit - 15000ms, openSearcher=false
Soft Commit - 1000ms, openSearcher=true

15sec hard commit was sort of a guess, I could try a smaller number.
When you say "getting too large" what limit do you think it would be
hitting: a ulimit (nofiles), disk space, number of changes, a limit in
Solr itself?

By my math there would be 15 tlogs max per core, but I don't really know
how it all works if someone could fill me in/point me somewhere.

Cheers,

Tim

On 27/07/13 07:57 AM, Erick Erickson wrote:

What is your autocommit limit? Is it possible that your transaction
logs are simply getting too large? tlogs are truncated whenever
you do a hard commit (autocommit) with openSearcher either
true for false it doesn't matter.

FWIW,
Erick

On Fri, Jul 26, 2013 at 12:56 AM, Tim Vaillancourt
wrote:

Thanks Shawn and Yonik!

Yonik: I noticed this error appears to be fairly trivial, but it is not
appearing after a previous crash. Every time I run this high-volume test
that produced my stack trace, I zero out the logs, Solr data and
Zookeeper
data and start over from scratch with a brand new collection and zero'd
out
logs.

The test is mostly high volume (2000-4000 updates/sec) and at the start
the
SolrCloud runs decently for a good 20-60~ minutes, no errors in the logs
at
all. Then that stack trace occurs on all 3 nodes (staggered), I
immediately
get some replica down messages and then some "cannot connect" errors to
all
other cluster nodes, who have all crashed the same way. The tlog error
could
be a symptom of the problem of running out of threads perhaps.

Shawn: thanks so much for sharing those details! Yes, they seem to be
nice
servers, for sure - I don't get to touch/see them but they're fast! I'll
look into firmwares for sure and

Re: Sending shard requests to all replicas

2013-07-27 Thread Shawn Heisey
On 7/27/2013 3:33 PM, Isaac Hebsh wrote:
> I have about 40 shards. repFactor=2.
> The cause of slower shards is very interesting, and this is the main
> approach we took.
> Note that in every query, it is another shard which is the slowest. In 20%
> of the queries, the slowest shard takes about 4 times more than the average
> shard qtime.
> While continuing investigation, remember it might be the virtualization /
> storage-access / network / gc /..., so I thought that reducing the effect
> of the slow shards might be a good (temporary or permanent) solution.

Virtualization is not the best approach for Solr.  Assuming you're
dealing with your own hardware and not something based in the cloud like
Amazon, you can get better results by running on bare metal and having
multiple shards per host.

Garbage collection is a very likely source of this problem.

http://wiki.apache.org/solr/SolrPerformanceProblems#GC_pause_problems

> I thought it should be an almost trivial code change (for proving the
> concept). Isn't it?

I have no idea what you're saying/asking here.  Can you clarify?

It seems to me that sending requests to all replicas would just increase
the overall load on the cluster, with no real benefit.

Thanks,
Shawn



Re: Solr 4.3.1 only accepts UTF-8 encoded queries?

2013-07-27 Thread Shawn Heisey
On 7/26/2013 2:03 PM, Gustav wrote:
> The problem here is that in my client's application, the query beign encoded
> in iso-8859-1 its a *must*. So, this is kind of a trouble here.
> I just dont get how this encoding could work on queries in version 3.5, but
> it doesnt in 4.3.

I brought up the issue on the dev list.  Allowing a user to change the
default character set would cause problems for SolrCloud or distributed
search, because the requests generated by the server are UTF-8.

The responder did say that he can imagine all the code for a solution
that involves an input encoding parameter.  I filed SOLR-5082 to track it.

https://issues.apache.org/jira/browse/SOLR-5082

Thanks,
Shawn



Re: Sending shard requests to all replicas

2013-07-27 Thread Isaac Hebsh
Shawn, thank you for the tips.
I know the significant cons of virtualization, but I don't want to move
this thread into a virtualization pros/cons in the Solr(Cloud) case.

I've just asked what is the minimal code change should be made, in order to
examine whether this is a possible solution or not.. :)


On Sun, Jul 28, 2013 at 1:06 AM, Shawn Heisey  wrote:

> On 7/27/2013 3:33 PM, Isaac Hebsh wrote:
> > I have about 40 shards. repFactor=2.
> > The cause of slower shards is very interesting, and this is the main
> > approach we took.
> > Note that in every query, it is another shard which is the slowest. In
> 20%
> > of the queries, the slowest shard takes about 4 times more than the
> average
> > shard qtime.
> > While continuing investigation, remember it might be the virtualization /
> > storage-access / network / gc /..., so I thought that reducing the effect
> > of the slow shards might be a good (temporary or permanent) solution.
>
> Virtualization is not the best approach for Solr.  Assuming you're
> dealing with your own hardware and not something based in the cloud like
> Amazon, you can get better results by running on bare metal and having
> multiple shards per host.
>
> Garbage collection is a very likely source of this problem.
>
> http://wiki.apache.org/solr/SolrPerformanceProblems#GC_pause_problems
>
> > I thought it should be an almost trivial code change (for proving the
> > concept). Isn't it?
>
> I have no idea what you're saying/asking here.  Can you clarify?
>
> It seems to me that sending requests to all replicas would just increase
> the overall load on the cluster, with no real benefit.
>
> Thanks,
> Shawn
>
>


Searching in stopwords

2013-07-27 Thread Rohit Kumar
I have a company search which uses stopwords during quezary time. In my
stopwords list i have entries like :

HR
Club
India
Pvt.
Ltd.



So if i search for companies like HR Club i get no results. Similarly
search for India HR giving no results. How can i get results in query for
following companies :

1. HR India
2. HR Club
3. HR India Pvt Ltd


I would still want to maintain the above list of stopwords since these
letters occur heavily in company text.

Please guide if i need to change my strategy itself.






   





   
   
 




   




Thanks
Rohit Kumar


Re: Searching in stopwords

2013-07-27 Thread Jack Krupansky
Edismax should be able to handle a query consisting of only query-time stop 
words.


What does your text field type analyzer look like?

-- Jack Krupansky

-Original Message- 
From: Rohit Kumar

Sent: Saturday, July 27, 2013 9:59 PM
To: solr-user@lucene.apache.org
Subject: Searching in stopwords

I have a company search which uses stopwords during quezary time. In my
stopwords list i have entries like :

HR
Club
India
Pvt.
Ltd.



So if i search for companies like HR Club i get no results. Similarly
search for India HR giving no results. How can i get results in query for
following companies :

1. HR India
2. HR Club
3. HR India Pvt Ltd


I would still want to maintain the above list of stopwords since these
letters occur heavily in company text.

Please guide if i need to change my strategy itself.






  

   
   

  
  


   
   
   
  
  
   



Thanks
Rohit Kumar 



Re: processing documents in solr

2013-07-27 Thread Maurizio Cucchiara
In both cases, for better performance, first I'd load just all the IDs,
after, during processing I'd load each document.
For what concern the incremental requirement, it should not be difficult to
write an hash function which maps a non-numerical I'd to a value.
 On Jul 27, 2013 7:03 AM, "Joe Zhang"  wrote:

> Dear list:
>
> I have an ever-growing solr repository, and I need to process every single
> document to extract statistics. What would be a reasonable process that
> satifies the following properties:
>
> - Exhaustive: I have to traverse every single document
> - Incremental: in other words, it has to allow me to divide and conquer ---
> if I have processed the first 20k docs, next time I can start with 20001.
>
> A simple "*:*" query would satisfy the 1st but not the 2nd property. In
> fact, given that the processing will take very long, and the repository
> keeps growing, it is not even clear that the exhaustiveness is achieved.
>
> I'm running solr 3.6.2 in a single-machine setting; no hadoop capability
> yet. But I guess the same issues still hold even if I have the solr cloud
> environment, right, say in each shard?
>
> Any help would be greatly appreciated.
>
> Joe
>


Re: paging vs streaming. spawn from (Processing a lot of results in Solr)

2013-07-27 Thread Mikhail Khludnev
On Sun, Jul 28, 2013 at 1:25 AM, Yonik Seeley  wrote:

>
> Which part is problematic... the creation of the DocList (the search),
>
Literally DocList is a copy of TopDocs. Creating TopDocs is not a search,
but ranking.
And ranking costs is log(rows+start) beside of numFound, which the search
takes.
Interesting that we still pay that log() even if ask for collecting docs
as-is with _docid_


> or it's memory requirements (an int per doc)?
>
TopXxxCollector as well as XxxComparators allocates same [rows+start]

it's clear that after we have deep paging, we need to handle heaps just
with size of rows (without start).
It's fairly ok, if we use Solr like site navigation engine, but it's
'sub-optimal' for data analytic use-cases, where we need something like
SELECT * FROM ... in rdbms. In this case any memory allocation on billions
docs index is a bummer. That's why I'm asking about removing heap based
collector/comparator.


> -Yonik
> http://lucidworks.com
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics