Re: Using Solr Spatial in conjunction with HBASE/Hadoop

2013-01-20 Thread ashok joshi
Have you looked at Oracle NoSQL Database
http://www.oracle.com/us/products/database/nosql/overview/index.html, a
scalable key-value store?

Can Solr be integrated with it?

Thanks and warm regards.
ashok joshi
oracle



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Using-Solr-Spatial-in-conjunction-with-HBASE-Hadoop-tp4034307p4034848.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Language Identification in index time

2013-01-20 Thread Jack Krupansky

It sounds like you want an update request processor:
http://wiki.apache.org/solr/UpdateRequestProcessor

But, it also sounds like you should probably be normalizing the encoding 
before sending the data to Solr.


-- Jack Krupansky

-Original Message- 
From: Yewint Ko

Sent: Sunday, January 20, 2013 10:36 AM
To: solr-user@lucene.apache.org
Subject: Language Identification in index time

Hi all

I am very new to solr and nutch. Currently i have a requirement to develop a 
small search engine for local movie websites. Because non standard encoding 
system currently using on many of our local websites, it become necessary 
for us to develop encoding identifier and converter in web crawling, 
indexing and query processing. The idea is we will identify the encoding 
used on the website, convert (if necessary) and store  the index in unicode 
standard.


We have developed our own identifier and converter (solr SearchComponent) 
that can be used in query time to identify the encoding of the user query 
and convert it to match the index.


The problem I am having is that I dont know how to intercept the request in 
indexing time for identifying and converting purpose.  Is there something 
like filter chain that can access the text before passing it to tokenizer, 
so that we can access the text and detect which encoding it is.


Thanks
yewint 



Re: Missing documents with ConcurrentUpdateSolrServer (vs. HttpSolrServer) ?

2013-01-20 Thread Erick Erickson
If this was in SolrCloud mode, there was a bug in 4.0 when submitting
batches of documents at once. Can't find it right now, but thought I'd
mention it just in case. Submitting the docs one-at-a-time doesn't
have the same problem.

May not be applicable, and entirely orthogonal to the discussion about
swallowing errors

Erick

On Tue, Jan 15, 2013 at 4:10 PM, Mark Bennett mbenn...@ideaeng.com wrote:
 First off, just reporting this:

 I wound up with approx 58% few documents having submitted via
 ConcurrentUpdateSolrServer.  I went back and changed the code to use
 HttpSolrServer and had 100%

 This was a long running test, approx 12 hours, with gigabytes of data, so
 conveniently shared / reproducible, but I at least wanted to email around,
 in part to get it on the record, and second to see if anybody else has
 seen this?  I didn't see anything in JIRA.

 I realize that Concurrent update is asynchronous and I'm giving up the
 ability to monitor things, but since it works using the old server, there's
 nothing glaringly wrong at least.

 Here's a few more details:
 * Approx 2 M docs, submitted 1,000 at a time.
 * Solr 4.0.0 on Windows Server 2008
 * Solr server JVM configured with 4 Gigs of RAM
 * Submitting client JVM (SolrJ) configured with 10 Gigs of RAM
 * Did didn't see any OOM (Out Of Memory) errors on the asynchronous /
 ConcurrentUpdateSolrServer run.  However, I didn't capture the entire log.
 Usually with OOM it's just before the run crashes, and the end of the log
 on the screen looked fine.
 * I also didn't think there was OOM issues on the Solr server side, for the
 same reason
 * When submitting the same data synchronously (via HttpSolrServer) it
 didn't have any problems

 Questions:

 The async client certainly finished faster, and since the underlying Solr
 server presumably didn't do the real work any faster, presumably a backlog
 built up somewhere.  Agreed?

 I'm guessing this backlog had something to do with the failure.  Or are
 there other areas to think about?

 Which process would get backlogged, the SolrJ client or the Solr server?
 I'd guess the server?

 And if async submits are accumulated in the Solr server, is there some
 mechanism to queue them onto disk, or does it try to hold them all in RAM?

 And *if* the backlog caused an OOM condition, wouldn't that JVM have mostly
 crashed (if not completely)?

 Any guesses on the mostly likely failure point, and where to look?

 Thanks,
 Mark

 --
 Mark Bennett / New Idea Engineering, Inc. / mbenn...@ideaeng.com
 Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513


Re: Solr load balancer

2013-01-20 Thread Erick Erickson
Hmmm, the first thing I'd look at is why you are having long GC
pauses. Here's a great place to start:

http://www.lucidimagination.com/blog/2011/03/27/garbage-collection-bootcamp-1-0/
and:
http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

I've wondered about a similar approach, but by firing off the same
query to multiple nodes in your cluster, you'll be effectively
doubling (at least) the load on your system. Leading to more memory
issues perhaps in a non-virtuous cycle.

FWIW,
Erick

On Fri, Jan 18, 2013 at 5:41 AM, Phil Hoy p...@brightsolid.com wrote:
 Hi,

 I would like to experiment with some custom load balancers to help with query 
 latency in the face of long gc pauses and the odd time-consuming query that 
 we need to be able to support. At the moment setting the socket timeout via 
 the HttpShardHandlerFactory does help, but of course it can only be set to a 
 length of time as long as the most time consuming query we are likely to 
 receive.

 For example perhaps a load balancer that sends multiple queries concurrently 
 to all/some replicas and only keeps the first response might be effective. Or 
 maybe a load balancer which takes account of the frequency of timeouts would 
 be able to recognize zombies more effectively.

 To use alternative load balancer implementations cleanly and without having 
 to hack solr directly, I would need to be able to make the existing 
 LBHttpSolrServer and HttpShardHandlerFactory more amenable to extension, I 
 can then override the default load balancer using solr's plugin mechanism.

 So my question is, if I made a patch to make the load balancer more 
 pluggable, is this something that would be acceptable and if so what do I do 
 next?

 Phil

 __
 brightsolid is used in this email to collectively mean brightsolid online 
 innovation limited and its subsidiary companies brightsolid online publishing 
 limited and brightsolid online technology limited.
 findmypast.co.uk is a brand of brightsolid online publishing limited.
 brightsolid online innovation limited, Gateway House, Luna Place, Dundee 
 Technology Park, Dundee DD2 1TP.  Registered in Scotland No. SC274983.
 brightsolid online publishing limited, The Glebe, 6 Chapel Place, Rivington 
 Street, London EC2A 3DQ. Registered in England No. 04369607.
 brightsolid online technology limited, Gateway House, Luna Place, Dundee 
 Technology Park, Dundee DD2 1TP.  Registered in Scotland No. SC161678.

 Email Disclaimer

 This message is confidential and may contain privileged information. You 
 should not disclose its contents to any other person. If you are not the 
 intended recipient, please notify the sender named above immediately. It is 
 expressly declared that this e-mail does not constitute nor form part of a 
 contract or unilateral obligation. Opinions, conclusions and other 
 information in this message that do not relate to the official business of 
 brightsolid shall be understood as neither given nor endorsed by it.
 __
 This email has been scanned by the brightsolid Email Security System. Powered 
 by MessageLabs
 __


Re: Solr cache considerations

2013-01-20 Thread Erick Erickson
About your question about document cache: Typically the document cache
has a pretty low hit-ratio. I've rarely, if ever, seen it get hit very
often. And remember that this cache is only hit when assembling the
response for a few documents (your page size).

Bottom line: I wouldn't worry about this cache much. It's quite useful
for processing a particular query faster, but not really intended for
cross-query use.

Really, I think you're getting the cart before the horse here. Run it
up the flagpole and try it. Rely on the OS to do its job
(http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html).
Find  a bottleneck _then_ tune. Premature optimization and all
that

Several tens of millions of docs isn't that large unless the text
fields are enormous.

Best
Erick

On Sat, Jan 19, 2013 at 2:32 PM, Isaac Hebsh isaac.he...@gmail.com wrote:
 Ok. Thank you everyone for your helpful answers.
 I understand that fieldValueCache is not used for resolving queries.
 Is there any cache that can help this basic scenario (a lot of different
 queries, on a small set of fields)?
 Does Lucene's FieldCache help (implicitly)?
 How can I use RAM to reduce I/O in this type of queries?


 On Fri, Jan 18, 2013 at 4:09 PM, Tomás Fernández Löbbe 
 tomasflo...@gmail.com wrote:

 No, the fieldValueCache is not used for resolving queries. Only for
 multi-token faceting and apparently for the stats component too. The
 document cache maintains in memory the stored content of the fields you are
 retrieving or highlighting on. It'll hit if the same document matches the
 query multiple times and the same fields are requested, but as Eirck said,
 it is important for cases when multiple components in the same request need
 to access the same data.

 I think soft committing every 10 minutes is totally fine, but you should
 hard commit more often if you are going to be using transaction log.
 openSearcher=false will essentially tell Solr not to open a new searcher
 after the (hard) commit, so you won't see the new indexed data and caches
 wont be flushed. openSearcher=false makes sense when you are using
 hard-commits together with soft-commits, as the soft-commit is dealing
 with opening/closing searchers, you don't need hard commits to do it.

 Tomás


 On Fri, Jan 18, 2013 at 2:20 AM, Isaac Hebsh isaac.he...@gmail.com
 wrote:

  Unfortunately, it seems (
  http://lucene.472066.n3.nabble.com/Nrt-and-caching-td3993612.html) that
  these caches are not per-segment. In this case, I want to (soft) commit
  less frequently. Am I right?
 
  Tomás, as the fieldValueCache is very similar to lucene's FieldCache, I
  guess it has a big contribution to standard (not only faceted) queries
  time. SolrWiki claims that it primarily used by faceting. What that says
  about complex textual queries?
 
  documentCache:
  Erick, After a query processing is finished, doesn't some documents stay
 in
  the documentCache? can't I use it to accelerate queries that should
  retrieve stored fields of documents? In this case, a big documentCache
 can
  hold more documents..
 
  About commit frequency:
  HardCommit: openSearch=false seems as a nice solution. Where can I read
  about this? (found nothing but one unexplained sentence in SolrWiki).
  SoftCommit: In my case, the required index freshness is 10 minutes. The
  plan to soft commit every 10 minutes is similar to storing all of the
  documents in a queue (outside to Solr), an indexing a bulk every 10
  minutes.
 
  Thanks.
 
 
  On Fri, Jan 18, 2013 at 2:15 AM, Tomás Fernández Löbbe 
  tomasflo...@gmail.com wrote:
 
   I think fieldValueCache is not per segment, only fieldCache is.
 However,
   unless I'm missing something, this cache is only used for faceting on
   multivalued fields
  
  
   On Thu, Jan 17, 2013 at 8:58 PM, Erick Erickson 
 erickerick...@gmail.com
   wrote:
  
filterCache: This is bounded by 1M * (maxDoc) / 8 * (num filters in
cache). Notice the /8. This reflects the fact that the filters are
represented by a bitset on the _internal_ Lucene ID. UniqueId has no
bearing here whatsoever. This is, in a nutshell, why warming is
required, the internal Lucene IDs may change. Note also that it's
maxDoc, the internal arrays have holes for deleted documents.
   
Note this is an _upper_ bound, if there are only a few docs that
match, the size will be (num of matching docs) * sizeof(int)).
   
fieldValueCache. I don't think so, although I'm a bit fuzzy on this.
It depends on whether these are per-segment caches or not. Any per
segment cache is still valid.
   
Think of documentCache as intended to hold the stored fields while
various components operate on it, thus avoiding repeatedly fetching
the data from disk. It's _usually_ not too big a worry.
   
About hard-commits once a day. That's _extremely_ long. Think instead
of committing more frequently with openSearcher=false. If nothing
else, you transaction log will grow 

RE: Solr 4.0 - timeAllowed in distributed search

2013-01-20 Thread Michael Ryan
(This is based on my knowledge of 3.6 - not sure if this has changed in 4.0)

You are using rows=3, which requires retrieving 3 documents from disk. 
In a non-distributed search, the QTime will not include the time it takes to 
retrieve these documents, but in a distributed search, it will. For a *:* 
query, the document retrieval will almost always be the slowest part of the 
query. I'd suggest measuring how long it takes for the response to be returned, 
or use rows=0.

The timeAllowed feature is very misleading. It only applies to a small portion 
of the query (which in my experience is usually not the part of the query that 
is actually slow). Do not depend on timeAllowed doing anything useful :)

-Michael

-Original Message-
From: Lyuba Romanchuk [mailto:lyuba.romanc...@gmail.com] 
Sent: Sunday, January 20, 2013 6:36 AM
To: solr-user@lucene.apache.org
Subject: Solr 4.0 - timeAllowed in distributed search

Hi,

I try to use timeAllowed in query both in distributed search with one shard and 
directly to the same shard.
I send the same query with timeAllowed=500 :

   - directly to the shard then QTime ~= 600 ms
   - through distributes search to the same shard QTime ~= 7 sec.

I have two questions:

   - It seems that timeAllowed parameter doesn't work for distributes
   search, does it?
   - What may be the reason that causes the query to the shard through
   distributes search takes much more time than to the shard directly (the
   same distribution remains without timeAllowed parameter in the query)?


Test results:

Ask one shard through distributed search:


http://localhost:8983/solr/shard_2013-01-07/select?q=*:*rows=3shards=127.0.0.1%3A8983%2Fsolr%2Fshard_2013-01-07timeAllowed=500partialResults=trueshards.info=truedebugQuery=true
response
lst name=responseHeader
bool name=partialResultstrue/bool
int name=status0/int
int name=QTime7307/int
lst name=params
str name=q*:*/str
str name=shards127.0.0.1:8983/solr/shard_2013-01-07/str
str name=partialResultstrue/str
str name=debugQuerytrue/str
str name=shards.infotrue/str
str name=rows3/str
str name=timeAllowed500/str/lst/lst
lst name=shards.info
lst name=127.0.0.1:8983/solr/shard_2013-01-07
long name=numFound29574223/long
float name=maxScore1.0/float
long name=time646/long/lst/lst
result name=response numFound=29574223 start=0 maxScore=1.0 ...
30,000 docs
...
lst name=debug
str name=rawquerystring*:*/str
str name=querystring*:*/str
str name=parsedqueryMatchAllDocsQuery(*:*)/str
str name=parsedquery_toString*:*/str
str name=QParserLuceneQParser/str
lst name=timingdouble name=time6141.0/double lst 
name=preparedouble name=time0.0/double lst 
name=org.apache.solr.handler.component.QueryComponentdouble
name=time0.0/double/lst
lst name=org.apache.solr.handler.component.FacetComponentdouble
name=time0.0/double/lst
lst name=org.apache.solr.handler.component.MoreLikeThisComponentdouble
name=time0.0/double/lst
lst name=org.apache.solr.handler.component.HighlightComponentdouble
name=time0.0/double/lst
lst name=org.apache.solr.handler.component.StatsComponentdouble
name=time0.0/double/lst
lst name=org.apache.solr.handler.component.DebugComponentdouble
name=time0.0/double/lst/lst
lst name=processdouble name=time6141.0/double lst 
name=org.apache.solr.handler.component.QueryComponentdouble
name=time6022.0/double/lst
lst name=org.apache.solr.handler.component.FacetComponentdouble
name=time0.0/double/lst
lst name=org.apache.solr.handler.component.MoreLikeThisComponentdouble
name=time0.0/double/lst
lst name=org.apache.solr.handler.component.HighlightComponentdouble
name=time0.0/double/lst
lst name=org.apache.solr.handler.component.StatsComponentdouble
name=time0.0/double/lst
lst name=org.apache.solr.handler.component.DebugComponentdouble
name=time119.0/double/lst/lst/lst

Ask the same shard directly:

http://localhost:8983/solr/shard_2013-01-07/select?q=*:*rows=3timeAllowed=500partialResults=trueshards.info=truedebugQuery=true
lst name=responseHeader
bool name=partialResultstrue/bool
int name=status0/int
int name=QTime617/int
lst name=params
str name=q*:*/str
str name=partialResultstrue/str
str name=debugQuerytrue/str
str name=shards.infotrue/str
str name=rows3/str
str name=timeAllowed500/str/lst/lst
result name=response numFound=28687243 start=0 ...
30,000 docs
lst name=debugstr name=rawquerystring*:*/strstr
name=querystring*:*/strstr
name=parsedqueryMatchAllDocsQuery(*:*)/strstr
name=parsedquery_toString*:*/str
str name=QParserLuceneQParser/str
lst name=timingdouble name=time617.0/double lst 
name=preparedouble name=time0.0/double lst 
name=org.apache.solr.handler.component.QueryComponentdouble
name=time0.0/double/lst
lst name=org.apache.solr.handler.component.FacetComponentdouble
name=time0.0/double/lst
lst name=org.apache.solr.handler.component.MoreLikeThisComponentdouble
name=time0.0/double/lst
lst name=org.apache.solr.handler.component.HighlightComponentdouble
name=time0.0/double/lst
lst 

Re: Long ParNew GC pauses - even when young generation is small

2013-01-20 Thread Shawn Heisey

On 1/18/2013 10:07 PM, Shawn Heisey wrote:

On my dev 4.1 server with Java 7u11, I am using the G1 collector with a
max pause target of 1500ms.  I was thinking that this collector was
producing long pauses too, but after reviewing the gc log with a closer
eye, I see that there are lines that specifically say pause ... and
all of THOSE lines are below half a second except one that took 1.4
seconds.  Does that mean that it's actually meeting the target, or are
the other lines that show quite long time values indicative of a
problem?  If only the lines that explicitly say pause are the ones I
need to worry about, then it looks like G1 is the clear winner.


Here's a paste showing a command and its output.  I included remark in 
the grep because I saw a presentation saying that remark in G1 is 
stop-the-world:


http://pastie.org/private/vygpvtjzicsl8uztg3drw

None of the matching log lines get close to my 5 second pain point.  If 
I check the entire unfiltered log for lines that exceed 3 seconds, I do 
find a few, but only one of them says pause and it's far enough below 
the 5 second level that it probably would not cause a problem:


http://pastie.org/private/wcessvbrditextxmoapksq

Here's the perl script used in the two outputs above:

http://pastie.org/private/itu9hbgiwugdjtmy3yg8g

The log was gathered during a full-import of six large shards, over 12 
million docs each.  The import took 7 hours.  I had the patches for 
LUCENE-4599 (Compressed TermVectors) applied to Solr 4.1 at the time.


What I'd like to know is whether a 'concurrent-mark-end' line indicates 
stop-the-world or not.  I suspect that it is done while the application 
is working.  If this is right, then I think I have found the right GC 
settings:


-XX:+UseG1GC -XX:MaxGCPauseMillis=1500 -XX:GCPauseIntervalMillis=4000

My production servers have more total memory, more CPU cores, and much 
faster I/O than the dev server where I have been running these tests, 
but they both use the same 8GB java heap.  One last question: Should I 
be worried about using the G1 collector on Oracle Java 6u38, which was 
released at the same time as 7u11?  This *might* be a good opportunity 
to upgrade to java 7 in production, actually.  I have two completely 
independent index chains, I could upgrade the secondary.


If anyone has any suggestions for my GC parsing perl script, or a knows 
about a much more functional replacement, let me know.


Thanks,
Shawn



Re: Long ParNew GC pauses - even when young generation is small

2013-01-20 Thread Shawn Heisey

On 1/20/2013 11:33 AM, Shawn Heisey wrote:

On 1/18/2013 10:07 PM, Shawn Heisey wrote:

On my dev 4.1 server with Java 7u11, I am using the G1 collector with a
max pause target of 1500ms.  I was thinking that this collector was
producing long pauses too, but after reviewing the gc log with a closer
eye, I see that there are lines that specifically say pause ... and
all of THOSE lines are below half a second except one that took 1.4
seconds.  Does that mean that it's actually meeting the target, or are
the other lines that show quite long time values indicative of a
problem?  If only the lines that explicitly say pause are the ones I
need to worry about, then it looks like G1 is the clear winner.


Here's a paste showing a command and its output.  I included remark in
the grep because I saw a presentation saying that remark in G1 is
stop-the-world:

http://pastie.org/private/vygpvtjzicsl8uztg3drw

None of the matching log lines get close to my 5 second pain point.  If
I check the entire unfiltered log for lines that exceed 3 seconds, I do
find a few, but only one of them says pause and it's far enough below
the 5 second level that it probably would not cause a problem:

http://pastie.org/private/wcessvbrditextxmoapksq

Here's the perl script used in the two outputs above:

http://pastie.org/private/itu9hbgiwugdjtmy3yg8g


Here's the full gc log for anyone that feels compelled to fully investigate:

http://dl.dropbox.com/u/97770508/gc.log

Thanks,
Shawn



Re: Have the SolrCloud collection REST endpoints move or changed for 4.1?

2013-01-20 Thread Brett Hoerner
So the ticket I created wasn't related, there is a working patch for that
now but my original issue remains, I get 404 when trying to post updates to
a URL that worked fine in Solr 4.0.


On Sat, Jan 19, 2013 at 5:56 PM, Brett Hoerner br...@bretthoerner.comwrote:

 I'm actually wondering if this other issue I've been having is a problem:

 https://issues.apache.org/jira/browse/SOLR-4321

 The fact that some nodes don't get pieces of a collection could explain
 the 404.

 That said, even when a node has parts of a collection it reports 404
 sometimes. What's odd is that I can use curl to post a JSON document to the
 same URL and it will return 200.

 When I log every request I make from my indexer process (using solr4j)
 it's about 50/50 between 404 and 200...


  On Sat, Jan 19, 2013 at 5:22 PM, Brett Hoerner br...@bretthoerner.comwrote:

 I was using Solr 4.0 but ran into a few problems using SolrCloud. I'm
 trying out 4.1 RC1 right now but the update URL I used to use is returning
 HTTP 404.

 For example, I would post my document updates to,

 http://localhost:8983/solr/collection1

 But that is 404ing now (collection1 exists according to the admin UI, all
 shards are green and happy, and data dirs exist on the nodes).

 I also tried the following,

 http://localhost:8983/solr/collection1/update

 And also received a 404 there.

 A specific example from the Java client:

 22:38:12.474 [pool-7-thread-14] ERROR com.massrel.faassolr.SolrBackend -
 Error while flushing to Solr.
 org.apache.solr.common.SolrException: Server at
 http://backfill-2d.i.massrel.com:8983/solr/15724/update returned non ok
 status:404, message:Not Found
  at
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:372)
 ~[solr-solrj-4.0.0.jar:4.0.0 1394950 - rmuir - 2012-10-06 03:05:44]
 at
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
 ~[solr-solrj-4.0.0.jar:4.0.0 1394950 - rmuir - 2012-10-06 03:05:44]
  at
 org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(LBHttpSolrServer.java:438)
 ~[solr-solrj-4.0.0.jar:4.0.0 1394950 - rmuir - 2012-10-06 03:05:44]
 at
 org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
 ~[solr-solrj-4.0.0.jar:4.0.0 1394950 - rmuir - 2012-10-06 03:05:44]

 But I can hit that URL with a GET,

 $ curl http://backfill-1d.i.massrel.com:8983/solr/15724/update
 ?xml version=1.0 encoding=UTF-8?
 response
 lst name=responseHeaderint name=status400/intint
 name=QTime2/int/lstlst name=errorstr name=msgmissing content
 stream/strint name=code400/int/lst
 /response

 Thoughts?

 Thanks.





Re: Long ParNew GC pauses - even when young generation is small

2013-01-20 Thread Shawn Heisey

On 1/18/2013 10:07 PM, Shawn Heisey wrote:

I may try the G1 collector with Java 6 in production, since I am on the
newest Oracle version.


I am giving this a try on my secondary server set.  An encouraging note: 
The -XX:+UnlockExperimentalVMOptions option is no longer required to use 
the G1 collector, at least on version 6u38.


Thanks,
Shawn



Re: Solr cache considerations

2013-01-20 Thread Isaac Hebsh
Wow Erick, The MMap acrtivle is a very fundamental one. Totaly changed my
view. It must be mentioned in SolrPerformanceFactors in SolrWiki...
I'm sorry I did not know it before.
Thank you a lot.
I promise to share my results then my cart will start to fly :)


On Sun, Jan 20, 2013 at 6:08 PM, Erick Erickson erickerick...@gmail.comwrote:

 About your question about document cache: Typically the document cache
 has a pretty low hit-ratio. I've rarely, if ever, seen it get hit very
 often. And remember that this cache is only hit when assembling the
 response for a few documents (your page size).

 Bottom line: I wouldn't worry about this cache much. It's quite useful
 for processing a particular query faster, but not really intended for
 cross-query use.

 Really, I think you're getting the cart before the horse here. Run it
 up the flagpole and try it. Rely on the OS to do its job
 (http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html).
 Find  a bottleneck _then_ tune. Premature optimization and all
 that

 Several tens of millions of docs isn't that large unless the text
 fields are enormous.

 Best
 Erick

 On Sat, Jan 19, 2013 at 2:32 PM, Isaac Hebsh isaac.he...@gmail.com
 wrote:
  Ok. Thank you everyone for your helpful answers.
  I understand that fieldValueCache is not used for resolving queries.
  Is there any cache that can help this basic scenario (a lot of different
  queries, on a small set of fields)?
  Does Lucene's FieldCache help (implicitly)?
  How can I use RAM to reduce I/O in this type of queries?
 
 
  On Fri, Jan 18, 2013 at 4:09 PM, Tomás Fernández Löbbe 
  tomasflo...@gmail.com wrote:
 
  No, the fieldValueCache is not used for resolving queries. Only for
  multi-token faceting and apparently for the stats component too. The
  document cache maintains in memory the stored content of the fields you
 are
  retrieving or highlighting on. It'll hit if the same document matches
 the
  query multiple times and the same fields are requested, but as Eirck
 said,
  it is important for cases when multiple components in the same request
 need
  to access the same data.
 
  I think soft committing every 10 minutes is totally fine, but you should
  hard commit more often if you are going to be using transaction log.
  openSearcher=false will essentially tell Solr not to open a new searcher
  after the (hard) commit, so you won't see the new indexed data and
 caches
  wont be flushed. openSearcher=false makes sense when you are using
  hard-commits together with soft-commits, as the soft-commit is dealing
  with opening/closing searchers, you don't need hard commits to do it.
 
  Tomás
 
 
  On Fri, Jan 18, 2013 at 2:20 AM, Isaac Hebsh isaac.he...@gmail.com
  wrote:
 
   Unfortunately, it seems (
   http://lucene.472066.n3.nabble.com/Nrt-and-caching-td3993612.html)
 that
   these caches are not per-segment. In this case, I want to (soft)
 commit
   less frequently. Am I right?
  
   Tomás, as the fieldValueCache is very similar to lucene's FieldCache,
 I
   guess it has a big contribution to standard (not only faceted) queries
   time. SolrWiki claims that it primarily used by faceting. What that
 says
   about complex textual queries?
  
   documentCache:
   Erick, After a query processing is finished, doesn't some documents
 stay
  in
   the documentCache? can't I use it to accelerate queries that should
   retrieve stored fields of documents? In this case, a big documentCache
  can
   hold more documents..
  
   About commit frequency:
   HardCommit: openSearch=false seems as a nice solution. Where can I
 read
   about this? (found nothing but one unexplained sentence in SolrWiki).
   SoftCommit: In my case, the required index freshness is 10 minutes.
 The
   plan to soft commit every 10 minutes is similar to storing all of the
   documents in a queue (outside to Solr), an indexing a bulk every 10
   minutes.
  
   Thanks.
  
  
   On Fri, Jan 18, 2013 at 2:15 AM, Tomás Fernández Löbbe 
   tomasflo...@gmail.com wrote:
  
I think fieldValueCache is not per segment, only fieldCache is.
  However,
unless I'm missing something, this cache is only used for faceting
 on
multivalued fields
   
   
On Thu, Jan 17, 2013 at 8:58 PM, Erick Erickson 
  erickerick...@gmail.com
wrote:
   
 filterCache: This is bounded by 1M * (maxDoc) / 8 * (num filters
 in
 cache). Notice the /8. This reflects the fact that the filters are
 represented by a bitset on the _internal_ Lucene ID. UniqueId has
 no
 bearing here whatsoever. This is, in a nutshell, why warming is
 required, the internal Lucene IDs may change. Note also that it's
 maxDoc, the internal arrays have holes for deleted documents.

 Note this is an _upper_ bound, if there are only a few docs that
 match, the size will be (num of matching docs) * sizeof(int)).

 fieldValueCache. I don't think so, although I'm a bit fuzzy on
 this.
 It depends on whether these are 

Re: Solr cache considerations

2013-01-20 Thread Walter Underwood
I routinely see hit rates over 75% on the document cache. Perhaps yours is too 
small. Mine is set at 10240 entries.

wunder

On Jan 20, 2013, at 8:08 AM, Erick Erickson wrote:

 About your question about document cache: Typically the document cache
 has a pretty low hit-ratio. I've rarely, if ever, seen it get hit very
 often. And remember that this cache is only hit when assembling the
 response for a few documents (your page size).
 
 Bottom line: I wouldn't worry about this cache much. It's quite useful
 for processing a particular query faster, but not really intended for
 cross-query use.
 
 Really, I think you're getting the cart before the horse here. Run it
 up the flagpole and try it. Rely on the OS to do its job
 (http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html).
 Find  a bottleneck _then_ tune. Premature optimization and all
 that
 
 Several tens of millions of docs isn't that large unless the text
 fields are enormous.
 
 Best
 Erick
 
 On Sat, Jan 19, 2013 at 2:32 PM, Isaac Hebsh isaac.he...@gmail.com wrote:
 Ok. Thank you everyone for your helpful answers.
 I understand that fieldValueCache is not used for resolving queries.
 Is there any cache that can help this basic scenario (a lot of different
 queries, on a small set of fields)?
 Does Lucene's FieldCache help (implicitly)?
 How can I use RAM to reduce I/O in this type of queries?
 
 
 On Fri, Jan 18, 2013 at 4:09 PM, Tomás Fernández Löbbe 
 tomasflo...@gmail.com wrote:
 
 No, the fieldValueCache is not used for resolving queries. Only for
 multi-token faceting and apparently for the stats component too. The
 document cache maintains in memory the stored content of the fields you are
 retrieving or highlighting on. It'll hit if the same document matches the
 query multiple times and the same fields are requested, but as Eirck said,
 it is important for cases when multiple components in the same request need
 to access the same data.
 
 I think soft committing every 10 minutes is totally fine, but you should
 hard commit more often if you are going to be using transaction log.
 openSearcher=false will essentially tell Solr not to open a new searcher
 after the (hard) commit, so you won't see the new indexed data and caches
 wont be flushed. openSearcher=false makes sense when you are using
 hard-commits together with soft-commits, as the soft-commit is dealing
 with opening/closing searchers, you don't need hard commits to do it.
 
 Tomás
 
 
 On Fri, Jan 18, 2013 at 2:20 AM, Isaac Hebsh isaac.he...@gmail.com
 wrote:
 
 Unfortunately, it seems (
 http://lucene.472066.n3.nabble.com/Nrt-and-caching-td3993612.html) that
 these caches are not per-segment. In this case, I want to (soft) commit
 less frequently. Am I right?
 
 Tomás, as the fieldValueCache is very similar to lucene's FieldCache, I
 guess it has a big contribution to standard (not only faceted) queries
 time. SolrWiki claims that it primarily used by faceting. What that says
 about complex textual queries?
 
 documentCache:
 Erick, After a query processing is finished, doesn't some documents stay
 in
 the documentCache? can't I use it to accelerate queries that should
 retrieve stored fields of documents? In this case, a big documentCache
 can
 hold more documents..
 
 About commit frequency:
 HardCommit: openSearch=false seems as a nice solution. Where can I read
 about this? (found nothing but one unexplained sentence in SolrWiki).
 SoftCommit: In my case, the required index freshness is 10 minutes. The
 plan to soft commit every 10 minutes is similar to storing all of the
 documents in a queue (outside to Solr), an indexing a bulk every 10
 minutes.
 
 Thanks.
 
 
 On Fri, Jan 18, 2013 at 2:15 AM, Tomás Fernández Löbbe 
 tomasflo...@gmail.com wrote:
 
 I think fieldValueCache is not per segment, only fieldCache is.
 However,
 unless I'm missing something, this cache is only used for faceting on
 multivalued fields
 
 
 On Thu, Jan 17, 2013 at 8:58 PM, Erick Erickson 
 erickerick...@gmail.com
 wrote:
 
 filterCache: This is bounded by 1M * (maxDoc) / 8 * (num filters in
 cache). Notice the /8. This reflects the fact that the filters are
 represented by a bitset on the _internal_ Lucene ID. UniqueId has no
 bearing here whatsoever. This is, in a nutshell, why warming is
 required, the internal Lucene IDs may change. Note also that it's
 maxDoc, the internal arrays have holes for deleted documents.
 
 Note this is an _upper_ bound, if there are only a few docs that
 match, the size will be (num of matching docs) * sizeof(int)).
 
 fieldValueCache. I don't think so, although I'm a bit fuzzy on this.
 It depends on whether these are per-segment caches or not. Any per
 segment cache is still valid.
 
 Think of documentCache as intended to hold the stored fields while
 various components operate on it, thus avoiding repeatedly fetching
 the data from disk. It's _usually_ not too big a worry.
 
 About hard-commits once a day. That's _extremely_ long. 

Re: Solr 4.0 - timeAllowed in distributed search

2013-01-20 Thread Walter Underwood
If you are going to request 30,000 rows, you can give up on getting good 
performance. It is not going to happen.

Even without all the disk accesses, think about how much is sent over the 
network, then parsed by the client. The client cannot even start working with 
the data until it is all received and parsed.

wunder

On Jan 20, 2013, at 8:49 AM, Michael Ryan wrote:

 (This is based on my knowledge of 3.6 - not sure if this has changed in 4.0)
 
 You are using rows=3, which requires retrieving 3 documents from 
 disk. In a non-distributed search, the QTime will not include the time it 
 takes to retrieve these documents, but in a distributed search, it will. For 
 a *:* query, the document retrieval will almost always be the slowest part of 
 the query. I'd suggest measuring how long it takes for the response to be 
 returned, or use rows=0.
 
 The timeAllowed feature is very misleading. It only applies to a small 
 portion of the query (which in my experience is usually not the part of the 
 query that is actually slow). Do not depend on timeAllowed doing anything 
 useful :)
 
 -Michael
 
 -Original Message-
 From: Lyuba Romanchuk [mailto:lyuba.romanc...@gmail.com] 
 Sent: Sunday, January 20, 2013 6:36 AM
 To: solr-user@lucene.apache.org
 Subject: Solr 4.0 - timeAllowed in distributed search
 
 Hi,
 
 I try to use timeAllowed in query both in distributed search with one shard 
 and directly to the same shard.
 I send the same query with timeAllowed=500 :
 
   - directly to the shard then QTime ~= 600 ms
   - through distributes search to the same shard QTime ~= 7 sec.
 
 I have two questions:
 
   - It seems that timeAllowed parameter doesn't work for distributes
   search, does it?
   - What may be the reason that causes the query to the shard through
   distributes search takes much more time than to the shard directly (the
   same distribution remains without timeAllowed parameter in the query)?
 
 
 Test results:
 
 Ask one shard through distributed search:
 
 
 http://localhost:8983/solr/shard_2013-01-07/select?q=*:*rows=3shards=127.0.0.1%3A8983%2Fsolr%2Fshard_2013-01-07timeAllowed=500partialResults=trueshards.info=truedebugQuery=true
 response
 lst name=responseHeader
 bool name=partialResultstrue/bool
 int name=status0/int
 int name=QTime7307/int
 lst name=params
 str name=q*:*/str
 str name=shards127.0.0.1:8983/solr/shard_2013-01-07/str
 str name=partialResultstrue/str
 str name=debugQuerytrue/str
 str name=shards.infotrue/str
 str name=rows3/str
 str name=timeAllowed500/str/lst/lst
 lst name=shards.info
 lst name=127.0.0.1:8983/solr/shard_2013-01-07
 long name=numFound29574223/long
 float name=maxScore1.0/float
 long name=time646/long/lst/lst
 result name=response numFound=29574223 start=0 maxScore=1.0 ...
 30,000 docs
 ...
 lst name=debug
 str name=rawquerystring*:*/str
 str name=querystring*:*/str
 str name=parsedqueryMatchAllDocsQuery(*:*)/str
 str name=parsedquery_toString*:*/str
 str name=QParserLuceneQParser/str
 lst name=timingdouble name=time6141.0/double lst 
 name=preparedouble name=time0.0/double lst 
 name=org.apache.solr.handler.component.QueryComponentdouble
 name=time0.0/double/lst
 lst name=org.apache.solr.handler.component.FacetComponentdouble
 name=time0.0/double/lst
 lst name=org.apache.solr.handler.component.MoreLikeThisComponentdouble
 name=time0.0/double/lst
 lst name=org.apache.solr.handler.component.HighlightComponentdouble
 name=time0.0/double/lst
 lst name=org.apache.solr.handler.component.StatsComponentdouble
 name=time0.0/double/lst
 lst name=org.apache.solr.handler.component.DebugComponentdouble
 name=time0.0/double/lst/lst
 lst name=processdouble name=time6141.0/double lst 
 name=org.apache.solr.handler.component.QueryComponentdouble
 name=time6022.0/double/lst
 lst name=org.apache.solr.handler.component.FacetComponentdouble
 name=time0.0/double/lst
 lst name=org.apache.solr.handler.component.MoreLikeThisComponentdouble
 name=time0.0/double/lst
 lst name=org.apache.solr.handler.component.HighlightComponentdouble
 name=time0.0/double/lst
 lst name=org.apache.solr.handler.component.StatsComponentdouble
 name=time0.0/double/lst
 lst name=org.apache.solr.handler.component.DebugComponentdouble
 name=time119.0/double/lst/lst/lst
 
 Ask the same shard directly:
 
 http://localhost:8983/solr/shard_2013-01-07/select?q=*:*rows=3timeAllowed=500partialResults=trueshards.info=truedebugQuery=true
 lst name=responseHeader
 bool name=partialResultstrue/bool
 int name=status0/int
 int name=QTime617/int
 lst name=params
 str name=q*:*/str
 str name=partialResultstrue/str
 str name=debugQuerytrue/str
 str name=shards.infotrue/str
 str name=rows3/str
 str name=timeAllowed500/str/lst/lst
 result name=response numFound=28687243 start=0 ...
 30,000 docs
 lst name=debugstr name=rawquerystring*:*/strstr
 name=querystring*:*/strstr
 name=parsedqueryMatchAllDocsQuery(*:*)/strstr
 name=parsedquery_toString*:*/str
 str name=QParserLuceneQParser/str
 lst 

Re: Have the SolrCloud collection REST endpoints move or changed for 4.1?

2013-01-20 Thread Brett Hoerner
Sorry, I take it back. It looks like fixing
https://issues.apache.org/jira/browse/SOLR-4321 fixed my issue after all.


On Sun, Jan 20, 2013 at 2:21 PM, Brett Hoerner br...@bretthoerner.comwrote:

 So the ticket I created wasn't related, there is a working patch for that
 now but my original issue remains, I get 404 when trying to post updates to
 a URL that worked fine in Solr 4.0.


 On Sat, Jan 19, 2013 at 5:56 PM, Brett Hoerner br...@bretthoerner.comwrote:

 I'm actually wondering if this other issue I've been having is a problem:

 https://issues.apache.org/jira/browse/SOLR-4321

 The fact that some nodes don't get pieces of a collection could explain
 the 404.

 That said, even when a node has parts of a collection it reports 404
 sometimes. What's odd is that I can use curl to post a JSON document to the
 same URL and it will return 200.

 When I log every request I make from my indexer process (using solr4j)
 it's about 50/50 between 404 and 200...


  On Sat, Jan 19, 2013 at 5:22 PM, Brett Hoerner 
 br...@bretthoerner.comwrote:

 I was using Solr 4.0 but ran into a few problems using SolrCloud. I'm
 trying out 4.1 RC1 right now but the update URL I used to use is returning
 HTTP 404.

 For example, I would post my document updates to,

 http://localhost:8983/solr/collection1

 But that is 404ing now (collection1 exists according to the admin UI,
 all shards are green and happy, and data dirs exist on the nodes).

 I also tried the following,

 http://localhost:8983/solr/collection1/update

 And also received a 404 there.

 A specific example from the Java client:

 22:38:12.474 [pool-7-thread-14] ERROR com.massrel.faassolr.SolrBackend -
 Error while flushing to Solr.
 org.apache.solr.common.SolrException: Server at
 http://backfill-2d.i.massrel.com:8983/solr/15724/update returned non ok
 status:404, message:Not Found
  at
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:372)
 ~[solr-solrj-4.0.0.jar:4.0.0 1394950 - rmuir - 2012-10-06 03:05:44]
 at
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
 ~[solr-solrj-4.0.0.jar:4.0.0 1394950 - rmuir - 2012-10-06 03:05:44]
  at
 org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(LBHttpSolrServer.java:438)
 ~[solr-solrj-4.0.0.jar:4.0.0 1394950 - rmuir - 2012-10-06 03:05:44]
 at
 org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
 ~[solr-solrj-4.0.0.jar:4.0.0 1394950 - rmuir - 2012-10-06 03:05:44]

 But I can hit that URL with a GET,

 $ curl http://backfill-1d.i.massrel.com:8983/solr/15724/update
 ?xml version=1.0 encoding=UTF-8?
 response
 lst name=responseHeaderint name=status400/intint
 name=QTime2/int/lstlst name=errorstr name=msgmissing content
 stream/strint name=code400/int/lst
 /response

 Thoughts?

 Thanks.






RE: Long ParNew GC pauses - even when young generation is small

2013-01-20 Thread Markus Jelsma
Hi Shawn,

Although our heap spaces are much less than yours (256M for 2x 2.5GB cores per 
node) we saw decreased throughput and higher latency with G1 on Java 6. You can 
also expect higher CPU consumption. You can check it very well with VisualVM 
attached. 

Looking forward to your results.

Markus

 
 
-Original message-
 From:Shawn Heisey s...@elyograg.org
 Sent: Sun 20-Jan-2013 21:48
 To: solr-user@lucene.apache.org
 Subject: Re: Long ParNew GC pauses - even when young generation is small
 
 On 1/18/2013 10:07 PM, Shawn Heisey wrote:
  I may try the G1 collector with Java 6 in production, since I am on the
  newest Oracle version.
 
 I am giving this a try on my secondary server set.  An encouraging note: 
 The -XX:+UnlockExperimentalVMOptions option is no longer required to use 
 the G1 collector, at least on version 6u38.
 
 Thanks,
 Shawn
 
 


Re: Long ParNew GC pauses - even when young generation is small

2013-01-20 Thread Shawn Heisey

On 1/20/2013 2:13 PM, Markus Jelsma wrote:

Hi Shawn,

Although our heap spaces are much less than yours (256M for 2x 2.5GB cores per 
node) we saw decreased throughput and higher latency with G1 on Java 6. You can 
also expect higher CPU consumption. You can check it very well with VisualVM 
attached.

Looking forward to your results.


I don't have any really good test tools developed for testing throughput 
and latency.  I have some less-than-ideal tools for other purposes that 
I might be able to adapt.


Throughput is not a major issue for us - query volume is quite low.  I 
would be mildly surprised by 5 queries per second.  I don't have much of 
an idea of queries per second over the short term - the numbers 
available in 3.5 are limited.


As for latency, early indications from an old SOLR-1972 patch suggest 
that the QTime values might be a little higher.  The primary server 
stats (using CMS/ParNew) are over 1 million queries, and the secondary 
server stats (using G1) so far are only about 5000 queries.  The QTime 
values are steadily dropping as the number of queries goes up.


Here's a status page that gathers all the stats.  Chain A is using 
CMS/ParNew and is no longer receiving queries.  All the queries are now 
going to chain B, which is using G1.


http://dl.dropbox.com/u/97770508/g1-vs-cms-stats.png

The server CPU utilization graph doesn't have enough information yet to 
make any determination, but what little data is visible suggests that 
CPU may be higher.  The secondary servers also have slightly slower CPUs 
than the primary servers.  I was forced to make concessions on later 
purchases to keep the cost down.


Thanks,
Shawn



Re: build CMIS compatible Solr

2013-01-20 Thread Nicholas Li
I think this might be the one you are talking about:
https://github.com/sourcesense/solr-cmis

But I think Alfresco has already had search functionality, similar to Solr.
Then why did you want to use it to index docs out of Alfresco?

On Fri, Jan 18, 2013 at 8:00 PM, Upayavira u...@odoko.co.uk wrote:

 A colleague of mine when I was working for Sourcesense made a CMIS
 plugin for Solr. It was one way, and we used it to index stuff out of
 Alfresco into Solr. I can't search for it now, let me know if you can't
 find it.

 Upayavira

 On Fri, Jan 18, 2013, at 05:35 AM, Nicholas Li wrote:
  I want to make something like Alfresco, but not having that many
  features.
  And I'd like to utilise the searching ability of Solr.
 
  On Fri, Jan 18, 2013 at 4:11 PM, Gora Mohanty g...@mimirtech.com
 wrote:
 
   On 18 January 2013 10:36, Nicholas Li nicholas...@yarris.com wrote:
hi
   
I am new to solr and I would like to use Solr as my document server,
 plus
search engine. But solr is not CMIS compatible( While it shoud not
 be, as
it is not build as a pure document management server).  In that
 sense, I
would build another layer beyond Solr so that the exposed interface
 would
be CMIS compatible.
   [...]
  
   May I ask why? Solr is designed to be a search engine,
   which is a very different beast from a document repository.
   In the open-source world, Alfresco ( http://www.alfresco.com/ )
   already exists, can index into Solr, and supports CMIS-based
   access.
  
   Regards,
   Gora
  



Re: Long ParNew GC pauses - even when young generation is small

2013-01-20 Thread Shawn Heisey
Unfortunately, G1 on Java 6 was a bust.  Several times GC pauses made my 
load balancer think the server was down, just like with CMS/ParNew.


Either there's something about my production query patterns that doesn't 
get along with any of the garbage collection methods, or I need to 
upgrade to Java 7.


I have tried lowering my max heap before.  That results in OOM problems 
when I do full-import with DIH.




On 1/20/2013 2:13 PM, Markus Jelsma wrote:

Hi Shawn,

Although our heap spaces are much less than yours (256M for 2x 2.5GB cores per 
node) we saw decreased throughput and higher latency with G1 on Java 6. You can 
also expect higher CPU consumption. You can check it very well with VisualVM 
attached.

Looking forward to your results.

Markus



-Original message-

From:Shawn Heisey s...@elyograg.org
Sent: Sun 20-Jan-2013 21:48
To: solr-user@lucene.apache.org
Subject: Re: Long ParNew GC pauses - even when young generation is small

On 1/18/2013 10:07 PM, Shawn Heisey wrote:

I may try the G1 collector with Java 6 in production, since I am on the
newest Oracle version.


I am giving this a try on my secondary server set.  An encouraging note:
The -XX:+UnlockExperimentalVMOptions option is no longer required to use
the G1 collector, at least on version 6u38.

Thanks,
Shawn






Re: Long ParNew GC pauses - even when young generation is small

2013-01-20 Thread giltene
 I don't see any info on your website about pricing, so I can't make any 
 decisions about whether it would be right for me.  Can you give me 
 long-term pricing information?

As is the case with much of enterprise software (including getting a
supported version of Oracle HotSpot), this is a sales-person conversation
that we'd be happy to have. You can ask for someone to contact you about
this right on the site, or if you want, you can contact me at gil at
azulsystems dot com and I'll make sure we get you the information you
need.

 Chances are that once I inform management of the cost, it'd never fly.

You may be surprised. You seem to assume that Zing is expensive for some
reason, while it's probably on par or cheaper than other supported JVMs for
this sort of thing. It's certainly flown with management for others
running into the exact same problems with both Solr and Lucene. Saved them
both time and money in the process of forever removing GC headaches.
 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Long-ParNew-GC-pauses-even-when-young-generation-is-small-tp4031110p4034932.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Long ParNew GC pauses - even when young generation is small

2013-01-20 Thread giltene
If you believe the logs, using -XX:+PrintGCApplicationStoppedTime is probably
the easiest way to avoid having to try to parse pause times from various
formats. But remember, GC logs can [often unintentionally] lie (I've seen
them under-report by multi-second gaps).

If you want to actually measure your JVM pauses (GC or others), you can use
something like jHiccup (http://www.azulsystems.com/jHiccup). It is a free
(as in beer) and public domain (CC0) tool that will show you any
blip/glitch/hiccup that you jvm experiences while running your application,
and report it in both time based and detailed percentile form. What jHccup
shows you is a best-case response time for your applicartion as it runs
(the response time the application would have shown if completed everything
as zero work).

It's near-trivial to add jHiccup to your environment (as either a java agent
or wrapper script). It would be interesting to see the percentile histograms
(jHiccup's .hgrm text output) for your environment.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Long-ParNew-GC-pauses-even-when-young-generation-is-small-tp4031110p4034934.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Tokenized keywords

2013-01-20 Thread Dikchant Sahi
Can you please elaborate a more on what you are trying to achieve.

Tokenizers work on indexed field and doesn't effect how the values will be
displayed. The response value comes from stored field. If you want to see
how your query is being tokenized, you can do it using analysis interface
or enable debugQuery to see how your query is being formed.


On Mon, Jan 21, 2013 at 11:06 AM, Romita Saha
romita.s...@sg.panasonic.comwrote

 Hi,

 I use some tokenizers to tokenize the query. I want to see the tokenized
 query words displayed in the response.Could you kindly help me do that.

 Thanks and regards,
 Romita


Re: Tokenized keywords

2013-01-20 Thread Romita Saha
What I am trying to achieve is as follows.

I query Search for all the Laptops and my tokenized key words are 
search laptop (I apply stopword filter to filter out words like 
for,all,the and i also user lowercase filter).
I want to display these tokenized keywords using debugQuery.

Thanks and regards,
Romita 



From:   Dikchant Sahi contacts...@gmail.com
To: solr-user@lucene.apache.org, 
Date:   01/21/2013 02:26 PM
Subject:Re: Tokenized keywords



Can you please elaborate a more on what you are trying to achieve.

Tokenizers work on indexed field and doesn't effect how the values will be
displayed. The response value comes from stored field. If you want to see
how your query is being tokenized, you can do it using analysis interface
or enable debugQuery to see how your query is being formed.


On Mon, Jan 21, 2013 at 11:06 AM, Romita Saha
romita.s...@sg.panasonic.comwrote

 Hi,

 I use some tokenizers to tokenize the query. I want to see the tokenized
 query words displayed in the response.Could you kindly help me do 
that.

 Thanks and regards,
 Romita



Data import handler start bulging the memory after completing 1 million

2013-01-20 Thread vijeshnair
http://lucene.472066.n3.nabble.com/file/n4034949/ScreenShot034.jpg 

You may refer this snapshot to get an understanding of the resource
consumption. I am trying to index a total number of 13 million documents
from MySQL to SOLR. First 1 million document's got completed very smoothly
in the first 2 minutes, later it started bulging the RAM, and it never gets
released in between. I have tried all the known tricks and tactics, still
failing to rectify this issue. I am using SOLR 4.0, using DIH to import from
MySQL 5.5 DB. Any help will be much appreciated, and I am trying to find any
loop hole in my schema and config files.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Data-import-handler-start-bulging-the-memory-after-completing-1-million-tp4034949.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Tokenized keywords

2013-01-20 Thread Mikhail Khludnev
Romita,
That's what exactly is shown debugQuery output. If you cant find it there,
paste output here, let's try to find together. Also pay attention to
explainOther debug parameter and analisys page in admin ui.
21.01.2013 10:50 пользователь Romita Saha romita.s...@sg.panasonic.com
написал:

 What I am trying to achieve is as follows.

 I query Search for all the Laptops and my tokenized key words are
 search laptop (I apply stopword filter to filter out words like
 for,all,the and i also user lowercase filter).
 I want to display these tokenized keywords using debugQuery.

 Thanks and regards,
 Romita



 From:   Dikchant Sahi contacts...@gmail.com
 To: solr-user@lucene.apache.org,
 Date:   01/21/2013 02:26 PM
 Subject:Re: Tokenized keywords



 Can you please elaborate a more on what you are trying to achieve.

 Tokenizers work on indexed field and doesn't effect how the values will be
 displayed. The response value comes from stored field. If you want to see
 how your query is being tokenized, you can do it using analysis interface
 or enable debugQuery to see how your query is being formed.


 On Mon, Jan 21, 2013 at 11:06 AM, Romita Saha
 romita.s...@sg.panasonic.comwrote

  Hi,
 
  I use some tokenizers to tokenize the query. I want to see the tokenized
  query words displayed in the response.Could you kindly help me do
 that.
 
  Thanks and regards,
  Romita