from:"Jason Hellman"

Re: Hierarchical faceting

2014-11-17 Thread Jason Hellman

I realize you want to avoid putting depth details into the field values, but 
something has to imply the depth.  So with that in mind, here is another 
approach (with the assumption that you are chasing down a single branch of a 
tree (and all its subbranch offshoots)),

Use dynamic fields
Step from one level to the next with a simple increment
Build the facet for the next level on the call
The UI needs only know the current level

This would possibly be as so:

step_fieldname_n

With a dynamic field configuration of:

step_*

The content of the step_fieldname_n field would either be the strong of the 
field value or the delimited path of the current level (as suited to taste).  
Either way, most likely a fieldType of String (or some variation thereof)

The UI would then call:

facet.field=step_fieldname_n+1

And the UI would need to be aware to carry the n+1 into the fq link verbiage:

fq=step_fieldname_n+1:facetvalue

The trick of all of this is that you must build your index with the depth of 
your hierarchy in mind to place the values into the suitable fields.  You 
could, of course, write an UpdateProcessor to accomplish this if that seems 
fitting.

Jason

 On Nov 17, 2014, at 12:22 PM, Alexandre Rafalovitch arafa...@gmail.com 
 wrote:
 
 You might be able to stick in a couple of PatternReplaceFilterFactory
 in a row with regular expressions to catch different levels.
 
 Something like:
 
 filter class=solr.PatternReplaceFilterFactory
 pattern=^[^0-9][^/]+/[^/]/[^/]+$ replacement=2$0 /
 filter class=solr.PatternReplaceFilterFactory
 pattern=^[^0-9][^/]+/[^/]$ replacement=1$0 /
 ...
 
 I did not test this, you may need to escape some thing or put explicit
 groups in there.
 
 Regards,
   Alex.
 P.s. 
 http://www.solr-start.com/javadoc/solr-lucene/org/apache/lucene/analysis/pattern/PatternReplaceFilterFactory.html
 
 Personal: http://www.outerthoughts.com/ and @arafalov
 Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
 Solr popularizers community: https://www.linkedin.com/groups?gid=6713853
 
 
 On 17 November 2014 15:01, rashmy1 rashmy.appanerava...@siemens.com wrote:
 Hi Alexandre,
 Yes, I've read this post and that's the 'Option1' listed in my initial post.
 
 I'm looking to see if Solr has any in-built tokenizer that splits the tokens
 and prepends with the depth information. I'd like to avoid building depth
 information into the filed values if Solr already has something that can be
 used.
 
 Thanks!
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Hierarchical-faceting-tp4169263p4169536.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: openSearcher, default commit settings

2014-06-02 Thread Jason Hellman

Boon,

I expect you will find many definitions of “proper usage” depending upon
context and expected results. Personally, don’t believe this is Solr’s job to
enforce, and there are many ways through the use of directives in the servlet
container layer that can allow restrictions if you feel this is required.

I would recommend considering an abstraction layer if you feel your development
team may (accidentally) abuse the system they are permitted to use. I’ve seen
this employed very well with minimal latency and cost in extremely large
corporations that have many multiple development teams using the same search
infrastructure.

Jason

On Jun 2, 2014, at 3:53 AM, Boon Low boon@dctfh.com wrote:

Thanks for clearing this up. The wiki, being an authoritative reference,
needs to be corrected.

Re. default commit settings. I agree educating developers is very essential.
But in reality, you can't rely on this as the sole mechanism for ensuring
proper usage of the update API, especially for calls such as commit,
optimize, expungeDeletes which can be very expensive for large indexes on
a shared infrastructure.

The issue is, there's no control mechanism in Solr for update calls (cf.
rewriting calls via load-balancer). Once you expose the update handler to the
developers, they could send 10 commit/optimise op per min, opening new
searchers for each of those calls (openSearcher is only configurable for
autocommit). And there is nothing you can do about it in Solr, even as a
immediate stopgap while a fix is being implemented for the next sprint.

It's be good to have some consistency in terms of configuring handlers, i.e.
having default/invariant settings for both the search and update handlers.

Thanks,

Boon

-
Boon Low
Search Engineer, DCT Family History

On 29 May 2014, at 18:03, Shawn Heisey
s...@elyograg.orgmailto:s...@elyograg.org wrote:

On 5/29/2014 9:21 AM, Boon Low wrote:
1. openSearcher (autoCommit)
According to the Apache Solr reference, autoCommit/openSearcher is set to
false by default.

https://cwiki.apache.org/confluence/display/solr/UpdateHandlers+in+SolrConfig

But on Solr v4.8.1, if openSearcher is omitted from the autoCommit config,
new searchers are opened and warmed post auto-commits. Is this behaviour
intended or the wiki wrong?

I am reasonably certain that the default for openSearcher if it is not
specified will always be true. My understanding and your actual
experience says that the documentation is wrong. Additional note: The
docs for autoSoftCommit are basically a footnote on autoCommit, which I
think is a mistake -- it should have its own section, and the docs
should mention that openSearcher does not apply.

I think the code confirms this. From SolrConfig.java:

protected UpdateHandlerInfo loadUpdatehandlerInfo() {
return new UpdateHandlerInfo(get(updateHandler/@class,null),
getInt(updateHandler/autoCommit/maxDocs,-1),
getInt(updateHandler/autoCommit/maxTime,-1),
getBool(updateHandler/autoCommit/openSearcher,true),
getInt(updateHandler/commitIntervalLowerBound,-1),
getInt(updateHandler/autoSoftCommit/maxDocs,-1),
getInt(updateHandler/autoSoftCommit/maxTime,-1),
getBool(updateHandler/commitWithin/softCommit,true));
}

2. openSearcher and other default commit settings
From previous posts, I know it's not possible to disable commits completely
in Solr config (without coding). But is there a way to configure the default
settings of hard/explicit commits for the update handler? If not it makes
sense to have a configuration mechanism. Currently, a simple commit call
seems to be hard-wired with the following options:

..
commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}

There's no server-side option, e.g. to set openSearcher=false as default or
invariant (cf. searchHandler) to prevent new searchers from opening.

I found that at times it is necessary to have better server- or
infrastructure-side controls for update/commits, especially in agile teams.
Client/UI developers do not necessarily have complete Solr knowledge.
Unintended commits from misbehaving client-side updates may be norm (e.g. 10
times per minute!).

Since you want to handle commits automatically, you'll want to educate
your developers and tell them that they should never send commits -- let
Solr handle it. If the code that talks to Solr is Java and uses SolrJ,
you might want to consider using forbidden-apis in your project so that
a build will fail if the commit method gets used.

https://code.google.com/p/forbidden-apis/

Thanks,
Shawn

__
brightsolid is used in this email to mean brightsolid online technology
limited.

Email Disclaimer

This message is confidential and

Re: Boost documents having a field value

2014-06-02 Thread Jason Hellman

Hakim,

That is what Boost Query (bq=) does.

http://wiki.apache.org/solr/DisMaxQParserPlugin#bq_.28Boost_Query.29

Jason


On Jun 2, 2014, at 10:58 AM, Hakim Benoudjit h.benoud...@gmail.com wrote:

 Hi guys,
 Is it possible in solr to boost documents having a field value (Ex.
 field:value)?
 I know that it's possible to boost a field above other fields at
 query-time, but I want to boost a field value not the field name.
 And if so, is the boosting done at query time or on indexing?
 
 -- 
 Hakim Benoudjit.

Re: SolrCloud: Understanding Replication

2014-05-30 Thread Jason Hellman

Marc,

Fundamentally it’s a good solution design to always be capable of reposting 
(reindexing) your data to Solr.  You are demonstrating a classic use case of 
this, which is upgrade.  Is there a critical reason why you are avoiding this 
step?  

Jason

On May 30, 2014, at 10:38 AM, Marc Campeau cam...@gmail.com wrote:

 2014-05-30 12:24 GMT-04:00 Erick Erickson erickerick...@gmail.com:
 
 Let's back up a bit here. Why are you copying your indexes around?
 SolrCloud does all this for you. I suspect you've somehow made a mis-step.
 
 
 I started by copying the index around because my 4.5.1 instance is not
 setup as Cloud and I wanted to avoid reindexing all my data when migrating
 to my new 4.8.1 SolrCloud setup. I've now put that aside and I'm just
 trying to get replication happening when I populate an empty collection.
 
 
 So here's what I'd do by preference; Just set up a new collection and
 re-index. Make sure all of the nodes are up and then just go ahead and
 index to any of them. If you're using SolrJ, CloudSolrServer will be a bit
 more efficient than sending the docs to random nodes, but that's not
 necessary.
 
 
 I've been trying that this morning. Stop the instances, deleted the
 contents of  /data on all my 4.8.1 instances then started them again...
 they all show up in a 1 shard cluster as 4 replicas and one is the
 leader... they're still shown as down in clusterstate. Then I sent a
 document to be added to one of the nodes specifically. Only that node now
 contains the document. It hasn't been replicated to the other instances.
 
 When I issue queries to the collection for that document through my load
 balancer it works roughtly 1/4 times, in accordance with the fact that it's
 only on the instance where it was added.
 
 Must I use the CLI API for collections to create this new collection or can
 I just do it old style by creating subfolder in /solr directory with my
 confs?
 
 
 Here's the log of these operations
 
 LOG of Instance where document was added:
 
 2758138 [qtp1781256139-14] INFO
 org.apache.solr.update.processor.LogUpdateProcessor  – [mycollection]
 webapp=/solr path=/update/ params={indent=onversion=2.2wt=json}
 {add=[Listing_3446279]} 0 271
 2769177 [qtp1781256139-12] INFO  org.apache.solr.core.SolrCore  –
 [mycollection] webapp=/solr path=/admin/ping params={} hits=0 status=0
 QTime=1
 
 [... More Pings ... ]
 
 2773138 [commitScheduler-7-thread-1] INFO
 org.apache.solr.update.UpdateHandler  – start
 commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=true,prepareCommit=false}
 2773377 [commitScheduler-7-thread-1] INFO
 org.apache.solr.search.SolrIndexSearcher  – Opening
 Searcher@175816a5[mycollection]
 main
 2773389 [searcherExecutor-5-thread-1] INFO  org.apache.solr.core.SolrCore
 – QuerySenderListener sending requests to Searcher@175816a5[mycollection]
 main{StandardDirectoryReader(segments_1:3:nrt _0(4.8):C1)}
 2773389 [searcherExecutor-5-thread-1] INFO  org.apache.solr.core.SolrCore
 – QuerySenderListener done.
 2773390 [searcherExecutor-5-thread-1] INFO  org.apache.solr.core.SolrCore
 – [mycollection] Registered new searcher Searcher@175816a5[mycollection]
 main{StandardDirectoryReader(segments_1:3:nrt _0(4.8):C1)}
 2773390 [commitScheduler-7-thread-1] INFO
 org.apache.solr.update.UpdateHandler  – end_commit_flush
 
 [... More Pings ... ]
 
 2799792 [qtp1781256139-18] INFO  org.apache.solr.update.UpdateHandler  –
 start
 commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
 2799883 [qtp1781256139-18] INFO  org.apache.solr.core.SolrCore  –
 SolrDeletionPolicy.onCommit: commits: num=2
 commit{dir=NRTCachingDirectory(MMapDirectory@/opt/solr-4.8.0/example/solr/mycollection/data/index
 lockFactory=NativeFSLockFactory@/opt/solr-4.8.0/example/solr/mycollection/data/index;
 maxCacheMB=48.0 maxMergeSizeMB=4.0),segFN=segments_1,generation=1}
 commit{dir=NRTCachingDirectory(MMapDirectory@/opt/solr-4.8.0/example/solr/mycollection/data/index
 lockFactory=NativeFSLockFactory@/opt/solr-4.8.0/example/solr/mycollection/data/index;
 maxCacheMB=48.0 maxMergeSizeMB=4.0),segFN=segments_2,generation=2}
 2799884 [qtp1781256139-18] INFO  org.apache.solr.core.SolrCore  – newest
 commit generation = 2
 2799887 [qtp1781256139-18] INFO  org.apache.solr.core.SolrCore  –
 SolrIndexSearcher has not changed - not re-opening:
 org.apache.solr.search.SolrIndexSearcher
 2799887 [qtp1781256139-18] INFO  org.apache.solr.update.UpdateHandler  –
 end_commit_flush
 2799888 [qtp1781256139-18] INFO
 org.apache.solr.update.processor.LogUpdateProcessor  – [mycollection]
 webapp=/solr path=/update
 params={update.distrib=FROMLEADERwaitSearcher=trueopenSearcher=truecommit=truesoftCommit=falsedistrib.from=
 http://192.168.150.90:8983/solr/mycollection/commit_end_point=truewt=javabinversion=2expungeDeletes=false}
 {commit=} 0 96
 2800051 [qtp1781256139-14] INFO

Re: Error enquiry- exceeded limit of maxWarmingSearchers=2

2014-05-30 Thread Jason Hellman

I’m also not sure I understand the practical purpose of your hard/soft auto 
commit settings.  You are stating the following:

Every 10 seconds I want data written to disk, but not be searchable.
Every 15 seconds I want data to be written into memory and searchable.

I would consider whether your soft commit window is too long, or if you can 
lengthen your hard commit period.  It’s typical to see hard commits occur 
*less* frequently than soft commits.


On May 30, 2014, at 11:04 AM, Shawn Heisey s...@elyograg.org wrote:

 On 5/29/2014 9:55 PM, M, Arjun (NSN - IN/Bangalore) wrote:
  Thanks a lot for your nice explanation..  Now I understood the 
 difference between autoCommit and autoSoftCommit.. Now my config looks like 
 below.
 
 autoCommit 
   maxDocs1/maxDocs 
   openSearcherfalse/openSearcher 
 /autoCommit
 
 autoSoftCommit 
   maxTime15000/maxTime 
 /autoSoftCommit
 
 
  With this now I am getting some other error like this.  
 
 org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: 
 version conflict for 140142167803912812800030383128128 
 expected=1469497192978841608 actual=1469497212082847746
 
 This sounds like you are including the _version_ field in your document
 when you index.  You probably shouldn't be doing that.  Here's what that
 field is for, and how it works:
 
 http://heliosearch.org/solr/optimistic-concurrency/
 
 Thanks,
 Shawn

Re: Error enquiry- exceeded limit of maxWarmingSearchers=2

2014-05-30 Thread Jason Hellman

I just realized I failed my own reading comprehension :)

You have maxDocs, not maxTime for hard commit.  Please disregard.

On May 30, 2014, at 1:46 PM, Jason Hellman jhell...@innoventsolutions.com 
wrote:

 I’m also not sure I understand the practical purpose of your hard/soft auto 
 commit settings.  You are stating the following:
 
 Every 10 seconds I want data written to disk, but not be searchable.
 Every 15 seconds I want data to be written into memory and searchable.
 
 I would consider whether your soft commit window is too long, or if you can 
 lengthen your hard commit period.  It’s typical to see hard commits occur 
 *less* frequently than soft commits.
 
 
 On May 30, 2014, at 11:04 AM, Shawn Heisey s...@elyograg.org wrote:
 
 On 5/29/2014 9:55 PM, M, Arjun (NSN - IN/Bangalore) wrote:
 Thanks a lot for your nice explanation..  Now I understood the 
 difference between autoCommit and autoSoftCommit.. Now my config looks like 
 below.
 
 autoCommit 
  maxDocs1/maxDocs 
  openSearcherfalse/openSearcher 
/autoCommit
 
 autoSoftCommit 
  maxTime15000/maxTime 
/autoSoftCommit
 
 
 With this now I am getting some other error like this.  
 
 org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: 
 version conflict for 140142167803912812800030383128128 
 expected=1469497192978841608 actual=1469497212082847746
 
 This sounds like you are including the _version_ field in your document
 when you index.  You probably shouldn't be doing that.  Here's what that
 field is for, and how it works:
 
 http://heliosearch.org/solr/optimistic-concurrency/
 
 Thanks,
 Shawn

Re: Enforcing a hard timeout on shard requests?

2014-05-30 Thread Jason Hellman

Gregg,

I don’t have an answer to your question but I’m very curious what use case you 
have that permits such arbitrary partial-results.  Is it just an edge case or 
do you want to permit a common occurrence?

Jason

On May 30, 2014, at 3:05 PM, Gregg Donovan gregg...@gmail.com wrote:

 I'd like a to add a hard timeout on some of my sharded requests. E.g.: for
 about 30% of the requests, I want to wait no longer than 120ms before a
 response comes back, but aggregating results from as many shards as
 possible in that 120ms.
 
 My first attempt was to use timeAllowed=120shards.tolerant=true. This sort
 of works, in that I'll see partial results occasionally, but slow shards
 will still take much longer than my timeout to return, sometimes up to
 700ms. I imagine if the CPU is busy or the node is GC-ing that it won't be
 able to enforce the timeAllowed and return.
 
 Is there a way to enforce this timeout without failing the request
 entirely? I'd still like to get as many shards to return in 120ms as I can,
 even if they have partialResults.
 
 Thanks.
 
 --Gregg

Re: Solr interface

2014-04-07 Thread Jason Hellman

This. And so much this. As much this as you can muster.

On Apr 7, 2014, at 1:49 PM, Michael Della Bitta
michael.della.bi...@appinions.com wrote:

The speed of ingest via HTTP improves greatly once you do two things:

1. Batch multiple documents into a single request.
2. Index with multiple threads at once.

Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

The Science of Influence Marketing

18 East 41st Street

New York, NY 10017

t: @appinions https://twitter.com/Appinions | g+:
plus.google.com/appinionshttps://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts
w: appinions.com http://www.appinions.com/

On Mon, Apr 7, 2014 at 12:40 PM, Daniel Collins danwcoll...@gmail.comwrote:

I have to agree with Shawn. We have a SolrCloud setup with 256 shards,
~400M documents in total, with 4-way replication (so its quite a big
setup!) I had thought that HTTP would slow things down, so we recently
trialed a JNI approach (clients are C++) so we could call SolrJ and get the
benefits of JavaBin encoding for our indexing

Once we had done benchmarks with both solutions, I think we saved about 1ms
per document (on average) with JNI, so it wasn't as big a gain as we were
expecting. There are other benefits of SolrJ (zookeeper integration,
better routing, etc) and we were doing local HTTP (so it was literally just
a TCP port to localhost, no actual net traffic) but that just goes to prove
what other posters have said here. Check whether HTTP really *is* the
bottleneck before you try to replace it!

On 7 April 2014 17:05, Shawn Heisey s...@elyograg.org wrote:

On 4/7/2014 5:52 AM, Jonathan Varsanik wrote:

Do you mean to tell me that the people on this list that are indexing
100s of millions of documents are doing this over http? I have been
using
custom Lucene code to index files, as I thought this would be faster for
many documents and I wanted some non-standard OCR and index fields. Is
there a better way?

To the OP: You can also use Lucene to locally index files for Solr.

My sharded index has 94 million docs in it. All normal indexing and
maintenance is done with SolrJ, over http.Currently full rebuilds are
done
with the dataimport handler loading from MySQL, but that is legacy. This
is NOT a SolrCloud installation. It is also not a replicated setup -- my
indexing program keeps both copies up to date independently, similar to
what happens behind the scenes with SolrCloud.

The single-thread DIH is very well optimized, and is faster than what I
have written myself -- also single-threaded.

The real reason that we still use DIH for rebuilds is that I can run the
DIH simultaenously on all shards. A full rebuild that way takes about 5
hours. A SolrJ process feeding all shards with a single thread would
take
a lot longer. Once I have time to work on it, I can make the SolrJ
rebuild
multi-threaded, and I expect it will be similar to DIH in rebuild speed.
Hopefully I can make it faster.

There is always overhead with HTTP. On a gigabit LAN, I don't think it's
high enough to matter.

Using Lucene to index files for Solr is an option -- but that requires
writing a custom Lucene application, and knowledge about how to turn the
Solr schema into Lucene code. A lot of users on this list (me included)
do
not have the skills required. I know SolrJ reasonably well, but Lucene
is
a nut that I haven't cracked.

Thanks,
Shawn

Re: Exact fragment length in highlighting

2014-02-19 Thread Jason Hellman

Juan,

Pay close attention to the boundary scanner you’re employing:

http://wiki.apache.org/solr/HighlightingParameters#hl.boundaryScanner

You can be explicit to indicate a type (hl.bs.type) with options such as 
CHARACTER, WORD, SENTENCE, and LINE.  The default is WORD (as the wiki 
indicates) and I presume this is what you are employing.

Be careful about using explicit characters.  I had an interesting case of 
highlight returns that looked like this:

 This is a highlight
 Here is another highlight
 Yes, another one, etc…

It was a bit maddening trying to figure out why “” was in the highlight…turned 
out it was XML content and the character boundary clipped the trailing “” 
based on the boundary rules.

In any case, you should be able to achieve a pretty flexible result depending 
on what you’re really after with the right combination of settings.

Jason

On Feb 19, 2014, at 7:53 AM, Juan Carlos Serrano jcserran...@gmail.com wrote:

 Hello everybody,
 
 I'm using Solr 4.6.1. and I'd like to know if there's a way to determine
 exactly the number of characters of a fragment used in highlights. If I use
 hl.fragsize=70 the length of the fragments that I get is variable (often)
 and I get results of 90 characters length.
 
 Regards and thanks in advance,
 
 Juan Carlos

Re: Caching Solr boost functions?

2014-02-19 Thread Jason Hellman

Gregg,

The QueryResultCache caches a sorted int array of results matching the a query. 
 This should overlap very nicely with your desired behavior, as a hit in this 
cache will not perform a Lucene query nor a need to calculate score.  

Now, ‘for the life of the Searcher’ is the trick here.  You can size your cache 
large enough to ensure it can fit every possible query, but at some point this 
is untenable.  I would argue that high volatility of query parameters would 
invalidate the need for caching anyway, but that’s clearly debatable.  
Nevertheless, this should work admirably well to solve your needs.

Jason

On Feb 18, 2014, at 11:32 AM, Gregg Donovan gregg...@gmail.com wrote:

 We're testing out a new handler that uses edismax with three different
 boost functions. One has a random() function in it, so is not very
 cacheable, but the other two boost functions do not change from query to
 query.
 
 I'd like to tell Solr to cache those boost queries for the life of the
 Searcher so they don't get recomputed every time. Is there any way to do
 that out of the box?
 
 In a different custom QParser we have we wrote a CachingValueSource that
 wrapped a ValueSource with a custom ValueSource cache. Would it make sense
 to implement that as a standard Solr function so that one could do:
 
 boost=cache(expensiveFunctionQuery())
 
 Thanks.
 
 --Gregg

Re: Solr Autosuggest - Strange issue with leading numbers in query

2014-02-19 Thread Jason Hellman

Here’s a rather obvious question:  have you rebuilt your spell index recently?  
Is it possible the offending numbers snuck into the spell dictionary?  The 
terms component will show you what’s in your current, searchable field…but not 
the dictionary.

If my memory serves correctly, with collate=true this would allow for such 
behavior to occur, especially with onlyMorePopular set to false (which would 
ensure the resulting collation has a query count greater than the current 
query).  Have you flipped onlyMorePopular to true to confirm?




On Feb 18, 2014, at 10:16 AM, bbi123 bbar...@gmail.com wrote:

 Thanks a lot for your response Erik.
 
 I was trying to find if I have any suggestion starting with numbers using
 terms component but I couldn't find any.. Its very strange!!!
 
 Anyways, thanks again for your response.
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Solr-Autosuggest-Strange-issue-with-leading-numbers-in-query-tp4116751p4118072.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: block join and atomic updates

2014-02-18 Thread Jason Hellman

Thinking in terms of normalized data in the context of a Lucene index is 
dangerous.  It is not a relational data model technology, and the join 
behaviors available to you have limited use.  Each approach requires 
compromises that are likely impermissible for certain uses cases.  

If it is at all reasonable to consider you will likely be best served 
de-normalizing the data.  Of course, your specific details may prove an 
exception to this rule…but generally approach works very well.

On Feb 18, 2014, at 4:19 AM, Mikhail Khludnev mkhlud...@griddynamics.com 
wrote:

 absolutely.
 
 
 On Tue, Feb 18, 2014 at 1:20 PM, m...@preselect-media.com wrote:
 
 But isn't query time join much slower when it comes to a large amount of
 documents?
 
 Zitat von Mikhail Khludnev mkhlud...@griddynamics.com:
 
 
 Hello,
 
 It sounds like you need to switch to query time join.
 15.02.2014 21:57 пользователь m...@preselect-media.com написал:
 
 Any suggestions?
 
 
 Zitat von m...@preselect-media.com:
 
 Yonik Seeley yo...@heliosearch.com:
 
 
 On Thu, Feb 13, 2014 at 8:25 AM,  m...@preselect-media.com wrote:
 
 Is there any workaround to perform atomic updates on blocks or do I
 have to
 re-index the parent document and all its children always again if I
 want to
 update a field?
 
 
 The latter, unfortunately.
 
 
 Is there any plan to change this behavior in near future?
 
 So, I'm thinking of alternatives without loosing the benefit of block
 join.
 I try to explain an idea I just thought about:
 
 Let's say I have a parent document A with a number of fields I want to
 update regularly and a number of child documents AC_1 ... AC_n which are
 only indexed once and aren't going to change anymore.
 So, if I index A and AC_* in a block and I update A, the block is gone.
 But if I create an additional document AF which only contains something
 like an foreign key to A and indexing AF + AC_* as a block (not A + AC_*
 anymore), could I perform a {!parent ... } query on AF + AC_* and make
 an
 join from the results to get A?
 Does this makes any sense and is it even possible? ;-)
 And if it's possible, how can I do it?
 
 Thanks,
 - Moritz
 
 
 
 
 
 
 
 
 
 
 
 -- 
 Sincerely yours
 Mikhail Khludnev
 Principal Engineer,
 Grid Dynamics
 
 http://www.griddynamics.com
 mkhlud...@griddynamics.com

Re: Solr server requirements for 100+ million documents

2014-02-11 Thread Jason Hellman

Whether you use the same machines as Solr or separate machines is a matter 
suited to taste.

If you are the CTO, then you should make this decision.  If not, inform 
management that risk conditions are greater when you share function and control 
on a single piece of hardware.  A single failure of a replica + zookeeper node 
will be more impactful than a single failure of a replica *or* a zookeeper 
node.  Let them earn the big bucks to make the risk decision.

The good news is, zookeeper hardware can be extremely lightweight for Solr 
Cloud.  Commodity hardware should work just fine…and thus scaling to 5 nodes 
for zookeeper is not that hard at all.

Jason


On Feb 11, 2014, at 3:00 PM, svante karlsson s...@csi.se wrote:

 ZK needs a quorum to keep functional so 3 servers handles one failure. 5
 handles 2 node failures. If you Solr with 1 replica per shard then stick to
 3 ZK. If you use 2 replicas use 5 ZK

Re: Memory Usage on Windows Os while indexing

2014-01-21 Thread Jason Hellman

To a very large extent, the capability of a platform is measurable by the skill 
of the team administering it.

If core competencies lie in Windows OS then I would wager heavily the platform 
will outperform a similar Linux OS installation in the long haul.

All things being equal, it’s really hard to argue with Linux.  But nothing is 
ever equal.

On Jan 21, 2014, at 8:57 PM, Shawn Heisey s...@elyograg.org wrote:

 On 1/21/2014 2:17 AM, onetwothree wrote:
 Does Solr on a Linux Os has a better memory management than a Windows Os, or
 can you neglect this comparison?  
 
 As Toke said, this is indeed debatable.
 
 I personally believe that Linux is better at almost everything, but if
 you're running a recent 64-bit Windows Server OS, you may not actually
 see a lot of difference.  Microsoft has VERY talented people working for
 them, and even though I won't use it for most server applications,
 Windows is a very capable platform.
 
 If you ignore personal bias and proceed with the idea that Linux and
 Windows are approximately equal in terms of real-world performance, then
 one factor that might be critical is price.  Linux can be installed for
 zero cost, a standalone bare metal Windows Server license is several
 hundred dollars, sometimes more.
 
 Thanks,
 Shawn

Re: how to best convert some term in q to a fq

2013-12-28 Thread Jason Hellman

I second this notion.

My reasoning focuses mostly on maintainability, where I posit that your client 
code will be far easier to extend/modify/troubleshoot than any effort spent 
attempting to do this within Solr.

Jason


On Dec 23, 2013, at 12:07 PM, Joel Bernstein joels...@gmail.com wrote:

 I  would suggest handling this in the client. You could write custom Solr
 code also but it would be more complicated because you'd be working with
 Solr's API's.
 
 Joel Bernstein
 Search Engineer at Heliosearch
 
 
 On Mon, Dec 23, 2013 at 2:36 PM, jmlucjav jmluc...@gmail.com wrote:
 
 Hi,
 
 I have this scenario that I think is no unusual: solr will get a user
 entered query string like 'apple pear france'.
 
 I need to do this: if any of the terms is a country, then change the query
 params to move that term to a fq, i.e:
 q=apple pear france
 to
 q=apple pearfq=country:france
 
 What do you guys would be the best way to implement this?
 - custom searchcomponent or queryparser
 - servlet in same jetty as solr
 - client code
 
 To simplify, consider countries are just a single term.
 
 Any pointer to an example to base this on would be great. thanks

Re: Problem with size of segments

2013-11-07 Thread Jason Hellman

David,

I find Mike McCandless’ blog article to be very informative.  Give it a go and 
let us know if you are still seeking clarification:

http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html

Jason

On Nov 7, 2013, at 5:09 AM, david.dav...@correo.aeat.es wrote:

 Hi,
 
 I have an index very big, with 337 G more or less. I am using Solr 4.2.
 The problem we have is related with the size of segments: this is the size 
 of the biggest ones:
 324 G, 3.7G, 3.6 G, 1.6 G, 1.6 G, 465 M ...  We have 
 LogByteSizeMergePolicy with 10 as MergeFactor in our solrconfig.
 
 Really the issue is not a problem, but at least I would like to know why 
 my segments have this size. According with I have read in papers, if I 
 have a MergeFactor of 10 each level within the index should be one order
 of magnitude bigger than previously. So , I can't understand why I have a 
 segment of 324 G while the others are only of 3 G, this is 2 orders of 
 magnitude bigger.
 
 Is this correct or it is a problem with my index?
 Where can I read a good explanation about the Merge Policy? 
 
 Thank you very much,
 
 Regards,
 
 David Dávila
 AEAT

Re: Function query matching

2013-11-07 Thread Jason Hellman

You can, of course, us a function range query:

select?q=text:newsfq={!frange l=0 u=100}sum(x,y)

http://lucene.apache.org/solr/4_5_1/solr-core/org/apache/solr/search/FunctionRangeQParserPlugin.html

This will give you a bit more flexibility to meet your goal.

On Nov 7, 2013, at 7:26 AM, Erik Hatcher erik.hatc...@gmail.com wrote:

 Function queries score (all) documents, but don't filter them.  All documents 
 effectively match a function query.   
 
   Erik
 
 On Nov 7, 2013, at 1:48 PM, Peter Keegan peterlkee...@gmail.com wrote:
 
 Why does this function query return docs that don't match the embedded
 query?
 select?qq=text:newsq={!func}sum(query($qq),0)

Re: Replacing Google Mini Search Appliance with Solr?

2013-10-30 Thread Jason Hellman

Nutch is an excellent option.  It should feel very comfortable for people 
migrating away from the Google appliances.

Apache Droids is another possible way to approach, and I’ve found people using 
Heretrix or Manifold for various use cases (and usually in combination with 
other use cases where the extra overhead was worth the trouble).

I think the simples approach will be Nutch…it’s absolutely worth taking a shot 
at it.

DO NOT write a crawler!  That is a rabbit hole you do not want to peer down 
into :)



On Oct 30, 2013, at 10:54 AM, Markus Jelsma markus.jel...@openindex.io wrote:

 Hi Eric,
 
 We have also helped some government institution to replave their expensive 
 GSA with open source software. In our case we use Apache Nutch 1.7 to crawl 
 the websites and index to Apache Solr. It is very effective, robust and 
 scales easily with Hadoop if you have to. Nutch may not be the easiest tool 
 for the job but is very stable, feature rich and has an active community here 
 at Apache.
 
 Cheers,
 
 -Original message-
 From:Palmer, Eric epal...@richmond.edu
 Sent: Wednesday 30th October 2013 18:48
 To: solr-user@lucene.apache.org
 Subject: Replacing Google Mini Search Appliance with Solr?
 
 Hello all,
 
 Been lurking on the list for awhile.
 
 We are at the end of life for replacing two google mini search appliances 
 used to index our public web sites. Google is no longer selling the mini 
 appliances and buying the big appliance is not cost beneficial.
 
 http://search.richmond.edu/
 
 We would run a solr replacement in linux (cents, redhat, similar) with open 
 Java or Oracle Java.
 
 Background
 ==
 ~130 sites
 only ~12,000 pages (at a depth of 3)
 probably ~40,000 pages if we go to a depth of 4
 
 We use key matches a lot. In solr terms these are elevated documents 
 (elevations)
 
 We would code a search query form in php and wrap it into our design 
 (http://www.richmond.edu)
 
 I have played with and love lucidworks and know that their $ solution works 
 for our use cases but the cost model is not attractive for such a small 
 collection.
 
 So with solr what are my open source options and what are people's 
 experiences crawling and indexing web sites with solr + crawler. I 
 understand there is not a crawler with solr so that would have to be first 
 up to get one working.
 
 We can code in Java, PHP, Python etc. if we have to, but we don't want to 
 write a crawler if we can avoid it.
 
 thanks in advance for and information.
 
 --
 Eric Palmer
 Web Services
 U of Richmond

Re: When is/should qf different from pf?

2013-10-29 Thread Jason Hellman

It is probable that with no addition boost to pf fields that the sum of the 
scores will be higher.  But it is *possible* that they are not, and adding a 
boost to pf gives greater probability that they will be.

All of this bears testing to confirm what search use cases merit what level of 
boost.  No boost value is universally right…so YMMV, etc...

On Oct 29, 2013, at 9:30 AM, xavier jmlucjav jmluc...@gmail.com wrote:

 I am confused, wouldn't a doc that match both the phrase and the term
 queries have a better score than a doc matching only the term score, even
 if qf and pf are the same??
 
 
 On Mon, Oct 28, 2013 at 7:54 PM, Upayavira u...@odoko.co.uk wrote:
 
 There'd be no point having them the same.
 
 You're likely to include boosts in your pf, so that docs that match the
 phrase query as well as the term query score higher than those that just
 match the term query.
 
 Such as:
 
  qf=text descriptionpf=text^2 description^4
 
 Upayavira
 
 On Mon, Oct 28, 2013, at 05:44 PM, Amit Nithian wrote:
 Thanks Erick. Numeric fields make sense as I guess would strictly string
 fields too since its one  term? In the normal text searching case though
 does it make sense to have qf and pf differ?
 
 Thanks
 Amit
 On Oct 28, 2013 3:36 AM, Erick Erickson erickerick...@gmail.com
 wrote:
 
 The facetious answer is when phrases aren't important in the fields.
 If you're doing a simple boolean match, adding phrase fields will add
 expense, to no good purpose etc. Phrases on numeric
 fields seems wrong.
 
 FWIW,
 Erick
 
 
 On Mon, Oct 28, 2013 at 1:03 AM, Amit Nithian anith...@gmail.com
 wrote:
 
 Hi all,
 
 I have been using Solr for years but never really stopped to wonder:
 
 When using the dismax/edismax handler, when do you have the qf
 different
 from the pf?
 
 I have always set them to be the same (maybe different weights) but
 I was
 wondering if there is a situation where you would have a field in
 the qf
 not in the pf or vice versa.
 
 My understanding from the docs is that qf is a term-wise hard filter
 while
 pf is a phrase-wise boost of documents who made it past the qf
 filter.
 
 Thanks!
 Amit

Re: Reclaiming disk space from (large, optimized) segments

2013-10-29 Thread Jason Hellman

If I sage Otis’ intent here it is to create shards on the basis of intervals of 
time.  A shard represents a single interval (let’s say a year’s worth of data) 
and when that data is no longer necessary it is simply shut down and no longer 
included in queries.

So, for example, you could have three shards spanning the years 2011, 2012, and 
2013 respectively.  When you no longer need 2011 you simply remove the shard.  
My example is simple…compress based upon your needs.

On Oct 29, 2013, at 8:42 AM, Gun Akkor gun.ak...@carbonblack.com wrote:

 Otis,
 
 Thank you for your response,
 
 Could you elaborate a bit more on what you have in mind when you say
 time-based indices?
 
 Gun
 
 
 ---
 Senior Software Engineer
 Carbon Black, Inc.
 gun.ak...@carbonblack.com
 
 
 On Thu, Oct 24, 2013 at 11:56 PM, Otis Gospodnetic 
 otis.gospodne...@gmail.com wrote:
 
 Only skimmed your email, but purge every 4 hours jumped out at me. Would it
 make sense to have time-based indices that can be periodically dropped
 instead of being purged?
 
 Otis
 Solr  ElasticSearch Support
 http://sematext.com/
 On Oct 23, 2013 10:33 AM, Scott Lundgren scott.lundg...@carbonblack.com
 
 wrote:
 
 *Background:*
 
 - Our use case is to use SOLR as a massive FIFO queue.
 
 - Document additions and updates happen continuously.
 
- Documents are being added at sustained a rate of 50 - 100 documents
 per second.
 
- About 50% of these document are updates to existing docs, indexed
 using atomic updates: the original doc is thus deleted and re-added.
 
 - There is a separate purge operation running every four hours that
 deletes
 the oldest docs, if required based on a number of unrelated configuration
 parameters.
 
 - At some time in the past, a manual force merge / optimize with
 maxSegments=2 was run to troubleshoot high disk i/o and remove too many
 segments as a potential variable.  Currently, the largest fdts are 74G
 and
 43G.   There are 47 total segments, the largest other sizes are all
 around
 2G.
 
 - Merge policies are all at Solr 4 defaults. Index size is currently ~50M
 maxDocs, ~35M numDocs, 276GB.
 
 *Issue:*
 
 The background purge operation is deleting docs on schedule, but the disk
 space is not being recovered.
 
 *Presumptions:*
 I presume, but have not confirmed (how?) the 15M deleted documents are
 predominately in the two large segments.  Because they are largely in the
 two large segments, and those large segments still have (some/many) live
 documents, the segment backing files are not deleted.
 
 *Questions:*
 
 - When will those segments get merged and documents recovered?  Does it
 happen when _all_ the documents in those segments are deleted?  Some
 percentage of the segment is filled with deleted documents?
 - Is there a way to do it right now vs. just waiting?
 - In some cases, the purge delete conditional is _just_ free disk space:
 when index  free space, delete oldest.  Those setups are now in
 scenarios
 where index  free space, and getting worse.  How does low disk space
 effect above two questions?
 - Is there a way for me to determine stats on a per-segment basis?
   - for example, how many deleted documents in a particular segment?
 - On the flip side, can I determine in what segment a particular document
 is located?
 
 Thank you,
 
 Scott
 
 --
 Scott Lundgren
 Director of Engineering
 Carbon Black, Inc.
 (210) 204-0483 | scott.lundg...@carbonblack.com

Re: SOLRJ replace document

2013-10-18 Thread Jason Hellman

Keep in mind that DataStax has a custom update handler, and as such isn't 
exactly a vanilla Solr implementation (even though in many ways it still is).  
Since updates are co-written to Cassandra and Solr you should always tread a 
bit carefully when slightly outside what they perceive to be norms.


On Oct 18, 2013, at 7:21 PM, Brent Ryan brent.r...@gmail.com wrote:

 So I think the issue might be related to the tech stack we're using which
 is SOLR within DataStax enterprise which doesn't support atomic updates.
 But I think it must have some sort of bug around this because it doesn't
 appear to work correctly for this use case when using solrj ...  Anyways,
 I've contacted support so lets see what they say.
 
 
 On Fri, Oct 18, 2013 at 5:51 PM, Shawn Heisey s...@elyograg.org wrote:
 
 On 10/18/2013 3:36 PM, Brent Ryan wrote:
 
 My schema is pretty simple and has a string field called solr_id as my
 unique key.  Once I get back to my computer I'll send some more details.
 
 
 If you are trying to use a Map object as the value of a field, that is
 probably why it is interpreting your add request as an atomic update.  If
 this is the case, and you're doing it because you have a multivalued field,
 you can use a List object rather than a Map.
 
 If this doesn't sound like what's going on, can you share your code, or a
 simplification of the SolrJ parts of it?
 
 Thanks,
 Shawn

Re: field title_ngram was indexed without position data; cannot run PhraseQuery

2013-10-15 Thread Jason Hellman

If you consider what n-grams do this should make sense to you.  Consider the 
following piece of data:

White iPod

If the field is fed through a bigram filter (n-gram with size of 2) the 
resulting token stream would appear as such:

wh hi it te
ip po od

The usual use of n-grams is to match those partial tokens, essentially giving 
you a great deal of power in creating non-wildcard partial matches.  How you 
use this is up to your imagination, but one easy use is in partial matches in 
autosuggest features.

I can't speak for the intent behind the way it's coded, but it makes a great 
deal of sense to me that positional data would be seen as unnecessary since the 
intent of n-grams typically doesn't collide with phrase searches.  If you need 
both behaviors it's far better to use copyField and have one field dedicated to 
standard tokenization and token filters, and another field for n-grams.  

I hope that's useful to you.

On Oct 15, 2013, at 6:14 AM, MC videm...@gmail.com wrote:

 Hello,
 
 Could someone explain (or perhaps provide a documentation link) what does the 
 following error mean:
 field title_ngram was indexed without position data; cannot run 
 PhraseQuery
 
 I'll do some more searching online, I was just wondering if anyone has 
 encountered this error before, and what the possible solution might be. I've 
 recently upgraded my version of solr from 3.6.0 to 4.5.0, I'm not sure if 
 this has any bearing or not.
 Thanks,
 
 M

Re: Concurent indexing

2013-10-14 Thread Jason Hellman

The limitations on how many threads you can use to load data is primarily 
driven by factors on your hardware:  CPU, heap usage, I/O, and the like.  It is 
common for most index load processes to be able to handle more incoming data on 
the Solr side of the equation than can typically be loaded from the source 
repository.  You'll have to explore a bit to find the limits, but if your 
hardware is sufficient you can likely load a great deal.

As for commits, they will indeed commit anything added to Solr regardless of 
the thread of the update.  Keep this in mind if you have a rollback concept in 
mind, or if you're measuring your incremental load to restart in case of 
error/failure.  Presuming you want more control, and If you are multi-threading 
index updates, it may be useful to have a delegate handle the commit process…or 
on a large data load, consider a commit at the end.  


On Oct 14, 2013, at 6:44 AM, maephisto my_sky...@yahoo.com wrote:

 Hi,
 
 I have a collection (numShards=3, replicationFactor=2) split on 2 machines.
 Since the amount of data is huge I have to index, I would like start
 multiple instances of the same process that would index data to Solr.
 Is there any limitation or counter-indication is this area? 
 
 The indexing client is custom built by me and parses files (each instance
 parses a different file), and the uniqueId is auto-generated. 
 Would a commit in a process also commit the uncommitted changes created by
 another process?
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Concurent-indexing-tp4095409.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Update existing documents when using ExtractingRequestHandler?

2013-10-10 Thread Jason Hellman

As an endorsement of Erick's like, the primary benefit I see to processing 
through your own code is better error-, exception-, and logging-handling which 
is trivial for you to write.

Consider that your code could reside on any server, either receiving through a 
PUSH or PULLing the data from your web server (as suits your needs) and thus 
offloads the effort from your busy web server.

In the long run, this will be a more flexible, adaptable solution that meets 
future needs with minimal effort.  Further, it typically doesn't require a 
Solr expert to write so you can find plenty of people to help on this as 
future needs dictate.


On Oct 10, 2013, at 4:21 AM, Erick Erickson erickerick...@gmail.com wrote:

 1 - puts the work on the Solr server though.
 2 - This is just a SolrJ program, could be run anywhere. See:
 http://searchhub.org/dev/2012/02/14/indexing-with-solrj/ It would give
 you the most flexibility to offload the Tika processing to N other
 machines.
 3 - This could work, but you'd then be indexing every document twice
 as well as loading the server with the Tika work. And you'd have to
 store all the fields.
 
 Personally I like 2...
 
 FWIW,
 Erick
 
 
 On Wed, Oct 9, 2013 at 11:50 AM, Jeroen Steggink jer...@stegg-inc.com wrote:
 Hi,
 
 In a content management system I have a document and an attachment. The
 document contains the meta data and the attachment the actual data.
 I would like to combine data of both in one Solr document.
 
 I have thought of several options:
 
 1. Using ExtractingRequestHandler I would extract the data (extractOnly)
 and combine it with the meta data and send it to Solr.
 But this might be inefficient and increase the network traffic.
 2. Seperate Tika installation and use that to extract and send the data
 to Solr.
 This would stress an already busy web server.
 3. First upload the file using ExtractingRequestHandler, then use atomic
 updates to add the other fields.
 
 Or is there another way? First add the meta data and later use the
 ExtractingRequestHandler to add the file contents?
 
 Cheers,
 Jeroen
 
 --
 Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: Field with default value and stored=false, will be reset back to the default value in case of updating other fields

2013-10-10 Thread Jason Hellman

The best use case I see for atomic updates typically involves avoid 
transmission of large documents for small field updates.  If you are updating a 
readCount field of a PDF document that is 1MB in size you will avoid 
resending the 1MB PDF document's data in order to increment the readCount 
field.

If, instead, we're talking about 5K database records then there's plenty of 
argument to be made that the whole document should just be retransmitted and 
thus avoid the (potentially) unnecessary cost of storing all fields.

As in everything, we face compromises…the question is which one better suits 
your needs.

On Oct 10, 2013, at 5:07 AM, Erick Erickson erickerick...@gmail.com wrote:

 bq: so what is the point of having atomic updates if
 i need to update everything?
 
 _nobody_ claims this is ideal, it does solve a certain use-case.
 We'd all like like true partial-updates that didn't require
 stored fields.
 
 The use-case here is that you don't have access to the
 system-of-record so you don't have a choice.
 
 See the JIRA about stacked segments for update without
 storing fields work.
 
 Best,
 Erick
 
 On Thu, Oct 10, 2013 at 12:09 AM, Shawn Heisey elyog...@elyograg.org wrote:
 On 10/9/2013 8:39 PM, deniz wrote:
 Billnbell wrote
 You have to update the whole record including all fields...
 
 so what is the point of having atomic updates if i need to update
 everything?
 
 If you have any regular fields that are not stored, atomic updates will
 not work -- unstored field data will be lost.  If you have copyField
 destination fields that *are* stored, atomic updates will not work as
 expected with those fields.  The wiki spells out the requirements:
 
 http://wiki.apache.org/solr/Atomic_Updates#Caveats_and_Limitations
 
 An atomic update is just a shortcut for read all existing fields from
 the original document, apply the atomic updates, and re-insert the
 document, overwriting the original.
 
 Thanks,
 Shawn

Re: Solr auto suggestion not working

2013-10-10 Thread Jason Hellman

Very specifically, what is the field definition that is being used for the 
suggestions?

On Oct 10, 2013, at 5:49 AM, Furkan KAMACI furkankam...@gmail.com wrote:

 What is your configuration for auto suggestion?
 
 
 2013/10/10 ar...@skillnetinc.com ar...@skillnetinc.com
 
 
 
 Hi,
 
 We are encountering an issue in solr search auto suggestion feature. Here
 is
 the problem statement with an example:
 We have a product named 'Apple iphone 5s - 16 GB'. Now when in the search
 box we type 'Apple' or 'iphone' this product name comes in the suggestion
 list. But when we type 'iphone 5s' no result comes in suggestion list. Even
 when we type only '5s' then also no result comes.
 
 Please help us in resolving this issue and it is occurring on production
 environment and impacting client's business.
 
 Regards,
 Arun
 
 
 
 
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Solr-auto-suggestion-not-working-tp4094660.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: How to achieve distributed spelling check in SolrCloud ?

2013-10-08 Thread Jason Hellman

The shards.qt parameter is the easiest one to forget, with the most dramatic of 
consequences!

On Oct 8, 2013, at 11:10 AM, shamik sham...@gmail.com wrote:

 James,
 
  Thanks for your reply. The shards.qt did the trick. I read the
 documentation earlier but was not clear on the implementation, now it
 totally makes sense.
 
 Appreciate your help.
 
 Regards,
 Shamik
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/RE-How-to-achieve-distributed-spelling-check-in-SolrCloud-tp4094113p4094137.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Delete a field - Atomic updates (SOLR 4.1.0) without using null=true

2013-10-07 Thread Jason Hellman

I don't know if there's a way to accomplish your goal directly, but as a pure 
workaround, you can write a routine to fetch all the stored values and resubmit 
the document without the field in question.  This is what atomic updates do, 
minus the overhead of the transmission.

On Oct 7, 2013, at 11:15 AM, SolrLover bbar...@gmail.com wrote:

 I am using SOLR 4.1.0 and perform atomic updates on SOLR documents.
 
 Unfortunately there is a bug in 4.1.0
 (https://issues.apache.org/jira/browse/SOLR-4297) that blocks me from using
 null=true for deleting a field through atomic update functionality. Is
 there any other way to delete a field other than using this syntax?
 
 FYI..I wont be able to migrate to latest version now due to company code
 freeze hence trying to figure out a temporary work around.
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Delete-a-field-Atomic-updates-SOLR-4-1-0-without-using-null-true-tp4093951.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Adding OR operator in querystring and grouping fields?

2013-10-07 Thread Jason Hellman

fq=here:there OR this:that

For the lurker:  an AND should be:

fq=here:therefq=this:that

While you can, technically, pass:

fq=here:there AND this:that

Solr will cache the separate fq= parameters and reuse them in any context.  The 
AND(ed) filter will be cached as a single entry and only used when the same AND 
construct is sent.  Perhaps useful, not as generally desirable.


On Oct 7, 2013, at 2:10 PM, Jack Krupansky j...@basetechnology.com wrote:

 Combine the two filter queries with an explicit OR operator.
 
 -- Jack Krupansky
 -Original Message- From: PeterKerk
 Sent: Monday, October 07, 2013 1:50 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Adding OR operator in querystring and grouping fields?
 
 Ok thanks.
 you must combine them into one filter query parameter. , how would I do
 that? Can I simply change the URL structure or must I change my schema.xml
 and/or data-config.xml?
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Adding-OR-operator-in-querystring-and-grouping-fields-tp4093942p4093947.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Some text not indexed in solr4.4

2013-09-17 Thread Jason Hellman

Utkarsh,

Check to see if the value is actually indexed into the field by using the Terms 
request handler:

http://localhost:8983/solr/terms?terms.fl=textterms.prefix=d

(adjust the prefix to whatever you're looking for)

This should get you going in the right direction.

Jason


On Sep 17, 2013, at 2:20 PM, Utkarsh Sengar utkarsh2...@gmail.com wrote:

 I have a copyField called allText with type text_general:
 https://gist.github.com/utkarsh2012/6167128#file-schema-xml-L68
 
 I have ~100 documents which have the text: dyson and dc44 or dc41 etc.
 
 For example:
 title: Dyson DC44 Animal Digital Slim Cordless Vacuum
 description: The DC44 Animal is the new Dyson Digital Slim vacuum
 cleaner  the cordless machine that doesn’t lose suction. It has been
 engineered for floor to ceiling cleaning. DC44 Animal has a detachable
 long-reach wand  which is balanced for floor to ceiling cleaning.   The
 motorized floor tool has twice the power of the DC35 floor tool  to drive
 the bristles deeper into the carpet pile with more force. It attaches to
 the wand or directly to the machine for cleaning awkward spaces. The brush
 bar has carbon fiber filaments for removing fine dust from hard floors.
 DC44 Animal has a run time of 20 minutes or 8 minutes on Boost mode.
 Powered by the Dyson digital motor  DC44 Animal has a fade-free nickel
 manganese cobalt battery and Root Cyclone technology for constant  powerful
 suction.,
 UPC: 0879957006362
 
 The documents are indexed.
 
 Analysis says its indexeD: http://i.imgur.com/O52ino1.png
 But when I search for allText:dyson dc44 I get no results, response:
 http://pastie.org/8334220
 
 Any suggestions about the problem? I am out of ideas about how to debug
 this.
 
 -- 
 Thanks,
 -Utkarsh

Re: JSON update request handler commitWithin

2013-09-05 Thread Jason Hellman

They have modified the mechanisms for committing documents…Solr in DSE is not 
stock Solr...so you are likely encountering a boundary where stock Solr 
behavior is not fully supported.

I would definitely reach out to them to find out if they support the request.

On Sep 5, 2013, at 8:27 AM, Ryan, Brent br...@cvent.com wrote:

 Ya, looks like this is a bug in Datastax Enterprise 3.1.2.  I'm using
 their enterprise cluster search product which is built on SOLR 4.
 
 :(
 
 
 
 On 9/5/13 11:24 AM, Jack Krupansky j...@basetechnology.com wrote:
 
 I just tried commitWithin with the standard Solr example in Solr 4.4 and
 it works fine.
 
 Can you reproduce your problem using the standard Solr example in Solr
 4.4?
 
 -- Jack Krupansky
 
 From: Ryan, Brent 
 Sent: Thursday, September 05, 2013 10:39 AM
 To: solr-user@lucene.apache.org
 Subject: JSON update request handler  commitWithin
 
 I'm prototyping a search product for us and I was trying to use the
 commitWithin parameter for posting updated JSON documents like so:
 
 curl -v 
 'http://localhost:8983/solr/proposal.solr/update/json?commitWithin=1'
 --data-binary @rfp.json -H 'Content-type:application/json'
 
 However, the commit never seems to happen as you can see below there are
 still 2 docsPending (even 1 hour later).  Is there a trick to getting
 this to work with submitting to the json update request handler?

Re: data/index naming format

2013-09-05 Thread Jason Hellman

The circumstance I've most typically seen the index.timestamp show up is when 
an update is sent to a slave server.  The replication then appears to preserve 
the updated slave index in a separate folder while still respecting the correct 
data from the master.  

On Sep 5, 2013, at 8:03 PM, Shawn Heisey s...@elyograg.org wrote:

 On 9/5/2013 6:48 PM, Aditya Sakhuja wrote:
 I am running solr 4.1 for now, and am confused about the structure and
 naming of the contents of the data dir. I do not see the index.properties
 being generated on a fresh solr node start either.
 
 Can someone clarify when should one expect to see
 
 data/index vs. data/index.timestamp, and the index.properties along with
 the second version.
 
 I have never seen an index.properties file get created.  I've used
 versions from 1.4.0 through 4.4.0.
 
 Generally when you have an index.timestamp directory, it's because
 you're doing replication.  There may be other circumstances when it
 appears, but I do not know what those are.
 
 As for the other files in the index directory, here's Lucene's file
 format documentation:
 
 http://lucene.apache.org/core/4_4_0/core/org/apache/lucene/codecs/lucene42/package-summary.html#package_description
 
 Thanks,
 Shawn

Re: SolrCloud Set up

2013-08-30 Thread Jason Hellman

One additional thought here:  from a paranoid risk-management perspective it's 
not a good idea to have two critical services dependent upon a single point of 
failure if the hardware fails.  Obviously risk-management is suited to taste, 
so you may feel the cost/benefit does not merit the separation.  But it's good 
to make that decision consciously…you'd hate to have to justify a failure here 
after-the-fact as something overlooked :)


On Aug 30, 2013, at 9:40 AM, Shawn Heisey s...@elyograg.org wrote:

 On 8/30/2013 9:43 AM, Jared Griffith wrote:
 One last thing.  Is there any real benefit in running SolrCloud and
 Zookeeper separate?   I am seeing some funkiness with the separation of the
 two, funkiness I wasn't seeing when running SolrCloud + Zookeeper together
 as outlined in the Wiki.
 
 For a robust install, you want zookeeper to be a separate process.  It can 
 run on the same server as Solr, but the embedded zookeeper (-DzkRun) should 
 not be used except for dev and proof of concept work.
 
 The reason is simple.  Zookeeper is the central coordinator for SolrCloud.  
 In order for it to remain stable, it should not be restarted without good 
 reason.  If you are running zookeeper as part of Solr, then you will be 
 affecting zookeeper operation anytime you restart that instance of Solr.
 
 Making changes to your Solr setup often requires that you restart Solr.  This 
 includes upgrading Solr and changing some aspects of its configuration.  Some 
 configuration aspects can be changed with just a collection reload, but 
 others require a full application restart.
 
 Thanks,
 Shawn

Re: Indexing hangs when more than 1 server in a cluster

2013-08-14 Thread Jason Hellman

Kevin,

I wouldn't have considered using softCommits at all based on what I understand 
from your use case.  You appear to be loading in large batches, and softCommits 
are better aligned to NRT search where there is a steady stream of smaller 
updates that need to be available immediately.  

As Erick pointed out, soft commits are all about avoiding constant reopening of 
the index searcher…where by constant we mean every few seconds.  Provided you 
can wait until your batch is completed, and that frequency is roughly a minute 
or more, you likely will find an old-fashioned hard commit (with 
openSearcher=true) will work just fine (YMMV).

Jason



On Aug 14, 2013, at 4:51 AM, Erick Erickson erickerick...@gmail.com wrote:

 right, SOLR-5081 is possible but somewhat unlikely
 given the fact that you actually don't have very many
 nodes in your cluster.
 
 soft commits aren't relevant to the tlog, but here's
 the thing. Your tlogs may get replayed
 when you restart solr. If they're large, this may take
 a long time. When you said you restarted Solr after
 killing it, you might have triggered this.
 
 The way to keep tlogs small is to hard commit more
 frequently (you should look at their size before
 worrying about it though!). If you set openSearcher=false,
 this is pretty inexpensive, all it really does is close
 the current segment files, open new ones, and start a new
 tlog file. It does _not_ invalidate caches, do autowarming,
 all that expensive stuff.
 
 Your soft commit does _not_ improve performance! It is
 just less expensive than a hard commit with
 openSearcher=true. It _does_ invalidate caches, fire
 off autowarming, etc. So it does improve performance
 over doing hard commits with openSearcher=true
 with the same frequency, but it still isn't free. It's still
 good to have the soft commit interval as long as you
 can tolerate.
 
 It's perfectly reasonable to have a hard commit interval
 that's much shorter than your soft commit interval. As
 Yonik explained once, soft commits are about visibility
 but hard commits are about durability.
 
 Best
 Erick
 
 
 On Wed, Aug 14, 2013 at 2:20 AM, Kevin Osborn kevin.osb...@cbsi.com wrote:
 
 Interesting, that did work. Do you or anyone else have any ideas or what I
 should look at? While soft commit is not a requirement in my project, my
 understanding is that it should help performance. On the same index, I will
 be doing both a large number of queries as well as updates.
 
 If I have to disable autoCommit, should I increase the chunk size?
 
 Of course, I will have to run a more large scale test tomorrow, but I saw
 this problem fairly consistently in my smaller test.
 
 In a previous experiment, I applied the SOLR-4816 patch that someone
 indicated might help. I also reduced the CSV upload chunk size to 500. It
 seemed like things got a little better, but still eventually hung.
 
 I also see SOLR-5081, but I don't know if that is my issue or not. At least
 in my test, the index writes are not parallel as in the ticket.
 
 -Kevin
 
 
 On Tue, Aug 13, 2013 at 8:40 PM, Jason Hellman 
 jhell...@innoventsolutions.com wrote:
 
 While I don't have a past history of this issue to use as reference, if I
 were in your shoes I would consider trying your updates with softCommit
 disabled.  My suspicion is you're experiencing some issue with the
 transaction logging and how it's managed when your hard commit occurs.
 
 If you can give that a try and let us know how that fares we might have
 some further input to share.
 
 
 On Aug 13, 2013, at 11:54 AM, Kevin Osborn kevin.osb...@cbsi.com
 wrote:
 
 I am using Solr Cloud 4.4. It is pretty much a base configuration. We
 have
 2 servers and 3 collections. Collection1 is 1 shard and the Collection2
 and
 Collection3 both have 2 shards. Both servers are identical.
 
 So, here is my process, I do a lot of queries on Collection1 and
 Collection2. I then do a bunch of inserts into Collection3. I am doing
 CSV
 uploads. I am also doing custom shard routing. All the products in a
 single
 upload will have the same shard key. All Solr interaction is through
 SolrJ
 with full Zookeeper awareness. My uploads are also using soft commits.
 
 I tried this on a record set of 936 products. Everything worked fine. I
 then sent over a record set of 300k products. The upload into
 Collection3
 is chunked. I tried both 1000 and 200,000 with similar results. The
 first
 upload to Solr would just hang. There would simply be no response from
 Solr. A few of the products from this request would make it into the
 index,
 but not many.
 
 In this state, queries continued to work, but deletes did not.
 
 My only solution was to kill each Solr process.
 
 As an experiment, I did the large catalog first. First, I reset
 everything.
 With A chunk size of 1000, about 110,000 out of 300,000 records made it
 into Solr before the process hung. Again, queries worked, but deletes
 did
 not and I had to kill Solr. It hung after about 30 seconds

Re: Facet field display name

2013-08-13 Thread Jason Hellman

It's been my experience that using they convenient feature to change the output 
key still doesn't save you from having to map it back to the field name 
underlying it in order to trigger the filter query.  With that in mind it just 
makes more sense to me to leave the effort in the View portion of the design.  

On Aug 12, 2013, at 6:34 AM, Peter Sturge peter.stu...@gmail.com wrote:

 2c worth,
 We do lots of facet lookups to allow 'prettyprint' versions of facet names.
 We do this on the client-side, though. The reason is then the lookups can
 be different for different locations/users etc. - makes it easy for
 localization.
 It's also very easy to implement such a lookup, without having to disturb
 the innards of Solr...
 
 
 
 On Mon, Aug 12, 2013 at 2:25 PM, Erick Erickson 
 erickerick...@gmail.comwrote:
 
 Have you seen the key parameter here:
 
 http://wiki.apache.org/solr/SimpleFacetParameters#key_:_Changing_the_output_key
 
 it allows you to label the output key anything you want, and since these
 are
 field names, this seems to-able.
 
 Best,
 Erick
 
 
 On Mon, Aug 12, 2013 at 4:02 AM, Aleksander Akerø aleksan...@gurusoft.no
 wrote:
 
 Hi
 
 I wondered if there was some way to configure a display name for facet
 fields. Either that or some way to display nordic letters without it
 messing up the faceting.
 
 Say I wanted a facet field called område (norwegian, area in
 english).
 Then I would have to create the field something like this in schema.xml:
 
 field name=omrade type=string indexed=true stored=true
 required=false /
 
 But then I would have to do a replace to show a prettier name in
 frontend. It would be preferred not to do this sort of hardcoding, as I
 would have to do this for all the facet fields.
 
 
 Either that or I could try encoding the 'å' like this:
 
 field name=omr#229;de type=string indexed=true stored=true
 required=false /
 
 Then it will show up with a pretty name, but the faceting will fail.
 Maybe
 this is due to encoding issues, seen as the frontend is encoded with
 ISO-8859-1?
 
 
 So does anyone have a good practice for either getting this sort of
 problem
 working properly. Or a way to define an alternative display name for a
 facet field, that I could display instead of the field.name?
 
 
 *Aleksander Akerø*
 Systemkonsulent
 Mobil: 944 89 054
 E-post: aleksan...@gurusoft.no
 
 *Gurusoft AS*
 Telefon: 92 44 09 99
 Østre Kullerød
 www.gurusoft.no

Re: Indexing hangs when more than 1 server in a cluster

2013-08-13 Thread Jason Hellman

While I don't have a past history of this issue to use as reference, if I were 
in your shoes I would consider trying your updates with softCommit disabled.  
My suspicion is you're experiencing some issue with the transaction logging and 
how it's managed when your hard commit occurs.

If you can give that a try and let us know how that fares we might have some 
further input to share.


On Aug 13, 2013, at 11:54 AM, Kevin Osborn kevin.osb...@cbsi.com wrote:

 I am using Solr Cloud 4.4. It is pretty much a base configuration. We have
 2 servers and 3 collections. Collection1 is 1 shard and the Collection2 and
 Collection3 both have 2 shards. Both servers are identical.
 
 So, here is my process, I do a lot of queries on Collection1 and
 Collection2. I then do a bunch of inserts into Collection3. I am doing CSV
 uploads. I am also doing custom shard routing. All the products in a single
 upload will have the same shard key. All Solr interaction is through SolrJ
 with full Zookeeper awareness. My uploads are also using soft commits.
 
 I tried this on a record set of 936 products. Everything worked fine. I
 then sent over a record set of 300k products. The upload into Collection3
 is chunked. I tried both 1000 and 200,000 with similar results. The first
 upload to Solr would just hang. There would simply be no response from
 Solr. A few of the products from this request would make it into the index,
 but not many.
 
 In this state, queries continued to work, but deletes did not.
 
 My only solution was to kill each Solr process.
 
 As an experiment, I did the large catalog first. First, I reset everything.
 With A chunk size of 1000, about 110,000 out of 300,000 records made it
 into Solr before the process hung. Again, queries worked, but deletes did
 not and I had to kill Solr. It hung after about 30 seconds. Timing-wise,
 this is at about the second autocommit cycle, given the default autocommit
 of 15 seconds. I am not sure if this is related or not.
 
 As an additional experiment, I ran the entire test with just a single node
 in the cluster. This time, everything ran fine.
 
 Does anyone have any ideas? Everything is pretty default. These servers are
 Azure VMs, although I have seen similar behavior running two Solr instances
 on a single internal server as well.
 
 I had also noticed similar behavior before with Solr 4.3. It definitely has
 something do with the clustering, but I am not sure what. And I don't see
 any error message (or really anything else) in the Solr logs.
 
 Thanks.
 
 -- 
 *KEVIN OSBORN*
 LEAD SOFTWARE ENGINEER
 CNET Content Solutions
 OFFICE 949.399.8714
 CELL 949.310.4677  SKYPE osbornk
 5 Park Plaza, Suite 600, Irvine, CA 92614
 [image: CNET Content Solutions]

Re: Spelling suggestions.

2013-08-09 Thread Jason Hellman

The majority of the behavior outlined in that wiki page should work quite 
sufficiently for 3.5.0.  Note that there are only a few items that are marked 
Solr4.0 only (DirectSolrSpellChecker and WordBreakSolrSpellChecker, for 
example).

On Aug 9, 2013, at 6:26 AM, Kamaljeet Kaur kamal.kaur...@gmail.com wrote:

 Hello,
 
 I have just configured apache-solr with my django project. And its working
 fine with a very simple and basic searching. I want to add spelling
 suggestions, if user misspell any word in the string entered.
 
 In this particular mailing-list, I searched for it. Many have give the link: 
 http://wiki.apache.org/solr/SpellCheckComponent#head-78f5afcf43df544832809abc68dd36b98152670c
 
 But I am using the version 3.5.0, Its for the version 1.3  Should i follow
 this tutorial or they are available for solr version 3.5.0 ? 
 
 Thanks
 Kamaljeet Kaur
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Spelling-suggestions-tp4083519.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Phrase query with prefix query

2013-08-02 Thread Jason Hellman

Or shingles, presuming you want to tokenize and output unigrams.

On Aug 2, 2013, at 11:33 AM, Walter Underwood wun...@wunderwood.org wrote:

 Search against a field using edge N-grams.  --wunder
 
 On Aug 2, 2013, at 11:16 AM, T. Kuro Kurosaka wrote:
 
 Is there a query parser that supports a phrase query with prefix query at 
 the end, such as San Fran* ?
 
 -- 
 -
 T. Kuro Kurosaka • Senior Software Engineer

Re: restricting a query by a set of field values

2013-07-29 Thread Jason Hellman

Ben,

This could be constructed as so:

fl=date_depositedfq=date[2013-07-01T00:00:00Z TO 
2013-07-31T23:59:00Z]fq=collection_id(1 2 n)q.op=OR

The parenthesis around the 1 2 n set indicate a boolean query, and we're 
ensuring they are an OR boolean by the q.op parameter.

This should get you the result set you desire.  Please beware that a very large 
boolean set (your IN(…) parameter) may be expensive to run.

Jason

On Jul 29, 2013, at 7:33 AM, Benjamin Ryan benjamin.r...@manchester.ac.uk 
wrote:

 Hi,
   Is it possible to construct a query in SOLR to perform a query 
 that is restricted to only those documents that have a field value in a 
 particular set of values similar to what would be done in POstgres with the 
 SQL query:
 
   SELECT date_deposited FROM stats
   WHERE date BETWEEN '2013-07-01 00:00:00' AND '2013-07-31 
 23:59:00'
   AND collection_id IN ()
 
   In my SOLR schema.xml date_deposited is a TrieDateField and 
 collection_id is an IntField
 
 Regards,
   Ben
 
 --
 Dr Ben Ryan
 Jorum Technical Manager
 
 5.12 Roscoe Building
 The University of Manchester
 Oxford Road
 Manchester
 M13 9PL
 Tel: 0160 275 6039
 E-mail: 
 benjamin.r...@manchester.ac.ukhttps://outlook.manchester.ac.uk/owa/redir.aspx?C=b28b5bdd1a91425abf8e32748c93f487URL=mailto%3abenjamin.ryan%40manchester.ac.uk
 --

Re: Solr 4.3.1 - query does not return documents, just numFounds, 2 shards, replication Factor 1

2013-07-29 Thread Jason Hellman

Nitin,

You need to ensure the fields you wish to see are marked stored=true in your 
schema.xml file, and you should include fields in your fl= parameter 
(fl=*,score is a good place to start).

Jason

On Jul 29, 2013, at 8:08 AM, Nitin Agarwal 2nitinagar...@gmail.com wrote:

 Hi, I am using Solr 4.3.1 with 2 Shards and replication factor of 1,
 running on apache tomcat 7.0.42 with external zookeeper 3.4.5.
 
 When I query select?q=*:*
 
 I only get the number of documents found, but no actual document. When I
 query with rows=0, I do get correct count of documents in the index.
 Faceting queries as well as group by queries also work with rows=0.
 However, when rows is not equal to 0 I do not get any documents.
 
 When I query the index I see that a query is being sent to both shards, and
 subsequently I see a query being sent with just ids, however, after that
 query returns I do not see any documents back.
 
 Not sure what do I need to change, please help.
 
 Thanks,
 Nitin

Re: solr - set fileds as default search field

2013-07-29 Thread Jason Hellman

Or use the copyField technique to a single searchable field and set df= to that 
field.  The example schema does this with the field called text.

On Jul 29, 2013, at 8:35 AM, Ahmet Arslan iori...@yahoo.com wrote:

 Hi,
 
 
 df is a single valued parameter. Only one field can be a default field.
 
 To query multiple fields use (e)dismax query parser : 
 http://wiki.apache.org/solr/ExtendedDisMax#qf_.28Query_Fields.29
 
 
 
 From: Mysurf Mail stammail...@gmail.com
 To: solr-user@lucene.apache.org 
 Sent: Monday, July 29, 2013 6:31 PM
 Subject: solr - set fileds as default search field
 
 
 The following query works well for me
 
 http://[]:8983/solr/vault/select?q=VersionComments%3AWhite
 
 returns all the documents where version comments includes White
 
 I try to omit the field name and put it as a default value as follows : In
 solr config I write
 
 requestHandler name=/select class=solr.SearchHandler
 !-- default values for query parameters can be specified, these
  will be overridden by parameters in the request
   --
 lst name=defaults
str name=echoParamsexplicit/str
int name=rows10/int
str name=dfPackageName/str
str name=dfTag/str
str name=dfVersionComments/str
str name=dfVersionTag/str
str name=dfDescription/str
str name=dfSKU/str
str name=dfSKUDesc/str
 /lst
 
 I restart the solr and create a full import.
 Then I try using
 
 http://[]:8983/solr/vault/select?q=White
 
 (Where
 
 http://[]:8983/solr/vault/select?q=VersionComments%3AWhite
 
 still works)
 
 But I dont get the document any as answer.
 What am I doing wrong?

Re: solr 4.3, autocommit, maxdocs

2013-07-15 Thread Jason Hellman

Jonathan,

Please note the openSearcher=false part of your configuration.  This is why you 
don't see documents.  The commits are occurring, and being written to segments 
on disk, but they are not visible to the search engine because a Solr searcher 
class has not opened them for visibility.

You can either change the value to true, or alternatively call a deterministic 
commit call at the end of your load (a solr/update?commit=true will default to 
openSearcher=true).  

Hope that's of use!

Jason


On Jul 15, 2013, at 9:52 AM, Jonathan Rochkind rochk...@jhu.edu wrote:

 I have a solr 4.3 instance I am in the process of standing up. It started out 
 with an empty index.
 
 I have in it's solrconfig.xml,
 
  updateHandler class=solr.DirectUpdateHandler2
autoCommit
  maxDocs10/maxDocs
  openSearcherfalse/openSearcher
/autoCommit
  updateHandler
 
 I have an index process running, that has currently added around 400k 
 documents to Solr.
 
 I had expected that a 'commit' would be run every 100k documents, from the 
 above configuration, so 4 commits would have been run by now, and I'd see 
 documents in the index.
 
 However, when I look in the Solr admin interface, at my core's 'overview' 
 page, it still says num docs 0, segment count 0.  When I expected num docs 
 400k at this point.
 
 Is there something I'm misunderstanding about the configuration or the admin 
 interface? Or am I right in my expectations, but something else must be going 
 wrong?
 
 Thanks for any advice,
 
 Jonathan

Re: Using the Schema API from SolrJ

2013-07-06 Thread Jason Hellman

Steven,

Some information can be gleaned from the system admin request handler:

http://localhost:8983/solr/admin/system

I am specifically looking at this:

lst name=corestr name=schemaexample/str

Mind you, that is a manually-set value in the schema file.  But just in case 
you want to get crazy you can also call the file admin request handler:

http://localhost:8983/solr/admin/file?file=schema.xml

…and parse the whole stinking thing :)

Jason


On Jul 6, 2013, at 1:59 PM, Steven Glass steven.gl...@zekira.com wrote:

 Does anyone have any idea how I can access the schema version info using 
 SolrJ?
 
 Thanks.
 
 
 On Jul 3, 2013, at 4:16 PM, Steven Glass wrote:
 
 I'm using a Solr 4.3 server and accessing it from both a Java based desktop 
 application using SolrJ and an Android based mobile application using my 
 home-grown REST adaptor.  I'm trying to make sure that versions of the 
 application are synchronized with updates to the server (too often testers 
 forget to update an app when the server changes).  I want to read the schema 
 version from the server and make sure it is the expected value.
 
 This was very easy to do using my home-grown REST adaptor.  The wiki 
 examples at http://wiki.apache.org/solr/SchemaRESTAPI were sufficient. 
 
 Unfortunately, I cannot figure out how to do the equivalent with SolrJ.  I 
 suspect that there is a really simple approach but I'm just missing it.
 
 Thanks in advance for any guidance you can offer.
 
 Best regards,
 
 Steven Glass

Re: Surprising score?

2013-07-05 Thread Jason Hellman

Also considering using the SweetSpotSimilarityFactory class which allows to to 
still engage normalization but control how intrusive it is.  This, combined 
with the ability to set a custom Similarity class on a per-fieldType basis may 
be extremely useful.

More info:

http://lucene.apache.org/solr/4_3_1/solr-core/org/apache/solr/search/similarities/SweetSpotSimilarityFactory.html

Jason

On Jul 5, 2013, at 5:59 AM, pravesh suyalprav...@yahoo.com wrote:

 Is there a way to omitNorms and still be able to use {!boost b=boost} ? 
 
 OR you could let /omitNorms=false/  as usual and have your custom
 Similarity implementation with the length normalization method overridden
 for using a constant value of 1.
 
 
 Regards
 Pravesh
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Surprising-score-tp4075436p4075722.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: 2.1billion+ document

2013-07-05 Thread Jason Hellman

Saqib:

At the simplest level:

1)  Source the machine
2)  Install Java
3)  Install a servlet container of your choice
4)  Copy your Solr WAR and conf directories as desired (probably a rough mirror 
of your current single server)
5)  Start it up and start sending data there
6)  Query both by simply adding:  
shards=host1/solr/collection,host2/solr/collection
7)  Profit

Or, in shorthand:

1)  Install new Solr instance and start indexing data there
2)  Add the shards parameter to your queries with both (or more) servers
3)  …
4)  Profit

Now…we usually want to be concerned about how to manage the data so that we 
don't send duplicates.  Without SolrCloud it is our responsibility to delegate 
traffic for updates and deletes.  We also like to think a bit more about how to 
take advantage of our lovely parallelism to increase index or query time.  We 
should also consider strategies to isolate domain data to single shards so as 
to allow isolated queries against dedicated data models in single shards.

But if you just want to basics, it really is as easy as describe above.

Jason


On Jul 5, 2013, at 7:36 PM, Ali, Saqib docbook@gmail.com wrote:

 Hello Otis,
 
 I was thinking more in terms of Solr DistributedSearch rather than
 SolrCloud. I was hoping to add another Solr instance, when the time comes.
 This is a low use application, but with lot of data. Uptime and query speed
 are not of importance. However we would like to be able to index more then
 2.1 b document when the time comes..
 
 Any advise will be highly appreciated.
 
 
 Thanks!!! :)
 Saqib
 
 
 On Fri, Jul 5, 2013 at 6:23 PM, Otis Gospodnetic otis.gospodne...@gmail.com
 wrote:
 
 Hi,
 
 It's a broad question, but it starts with getting a few servers,
 putting Solr 4.3.1 on it (soon 4.4), setting up Zookeeper, creating a
 Solr Collection (index) with N shards and M replicas, and reindexing
 your old data to this new cluster, which you can expand with new nodes
 over time.  If you have specific questions...
 
 Otis
 --
 Solr  ElasticSearch Support -- http://sematext.com/
 Performance Monitoring -- http://sematext.com/spm
 
 
 
 On Fri, Jul 5, 2013 at 8:42 PM, Ali, Saqib docbook@gmail.com wrote:
 Question regarding the 2.1 billion+ document.
 
 I understand that a single instance of solr has a limit of 2.1 billion
 documents.
 
 We currently have a single solr server. If we reach 2.1billion documents
 limit, what is involved in moving to the Solr DistributedSearch?
 
 Thanks! :)

Re: how to replicate Solr Cloud

2013-06-25 Thread Jason Hellman

Kevin,

I can imagine this working if you consider your second data center a pure slave 
relationship to your SolrCloud cluster.  I haven't tried it, but I don't see 
why the solrconfig.xml can't identify as a master allowing you to call any of 
your cores in the cluster to replicate out.  That being said, this idea doesn't 
facilitate a SolrCloud cluster in the second data center…just a slave that 
could be a repeater.

You say that sending the data in both directions is not idea, but it works and 
is conceptually very simple.  What is the reasoning behind wanting to get away 
from that approach?

Jason

On Jun 25, 2013, at 10:07 AM, Kevin Osborn kevin.osb...@cbsi.com wrote:

 We are going to have two datacenters, each with their own SolrCloud and
 ZooKeeper quorums. The end result will be that they should be replicas of
 each other.
 
 One method that has been mentioned is that we should add documents to each
 cluster separately. For various reasons, this may not be ideal for us.
 Instead, we are playing around with the idea of always indexing to one
 datacenter. And then having that replicate to the other datacenter. And
 this is where I am having some trouble on how to proceed.
 
 The nice thing about SolrCloud is that there is no masters and slaves. Each
 node is equals, has the same configs, etc. But in this case, I want to have
 a node in one datacenter poll for changes in another data center. Before
 SolrCloud, I would have used slave/master replication. But in the SolrCloud
 world, I am not sure how to configure this setup?
 
 Or is there any better ideas on how to use replication to push or pull data
 from one datacenter to another?
 
 In my case, NRT is not a requirement. And I will also be dealing with about
 3 collections and 5 or 6 shards.
 
 Thanks.
 
 -- 
 *KEVIN OSBORN*
 LEAD SOFTWARE ENGINEER
 CNET Content Solutions
 OFFICE 949.399.8714
 CELL 949.310.4677  SKYPE osbornk
 5 Park Plaza, Suite 600, Irvine, CA 92614
 [image: CNET Content Solutions]

Re: [solr cloud] solr hangs when indexing large number of documents from multiple threads

2013-06-24 Thread Jason Hellman

Vinay,

What autoCommit settings do you have for your indexing process?

Jason

On Jun 24, 2013, at 1:28 PM, Vinay Pothnis poth...@gmail.com wrote:

 Here is the ulimit -a output:
 
  core file size   (blocks, -c)  0  data seg size(kbytes,
 -d)  unlimited  scheduling priority  (-e)  0  file size
(blocks, -f)  unlimited  pending signals
 (-i)  179963  max locked memory(kbytes, -l)  64  max memory size
  (kbytes, -m)  unlimited  open files   (-n)
 32769  pipe size (512 bytes, -p)  8  POSIX message queues
(bytes,
 -q)  819200  real-time priority   (-r)  0  stack size
 (kbytes, -s)  10240  cpu time(seconds, -t)  unlimited  max
 user processes   (-u)  14  virtual memory   (kbytes,
 -v)  unlimited  file locks   (-x)  unlimited
 
 On Mon, Jun 24, 2013 at 12:47 PM, Yago Riveiro yago.rive...@gmail.comwrote:
 
 Hi,
 
 I have the same issue too, and the deploy is quasi exact like than mine,
 http://lucene.472066.n3.nabble.com/updating-docs-in-solr-cloud-hangs-td4067388.html#a4067862
 
 With some concurrence and batches of 10 solr apparently have some deadlock
 distributing updates
 
 Can you dump the configuration of the ulimit on your servers?, some people
 had the same issues because they are reach the ulimit maximum defined for
 descriptor and process.
 
 --
 Yago Riveiro
 Sent with Sparrow (http://www.sparrowmailapp.com/?sig)
 
 
 On Monday, June 24, 2013 at 7:49 PM, Vinay Pothnis wrote:
 
 Hello All,
 
 I have the following set up of solr cloud.
 
 * solr version 4.3.1
 * 3 node solr cloud + replciation factor 2
 * 3 zoo keepers
 * load balancer in front of the 3 solr nodes
 
 I am seeing this strange behavior when I am indexing a large number of
 documents (10 mil). When I have more than 3-5 threads sending documents
 (in
 batch of 20) to solr, sometimes solr goes into a hung state. After this
 all
 the update requests get timed out. What we see via AppDynamics (a
 performance monitoring tool) is that there are a number of threads that
 are
 stalled. The stack trace for one of the threads is shown below.
 
 The cluster has to be restarted to recover from this. When I reduce the
 concurrency to 1, 2, 3 threads, then the indexing goes through smoothly.
 Any pointers as to what could be wrong here?
 
 We send the updates to one of the nodes in the solr cloud through a load
 balancer.
 
 Thanks
 Vinay
 
 Thread Name:qtp2141131052-78
 ID:78
 Time:Fri Jun 21 23:20:22 GMT 2013
 State:WAITING
 Priority:5
 
 sun.misc.Unsafe.park(Native Method)
 java.util.concurrent.locks.
 LockSupport.park(LockSupport.java:186)
 
 java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
 
 java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994)
 
 java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303)
 java.util.concurrent.Semaphore.acquire(Semaphore.java:317)
 
 org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61)
 
 org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:418)
 
 org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:368)
 
 org.apache.solr.update.SolrCmdDistributor.flushAdds(SolrCmdDistributor.java:300)
 
 org.apache.solr.update.SolrCmdDistributor.finish(SolrCmdDistributor.java:96)
 
 org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:462)
 
 org.apache.solr.update.processor.DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1178)
 
 org.apache.solr.update.processor.LogUpdateProcessor.finish(LogUpdateProcessorFactory.java:179)
 
 org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:83)
 
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
 org.apache.solr.core.SolrCore.execute(SolrCore.java:1820)
 
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656)
 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359)
 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
 
 org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1423)
 
 org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:450)
 
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:138)
 
 org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:564)
 
 org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:213)
 
 org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1083)
 org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:379)

Re: [solr cloud] solr hangs when indexing large number of documents from multiple threads

2013-06-24 Thread Jason Hellman

Vinay,

You may wish to pay attention to how many transaction logs are being created 
along the way to your hard autoCommit, which should truncate the open handles 
for those files.  I might suggest setting a maxDocs value in parallel with your 
maxTime value (you can use both) to ensure the commit occurs at either 
breakpoint.  30 seconds is plenty of time for 5 parallel processes of 20 
document submissions to push you over the edge.

Jason

On Jun 24, 2013, at 2:21 PM, Vinay Pothnis poth...@gmail.com wrote:

 I have 'softAutoCommit' at 1 second and 'hardAutoCommit' at 30 seconds.
 
 On Mon, Jun 24, 2013 at 1:54 PM, Jason Hellman 
 jhell...@innoventsolutions.com wrote:
 
 Vinay,
 
 What autoCommit settings do you have for your indexing process?
 
 Jason
 
 On Jun 24, 2013, at 1:28 PM, Vinay Pothnis poth...@gmail.com wrote:
 
 Here is the ulimit -a output:
 
 core file size   (blocks, -c)  0  data seg size
 (kbytes,
 -d)  unlimited  scheduling priority  (-e)  0  file size
   (blocks, -f)  unlimited  pending signals
 (-i)  179963  max locked memory(kbytes, -l)  64  max memory size
 (kbytes, -m)  unlimited  open files   (-n)
 32769  pipe size (512 bytes, -p)  8  POSIX message queues
   (bytes,
 -q)  819200  real-time priority   (-r)  0  stack size
 (kbytes, -s)  10240  cpu time(seconds, -t)  unlimited
 max
 user processes   (-u)  14  virtual memory
 (kbytes,
 -v)  unlimited  file locks   (-x)  unlimited
 
 On Mon, Jun 24, 2013 at 12:47 PM, Yago Riveiro yago.rive...@gmail.com
 wrote:
 
 Hi,
 
 I have the same issue too, and the deploy is quasi exact like than mine,
 
 http://lucene.472066.n3.nabble.com/updating-docs-in-solr-cloud-hangs-td4067388.html#a4067862
 
 With some concurrence and batches of 10 solr apparently have some
 deadlock
 distributing updates
 
 Can you dump the configuration of the ulimit on your servers?, some
 people
 had the same issues because they are reach the ulimit maximum defined
 for
 descriptor and process.
 
 --
 Yago Riveiro
 Sent with Sparrow (http://www.sparrowmailapp.com/?sig)
 
 
 On Monday, June 24, 2013 at 7:49 PM, Vinay Pothnis wrote:
 
 Hello All,
 
 I have the following set up of solr cloud.
 
 * solr version 4.3.1
 * 3 node solr cloud + replciation factor 2
 * 3 zoo keepers
 * load balancer in front of the 3 solr nodes
 
 I am seeing this strange behavior when I am indexing a large number of
 documents (10 mil). When I have more than 3-5 threads sending documents
 (in
 batch of 20) to solr, sometimes solr goes into a hung state. After this
 all
 the update requests get timed out. What we see via AppDynamics (a
 performance monitoring tool) is that there are a number of threads that
 are
 stalled. The stack trace for one of the threads is shown below.
 
 The cluster has to be restarted to recover from this. When I reduce the
 concurrency to 1, 2, 3 threads, then the indexing goes through
 smoothly.
 Any pointers as to what could be wrong here?
 
 We send the updates to one of the nodes in the solr cloud through a
 load
 balancer.
 
 Thanks
 Vinay
 
 Thread Name:qtp2141131052-78
 ID:78
 Time:Fri Jun 21 23:20:22 GMT 2013
 State:WAITING
 Priority:5
 
 sun.misc.Unsafe.park(Native Method)
 java.util.concurrent.locks.
 LockSupport.park(LockSupport.java:186)
 
 
 java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
 
 
 java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994)
 
 
 java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303)
 java.util.concurrent.Semaphore.acquire(Semaphore.java:317)
 
 
 org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61)
 
 
 org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:418)
 
 
 org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:368)
 
 
 org.apache.solr.update.SolrCmdDistributor.flushAdds(SolrCmdDistributor.java:300)
 
 
 org.apache.solr.update.SolrCmdDistributor.finish(SolrCmdDistributor.java:96)
 
 
 org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:462)
 
 
 org.apache.solr.update.processor.DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1178)
 
 
 org.apache.solr.update.processor.LogUpdateProcessor.finish(LogUpdateProcessorFactory.java:179)
 
 
 org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:83)
 
 
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
 org.apache.solr.core.SolrCore.execute(SolrCore.java:1820)
 
 
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656)
 
 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359

Re: [solr cloud] solr hangs when indexing large number of documents from multiple threads

2013-06-24 Thread Jason Hellman

Scott,

My comment was meant to be a bit tongue-in-cheek, but my intent in the 
statement was to represent hard failure along the lines Vinay is seeing.  We're 
talking about OutOfMemoryException conditions, total cluster paralysis 
requiring restart, or other similar and disastrous conditions.

Where that line is is impossible to generically define, but trivial to 
accomplish.  What any of us running Solr has to achieve is a realistic 
simulation of our desired production load (probably well above peak) and to see 
what limits are reached.  Armed with that information we tweak.  In this case, 
we look at finding the point where data ingestion reaches a natural limit.  For 
some that may be JVM GC, for others memory buffer size on the client load, and 
yet others it may be I/O limits on multithreaded reads from a database or file 
system.   

In old Solr days we had a little less to worry about.  We might play with a 
commitWithin parameter, ramBufferSizeMB tweaks, or contemplate partial commits 
and rollback recoveries.  But with 4.x we now have more durable write options 
and NRT to consider, and SolrCloud begs to use this.  So we have to consider 
transaction logs, the file handles they leave open until commit operations 
occur, and how we want to manage writing to all cores simultaneously instead of 
a more narrow master/slave relationship.

It's all manageable, all predictable (with some load testing) and all filled 
with many possibilities to meet our specific needs.  Considering hat each 
person's data model, ingestion pipeline, request processors, and field analysis 
steps will be different, 5 threads of input at face value doesn't really 
contemplate the whole problem.  We have to measure our actual data against our 
expectations and find where the weak chain links are to strengthen them.  The 
symptoms aren't necessarily predictable in advance of this testing, but they're 
likely addressable and not difficult to decipher.

For what it's worth, SolrCloud is new enough that we're still experiencing some 
uncharted territory with unknown ramifications but with continued dialog 
through channels like these there are fewer territories without good 
cartography :)

Hope that's of use!

Jason



On Jun 24, 2013, at 7:12 PM, Scott Lundgren scott.lundg...@carbonblack.com 
wrote:

 Jason,
 
 Regarding your statement push you over the edge- what does that mean?
 Does it mean uncharted territory with unknown ramifications or something
 more like specific, known symptoms?
 
 I ask because our use is similar to Vinay's in some respects, and we want
 to be able to push the capabilities of write perf - but not over the edge!
 In particular, I am interested in knowing the symptoms of failure, to help
 us troubleshoot the underlying problems if and when they arise.
 
 Thanks,
 
 Scott
 
 On Monday, June 24, 2013, Jason Hellman wrote:
 
 Vinay,
 
 You may wish to pay attention to how many transaction logs are being
 created along the way to your hard autoCommit, which should truncate the
 open handles for those files.  I might suggest setting a maxDocs value in
 parallel with your maxTime value (you can use both) to ensure the commit
 occurs at either breakpoint.  30 seconds is plenty of time for 5 parallel
 processes of 20 document submissions to push you over the edge.
 
 Jason
 
 On Jun 24, 2013, at 2:21 PM, Vinay Pothnis poth...@gmail.com wrote:
 
 I have 'softAutoCommit' at 1 second and 'hardAutoCommit' at 30 seconds.
 
 On Mon, Jun 24, 2013 at 1:54 PM, Jason Hellman 
 jhell...@innoventsolutions.com wrote:
 
 Vinay,
 
 What autoCommit settings do you have for your indexing process?
 
 Jason
 
 On Jun 24, 2013, at 1:28 PM, Vinay Pothnis poth...@gmail.com wrote:
 
 Here is the ulimit -a output:
 
 core file size   (blocks, -c)  0  data seg size
 (kbytes,
 -d)  unlimited  scheduling priority  (-e)  0  file size
  (blocks, -f)  unlimited  pending signals
 (-i)  179963  max locked memory(kbytes, -l)  64  max memory
 size
(kbytes, -m)  unlimited  open files   (-n)
 32769  pipe size (512 bytes, -p)  8  POSIX message queues
  (bytes,
 -q)  819200  real-time priority   (-r)  0  stack size
 (kbytes, -s)  10240  cpu time(seconds, -t)  unlimited
 max
 user processes   (-u)  14  virtual memory
 (kbytes,
 -v)  unlimited  file locks   (-x)  unlimited
 
 On Mon, Jun 24, 2013 at 12:47 PM, Yago Riveiro yago.rive...@gmail.com
 wrote:
 
 Hi,
 
 I have the same issue too, and the deploy is quasi exact like than
 mine,
 
 
 http://lucene.472066.n3.nabble.com/updating-docs-in-solr-cloud-hangs-td4067388.html#a4067862
 
 With some concurrence and batches of 10 solr apparently have some
 deadlock
 distributing updates
 
 Can you dump the configuration of the ulimit on your servers?, some
 people
 had the same issues because they are reach the ulimit maximum defined
 for
 descriptor and process

Re: Restarting SOLR will remove all cache?

2013-06-24 Thread Jason Hellman

Shalin,

There's one point to test without caches, which is to establish how much value 
a cache actually provides.

For me, this primarily means providing a benchmark by which to decide when to 
stop obsessing over caches.  

But yes, for load testing I definitely agree :)

Jason

On Jun 21, 2013, at 11:01 AM, Shalin Shekhar Mangar shalinman...@gmail.com 
wrote:

 There are no disk caches as such. There is no point in testing without
 caches. Also, Lucene has field caches required for sorting which
 cannot be turned off.
 
 On Fri, Jun 21, 2013 at 11:22 PM, Learner bbar...@gmail.com wrote:
 I have a very simple question. Does restarting SOLR removes all caches
 (including disk caches if any?).
 
 I have disabled all caches in solrconfig.xml but even then I see that there
 is some caching happening all the time.
 
 I am currently doing some performance testing and I dont want cache to play
 any role now..
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Restarting-SOLR-will-remove-all-cache-tp4072200.html
 Sent from the Solr - User mailing list archive at Nabble.com.
 
 
 
 -- 
 Regards,
 Shalin Shekhar Mangar.

Re: in Solr 3.5, optimization increase the index size to double

2013-06-16 Thread Jason Hellman

And let's not forget the interesting bug in MMapDirectory:

http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/api/all/org/apache/lucene/store/MMapDirectory.html

NOTE: memory mapping uses up a portion of the virtual memory address space in
your process equal to the size of the file being mapped. Before using this
class, be sure your have plenty of virtual address space, e.g. by using a 64
bit JRE, or a 32 bit JRE with indexes that are guaranteed to fit within the
address space. On 32 bit platforms also consult setMaxChunkSize(int) if you
have problems with mmap failing because of fragmented address space. If you get
an OutOfMemoryException, it is recommended to reduce the chunk size, until it
works.
Due to this bug in Sun's JRE, MMapDirectory's IndexInput.close() is unable to
close the underlying OS file handle. Only when GC finally collects the
underlying objects, which could be quite some time later, will the file handle
be closed.

This will consume additional transient disk usage: on Windows, attempts to
delete or overwrite the files will result in an exception; on other platforms,
which typically have a delete on last close semantics, while such operations
will succeed, the bytes are still consuming space on disk. For many
applications this limitation is not a problem (e.g. if you have plenty of disk
space, and you don't rely on overwriting files on Windows) but it's still an
important limitation to be aware of.

If you're measuring by directory size (and not explicitly by the viewable
files) you may very well be seeing this.

Jason

On Jun 16, 2013, at 4:53 AM, Erick Erickson erickerick...@gmail.com wrote:

Optimzing will _temporarily_ double the index size,
but it shouldn't be permanent. Is it possible that
you have inadvertently told Solr to keep an extra
snapshot? I think it's numberToKeep in your
replication handler, but I'm going from memory here.

Best
Erick

On Fri, Jun 14, 2013 at 2:15 AM, Montu v Boda
montu.b...@highqsolutions.com wrote:
Hi,

i have replicate my index from 1.4 to 3.5 and after replication i try
optimize the index in 3.5 with below URL.
http://localhost:9002/solr35/collection1/update?optimize=truecommit=true

when i optimize the index in 3.5, it's increase the index size to double.

in 1.4 the size of index is 428GB and after optimization in 3.5 it becomes
791 GB.

Thanks Rrgards
Montu v Boda

--
View this message in context:
http://lucene.472066.n3.nabble.com/in-Solr-3-5-optimization-increase-the-index-size-to-double-tp4070433.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Filtering down terms in suggest

2013-06-12 Thread Jason Hellman

Aloke,

It may be best to simply run a query to populate the suggestion list.  While 
not as fast as the terms component (and suggester offshoots) it can still be 
tuned to be very, very fast.  

In this way, you can generate any fq/q combination required to meet your needs. 
 You can play with wildcard searches, or better yet NGram (EdgeNGram) behavior 
to get the right suggestion data back.

I would suggest an additional core to accomplish this (fed via replication) to 
avoid cache entry collision with your normal queries.

Hope that's useful to you.

Jason

On Jun 12, 2013, at 7:43 AM, Aloke Ghoshal alghos...@gmail.com wrote:

 Barani - the fq option doesn't work.
 Jason - the dynamic field option won't work due to the high number of
 groups and users.
 
 
 
 On Wed, Jun 12, 2013 at 1:12 AM, Jason Hellman 
 jhell...@innoventsolutions.com wrote:
 
 Aloke,
 
 If you do not have a factorial problem in the combination of userid and
 groupid (which I can imagine you might) you could consider creating a field
 for each combination (u1g1, u2g2) which can easily be done via dynamic
 fields.  Use CopyField to get data into these various constructs (again,
 easily configured via wildcard patterns) and then send the suggestion query
 to the right field.
 
 Obviously this will get out of hand if you have too many of these...so
 this has limits.
 
 Jason
 
 On Jun 11, 2013, at 8:29 AM, Aloke Ghoshal alghos...@gmail.com wrote:
 
 Hi,
 
 Trying to find a way to filter down the suggested terms set based on the
 term value of another indexed field?
 
 Let's say we have the following documents indexed in Solr:
 userid:1, groupid:1, content:alpha beta gamma
 userid:2, groupid:1, content:alternate better garden
 userid:3, groupid:2, content:altruism bent garner
 
 Now a query on (with a dictionary built using terms in the content
 field):
 q:groupid:1 AND content:al
 
 should suggest alpha  alternate, (not altruism, since it has a different
 groupid).
 
 The option to have a separate dictionary per group gets ruled out due to
 the high number of distinct groups (50K+).
 
 Kindly suggest ways to get this working.
 
 Thanks,
 Aloke

Re: Filtering down terms in suggest

2013-06-11 Thread Jason Hellman

Aloke,

If you do not have a factorial problem in the combination of userid and groupid 
(which I can imagine you might) you could consider creating a field for each 
combination (u1g1, u2g2) which can easily be done via dynamic fields.  Use 
CopyField to get data into these various constructs (again, easily configured 
via wildcard patterns) and then send the suggestion query to the right field.

Obviously this will get out of hand if you have too many of these...so this has 
limits.

Jason

On Jun 11, 2013, at 8:29 AM, Aloke Ghoshal alghos...@gmail.com wrote:

 Hi,
 
 Trying to find a way to filter down the suggested terms set based on the
 term value of another indexed field?
 
 Let's say we have the following documents indexed in Solr:
 userid:1, groupid:1, content:alpha beta gamma
 userid:2, groupid:1, content:alternate better garden
 userid:3, groupid:2, content:altruism bent garner
 
 Now a query on (with a dictionary built using terms in the content field):
 q:groupid:1 AND content:al
 
 should suggest alpha  alternate, (not altruism, since it has a different
 groupid).
 
 The option to have a separate dictionary per group gets ruled out due to
 the high number of distinct groups (50K+).
 
 Kindly suggest ways to get this working.
 
 Thanks,
 Aloke

Re: Two instances of solr - the same datadir?

2013-06-04 Thread Jason Hellman

Roman,

Could you be more specific as to why replication doesn't meet your 
requirements?  It was geared explicitly for this purpose, including the 
automatic discovery of changes to the data on the index master.  

Jason

On Jun 4, 2013, at 1:50 PM, Roman Chyla roman.ch...@gmail.com wrote:

 OK, so I have verified the two instances can run alongside, sharing the
 same datadir
 
 All update handlers are unaccessible in the read-only master
 
 updateHandler class=solr.DirectUpdateHandler2
 enable=${solr.can.write:true}
 
 java -Dsolr.can.write=false .
 
 And I can reload the index manually:
 
 curl 
 http://localhost:5005/solr/admin/cores?wt=jsonaction=RELOADcore=collection1
 
 
 But this is not an ideal solution; I'd like for the read-only server to
 discover index changes on its own. Any pointers?
 
 Thanks,
 
  roman
 
 
 On Tue, Jun 4, 2013 at 2:01 PM, Roman Chyla roman.ch...@gmail.com wrote:
 
 Hello,
 
 I need your expert advice. I am thinking about running two instances of
 solr that share the same datadirectory. The *reason* being: indexing
 instance is constantly building cache after every commit (we have a big
 cache) and this slows it down. But indexing doesn't need much RAM, only the
 search does (and server has lots of CPUs)
 
 So, it is like having two solr instances
 
 1. solr-indexing-master
 2. solr-read-only-master
 
 In the solrconfig.xml I can disable update components, It should be fine.
 However, I don't know how to 'trigger' index re-opening on (2) after the
 commit happens on (1).
 
 Ideally, the second instance could monitor the disk and re-open disk after
 new files appear there. Do I have to implement custom IndexReaderFactory?
 Or something else?
 
 Please note: I know about the replication, this usecase is IMHO slightly
 different - in fact, write-only-master (1) is also a replication master
 
 Googling turned out only this
 http://comments.gmane.org/gmane.comp.jakarta.lucene.solr.user/71912 - no
 pointers there.
 
 But If I am approaching the problem wrongly, please don't hesitate to
 're-educate' me :)
 
 Thanks!
 
  roman

Re: Can mm (min-match) be specified by field in dismax or edismax?

2013-06-03 Thread Jason Hellman

Well, there is a hack(ish) way to do it:

_query_:{!type=edismax qf='someField' v='$q' mm=100%}

This is clearly not a solrconfig.xml settings, but part of your query string 
using LocalParam behavior.

This is going to get really messy if you have plenty of fields you'd like to 
search, where you'd need a similar construct for each.  I cannot attest to 
performance at scale with such a construct…but just showing a way you can go 
about this if you feel compelled enough to do so.

Jason

On Jun 3, 2013, at 8:08 AM, Jack Krupansky j...@basetechnology.com wrote:

 No, but you can with the LucidWorks Search query parser:
 
 f1:(cat dog fox bat fish cow)~50% f2:(cat dog fox bat fish zebra)~2
 
 See:
 http://docs.lucidworks.com/display/lweug/Minimum+Match+for+Simple+Queries
 
 -- Jack Krupansky
 
 -Original Message- From: Eric Wilson Sent: Monday, June 03, 2013 
 10:30 AM To: solr-user@lucene.apache.org Subject: Can mm (min-match) be 
 specified by field in dismax or edismax? 
 I would like to have the min-match set differently for different fields in
 my dismax handler. Is this possible?

Re: Getting tons of EofException with jetty/SolrCloud

2013-05-31 Thread Jason Hellman

Those are default, though autoSoftCommit is commented out by default.

Keep in mind about the hard commit running every 15 seconds: it is not
updating your searchable data (due to the openSearcher=false setting). In
theory, your data should be searchable due to autoSoftCommit running every 1
second. Every 15 seconds the hard commit comes along to truncate the
transaction logs and persist the data to lucene segments, but searches are
still being served from a combat ion of the last hard commit with
openSearcher=true plus all the soft committed data in memory.

At some point it's useful to call a hard commit with openSearcher=true. This
will essentially set the state of all searchable data to the segment data from
Lucene. Also, the 15 second default isn't intended to be a one-size-fits-all
policy. You need to find some good balancer here and testing this out with
simulated load is the right way to do this.

Others reading this thread may be able to provide better empirical or anecdotal
suggestions to you on settings, but be sure to test!

On May 31, 2013, at 12:14 PM, ltenny lte...@gmail.com wrote:

autoCommit
maxTime15000/maxTime
openSearcherfalse/openSearcher
/autoCommit

autoSoftCommit
maxTime1000/maxTime
/autoSoftCommit

I think these are close to the default values...not sure if I changed them.
These mean a hard commit every 15 seconds...right? Seems sort of reasonable
since we get a few hundred doc inserts in 15 seconds. Not sure...any advice
is very welcome.

--
View this message in context:
http://lucene.472066.n3.nabble.com/Getting-tons-of-EofException-with-jetty-SolrCloud-tp4067427p4067433.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: 2 VM setup for SOLRCLOUD?

2013-05-30 Thread Jason Hellman

Jamey,

You will need a load balancer on the front end to direct traffic into one of 
your SolrCore entry points.  It doesn't matter, technically, which one though 
you will find benefits to narrowing traffic to fewer (for purposes of better 
cache management).

Internally SolrCloud will round-robin distribute requests to other shards once 
a query begins execution.  But you do need an entry point externally to be 
defined through your load balancer.

Hope this is useful!

Jason

On May 30, 2013, at 12:48 PM, James Dulin jdu...@crelate.com wrote:

 Working to setup SolrCloud in Windows Azure.  I have read over the solr Cloud 
 wiki, but am a little confused about some of the deployment options.  I am 
 attaching an image for what I am thinking we want to do.  2 VM’s that will 
 have 2 shards spanning across them.  4 Nodes total across the two machines, 
 and a zookeeper on each VM.  I think this is feasible, but, I am a little 
 confused about how each node knows how to respond to requests (do I need a 
 load balancer in front, or can we just reference the “collection” etc.)
  
 
  
 Thanks!
  
 Jamey

Re: Nested Facets and distributed shard system.

2013-05-28 Thread Jason Hellman

You have mentioned Pivot Facets, but have you looked at the Path Hierarchy 
Tokenizer Factory:

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.PathHierarchyTokenizerFactory

This matches your use case, as best as I understand it.

Jason


On May 28, 2013, at 12:47 PM, vibhoreng04 vibhoren...@gmail.com wrote:

 Hi Erick  and Markus,
 
 Any Idea on this ? can we resolve this by group by queries?
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Nested-Facets-and-distributed-shard-system-tp4065847p4066583.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: split document or not

2013-05-28 Thread Jason Hellman

You may wish to explore the concept of using the Result Grouping (Field 
Collapsing) feature in which your paragraphs are individual documents that 
share a field to group them by (the ID of the document/book/article/whatever).

http://wiki.apache.org/solr/FieldCollapsing

This will net you absolutely isolated results for paragraphs, and give you a 
great deal of flexibility on how to query the results in cases where you do or 
do not need them grouped.

Jason


On May 28, 2013, at 3:10 PM, Hard_Club meddn...@gmail.com wrote:

 Thanks, Alexandre.
 
 But I need to know in which paragraph is matched the request. I need it
 because paragraphs are binded to some extra data that I need to output on
 result page. So I need to know paragraphs is'd. How to bind such attribute
 to multivalued field?
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/split-document-or-not-tp4066170p4066629.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: filter query by string length or word count?

2013-05-22 Thread Jason Hellman

Sam,

I would highly suggest counting the words in your external pipeline and sending 
that value in as a specific field.  It can then be queried quite simply with a:

wordcount:{80 TO *]

(Note the { next to 80, excluding the value of 80)

Jason

On May 22, 2013, at 11:37 AM, Sam Lee skyn...@gmail.com wrote:

 I have schema.xml
 field name=body type=text_en_html indexed=true stored=true
 omitNorms=true/
 ...
 fieldType name=text_en_html class=solr.TextField
 positionIncrementGap=100
analyzer type=index
charFilter class=solr.HTMLStripCharFilterFactory/
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.StopFilterFactory
ignoreCase=true
words=stopwords_en.txt
enablePositionIncrements=true
/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.EnglishPossessiveFilterFactory/
filter class=solr.KeywordMarkerFilterFactory
 protected=protwords.txt/
filter class=solr.PorterStemFilterFactory/
/analyzer
analyzer type=query
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
 ignoreCase=true expand=true/
filter class=solr.StopFilterFactory
ignoreCase=true
words=stopwords_en.txt
enablePositionIncrements=true
/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.EnglishPossessiveFilterFactory/
filter class=solr.KeywordMarkerFilterFactory
 protected=protwords.txt/
filter class=solr.PorterStemFilterFactory/
/analyzer
 /fieldType
 
 
 how can I query docs whose body has more than 80 words (or 80 characters) ?

Re: Not able to search Spanish word with ascent in solr

2013-05-20 Thread Jason Hellman

And use the /terms request handler to view what is present in the field:

/solr/terms?terms.fl=text_esterms.prefix=a

You're looking to ensure the index does, in fact, have the accented characters 
present.  It's just a sanity check, but could possibly save you a little 
(sanity, that is).

Jason

On May 20, 2013, at 12:51 PM, Jack Krupansky j...@basetechnology.com wrote:

 Try the Solr Admin UI Analysis page - enter text for both index and query for 
 your field and see whether the final terms still have their accents.
 
 -- Jack Krupansky
 
 -Original Message- From: jignesh
 Sent: Monday, May 20, 2013 10:46 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Not able to search Spanish word with ascent in solr
 
 Thanks for the reply..
 
 I am send below type of xml to solr
 
 ?xml version=1.0 encoding=UTF-8?adddoc
 field name=id15/field
 field name=id_i15/field
 field name=nameMis nuevos colgantes de PRIMARK/field
 field name=featuresamp;iquest;Alguna vez os habamp;eacute;is pasado
 por la zona de bisuteramp;iacute;a de PRIMARK? Cada vez que me doy una
 vuelta y paso por delante no puedo evitar echar un vistazo a ver si
 encuentro algamp;uacute;n detallito mono. Colgantes, pendientes, pulseras,
 diademastienen de todo y siempre estamp;aacute; bien de precio.
 Hoy queramp;iacute;a enseamp;ntilde;aros mis dos amp;uacute;ltimas
 compras: dos colgantes, uno con forma de bamp;uacute;ho y otro con un robot
 fashion. Y lo mejor es que samp;oacute;lo me he gastado 5 euros.
 amp;iquest;Quamp;eacute; os parecen?
 amp;iquest;Habamp;eacute;is comprado alguna vez en esta tienda?
 
 /field
 /doc
 
 
 I am giving below url
 
 http://localhost:8983/solr/select/?q=étnicoindent=onqf=nameqf=featuresdefType=edismaxstart=0rows=50wt=json
 
 waiting for reply
 
 Thanks
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Not-able-to-search-Spanish-word-with-ascent-in-solr-tp4064404p4064651.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: multiple cache for same field

2013-05-20 Thread Jason Hellman

Most definitely not the number of unique elements in each segment.  My 32 
document sample index (built from the default example docs data) has the 
following:

entry#0:
'StandardDirectoryReader(segments_b:29 _8(4.2.1):C32)'='manu_exact',class 
org.apache.lucene.index.SortedDocValues,0.5=org.apache.lucene.search.FieldCacheImpl$SortedDocValuesImpl#1778857102

There is no chance for there to be 1.8 billion unique elements in that index.

On May 20, 2013, at 1:20 PM, Erick Erickson erickerick...@gmail.com wrote:

 Not sure, never had to worry about what they are..
 
 On Mon, May 20, 2013 at 12:28 PM, J Mohamed Zahoor zah...@indix.com wrote:
 
 What is the number at the end?
 is it the no of unique elements in each segment?
 
 ./zahoor
 
 
 On 20-May-2013, at 7:37 PM, Erick Erickson erickerick...@gmail.com wrote:
 
 Because the same field is split amongst a number of segments. If you
 look in the index directory, you should see files like _3fgm.* and
 _3ffm.*. Each such group represents one segment. The number of
 segments changes with merging etc.
 
 Best
 Erick
 
 On Mon, May 20, 2013 at 6:43 AM, J Mohamed Zahoor zah...@indix.com wrote:
 Hi
 
 Why is that lucene field cache has multiple entries for the same field 
 S_24.
 It is a dynamic field.
 
 
 'SegmentCoreReader(owner=_3fgm(4.2.1):C7681)'='S_24',double,org.apache.lucene.search.FieldCache.NUMERIC_UTILS_DOUBLE_PARSER=org.apache.lucene.search.FieldCacheImpl$DoublesFromArray#1174240382
 
 'SegmentCoreReader(owner=_3ffm(4.2.1):C1596758)'='S_24',double,org.apache.lucene.search.FieldCache.NUMERIC_UTILS_DOUBLE_PARSER=org.apache.lucene.search.FieldCacheImpl$DoublesFromArray#83384344
 
 'SegmentCoreReader(owner=_3fgh(4.2.1):C2301)'='S_24',double,org.apache.lucene.search.FieldCache.NUMERIC_UTILS_DOUBLE_PARSER=org.apache.lucene.search.FieldCacheImpl$DoublesFromArray#1281331764
 
 
 Also, the number at the end.. does it specified the no of entries in that 
 cache bucket?
 
 ./zahoor

Re: Upgrading from SOLR 3.5 to 4.2.1 Results.

2013-05-18 Thread Jason Hellman

Rishi,

Fantastic!  Thank you so very much for sharing the details.

Jason

On May 17, 2013, at 12:29 PM, Rishi Easwaran rishi.easwa...@aol.com wrote:

 
 
 Hi All,
 
 Its Friday 3:00pm, warm  sunny outside and it was a good week. Figured I'd 
 share some good news.
 I work for AOL mail team and we use SOLR for our mail search backend. 
 We have been using it since pre-SOLR 1.4 and strong supporters of SOLR 
 community.
 We deal with millions indexes and billions of requests a day across our 
 complex.
 We finished full rollout of SOLR 4.2.1 into our production last week. 
 
 Some key highlights:
 - ~75% Reduction in Search response times
 - ~50% Reduction in SOLR Disk busy , which in turn helped with ~90% Reduction 
 in errors
 - Garbage collection total stop reduction by over 50% moving application 
 throughput into the 99.8% - 99.9% range
 - ~15% reduction in CPU usage
 
 We did not tune our application moving from 3.5 to 4.2.1 nor update java.
 For the most part it was a binary upgrade, with patches for our special use 
 case.  
 
 Now going forward we are looking at prototyping SOLR Cloud for our search 
 system, upgrade java and tomcat, tune our application further. Lots of fun 
 stuff :)
 
 Have a great weekend everyone. 
 Thanks,
 
 Rishi.

Re: Deleting an entry from a collection when they key has : in it

2013-05-17 Thread Jason Hellman

The first rule of Solr without Unique Key is that we don't talk about Solr 
without a Unique Key.

The second rule...

On May 16, 2013, at 8:47 PM, Jack Krupansky j...@basetechnology.com wrote:

 Technically, core Solr does not require a unique key. A lot of features in 
 Solr do require unique keys, and it is recommended that you have unique keys, 
 but it is not an absolute requirement.
 
 -- Jack Krupansky
 -Original Message- From: Daniel Baughman
 Sent: Thursday, May 16, 2013 1:50 PM
 To: solr-user@lucene.apache.org
 Subject: RE: Deleting an entry from a collection when they key has : in it
 
 Thanks for the idea
 http://localhost:8983/solr/docrepo/update/?stream.body=%3Cdelete%3E%3Cquery%
 3Ekey%3AD\:\\Webdocs\\sw4\\docRepo\\documents\\Hiring%20Manager\\Disciplinar
 y\\asdfasdf\.docx%3C%2Fquery%3E%3C%2Fdelete%3E
 
 I do have :'s and  \'s escaped, I believe.
 
 If in my schema, I have the key field set to indexed=false, then is that
 maybe the issue?  I'm going to try to set that to true and rebuild the
 repository and see if that does it.
 
 
 -Original Message-
 From: Jack Krupansky [mailto:j...@basetechnology.com]
 Sent: Thursday, May 16, 2013 11:20 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Deleting an entry from a collection when they key has : in it
 
 You need to escape colons in queries, using either a backslash or enclosing
 the full query term in quotes.
 
 In your case, you have backslashes as well in your query, which the query
 parser will interpret as an escape! So, you need to escape those backslashes
 as well:
 
 D\:\\somedir\\somefile.pdf
 
 or
 
 D:\\somedir\\somefile.pdf
 
 -- Jack Krupansky
 
 -Original Message-
 From: Daniel Baughman
 Sent: Thursday, May 16, 2013 11:33 AM
 To: solr-user@lucene.apache.org
 Subject: Deleting an entry from a collection when they key has : in it
 
 Hi All,
 
 
 
 I seem to be really struggling to delete an entry from  a search repository
 that has a : in the key.
 
 
 
 The key is path to the file ie, D:\somedir\somefile.pdf.
 
 
 
 I want to use a query to delete it and I just can't seem to make it go away.
 
 
 
 I've been trying stuff lke this:
 
 http://localhost:8983/solr/docrepo/update/?stream.body=%3Cdelete%3E%3Cquery%
 3Ekey%3AD\:\\Webdocs\\sw4\\docRepo\\documents\\Hiring%20Manager\\Disciplinar
 y\\asdfasdf\.docx%3C%2Fquery%3E%3C%2Fdelete%3E
 http://localhost:8983/solr/docrepo/update/?stream.body=%3Cdelete%3E%3Cquery
 %3Ekey%3AD\:\\Webdocs\\sw4\\docRepo\\documents\\Hiring%20Manager\\Disciplina
 ry\\asdfasdf\.docx%3C%2Fquery%3E%3C%2Fdelete%3Eversion=2.2start=0rows=10
 indent=on version=2.2start=0rows=10indent=on
 
 
 
 It doesn't throw an error but it doesn't delete the document either.
 
 
 
 Does anyone have any suggestions?
 
 
 
 Thanks,
 
 Dan

Re: Aggregate word counts over a subset of documents

2013-05-16 Thread Jason Hellman

David,

A Pivot Facet could possibly provide these results by the following syntax:

facet.pivot=category,includes

We would presume that includes is a tokenized field and thus a set of facet 
values would be rendered from the terms resoling from that tokenization.  This 
would be nested in each category…and, of course, the entire set of documents 
considered for these facets is constrained by the current query.

I think this maps to your requirement.

Jason

On May 16, 2013, at 12:29 PM, David Larochelle 
dlaroche...@cyber.law.harvard.edu wrote:

 Is there a way to get aggregate word counts over a subset of documents?
 
 For example given the following data:
 
  {
id: 1,
category: cat1,
includes: The green car.,
  },
  {
id: 2,
category: cat1,
includes: The red car.,
  },
  {
id: 3,
category: cat2,
includes: The black car.,
  }
 
 I'd like to be able to get total term frequency counts per category. e.g.
 
 category name=cat1
   lst name=the2/lst
   lst name=car2/lst
   lst name=green1/lst
   lst name=red1/lst
 /category
 category name=cat2
   lst name=the1/lst
   lst name=car1/lst
   lst name=black1/lst
 /category
 
 I was initially hoping to do this within Solr and I tried using the
 TermFrequencyComponent. This gives term frequencies for individual
 documents and term frequencies for the entire index but doesn't seem to
 help with subsets. For example, TermFrequencyComponent would tell me that
 car occurs 3 times over all documents in the index and 1 time in document 1
 but not that it occurs 2 times over cat1 documents and 1 time over cat2
 documents.
 
 Is there a good way to use Solr/Lucene to gather aggregate results like
 this? I've been focusing on just using Solr with XML files but I could
 certainly write Java code if necessary.
 
 Thanks,
 
 David

Re: Solr - Best Java Combination for performance?

2013-05-11 Thread Jason Hellman

I have run across plenty of implementations using just about every common 
servlet container on the market, and haven't run across any common problems to 
dissuade you against any one of them.  

On the JVM front most people seem to use Oracle because of it ubiquity.  But I 
have also run across a solid minority of Open and they seem just fine.  For 
that matter, more hand a handful of custom JVMs (usually via IBM).

The advice I always give on this topic leans heavily on practical consideration:

What servlet container and JVM does your team know best how to address if a 
problem occurs?

If you're unsure, I'd stick with Tomcat and Oracle since they are the most 
common and you'll find metric tons of help via posts on the internet that may 
coincide with an issue or optimization you're considering.

Hope that's useful!


On May 11, 2013, at 4:56 AM, Spadez james_will...@hotmail.com wrote:

 Hi,
 
 I was wondering, what setup have people had the most luck with from a
 performance point of view?
 
 Tomcat Vs Jetty
 Open JDK vs Oracle JDK
 
 I haven't been able to find any information online to backup any sort of
 performance claims. I am planning on using Tomcat with Open JDK, has anyone
 had any experience with this and is it a wise path to go down?
 
 Thanks!
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Solr-Best-Java-Combination-for-performance-tp4062554.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Negative Boosting at Recent Versions of Solr?

2013-05-10 Thread Jason Hellman

You learned the gosh-darndest things:

http://localhost:8983/solr/browse?q=ipodboost=product(price,-2)debugQuery=on

…nets:

-0.3797992 = (MATCH) sum of:
  0.13510442 = (MATCH) max of:
0.045963455 = (MATCH) weight(text:ipod^0.5 in 4) [DefaultSimilarity], 
result of:
  0.045963455 = score(doc=4,freq=3.0 = termFreq=3.0
), product of:

…blah blah blah…
  -0.5149036 = (MATCH) FunctionQuery(product(float(price),const(-2))), product 
of:
-23.0 = product(float(price)=11.5,const(-2))
1.0 = boost
0.022387113 = queryNorm

it works Similarly with boost=

-3.1081805 = (MATCH) boost((id:ipod^10.0 | author:ipod^2.0 | title:ipod^10.0 | 
text:ipod^0.5 | cat:ipod^1.4 | keywords:ipod^5.0 | manu:ipod^1.1 | 
description:ipod^5.0 | resourcename:ipod | name:ipod^1.2 | features:ipod | 
sku:ipod^1.5),product(float(price),const(-2))), product of:
  0.13513829 = (MATCH) max of:
0.045974977 = (MATCH) weight(text:ipod^0.5 in 4) [DefaultSimilarity], 
result of:
  0.045974977 = score(doc=4,freq=3.0 = termFreq=3.0
), product of:

…more blah…
  -23.0 = product(float(price)=11.5,const(-2))
I wonder how fantastically this can be abused now?



On May 10, 2013, at 7:22 AM, Dyer, James james.d...@ingramcontent.com wrote:

 Despite the discussion in SOLR-3823/SOLR-3278, my experience with Solr 4.2 is 
 that it does indeed allow negative boosts on both bf and qf.  I think the 
 functionality was added under the radar possibly with SOLR-4093, not sure 
 though.  In disbelief, I did some testing and it seems to really work.
 
 James Dyer
 Ingram Content Group
 (615) 213-4311
 
 
 -Original Message-
 From: Jack Krupansky [mailto:j...@basetechnology.com] 
 Sent: Thursday, May 09, 2013 5:41 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Negative Boosting at Recent Versions of Solr?
 
 Solr does support both additive and multiplicative boosts. Although Solr 
 doesn't support negative multiplicative boosts on query terms, it does 
 support fractional multiplicative boosts (0.25) which do allow you to 
 de-boost a term.
 
 The boosts for individual query terms and for the edismax qf parameter 
 cannot be negative, but can be fractional.
 
 The edismax bf parameter give a function query that provides an additive 
 boost, which could be negative.
 
 The edismax boost parameter gives a function query that provides a 
 multiplicative boost - which could be negative, so it’s not absolutely true 
 that doesn't support negative boosts.
 
 -- Jack Krupansky
 
 -Original Message- 
 From: Furkan KAMACI
 Sent: Thursday, May 09, 2013 6:08 PM
 To: solr-user@lucene.apache.org
 Subject: Negative Boosting at Recent Versions of Solr?
 
 I know that whilst Lucene allows negative boosts, Solr does not. However
 did it change with newer versions of Solr (I use Solr 4.2.1) or still same?

Re: Looking for Best Practice of Spellchecker

2013-05-10 Thread Jason Hellman

Nicholas,

Also consider that some misspellings are better handled through Synonyms (or
injected metadata).

You can garner a great deal of value out of the spell checker by following the
great advice James is giving here…but you'll find a well-placed helper
synonym or metavalue can often save a lot of headache and time.

Jason

On May 10, 2013, at 7:32 AM, Dyer, James james.d...@ingramcontent.com wrote:

Nicholas,

It sounds like you might want to use WordBreakSolrSpellChecker, which gets
obscure mention in the wiki. Read through this section:
http://wiki.apache.org/solr/SpellCheckComponent#Configuration and you will
see some information.

Also, the Solr Example shows how to configure this. See
http://svn.apache.org/repos/asf/lucene/dev/branches/branch_4x/solr/example/solr/collection1/conf/solrconfig.xml

Look for...

lst name=spellchecker
str name=namewordbreak/str
...
/lst

...and...

requestHandler name=/spell ...
...
/requestHandler

Also, I'd recommend you take a look at each parameter in the /spell request
handler and read its section on the spellcheckcomponent wiki page. You
probably will want to set many of these parameters as well.

You can get a query to return only spell results simply by specifying
rows=0. However, its one less query to just have it return the results
also. If there are no results, your application can check for collations and
re-issue a collation query. If there are both results and collations
returned, you can give the user results with did-you-mean suggestions.

James Dyer
Ingram Content Group
(615) 213-4311

-Original Message-
From: Nicholas Ding [mailto:nicholas...@gmail.com]
Sent: Friday, May 10, 2013 8:47 AM
To: solr-user@lucene.apache.org
Subject: Looking for Best Practice of Spellchecker

Hi guys,

I'm working on a local search project, I wanna integrate spellchecker for
the search.

So basically, my search engines is used to search local businesses. For
example, user could search for wall mart, here is a typo, I wanna
spellchecker to give me Collation for walmart.

My problems are:
1. I use DirectSolrSpellChecker on my BusinessNameField and pass wall
mart as phrase search, but I can't get collation from the spellchecker.
2. I tried not to pass phrase search, but pass q=Wall AND Mart to force a
100% match, but spellchecker can't give me collation also.

I read the documents about spellchecker on Solr wiki, but it's very brief.
I'm wondering is there any best practice of spellchecker, I believe it's
widely used in the search, right?

And I have another idea, I don't know whether it's valid or not. I want to
apply spellchecker everything before doing the search, so that I could rely
on the spellchecker to tell me whether my search could get result or not.

Thanks
Nicholas

Re: Sharing index data between two Solr instances

2013-05-10 Thread Jason Hellman

Milen,

At some point you'll need to call a commit to search your data, either via 
AutoCommit policy or deterministically.  There are various schools of though on 
which way to go but something needs to do this.  

If you go the AutoCommit route be sure to pay attention to the openSearcher 
value.  The default value of false will not cause an IndexSearcher to open the 
new data, and there is a strong use case for this…but if you're not aware you 
might be caught by surprise.

Once the commit fires your search process will automatically see the new data, 
with no interruption to its queue of queries. 

You may also want to consider having a Master/Slave relationship via 
replication for higher availability.  it is trivial to set up and works like a 
charm.

Jason



On May 10, 2013, at 8:14 AM, milen.ti...@materna.de wrote:

 Hello together!
 
 I've been googleing on this topic but still couldn't find a definitive answer 
 to my question.
 
 We have a setup of two machines both running Solr 4.2 within Tomcat. We are 
 considering sharing the index data between both webapps. One of the machines 
 will be configured to update the index periodically, the other one will be 
 accessing it read-only. Using native locking on a network-mounted NTFS, is it 
 possible for the reader to detect when new index data has been imported or do 
 we need to signal it from the updating webapp and make a commit in order to 
 open a new reader with the updated content?
 
 Thanks in advance!
 
 Milen Tilev
 Master of Science
 Softwareentwickler
 Business Unit Information
 
 
 MATERNA GmbH
 Information  Communications
 
 Voßkuhle 37
 44141 Dortmund
 Deutschland
 
 Telefon: +49 231 5599-8257
 Fax: +49 231 5599-98257
 E-Mail: milen.ti...@materna.demailto:milen.ti...@materna.de
 
 www.materna.dehttp://www.materna.de/ | 
 Newsletterhttp://www.materna.de/newsletter | 
 Twitterhttp://twitter.com/MATERNA_GmbH | 
 XINGhttp://www.xing.com/companies/MATERNAGMBH | 
 Facebookhttp://www.facebook.com/maternagmbh
 
 
 Sitz der MATERNA GmbH: Voßkuhle 37, 44141 Dortmund
 Geschäftsführer: Dr. Winfried Materna, Helmut an de Meulen, Ralph Hartwig
 Amtsgericht Dortmund HRB 5839

Re: Sharing index data between two Solr instances

2013-05-10 Thread Jason Hellman

Milen,

It is possible to have the configuration shared amongst multiple cores, I have 
seen this…though I haven't seen multiple separate instances share the same 
solr.xml core configuration (and, for that matter, separate possible locking 
policies).  It might work.

Honestly, I don't like it.  Your config is not likely changing often, and 
keeping these in sync should be relatively trivial for your data ingestion 
delegate.

But all of this is what replication does for you.  Of course, as you note, 
there is latency…and as such you may wish to consider SolrCloud instead.  Or a 
NRT (non SolrCloud) configuration.  You have a lot of options!  But the 
replication master/slave behavior is rock solid and does nearly everything you 
seek.

Jason

On May 10, 2013, at 8:40 AM, milen.ti...@materna.de wrote:

 Hello Jason,
 
 Thanks for Your quick response! The alternative of using the Solr replication 
 is also still pending at this point, so we will consider its pros and cons, 
 too.
 
 Fortunately, we are not using AutoCommit in our project, as we need to 
 control the creation of new segments, so I will propose to my colleagues that 
 we issue a manual commit on the read-only node after successful index update.
 
 Just one more question: would it be possible in this case to use the same 
 solrhome/conf directory (shared schema and solrconfig) and solr.xml file 
 within both webapps? I guess we should then signal the read-only side each 
 time the solr.xml has changed (additional cores may be added by the updating 
 machine depending on the imported data).
 
 Thanks again and best regards!
 
 Milen
 
 
 -Ursprüngliche Nachricht-
 Von: Jason Hellman [mailto:jhell...@innoventsolutions.com] 
 Gesendet: Freitag, 10. Mai 2013 17:30
 An: solr-user@lucene.apache.org
 Betreff: Re: Sharing index data between two Solr instances
 
 Milen,
 
 At some point you'll need to call a commit to search your data, either via 
 AutoCommit policy or deterministically.  There are various schools of though 
 on which way to go but something needs to do this.  
 
 If you go the AutoCommit route be sure to pay attention to the openSearcher 
 value.  The default value of false will not cause an IndexSearcher to open 
 the new data, and there is a strong use case for this.but if you're not aware 
 you might be caught by surprise.
 
 Once the commit fires your search process will automatically see the new 
 data, with no interruption to its queue of queries. 
 
 You may also want to consider having a Master/Slave relationship via 
 replication for higher availability.  it is trivial to set up and works like 
 a charm.
 
 Jason
 
 
 
 On May 10, 2013, at 8:14 AM, milen.ti...@materna.de wrote:
 
 Hello together!
 
 I've been googleing on this topic but still couldn't find a definitive 
 answer to my question.
 
 We have a setup of two machines both running Solr 4.2 within Tomcat. We are 
 considering sharing the index data between both webapps. One of the machines 
 will be configured to update the index periodically, the other one will be 
 accessing it read-only. Using native locking on a network-mounted NTFS, is 
 it possible for the reader to detect when new index data has been imported 
 or do we need to signal it from the updating webapp and make a commit in 
 order to open a new reader with the updated content?
 
 Thanks in advance!
 
 Milen Tilev
 Master of Science
 Softwareentwickler
 Business Unit Information
 
 
 MATERNA GmbH
 Information  Communications
 
 Voßkuhle 37
 44141 Dortmund
 Deutschland
 
 Telefon: +49 231 5599-8257
 Fax: +49 231 5599-98257
 E-Mail: milen.ti...@materna.demailto:milen.ti...@materna.de
 
 www.materna.dehttp://www.materna.de/ | 
 Newsletterhttp://www.materna.de/newsletter | 
 Twitterhttp://twitter.com/MATERNA_GmbH | 
 XINGhttp://www.xing.com/companies/MATERNAGMBH | 
 Facebookhttp://www.facebook.com/maternagmbh
 
 
 Sitz der MATERNA GmbH: Voßkuhle 37, 44141 Dortmund
 Geschäftsführer: Dr. Winfried Materna, Helmut an de Meulen, Ralph 
 Hartwig Amtsgericht Dortmund HRB 5839

Re: SOLR guidance required

2013-05-10 Thread Jason Hellman

One more tip on the use of filter queries.

DO:  fq=name1:value1fq=name2:value2fq=namen:valuen

DON'T:  fq=name1:value1 AND name2:value2 AND name3:value3

Where OR operators apply, this does not matter.  But your Solr cache will be 
much more savvy with the first construct.

Jason

On May 10, 2013, at 11:39 AM, pravesh suyalprav...@yahoo.com wrote:

 Aditya,
 
 As suggested by others, definitely you should use the filter queries
 directly to query SOLR. Just keep your indexes updated.
 Keep all your fields indexed/stored as per your requirements. Refer through
 the filter query wiki
 
 http://wiki.apache.org/solr/CommonQueryParameters
 http://wiki.apache.org/solr/CommonQueryParameters  
 
 http://wiki.apache.org/solr/SimpleFacetParameters
 http://wiki.apache.org/solr/SimpleFacetParameters  
 
 
 BTW, almost all the job sites out there (whether small/medium/big) use
 SOLR/lucene to power their searches :) 
 
 
 Best
 Pravesh
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/SOLR-guidance-required-tp4062188p4062422.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Does Distributed Search are Cached Only the By Node That Runs Query?

2013-05-10 Thread Jason Hellman

And for 10,000 documents across n shards, that can be significant!

On May 10, 2013, at 11:43 AM, Joel Bernstein joels...@gmail.com wrote:

 How many shards are in your collection? The query aggregator node will pull
 pack that results from each shard and hold the results in memory. Then it
 will add the results to a priority queue to sort them. This queue will need
 to be as large as the page that is being generated.
 
 After the query is finished this memory should be collectable.
 
 
 On Thu, May 9, 2013 at 8:00 PM, Otis Gospodnetic otis.gospodne...@gmail.com
 wrote:
 
 You are looking at jvm heap but attributing it to caching only. Not quite
 right...there are other things in that jvm heap.
 
 Otis
 Solr  ElasticSearch Support
 http://sematext.com/
 On May 9, 2013 3:55 PM, Furkan KAMACI furkankam...@gmail.com wrote:
 
 I have Solr 4.2.1 and run them as SolrCloud. When I do a search on
 SolrCloud as like that:
 
 ip_of_node_1:8983solr/select?q=*:*rows=1
 
 and when I check admin page I see that:
 
 I have 5 GB Java Heap. 616.32 MB is dark gray, 3.13 GB is gray.
 
 Before my search it was something like: 150 MB dark gray, 500 MB gray.
 
 I understand that when I do a search like that, fields are cached.
 However
 when I look at other SolrCloud nodes' admin pages there are no
 differences.
 Why that query is cached only by the node that I run that query on?
 
 
 
 
 
 -- 
 Joel Bernstein
 Professional Services LucidWorks

Re: Use case for storing positions and offsets in index?

2013-05-09 Thread Jason Hellman

Consider further that term vector data and highlighting becomes very useful if 
you highlight externally to Solr.  That is to say, you have the data stored 
externally and wish to re-parse positions of terms (especially synonyms) from 
source material.  This is a (not too uncommon) technique used for extremely 
large articles, where data storage in the Lucene index might be repetitive.

On May 8, 2013, at 11:04 PM, Jack Krupansky j...@basetechnology.com wrote:

 Term positions in the index are used for phrase query and span queries.
 
 There is a separate concept called term vectors that maintains positions as 
 well. It is most useful for highlighting - you want to know exactly where a 
 term started and ended.
 
 -- Jack Krupansky
 
 -Original Message- From: KnightRider
 Sent: Tuesday, May 07, 2013 12:58 PM
 To: solr-user@lucene.apache.org
 Subject: Use case for storing positions and offsets in index?
 
 Can someone please tell me the usecase for storing term positions and offsets
 in the index?
 
 I am trying to understand the difference between storing positions/offsets
 vs indexing positions/offsets.
 
 Thanks
 KR
 
 
 
 -
 Thanks
 -K'Rider
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Use-case-for-storing-positions-and-offsets-in-index-tp4061376.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Grouping search results by field returning all search results for a given query

2013-05-09 Thread Jason Hellman

Luis,

I am presuming you do not have an overarching grouping value here…and simply 
wish to show a standard search result that shows 1 item per company.

You should be able to accomplish your second page of desired items (the second 
item from each of your 20 represented companies) by using the group.offset 
parameter.  This will shift the position in the returned array of documents to 
the value provided.

Thus:

group.limit=1group.field=companyidgroup.offset=1

…would return the second item in each companyid group matching your current 
query.

Jason

On May 9, 2013, at 10:30 AM, Luis Carlos Guerrero Covo 
lcguerreroc...@gmail.com wrote:

 Hi,
 
 I'm using solr to maintain an index of items that belong to different
 companies. I want the search results to be returned in a way that is fair
 to all companies, thus I wish to group the results such that each company
 has 1 item in each group, and the groups of results should be returned
 sorted by score.
 
 example:
 --
 
 20 companies
 
 first 100 results
 
 1-20 results - (company1 highest score item, company2 highest score item,
 etc..)
 20-40 results - (company1 second highest score item, company 2 second
 highest score item, etc..)
 ...
 
 --
 
 I'm trying to use the field collapsing feature but I have only been able to
 create the first group of results by using
 group.limit=1,group.field=companyid. If I raise the group.limit value, I
 would be violating the 'fairness rule' because more than one result of a
 company would be returned in the first group of results.
 
 Can I achieve the desired search result using SOLR, or do I have to look at
 other options?
 
 thank you,
 
 Luis Guerrero

Re: 4.3 logging setup

2013-05-09 Thread Jason Hellman

From:

http://lucene.apache.org/solr/4_3_0/changes/Changes.html#4.3.0.upgrading_from_solr_4.2.0

Slf4j/logging jars are no longer included in the Solr webapp. All logging jars 
are now in example/lib/ext. Changing logging impls is now as easy as updating 
the jars in this folder with those necessary for the logging impl you would 
like. If you are using another webapp container, these jars will need to go in 
the corresponding location for that container. In conjunction, the 
dist-excl-slf4j and dist-war-excl-slf4 build targets have been removed since 
they are redundent. See the Slf4j documentation, SOLR-3706, and SOLR-4651 for 
more details.

It should just require you provide your preferred logging jars within an 
appropriate classpath. 


On May 9, 2013, at 9:24 AM, richardg richa...@dvdempire.com wrote:

 On all prior index version I setup my log via the logging.properties file in
 /usr/local/tomcat/conf, it looked like this:
 
 # Licensed to the Apache Software Foundation (ASF) under one or more
 # contributor license agreements.  See the NOTICE file distributed with
 # this work for additional information regarding copyright ownership.
 # The ASF licenses this file to You under the Apache License, Version 2.0
 # (the License); you may not use this file except in compliance with
 # the License.  You may obtain a copy of the License at
 #
 # http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an AS IS BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
 handlers = 1catalina.org.apache.juli.FileHandler,
 2localhost.org.apache.juli.FileHandler,
 3manager.org.apache.juli.FileHandler,
 4host-manager.org.apache.juli.FileHandler, java.util.logging.ConsoleHandler
 
 .handlers = 1catalina.org.apache.juli.FileHandler,
 java.util.logging.ConsoleHandler
 
 
 # Handler specific properties.
 # Describes specific configuration info for Handlers.
 
 
 1catalina.org.apache.juli.FileHandler.level = WARNING
 1catalina.org.apache.juli.FileHandler.directory = ${catalina.base}/logs
 1catalina.org.apache.juli.FileHandler.prefix = catalina.
 
 2localhost.org.apache.juli.FileHandler.level = FINE
 2localhost.org.apache.juli.FileHandler.directory = ${catalina.base}/logs
 2localhost.org.apache.juli.FileHandler.prefix = localhost.
 
 3manager.org.apache.juli.FileHandler.level = FINE
 3manager.org.apache.juli.FileHandler.directory = ${catalina.base}/logs
 3manager.org.apache.juli.FileHandler.prefix = manager.
 
 4host-manager.org.apache.juli.FileHandler.level = FINE
 4host-manager.org.apache.juli.FileHandler.directory = ${catalina.base}/logs
 4host-manager.org.apache.juli.FileHandler.prefix = host-manager.
 
 java.util.logging.ConsoleHandler.level = FINE
 java.util.logging.ConsoleHandler.formatter =
 java.util.logging.SimpleFormatter
 
 
 
 # Facility specific properties.
 # Provides extra control for each logger.
 
 
 org.apache.catalina.core.ContainerBase.[Catalina].[localhost].level = INFO
 org.apache.catalina.core.ContainerBase.[Catalina].[localhost].handlers =
 2localhost.org.apache.juli.FileHandler
 
 org.apache.catalina.core.ContainerBase.[Catalina].[localhost].[/manager].level
 = INFO
 org.apache.catalina.core.ContainerBase.[Catalina].[localhost].[/manager].handlers
 = 3manager.org.apache.juli.FileHandler
 
 org.apache.catalina.core.ContainerBase.[Catalina].[localhost].[/host-manager].level
 = INFO
 org.apache.catalina.core.ContainerBase.[Catalina].[localhost].[/host-manager].handlers
 = 4host-manager.org.apache.juli.FileHandler
 
 # For example, set the org.apache.catalina.util.LifecycleBase logger to log
 # each component that extends LifecycleBase changing state:
 #org.apache.catalina.util.LifecycleBase.level = FINE
 
 # To see debug messages in TldLocationsCache, uncomment the following line:
 #org.apache.jasper.compiler.TldLocationsCache.level = FINE
 
 After upgrading to 4.3 today the files defined aren't being logged to.  I
 know things have changed for logging w/ 4.3 but how can I get it setup like
 it was before?
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/4-3-logging-setup-tp4061875.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: More Like This and Caching

2013-05-09 Thread Jason Hellman

Purely from empirical observation, both the DocumentCache and QueryResultCache
are being populated and reused in reloads of a simple MLT search. You can see
in the cache inserts how much extra-curricular activity is happening to
populate the MLT data by how many inserts and lookups occur on the first load.

(lifted right out of the MLT wiki http://wiki.apache.org/solr/MoreLikeThis )

http://localhost:8983/solr/select?q=apachemlt=truemlt.fl=manu,catmlt.mindf=1mlt.mintf=1fl=id,score

There is no activity in the filterCache, fieldCache, or fieldValueCache - and
that makes plenty of sense.

On May 9, 2013, at 11:12 AM, David Parks davidpark...@yahoo.com wrote:

I'm not the expert here, but perhaps what you're noticing is actually the
OS's disk cache. The actual solr index isn't cached by solr, but as you read
the blocks off disk the OS disk cache probably did cache those blocks for
you. On the 2nd run the index blocks were read out of memory.

There was a very extensive discussion on this list not long back titled:
Re: SolrCloud loadbalancing, replication, and failover look that thread up
and you'll get a lot of in-depth on the topic.

David

-Original Message-
From: Giammarco Schisani [mailto:giamma...@schisani.com]
Sent: Thursday, May 09, 2013 2:59 PM
To: solr-user@lucene.apache.org
Subject: More Like This and Caching

Hi all,

Could anybody explain which Solr cache (e.g. queryResultCache,
documentCache, fieldCache, etc.) can be used by the More Like This handler?

One of my colleagues had previously suggested that the More Like This
handler does not take advantage of any of the Solr caches.

However, if I issue two identical MLT requests to the same Solr instance,
the second request will execute much faster than the first request (for
example, the first request will execute in 200ms and the second request will
execute in 20ms). This makes me believe that at least one of the Solr caches
is being used by the More Like This handler.

I think the documentCache is the cache that is most likely being used, but
would you be able to confirm?

As information, I am currently using Solr version 3.6.1.

Kind regards,
Giammarco Schisani

Re: 4.3 logging setup

2013-05-09 Thread Jason Hellman

If you nab the jars in example/lib/ext and place them within the appropriate 
folder in Tomcat (and this will somewhat depend on which version of Tomcat you 
are using…let's presume tomcat/lib as a brute-force approach) you should be 
back in business.

On May 9, 2013, at 11:41 AM, richardg richa...@dvdempire.com wrote:

 Thanks for responding.  My issue is I've never changed anything w/ logging, I
 have always used the built in Juli.  I've never messed w/ any jar files,
 just had edit the logging.properties file.  I don't know where I would get
 the jars for juli or where to put them, if that is what is needed.  I had
 read what you posted before I just can't make any sense of it.
 
 Thanks
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/4-3-logging-setup-tp4061875p4061901.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Grouping search results by field returning all search results for a given query

2013-05-09 Thread Jason Hellman

I would think pagination is resolved by obtaining the numFound value for your 
returned groups.  If you have numFound=6 then each page of 20 items (one item 
per company) would imply a total of 6 pages.

You'll have to arbitrate for the variance here…but it would seem to me you need 
as many pages as the highest value in the numFound field for all groups.  
This shouldn't require requerying but will definitely require a little 
intelligence on the web app to handle the groups that are less than the largest 
size.

Hope that's useful!

On May 9, 2013, at 12:23 PM, Luis Carlos Guerrero Covo 
lcguerreroc...@gmail.com wrote:

 Thank you for the prompt reply jason. The group.offset parameter is working
 for me, now I can iterate through all items for each company. The problem
 I'm having right now is pagination. Is there a way how this can be
 implemented out of the box with solr?
 
 Before I was using the group.main=true for easy pagination of results, but
 it seems like I'll have to ditch that and use the standard grouping format
 returned by solr for the group.offset parameter to be useful. Since all
 groups don't have the same number of items, I'll have to carefully
 calculate the results that should be returned for each page of 20 items and
 probably make several solr calls per page rendered.
 
 
 On Thu, May 9, 2013 at 1:07 PM, Jason Hellman 
 jhell...@innoventsolutions.com wrote:
 
 Luis,
 
 I am presuming you do not have an overarching grouping value here…and
 simply wish to show a standard search result that shows 1 item per company.
 
 You should be able to accomplish your second page of desired items (the
 second item from each of your 20 represented companies) by using the
 group.offset parameter.  This will shift the position in the returned array
 of documents to the value provided.
 
 Thus:
 
 group.limit=1group.field=companyidgroup.offset=1
 
 …would return the second item in each companyid group matching your
 current query.
 
 Jason
 
 On May 9, 2013, at 10:30 AM, Luis Carlos Guerrero Covo 
 lcguerreroc...@gmail.com wrote:
 
 Hi,
 
 I'm using solr to maintain an index of items that belong to different
 companies. I want the search results to be returned in a way that is fair
 to all companies, thus I wish to group the results such that each company
 has 1 item in each group, and the groups of results should be returned
 sorted by score.
 
 example:
 --
 
 20 companies
 
 first 100 results
 
 1-20 results - (company1 highest score item, company2 highest score item,
 etc..)
 20-40 results - (company1 second highest score item, company 2 second
 highest score item, etc..)
 ...
 
 --
 
 I'm trying to use the field collapsing feature but I have only been able
 to
 create the first group of results by using
 group.limit=1,group.field=companyid. If I raise the group.limit value, I
 would be violating the 'fairness rule' because more than one result of a
 company would be returned in the first group of results.
 
 Can I achieve the desired search result using SOLR, or do I have to look
 at
 other options?
 
 thank you,
 
 Luis Guerrero
 
 
 
 
 -- 
 Luis Carlos Guerrero Covo
 M.S. Computer Engineering
 (57) 3183542047

Re: disaster recovery scenarios for solr cloud and zookeeper

2013-05-04 Thread Jason Hellman

I have to imagine I'm quibbling with the original assertion that Solr 4.x is 
architected with a dependency on Zookeeper when I say the following:

Solr 4.x is not architected with a dependency on Zookeeper.  SolrCloud, 
however, is.  As such, if a line of reasoning drives greater concern about 
Zookeeper than (necessarily) Solr's resiliency it can clearly be opted to use 
Solr 4.x without Zookeeper.

I have to further imagine that isn't really the point of the original message.  
Unfortunately for me somehow I'm obsessing on saying it :)

On May 3, 2013, at 12:21 PM, Dennis Haller dhal...@talenttech.com wrote:

 Hi,
 
 Solr 4.x is architected with a dependency on Zookeeper, and Zookeeper is
 expected to have a very high (perfect?) availability. With 3 or 5 zookeeper
 nodes, it is possible to manage zookeeper maintenance and online
 availability to be close to %100. But what is the worst case for Solr if
 for some unanticipated reason all Zookeeper nodes go offline?
 
 Could someone comment on a couple of possible scenarios for which all ZK
 nodes are offline. What would happen to Solr and what would be needed to
 recover in each case?
 1) brief interruption, say 2 minutes,
 2) longer downtime, say 60 min
 
 Thanks
 Dennis

79 matches

Mail list logo